Metadata-Version: 2.4
Name: umap-stratified-split
Version: 0.1.0
Summary: Stratified dataset splitting via UMAP-based pseudo-labels
Author-email: "Bilal Qureshi, M.Sc." <bilal.qureshi@outlook.de>
License: MIT License
        
        Copyright (c) 2025 Bilal Q.
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/bilalqur/umap-stratified-split
Project-URL: Documentation, https://github.com/bilalqur/umap-stratified-split/docs
Project-URL: Issue Tracker, https://github.com/bilalqur/umap-stratified-split/issues
Keywords: umap,stratification,pytorch,dataset,split
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20
Requires-Dist: umap-learn<0.6,>=0.5.4
Requires-Dist: numba>=0.55
Requires-Dist: scikit-learn>=1.0
Requires-Dist: torch>=1.9
Requires-Dist: torchvision>=0.10
Requires-Dist: matplotlib>=3.3
Requires-Dist: pytest>=6.0
Dynamic: license-file

# UMAP-Stratified Dataset Split
![Python](https://img.shields.io/badge/Python-3.8+-blue)
> Stratified dataset splitting via UMAP‑based pseudo‑labels  
> A drop‑in alternative to `torch.utils.data.random_split` that ensures each split preserves the manifold structure of your data.

## 📖 Overview

When you randomly split a dataset, you risk concentrating your validation set in a narrow region of feature space—especially in small or highly clustered datasets.  
`umap-stratified-split` embeds your entire dataset with UMAP (with fixed seed), clusters the embedding into “pseudo‑labels,” then performs stratified splitting to guarantee that each subset visits **all** manifold regions evenly.

This approach is grounded in the assumption that your dataset lies on a meaningful low-dimensional manifold and benefits from a uniform, structure-preserving sampling across this space.

---

## 🧠 Theoretical Motivation

This method builds on the foundation laid by the UMAP algorithm:

> **Uniform Manifold Approximation and Projection (UMAP)** is a general-purpose nonlinear dimension reduction technique. Its core idea is to construct a graph-based fuzzy topological representation of data, then optimize a low-dimensional embedding that preserves local neighborhood structure [1].

UMAP relies on the following assumptions:

1. **The data is uniformly distributed on a Riemannian manifold**
2. **The Riemannian metric is locally constant** (or approximately so)
3. **The manifold is locally connected**

These assumptions allow UMAP to build a reliable low-dimensional embedding that captures meaningful cluster and density information [2].

By clustering the UMAP embedding and stratifying across those clusters, this package provides a principled way to sample validation and training sets that are *representative of the entire data manifold*.

---

## ✅ Advantages of this Approach

- **Manifold-aware validation**: Ensures your validation set covers the same regions as your training set
- **Less validation bias**: Avoids selecting validation samples from narrow regions of the space
- **General-purpose**: Works for any data type (images, time series, embeddings, tabular, etc.)
- **Drop-in replacement**: Mimics the `random_split` API from PyTorch
- **Fully unsupervised**: Uses the geometry of the data, no true labels required
- **Reproducible**: UMAP seed and clustering make the split deterministic

---

## ⚠️ When to Use / Limitations

This method works best when:

- Your data lies on a structured manifold (e.g. clustered, continuous trajectories)
- The standard random split leads to class imbalance or structural bias
- You don’t have true labels, but want stratified-like splits based on learned structure

Avoid using this method if:

- Your data is completely uniform or already well-distributed
- You are splitting into a *test set* (risk of data leakage via unsupervised embedding)
- Your feature extractor or UMAP embedding fails to capture meaningful structure

---

## 📚 References

1. McInnes, L., Healy, J., & Melville, J. (2018). *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction*. arXiv preprint [arXiv:1802.03426](https://arxiv.org/abs/1802.03426)
2. Official UMAP implementation: https://github.com/lmcinnes/umap

---

## 🔧 Installation

### ✅ Recommended (fast & modern):

```bash
# Install via uv (recommended)
uv pip install umap-stratified-split
```

### 🛠️ From GitHub (latest main):

```bash
uv pip install git+https://github.com/bilalqur/umap-stratified-split.git#egg=umap-stratified-split
```

### 🧪 Local development mode:

```bash
git clone https://github.com/bilalqur/umap-stratified-split.git
cd umap-stratified-split
uv pip install -e .
```

---
