Metadata-Version: 2.4
Name: cdmltrain
Version: 1.1.2
Summary: Stream ML datasets from ZIP/ZSTD/S3 archives into PyTorch without disk extraction.
Home-page: https://github.com/prem85642/cdmltrain
Author: Prem Kumar Tiwari
Author-email: prem85642@gmail.com
Project-URL: Bug Tracker, https://github.com/prem85642/cdmltrain/issues
Project-URL: Source Code, https://github.com/prem85642/cdmltrain
Keywords: pytorch dataloader dataset streaming zip zstd s3 cloud machine-learning
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Pillow>=8.0.0
Provides-Extra: zstd
Requires-Dist: zstandard>=0.20.0; extra == "zstd"
Provides-Extra: gpu
Requires-Dist: torch>=1.9.0; extra == "gpu"
Provides-Extra: s3
Requires-Dist: s3fs>=2023.0.0; extra == "s3"
Requires-Dist: boto3>=1.20.0; extra == "s3"
Provides-Extra: full
Requires-Dist: zstandard>=0.20.0; extra == "full"
Requires-Dist: torch>=1.9.0; extra == "full"
Requires-Dist: s3fs>=2023.0.0; extra == "full"
Requires-Dist: boto3>=1.20.0; extra == "full"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# cdmltrain 🚀
### Stream ML Datasets Directly from ZIP / ZSTD / S3 Cloud — No Extraction. No Wasted Storage. Zero OOM.

[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/prem85642/cdmltrain/blob/main/LICENSE)
[![PyPI version](https://badge.fury.io/py/cdmltrain.svg)](https://pypi.org/project/cdmltrain/)
[![Python 3.8+](https://img.shields.io/badge/Python-3.8%2B-blue.svg)](https://www.python.org/)
[![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey)]()

---

## 🔥 The Problem This Solves

Every Data Scientist / ML Engineer hits this wall:

```
"Your 100 GB Kaggle dataset is a ZIP file.
 Extracting it takes 2 hours and needs 300 GB of free disk space.
 Your Colab/Kaggle notebook crashes with Out-of-Memory errors."
```

**`cdmltrain` eliminates this problem entirely.**

It lets PyTorch read images, audio, text, CSV, JSON — **any data** — directly from a compressed archive **into RAM**, skipping disk extraction completely.

---

## ✨ Key Features

| Feature | Description |
|---|---|
| 🗜️ **Format Agnostic** | Images, audio (.wav), text, CSV, JSON, binary — all supported |
| ⚡ **O(1) Random Access** | ZIP Central Directory indexing — jumps to any file instantly |
| 🧠 **Memory-Safe Cache** | Custom LRU cache enforces strict RAM limits — zero OOM crashes |
| 🔒 **Thread Safe** | Concurrent reads for PyTorch `DataLoader(num_workers=N)` |
| 🔧 **4-Tier Architecture** | Auto-selects best engine based on your hardware & data location |
| 🏎️ **ZSTD Support** | `.tar.zst` archives — 25x faster than ZIP deflate |
| 🎮 **GPU Direct Loader** | Streams data to CUDA VRAM with async prefetch |
| ☁️ **S3 Cloud Streaming** | Stream data directly from AWS S3 — zero local download |
| 🌍 **Cross-Platform** | Windows, Linux, macOS — works everywhere |

---

## 🏗️ Architecture (4 Tiers — Auto-Selected)

```mermaid
graph TD
    A[Local File <br> .zip / .tar.zst] --> D{Auto-Select <br> Engine}
    B[Cloud AWS S3 <br> s3://bucket/data.zip] --> D

    subgraph Tier 1: Core
    E[1. Python CoreStreamEngine <br> <i>Zero dependencies, works anywhere</i>]
    end

    subgraph Tier 2: Fast Core
    F[2. C++ FastCoreEngine <br> <i>Bypasses GIL, faster multi-worker reads</i>]
    end

    subgraph Tier 3: ZSTD & GPU
    G[3. ZSTD / GPU Direct Loader <br> <i>25x faster decompression, CUDA VRAM streaming</i>]
    end

    subgraph Tier 4: S3 Cloud
    H[4. S3 Cloud Streaming Engine <br> <i>Byte-range HTTP requests, zero local download</i>]
    end

    D --> E
    D --> F
    D --> G
    D --> H

    E --> I(PyTorch DataLoader)
    F --> I
    G --> I
    H --> I
    
    I --> J((Model Training))
    
    classDef default fill:#1f2937,stroke:#3b82f6,stroke-width:2px,color:#f3f4f6;
    classDef tier fill:#111827,stroke:#10b981,stroke-width:2px,color:#e5e7eb;
    class E,F,G,H tier;
```

**The library auto-detects your hardware and data location, then picks the best tier. Same code always.**

---

## 📦 Installation

```bash
# Core (Tier 1 + 2) — works everywhere
pip install cdmltrain

# With ZSTD support (Tier 3)
pip install cdmltrain[zstd]

# With AWS S3 Cloud Streaming (Tier 4)
pip install cdmltrain[s3]

# Everything
pip install cdmltrain[full]

# Or from source:
git clone https://github.com/prem85642/cdmltrain.git
cd cdmltrain
pip install .

# Step 3 (Optional): ZSTD support for .tar.zst archives
pip install zstandard
```

> **C++ Acceleration (Tier 2):**
> - Windows: Install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
> - Linux: `sudo apt-get install build-essential`
> - If skipped: pure-Python engine runs automatically — no errors.

---

## 🚀 Quick Start

### Basic Usage (ZIP — Any Data Type)
```python
from cdmltrain import CDMLStreamDataset
from torch.utils.data import DataLoader

# Point directly to your ZIP — no extraction needed!
dataset = CDMLStreamDataset(
    zip_path="my_dataset.zip",
    max_cache_mb=2048    # RAM cache limit (MB)
)

dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)

for batch in dataloader:
    # Train your model normally
    pass
```

### Image Dataset (with PyTorch Transforms)
```python
from cdmltrain import CDMLStreamDataset
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

dataset = CDMLStreamDataset("images.zip", transform=transform, is_image=True)
```

### ZSTD Archive (25x Faster Decompression) ⚡
```python
# Just change the file extension — everything else is identical!
dataset = CDMLStreamDataset("my_dataset.tar.zst", max_cache_mb=2048)
```

### S3 Cloud Streaming (Zero Download) ☁️
Stream terabytes of data directly from an S3 bucket without downloading anything. AWS credentials are automatically picked up from your environment or `~/.aws/credentials`.

```python
# Just use an s3:// URI!
dataset = CDMLStreamDataset(
    zip_path="s3://my-bucket/huge_dataset.zip",
    max_cache_mb=2048
)
```

### GPU Direct Loader (NVIDIA VRAM Streaming) 🎮
For the highest possible performance, `cdmltrain` bypasses PyTorch's `DataLoader` multiprocessing bottleneck using CUDA Pinned Memory:

```python
from cdmltrain.gpu_loader import GPUDirectLoader

# Stream directly into GPU:0 VRAM using asynchronous DMA prefetching
loader = GPUDirectLoader(dataset, batch_size=64, device="cuda:0")

for batch in loader:
    # `batch` is already a CUDA Tensor! No need for `.to(device)`
    loss = model(batch) 
```

---

## 🔬 Data Scientist Debugging Tools
When a deep learning model crashes on `Epoch 4, Batch 1400` because of a corrupt image inside the archive, you don't need to extract the 100GB ZIP to find it. `cdmltrain` provides instant debugging hooks:

### 1. Extract a Corrupt File by Index
Instantly extract the broken image to your disk to inspect it visually:
```python
# Instantly fetch image #1400 and save it to the current folder
dataset.extract_sample_to_disk(idx=1400, export_path="./debug_folder/")
# Output: [CDML] Extracted file 'broken_dog.jpg' to -> ./debug_folder/broken_dog.jpg
```

### 2. Fetch a specific file by its Name
Want to look at the `labels.csv` file without touching the millions of images inside the ZIP?
```python
# Returns raw bytes (or a PIL Image if is_image=True)
csv_bytes = dataset.get_by_filename("train_labels.csv")
print(csv_bytes.decode('utf-8')[:100])
```

---

## 🖥️ 3-Tier Architecture Fallback

`cdmltrain` auto-detects your system capabilities and selects the fastest engine:

## ⚙️ Configuration Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `zip_path` | `str` | *(required)* | Path to `.zip`, `.tar.zst` file, or S3 URI (`s3://bucket/data.zip`) |
| `transform` | `callable` | `None` | PyTorch/torchvision transform |
| `is_image` | `bool` | `False` | `True` enables PIL image decoding |
| `max_cache_mb` | `int` | `2048` | Max RAM for caching (MB) |

**GPU Direct Loader Parameters** (`GPUDirectLoader`):

| Parameter | Type | Default | Description |
|---|---|---|---|
| `dataset` | `Dataset` | *(required)* | A `CDMLStreamDataset` instance |
| `batch_size` | `int` | `32` | Number of items per batch |
| `device` | `str` | `"cuda:0"` | Target CUDA device to stream into |

**`max_cache_mb` Guide:**

| Your RAM | Recommended |
|---|---|
| 4 GB (Colab Free) | `512` |
| 8 GB (Laptop) | `2048` |
| 16 GB (PC) | `6000` |
| 32 GB+ (Server) | `16000` |

---

## 📊 Benchmarks (Real Tests)

### ZSTD vs ZIP Speed (200 files × 10KB)

```mermaid
gantt
    title Files processed per second (Higher is better)
    dateFormat  X
    axisFormat %s
    
    section Baseline (ZIP)
    Tier 1/2 (29k/s)    :a1, 0, 29204
    
    section High-Perf (ZSTD)
    Tier 3 (739k/s) :a2, 0, 739653
```

| Engine | Speed | Speedup |
|---|---|---|
| ZIP (Deflate) — Tier 1/2 | 29,204 files/sec | baseline |
| ZSTD — Tier 3 | **739,653 files/sec** | **🔥 25x faster** |
| S3 Cloud Streaming — Tier 4 | Network Bound | **🔥 0 GB local disk space used** |

### GPU Direct Loader (Google Colab T4)
| Metric | Value |
|---|---|
| GPU | Tesla T4 (15.6 GB VRAM) |
| Batch device | `cuda:0` — data streamed to VRAM |
| Throughput | **17,512 items/sec** |
| Epoch time (100 items) | 0.0045s |

### Memory Safety Test
| Test | Result |
|---|---|
| Cache limit: 1MB, Data: 2MB | ✅ Stayed under 1MB |
| Thread safety: 8 workers | ✅ Zero race conditions |
| Corrupted ZIP | ✅ Rejected cleanly |
| 50MB single file | ✅ Byte-exact in 0.108s |
| 1000-file archive | ✅ Indexed in 0.04s |

---

## 🐛 Debugging / Inspection

Inspect any specific file without unzipping:
```python
dataset.extract_sample_to_disk(idx=42, export_path="./inspection/")
```

---

## 🛠️ Troubleshooting

### `ModuleNotFoundError: No module named 'cdmltrain'`
```bash
git clone https://github.com/prem85642/cdmltrain.git && cd cdmltrain && pip install .
```

### `ModuleNotFoundError: No module named 'zstandard'`
```bash
pip install zstandard
```

### `Microsoft Visual C++ 14.0 required` (Windows)
Install [C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/). Or skip — pure Python engine works fine.

### Out of Memory on Colab
```python
dataset = CDMLStreamDataset("data.zip", max_cache_mb=512)  # Reduce cache
```

### `Bad CRC-32` error
```bash
python -c "import zipfile; print(zipfile.ZipFile('file.zip').testzip())"
# None = healthy, anything else = re-download
```

### `PIL.UnidentifiedImageError`
```python
dataset = CDMLStreamDataset("data.zip", is_image=False)  # Not an image dataset
```

---

## 🖥️ OS Compatibility

| Feature | Windows | Linux | macOS |
|---|---|---|---|
| ZIP Engine (Tier 1) | ✅ | ✅ | ✅ |
| ZSTD Engine (Tier 3) | ✅ | ✅ | ✅ |
| GPU Direct Loader | ✅ | ✅ | ✅ |
| C++ Fast Engine (Tier 2) | ✅ pre-built | ✅ compile via `pip install .` | ✅ compile via `pip install .` |

---

## 📁 Project Structure

```
cdmltrain/
├── cdmltrain/
│   ├── __init__.py          # Package entry point
│   ├── core.py              # Tier 1: Python CoreStreamEngine
│   ├── dataset.py           # CDMLStreamDataset (auto-tier selection)
│   ├── zstd_engine.py       # Tier 3: ZSTD streaming engine
│   ├── s3_engine.py         # Tier 4: Cloud Network Streaming (S3) engine
│   ├── gpu_loader.py        # Tier 3: GPU Direct Loader (CUDA pinned memory)
│   └── src/
│       └── fast_core.cpp    # Tier 2: C++ FastCoreEngine (pybind11)
├── demo.py                  # Quickstart demo
├── quickstart.ipynb         # Jupyter Notebook tutorial
├── test_enterprise_audit.py # Enterprise QA suite (8 tests)
├── test_zstd_benchmark.py   # ZSTD vs ZIP benchmark
├── test_zstd_compat.py      # Cross-format compatibility test
├── test_gpu_loader.py       # GPU Direct Loader test
├── setup.py                 # pip install configuration
├── requirements.txt         # Dependencies
└── LICENSE                  # MIT License
```

---

## 🤝 Contributing

Pull requests are welcome! For major changes, please open an issue first.

---

## 📄 License

MIT License — see [LICENSE](LICENSE) for details.

**Made with ❤️ for the ML community — because your model matters more than your storage bill.**
