Metadata-Version: 2.4
Name: dataspool
Version: 0.1.0
Summary: Lightweight versioning for data files with automatic change detection
Project-URL: Homepage, https://github.com/jmiloser/dataspool
Project-URL: Documentation, https://github.com/jmiloser/dataspool#readme
Project-URL: Repository, https://github.com/jmiloser/dataspool
Project-URL: Issues, https://github.com/jmiloser/dataspool/issues
Author-email: Jim Miloser <jmiloser@gmail.com>
License: MIT
License-File: LICENSE
Keywords: data,dataframe,pandas,version-control,versioning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: pandas>=2.0.0
Provides-Extra: all
Requires-Dist: mypy>=1.5.0; extra == 'all'
Requires-Dist: pre-commit>=3.4.0; extra == 'all'
Requires-Dist: pyarrow>=13.0.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.4.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: parquet
Requires-Dist: pyarrow>=13.0.0; extra == 'parquet'
Description-Content-Type: text/markdown

# Dataspool

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Lightweight versioning for data files with automatic change detection.

Dataspool is a production-ready library for tracking changes in pandas DataFrames with minimal configuration. It automatically detects changes via content hashing and maintains a clean version history.

## Features

- Smart change detection - only saves when content actually changes
- Automatic versioning with timestamps and comprehensive metadata
- SHA256 content hashing for reliable change detection
- Complete version history and audit trail
- Zero configuration with sensible defaults
- Comprehensive test coverage

## Installation

```bash
pip install dataspool
```

## Quick Start

```python
import pandas as pd
from dataspool import save_with_version_control

df = pd.DataFrame({
    "Product": ["Apple", "Banana", "Cherry"],
    "Price": [1.50, 0.75, 2.25]
})

result = save_with_version_control(df, "prices")

if result["saved"]:
    print(f"New version saved: {result['version']}")
else:
    print("No changes detected")
```

## Usage

### Basic Versioning

```python
from pathlib import Path
from dataspool import save_with_version_control

result = save_with_version_control(
    df,
    base_filename="my_data",
    data_dir=Path("./data"),
    save_latest=True
)
```

### Version History

```python
from dataspool import get_version_history, get_latest_version

history = get_version_history(Path("./data"))
for version in history:
    print(f"{version['timestamp']}: {version['row_count']} rows")

latest = get_latest_version(Path("./data"))
print(f"Latest: {latest['timestamp']}")
```

### Change Detection

```python
df1 = pd.DataFrame({"A": [1, 2, 3]})
result1 = save_with_version_control(df1, "test")  # saved=True

# Same content, different order - detected as identical
df2 = pd.DataFrame({"A": [3, 1, 2]})
result2 = save_with_version_control(df2, "test")  # saved=False

# Modified content - creates new version
df3 = pd.DataFrame({"A": [1, 2, 4]})
result3 = save_with_version_control(df3, "test")  # saved=True
```

## Development

### Setup with uv

```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/jmiloser/dataspool.git
cd dataspool
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
pytest
pytest --cov=dataspool --cov-report=html
```

### Code Quality

```bash
ruff format .
ruff check .
mypy src/
```

## Roadmap

- Support for Polars DataFrames
- Parquet format support
- Diff functionality between versions
- Version rollback and restore
- Cloud storage backends (S3, GCS, Azure)
- CLI tool for version inspection
- Automatic version pruning

## License

MIT License - see LICENSE file for details.

## Author

Jim Miloser - [GitHub](https://github.com/jmiloser)
