Metadata-Version: 2.4
Name: splurge-data-profiler
Version: 2025.2.0
Summary: A data profiling tool for delimited and database sources.
Author: Jim Schilling
Maintainer: Jim Schilling
License-Expression: MIT
Project-URL: Homepage, https://github.com/jim-schilling/splurge-data-profiler
Project-URL: Documentation, https://github.com/jim-schilling/splurge-data-profiler/blob/main/docs/README-details.md
Project-URL: Repository, https://github.com/jim-schilling/splurge-data-profiler.git
Project-URL: Issues, https://github.com/jim-schilling/splurge-data-profiler/issues
Project-URL: Changelog, https://github.com/jim-schilling/splurge-data-profiler/blob/main/CHANGELOG.md
Keywords: data-profiling,csv,tsv,dsv,data-lake,sqlite,type-inference,data-analysis,etl,data-processing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Filters
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sqlalchemy>=2.0.37
Requires-Dist: splurge-dsv>=2025.1.5
Requires-Dist: splurge-typer>=2025.0.1
Requires-Dist: splurge-tabular>=2025.0.0
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-xdist>=3.8.0; extra == "test"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.8.0; extra == "dev"
Requires-Dist: ruff>=0.12.12; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: license-file

# Splurge Data Profiler

[![Version](https://img.shields.io/badge/version-2025.2.0-blue.svg)](https://github.com/jim-schilling/splurge-data-profiler/releases)
[![Python Versions](https://img.shields.io/pypi/pyversions/splurge-data-profiler.svg)](https://pypi.org/project/splurge-data-profiler/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Documentation](https://img.shields.io/badge/docs-detailed-blue.svg)](docs/README-details.md)
[![Coverage](https://img.shields.io/badge/coverage-92%25-brightgreen.svg)](https://github.com/jim-schilling/splurge-data-profiler)

A powerful data profiling tool for delimited and database sources that automatically infers data types and creates optimized data lakes (SQLite database).

## Features

- **DSV File Support**: Profile CSV, TSV, and other delimiter-separated value files
- **Automatic Type Inference**: Intelligently detect data types using adaptive sampling
- **Data Lake Creation**: Generate SQLite databases with optimized schemas
- **Inferred Tables**: Create tables with both original and type-cast columns
- **Flexible Configuration**: JSON-based configuration for customization
- **Command Line Interface**: Easy-to-use CLI for batch processing
- **Comprehensive Testing**: Extensive test coverage ensuring reliability and robustness
- **Production Ready**: Enterprise-grade error handling and performance optimization

## Installation

```bash
pip install splurge-data-profiler
```

## Quick Start

1. **Create a configuration file**:
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```

2. **Profile your data**:
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```

## CLI Usage

### Profile Command

Profile a DSV file and create a data lake:

```bash
python -m splurge_data_profiler profile <dsv_file> <config_file> [options]
```

**Options:**
- `--verbose`: Enable verbose output

### Create Config Command

Generate a sample configuration file:

```bash
python -m splurge_data_profiler create-config <output_file>
```

## Configuration

The configuration file is a JSON file that specifies how to process your DSV file:

```json
{
  "data_lake_path": "./data_lake",
  "dsv": {
    "delimiter": ",",
    "encoding": "utf-8"
  }
}
```

## Documentation

For detailed documentation, examples, and API reference, see:
- [Detailed Documentation](docs/README-details.md)
- [Changelog](CHANGELOG.md)

## Requirements

- Python 3.10+
- SQLAlchemy >= 2.0.37

## Quality Assurance

This project maintains high code quality through comprehensive testing:

- **Unit Tests**: Core component testing with 100% coverage of critical paths
- **Integration Tests**: End-to-end workflow validation
- **Edge Case Tests**: Error handling and boundary condition testing
- **E2E Tests**: Complete user scenario validation
- **Performance Tests**: Large dataset processing validation

Run tests with:
```bash
pytest
```

## License

MIT License
