Metadata-Version: 2.4
Name: ext-data
Version: 0.1.1
Summary: AI-powered document extraction system that classifies documents and extracts structured data
Author: Jyothi Sinha
Keywords: document-ai,ocr,nlp,information-extraction
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf>=1.23
Requires-Dist: pillow>=10.0
Requires-Dist: pytesseract>=0.3
Requires-Dist: requests>=2.31
Requires-Dist: python-docx>=1.0
Requires-Dist: python-pptx>=0.6
Requires-Dist: camelot-py[cv]>=0.11
Requires-Dist: tabula-py>=2.9
Requires-Dist: python-dateutil>=2.8
Requires-Dist: python-dotenv>=1.0
Requires-Dist: spacy>=3.7
Provides-Extra: ocr
Requires-Dist: pytesseract; extra == "ocr"
Requires-Dist: pillow; extra == "ocr"
Provides-Extra: tables
Requires-Dist: camelot-py[cv]; extra == "tables"
Requires-Dist: tabula-py; extra == "tables"
Provides-Extra: nlp
Requires-Dist: spacy; extra == "nlp"
Dynamic: license-file

# ext_data

Lightweight AI-based document extraction package for automatic document classification and structured data extraction using a hybrid NLP pipeline (spaCy + GLiNER).

## Features

- **Multi-format support** -- PDF, DOCX, PPTX, PNG, JPG
- **Automatic document classification**
- **Hybrid extraction** -- spaCy, GLiNER, regex, rules
- **Line item extraction** -- tables and text
- **OCR support** via Tesseract OCR
- **Structured JSON output** with confidence scores

## Installation

### Prerequisites

- Python 3.10 – 3.12
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) installed

### Install package

```bash
pip install ext_data
```

Or from source:

```bash
git clone https://github.com/your-repo/ext_data.git
cd ext_data
pip install -e .
```

## Usage

### Single file

```bash
python -m ext_data.main --input file.pdf
```

### Folder

```bash
python -m ext_data.main --input documents/
```

### Output

Results are saved as JSON:

```json
{
  "document_type": "invoice",
  "fields": {
    "invoice_number": "INV-001",
    "vendor_name": "Company",
    "total_amount": "1000.00"
  },
  "line_items": [],
  "overall_confidence": 0.8,
  "status": "success"
}
```

## Supported Types

- **Invoice** (enabled)
- Other types available in codebase (can be enabled manually)

## Project Structure

```
ext_data/
├── main.py
├── ingestion/
├── parsers/
├── extraction/
└── output/
```

## License

MIT
