Metadata-Version: 2.4
Name: arabic-extract
Version: 0.1.0
Summary: Clean Arabic text extraction from PDFs and scanned images — OCR + visual-order repair in one pipeline
Project-URL: Homepage, https://github.com/balswyan/arabic-extract
Project-URL: Source, https://github.com/balswyan/arabic-extract
Project-URL: Bug Tracker, https://github.com/balswyan/arabic-extract/issues
Author: Bandar AlSwyan
License: MPL-2.0
Keywords: arabic,bidi,easyocr,nlp,ocr,pdf,presentation-forms,rtl,tesseract,text-extraction
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Requires-Dist: arabic-repair>=0.1.0
Provides-Extra: all
Requires-Dist: easyocr>=1.7; extra == 'all'
Requires-Dist: pdfplumber>=0.9; extra == 'all'
Requires-Dist: pillow>=9; extra == 'all'
Requires-Dist: pymupdf>=1.23; extra == 'all'
Requires-Dist: pytesseract>=0.3; extra == 'all'
Provides-Extra: dev
Requires-Dist: pdfplumber>=0.9; extra == 'dev'
Requires-Dist: pillow>=9; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7; extra == 'easyocr'
Provides-Extra: pdf
Requires-Dist: pdfplumber>=0.9; extra == 'pdf'
Provides-Extra: pdf2image
Requires-Dist: pdf2image>=1.16; extra == 'pdf2image'
Requires-Dist: pillow>=9; extra == 'pdf2image'
Provides-Extra: pymupdf
Requires-Dist: pymupdf>=1.23; extra == 'pymupdf'
Provides-Extra: tesseract
Requires-Dist: pillow>=9; extra == 'tesseract'
Requires-Dist: pytesseract>=0.3; extra == 'tesseract'
Description-Content-Type: text/markdown

﻿# arabic-extract

Clean Arabic text extraction from PDFs and scanned images — one call, clean output.

Combines PDF text extraction, image OCR, and [arabic-repair](https://pypi.org/project/arabic-repair/)
into a single pipeline. Handles the visual-order problem that breaks standard Arabic NLP pipelines.

## The problem it solves

Arabic PDFs and scanned documents store text in **visual order** with **presentation-form characters**.
Standard tools (NFKC, CAMeL Tools) remove the presentation forms but cannot restore the reversed
word order — retrieval recall stays broken at ~27%. arabic-extract applies arabic-repair automatically,
restoring both letter forms and word order before the text reaches your NLP pipeline.

## Install

```bash
pip install arabic-extract[pdf]          # PDF text-layer extraction
pip install arabic-extract[tesseract]    # + image OCR via Tesseract (needs binary)
pip install arabic-extract[easyocr]      # + image OCR via EasyOCR (pure Python, ~200 MB)
pip install arabic-extract[pymupdf]      # + scanned PDF rendering via PyMuPDF
pip install arabic-extract[all]          # everything
```

**Tesseract binary** (for the tesseract extra):
- Windows: download from https://github.com/UB-Mannheim/tesseract/wiki — install the Arabic language pack
- Linux: `sudo apt install tesseract-ocr tesseract-ocr-ara`
- macOS: `brew install tesseract && brew install tesseract-lang`

## Quick start

```python
import arabic_extract as aocr

# PDF — auto-detects text layer vs scanned, repairs each page
result = aocr.extract("document.pdf")
print(result.text)           # clean logical Arabic, all pages joined
print(result.pages)          # per-page breakdown
print(result.contamination)  # how many words needed repair

# Scanned image
result = aocr.extract("scan.jpg")
print(result.text)

# Explicit PDF extraction
result = aocr.extract_pdf("document.pdf", engine="tesseract")

# Explicit image extraction
result = aocr.extract_image("scan.png", engine="easyocr")

# Chain into CAMeL Tools (normalize=True is the default)
result = aocr.extract("document.pdf", normalize=True)
```

## How it works

```
Input PDF or image
    │
    ├─ PDF with text layer  → pdfplumber extracts text (visual order)
    │                                     ↓
    ├─ Scanned PDF          → render page as image → OCR engine
    │                                     ↓
    └─ Image file           → OCR engine (Tesseract or EasyOCR)
                                          ↓
                               arabic-repair (de-shape + restore order)
                                          ↓
                               NFKC / CAMeL Tools normalization
                                          ↓
                               Clean logical Arabic text
```

A single PDF can have mixed pages — some with a text layer, some scanned.
Each page is handled correctly.

## Per-page results

```python
result = aocr.extract("document.pdf")

for page in result.pages:
    print(f"Page {page.page_number} [{page.method}]: {page.text[:80]}")
    # method: "text_layer" | "ocr" | "text_layer_empty"
```

## Ecosystem

| Package | Role |
|---|---|
| [arabic-rt](https://pypi.org/project/arabic-rt/) | Core shaping / fix / unfix engine |
| [arabic-repair](https://pypi.org/project/arabic-repair/) | Detect and repair visual-order contamination |
| [arabic-extract](https://pypi.org/project/arabic-extract/) | Full PDF + image extraction pipeline |
| [arabic-benchmark](https://github.com/balswyan/arabic-benchmark) | Benchmark proving the reordering gap |

## License

MPL-2.0
