Metadata-Version: 2.4
Name: rtm-pdf-processor
Version: 0.1.2
Summary: PDF to image and text extraction utility using Tesseract OCR
Author: Raviteja Mulukuntla
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pdf2image
Requires-Dist: pytesseract
Requires-Dist: tqdm

# RTM PDF Processor
# Developed and Maintained by Raviteja Mulukuntla

Python library to:
- Create PDFS directory in root directory and place your pdf files before accessing this loadPdfs
- Extract Hindi + English text
- Txt files will be saved to folder named PdfTextFiles
- Full file text will be available at (text = pdfProcessor.extractTextFromPdf(pdf))

- Usage Guide:

```python
from rtm_pdf_processor import RTMPdfProcessor
import os

pdfs_dir = os.getenv("PDFS_DIR", "./PDFS")
images_dir = os.getenv("Images_DIR", "PdfImages")
texts_dir = os.getenv("Texts_DIR", "PdfTextFiles")

pdfProcessor = RTMPdfProcessor(pdfs_dir, images_dir, texts_dir)

pdf_files = []
# Create PDFS directory in root directory and place your pdf files before accessing this loadPdfs
pdf_files = pdfProcessor.loadPdfs()

for pdf in pdf_files:
    pdfContent = pdfProcessor.readPdf(pdf)
    print(f"Read {len(pdfContent)} bytes from {pdf}")
    text = pdfProcessor.extractTextFromPdf(pdf,True,True)
    print("*" * 50)
    print(f"Extracted text from {pdf}:")
    print(text)

```bash
pip install rtm-pdf-processor
