Metadata-Version: 2.4
Name: autopdfparse
Version: 0.1.1
Summary: A Python package for parsing PDF documents using AI vision models
Project-URL: Homepage, https://github.com/ChidiRnweke/AutoPDFParse
Project-URL: Issues, https://github.com/ChidiRnweke/AutoPDFParse/issues
Author-email: Chidi Nweke <chidi125@gmail.com>
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: pillow>=11.2.1
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf>=1.25.5
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.51.0; extra == 'anthropic'
Requires-Dist: json-repair>=0.44.1; extra == 'anthropic'
Provides-Extra: gemini
Requires-Dist: google-genai>=1.15.0; extra == 'gemini'
Provides-Extra: openai
Requires-Dist: openai>=1.79.0; extra == 'openai'
Description-Content-Type: text/markdown

# AutoPDFParse

RAG (Retrieval-Augmented Generation) is a powerful technique that combines the strengths of large language models (LLMs) with external knowledge sources to improve the quality and relevance of generated content. One key challenge in RAG is effectively parsing and understanding complex documents, such as PDFs, which often contain a mix of text, images, tables, and other layout-dependent content. This is where AutoPDFParse comes in.

AutoPDFParse is a Python package designed to simplify the process of parsing PDF documents using multimodal LLMs. It leverages the capabilities of advanced AI models to automatically detect layout-dependent content and extract relevant information, making it easier to work with complex documents.

It works in a two step process:
1. **Visual Parsing**: Each page of the PDF is converted to an image and sent to a vision model (like OpenAI's GPT-4 Vision) to determine if the content is layout-dependent.
2. **Content Description**: If the content is layout-dependent, the image is sent to a description model (like OpenAI's GPT-4) to extract the text and other relevant information. If not, the raw text is extracted directly from the PDF.

This package is designed to work with multiple AI providers, including OpenAI, Google Gemini, and Anthropic Claude. It provides a simple and extensible interface for parsing PDF documents, making it easy to integrate into your projects.

## Features

- Automatic detection of layout-dependent content
- Visual parsing for complex documents with tables, charts, etc.
- Fallback to raw text extraction when appropriate
- Support for multiple AI providers (OpenAI, Anthropic Claude, Google Gemini)
- Easy extension with custom providers (just implement 2 functions!)
- Customizable system prompts for fine-tuning AI responses
- Built-in rate limiting for API requests
- Error handling and custom exceptions
- Async API for better performance

## Installation

Basic installation:

```bash
pip install autopdfparse
```

With specific AI provider support:

```bash
# OpenAI
pip install "autopdfparse[openai]"

# Google Gemini
pip install "autopdfparse[gemini]"

# Anthropic Claude
pip install "autopdfparse[anthropic]"

# All providers
pip install "autopdfparse[openai,gemini,anthropic]"
```

## Usage

### Basic Example with OpenAI

```python
import asyncio
from autopdfparse import OpenAIParser

async def main():
    # Parse from file
    parser = await OpenAIParser.get_parser(
        api_key="your_openai_api_key"
    )
    
    # Parse the document
    result = await parser.parse_file("path/to/document.pdf")
    # or parse from bytes
    with open("path/to/document.pdf", "rb") as f:
        pdf_bytes = f.read()
         result = await parser.parse_bytes(pdf_bytes)
    
    # Get all content concatenated
    all_content = result.get_all_content()
    print(all_content)

if __name__ == "__main__":
    asyncio.run(main())
```

### Using Google Gemini

```python
import asyncio
from autopdfparse import GeminiParser

async def use_gemini():
    parser = await GeminiParser.get_parser(api_key="your_google_api_key")
    
    result = await parser.parse_file("path/to/document.pdf")
    # or parse from bytes
    with open("path/to/document.pdf", "rb") as f:
        pdf_bytes = f.read()
        result = await parser.parse_bytes(pdf_bytes)

    # Get all content concatenated
    print(result.get_all_content())
```

### Using Anthropic Claude

```python
import asyncio
from autopdfparse import AnthropicParser

async def use_claude():
    parser = await AnthropicParser.get_parser(api_key="your_anthropic_api_key")
    
    result = await parser.parse_file("path/to/document.pdf")
    # or parse from bytes
    with open("path/to/document.pdf", "rb") as f:
        pdf_bytes = f.read()
        result = await parser.parse_bytes(pdf_bytes)
    print(result.get_all_content())
```

### Configuring Rate Limits

You can configure the maximum number of concurrent API requests to avoid rate limiting issues:

```python
from autopdfparse import Config

# Set the maximum number of concurrent API requests
OpenAIParser.MAX_CONCURRENT_REQUESTS = 5

# Then proceed with parsing
parser = await OpenAIParser.get_parser(...)
```

### Accessing Individual Pages

```python
async def process_pages(parser):
    result = await parser.parse()
    
    # Access individual pages
    for page in result.pages:
        print(f"Page {page.page_number}:")
        print(page.content)
        print(f"Generated by LLM: {page._from_llm}")
```

### Custom Model Selection

```python
# For OpenAI
openai_parser = await OpenAIParser.get_parser(
    api_key="your_openai_api_key",
    description_model="gpt-4.1",     # Model for content description
    visual_model="gpt-4.1-mini",     # Model for layout detection
)

# For Gemini
gemini_parser = await GeminiParser.get_parser(
    api_key="your_google_api_key",
    description_model="gemini-1.5-pro",
    visual_model="gemini-1.5-flash"
)

# For Anthropic
anthropic_parser = await AnthropicParser.get_parser(
    api_key="your_anthropic_api_key",
    description_model="claude-3-opus-20240229",
    visual_model="claude-3-haiku-20240307"
)
```

### Error Handling

```python
from autopdfparse import OpenAIParser, PDFParsingError, APIError, ModelError

async def handle_errors():
    try:
        parser = await OpenAIParser.get_parser(api_key="your_openai_api_key")
        result = await parser.parse()
    except ModelError as e:
        print(f"Model error (missing dependency?): {e}")
    except PDFParsingError as e:
        print(f"PDF parsing error: {e}")
    except APIError as e:
        print(f"API error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
```

### Custom Providers and Prompts

#### Customizing Prompts

All providers use default system prompts for content detection and layout analysis. You can customize these prompts when instantiating a parser:

```python
from autopdfparse import OpenAIParser
from autopdfparse.default_prompts import describe_image_system_prompt

# Create a modified prompt
my_custom_prompt = describe_image_system_prompt + "\nAdditional instructions: Focus on extracting tabular data with precision."

parser = await OpenAIParser.get_parser(
    api_key="your_openai_api_key",
    description_prompt=my_custom_prompt  # Custom prompt for content description
)
```

#### Implementing a Custom Provider

You can create your own provider by implementing the `VisionService` protocol and using it with the `PDFParser` class. This typically involves creating your `VisionService` implementation and then a helper class or factory function to instantiate `PDFParser` with your service.

```python
import asyncio
from autopdfparse.services import VisionService, PDFParser
from dataclasses import dataclass


@dataclass
class CustomVisionService(VisionService):
    """A simple custom vision service example.
    It might take arguments in its __init__ method, e.g., API keys or configurations.
    For this example, it takes no arguments.
    """
    
        
    async def describe_image_content(self, image: str) -> str:
        """
        Custom implementation of image content description.
        
        Args:
            image: Base64 encoded image
            
        Returns:
            Description of the image content
        """
        # Your custom logic here - this example just returns a placeholder
        # If CustomVisionService had an API key, it would be used here.
        return "Hello world! This is a custom image description."
    
    async def is_layout_dependent(self, image: str) -> bool:
        """
        Custom implementation for detecting layout-dependent content.
        
        Args:
            image: Base64 encoded image
            
        Returns:
            True if layout-dependent, False otherwise
        """
        # Your custom logic here - this example always returns False
        return False


    def get_parser(cls, **kwargs_for_vision_service) -> PDFParser:
        """
        Creates and returns a PDFParser instance configured with CustomVisionService.
        The PDF is not loaded at this stage.
        
        Args:
            **kwargs_for_vision_service: Arguments to pass to CustomVisionService constructor.
        
        Returns:
            A PDFParser instance.
        """
        # Instantiate your custom vision service
        # Pass any necessary arguments to CustomVisionService here
        vision_service = CustomVisionService(**kwargs_for_vision_service)
        
        # Return a PDFParser instance configured with your custom vision service
        return PDFParser(vision_service=vision_service)

# Usage
async def main():

    custom_parser = CustomParser.get_parser() 
    

    result = await custom_parser.parse_file("path/to/document.pdf")
    print(result.get_all_content())

if __name__ == "__main__":
    asyncio.run(main())
    pass
```

You can implement more sophisticated integrations with other AI providers or your own models by following this pattern.

### Synchronous API

In addition to the asynchronous API, AutoPDFParse also provides a synchronous API for environments where async/await cannot be used. The synchronous API mirrors the async API but uses blocking calls.

```python
from autopdfparse.sync import OpenAIParser

# Create a parser
parser = OpenAIParser.get_parser(api_key="your_openai_api_key")

# Parse the document (no await needed)
result = parser.parse_file("path/to/document.pdf")
# or parse from bytes
with open("path/to/document.pdf", "rb") as f:
    pdf_bytes = f.read()
    result = parser.parse_bytes(pdf_bytes)

# Access the content
content = result.get_all_content()
print(content)
```

The synchronous API supports all the same features as the asynchronous API.


Creating a custom provider with the synchronous API:

```python
from autopdfparse.sync.services import VisionService as SyncVisionService
from autopdfparse.sync.services import PDFParser as SyncPDFParser

class CustomSyncVisionService(SyncVisionService):
    """A simple synchronous custom vision service."""
    
    # Example: def __init__(self, custom_api_key: str):
    # self.api_key = custom_api_key

    def describe_image_content(self, image: str) -> str:
        # Synchronous implementation
        return "Image description here (sync)"
    
    def is_layout_dependent(self, image: str) -> bool:
        # Synchronous implementation
        return False

    @classmethod
    def get_parser(cls, **kwargs_for_vision_service) -> SyncPDFParser:
        """
        Creates and returns a synchronous PDFParser instance configured 
        with CustomSyncVisionService. The PDF is not loaded at this stage.
        
        Args:
            **kwargs_for_vision_service: Arguments to pass to CustomSyncVisionService constructor.

        Returns:
            A synchronous PDFParser instance.
        """
        vision_service = CustomSyncVisionService(**kwargs_for_vision_service)
        return SyncPDFParser(vision_service=vision_service)

```

## Requirements

- Python 3.12+
- API key for your chosen provider (OpenAI, Google, or Anthropic)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. We especially welcome contributions for new providers, new default prompts, etc.
