Metadata-Version: 2.4
Name: arxiv-appendix-extractor
Version: 0.1.0
Summary: A tool to download arXiv papers, intelligently extract appendices with math formulas, and manage source files.
Author-email: Your Name <you@example.com>
License: MIT
Project-URL: Homepage, https://github.com/your-username/arxiv-appendix-extractor
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: lxml
Requires-Dist: feedparser

# arXiv Appendix Extractor

A powerful command-line tool to download LaTeX source files for arXiv papers, intelligently extract appendix sections, and clean up source files for papers without appendices.

This tool is ideal for researchers and data scientists who need to programmatically analyze the mathematical derivations and proofs often found in the appendices of scientific papers.

## Features

-   **Download by Date or ID**: Fetch papers from a specific date and category, or by a list of individual arXiv IDs.
-   **Intelligent Appendix Extraction**: Uses a hybrid approach:
    1.  **HTML-Assisted**: Prefers high-quality HTML from ar5iv.org for accurate section detection.
    2.  **Heuristic Fallback**: If HTML is unavailable, it uses robust heuristics to find appendices in the raw LaTeX source.
-   **Robust and Resumable**: Maintains a progress file to automatically resume interrupted jobs, saving time and bandwidth.
-   **Parallel Processing**: Leverages multiple CPU cores to process papers in parallel, significantly speeding up large jobs.
-   **Smart Cleanup**: Includes a separate utility to safely remove the source files of papers that were found to have no appendix, saving disk space.
-   **Easy to Use**: Packaged as a simple command-line tool.

## Installation

```bash
# (After you build and upload to PyPI)
pip install arxiv-appendix-extractor
```

## Usage

### Main Extractor Tool

The main tool is `arxiv-extractor`. You must provide either `--date` or `--ids`.

**1. Process papers from a specific date:**

This command will process the first 10 papers from the 'cs' (Computer Science) category on August 15, 2025, using 4 parallel workers.

```bash
arxiv-extractor --date 2025-08-15 --category cs --max-papers 10 --workers 4
```

**2. Process specific papers by ID:**

```bash
arxiv-extractor --ids 2310.06825 2403.00318
```

All results and intermediate files will be saved in the `pipeline_output_refactored/` directory by default.

### Cleanup Tool

After running the extractor, you can clean up the source files for papers that had no appendices.

The `arxiv-cleanup` tool requires the report file and the papers directory generated by the extractor.

**1. Perform a "dry run" (highly recommended first):**

This will show you which directories would be deleted, without actually deleting anything.

```bash
arxiv-cleanup \
  --report-file pipeline_output_refactored/final_appendix_results.json \
  --papers-dir pipeline_output_refactored/papers \
  --dry-run
```

**2. Execute the cleanup:**

This will prompt you for confirmation before deleting files.

```bash
arxiv-cleanup \
  --report-file pipeline_output_refactored/final_appendix_results.json \
  --papers-dir pipeline_output_refactored/papers
```

To skip the confirmation prompt (e.g., in a script), add the `-y` flag.
