Metadata-Version: 2.4
Name: mark-everything
Version: 0.1.0
Summary: Convert any document (.docx, .pptx, .xlsx, .eml, .msg, .pdf) to clean Markdown.
Author: MarkEverything Contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/your-org/mark-everything
Project-URL: Repository, https://github.com/your-org/mark-everything
Project-URL: Bug Tracker, https://github.com/your-org/mark-everything/issues
Keywords: markdown,docx,pptx,xlsx,pdf,email,document-conversion,office,word,powerpoint,excel
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Office/Business
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: python-docx>=1.1
Requires-Dist: python-pptx>=0.6
Requires-Dist: openpyxl>=3.1
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: Pillow>=10.0
Provides-Extra: msg
Requires-Dist: extract-msg>=0.48; extra == "msg"
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.23; extra == "pdf"
Provides-Extra: all
Requires-Dist: extract-msg>=0.48; extra == "all"
Requires-Dist: pymupdf>=1.23; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# MarkEverything

Convert any Office document or email into clean, structured Markdown — preserving headings, tables, images, diagrams, and meaning.

| Format | What is extracted |
|--------|-------------------|
| `.docx` | Headings, paragraphs, bold/italic, tables, hyperlinks, embedded images, SmartArt → Mermaid, OML equations → LaTeX |
| `.pptx` | Per-slide sections, spatial reading order, connected shapes → Mermaid, embedded images, speaker notes |
| `.xlsx` | Sheet tables (used-range only), bold cells, Bar/Pie charts → Mermaid |
| `.eml` | YAML frontmatter headers, HTML/plain body, inline images, attachment list |
| `.msg` | Same as `.eml` via `extract-msg` |

---

## Requirements

- Python 3.10+
- See `requirements.txt` for library dependencies

---

## Installation

### From source (development)

```bash
git clone https://github.com/your-org/mark-everything.git
cd mark-everything
pip install -e ".[dev]"
```

### Runtime only

```bash
pip install -e .
```

### With `.msg` support

```bash
pip install -e ".[msg]"
```

---

## Quick Start

### Python API

```python
from mark_everything import convert
from mark_everything.models import ConversionConfig, ImageMode, MermaidDepth

# Minimal — all defaults (images embedded as base64)
result = convert("report.docx")
print(result.markdown)

# Full config
config = ConversionConfig(
    image_mode=ImageMode.PATH,       # save images to disk
    output_dir="output/assets",      # where to save them
    mermaid_depth=MermaidDepth.FULL, # attempt complex SmartArt
    exclude_sheets=["Internal"],     # skip this Excel sheet
    include_speaker_notes=True,      # include PPT speaker notes
)
result = convert("presentation.pptx", config)

# Inspect result
print(result.markdown)           # the Markdown string
print(result.asset_paths)        # list[Path] of saved images
print(result.warnings)           # list[str] of non-fatal issues
```

### Command-Line Interface

```bash
# Basic — print Markdown to stdout
mark-everything report.docx

# Write to file
mark-everything report.docx -o output/report.md

# Save images as files instead of base64
mark-everything report.docx --image-mode path --output-dir output/assets -o output/report.md

# PowerPoint — omit speaker notes
mark-everything deck.pptx --no-speaker-notes -o output/deck.md

# Excel — skip certain sheets, full SmartArt depth
mark-everything data.xlsx --exclude-sheets Internal Scratch --mermaid-depth full -o output/data.md

# Verbose logging
mark-everything email.eml -v -o output/email.md
```

---

## Configuration Reference

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `image_mode` | `"base64"` \| `"path"` | `"base64"` | Embed images as data-URIs or save to disk |
| `output_dir` | `Path \| None` | `None` | Asset directory when `image_mode="path"` |
| `mermaid_depth` | `"simple"` \| `"full"` | `"simple"` | SmartArt / connected-shape conversion depth |
| `exclude_sheets` | `list[str]` | `[]` | Excel sheet names to skip |
| `include_speaker_notes` | `bool` | `True` | Include PPT speaker notes as `> blockquotes` |

---

## Project Structure

```
mark_everything/
├── __init__.py          # Public API: convert()
├── engine.py            # Central router
├── models.py            # Pydantic config + result models
├── media.py             # Image encode / save helpers
├── mermaid.py           # Mermaid diagram builder
├── cli.py               # CLI entry-point
└── converters/
    ├── docx.py          # Word converter
    ├── pptx.py          # PowerPoint converter
    ├── xlsx.py          # Excel converter
    └── email.py         # Email (.eml / .msg) converter
```

---

## Design Principles

- **Pure functions** — all helpers are stateless; same input always yields same output.
- **Pydantic models** — `ConversionConfig` is validated and frozen; `ConversionResult` is immutable.
- **Structured logging** — `logging.getLogger(__name__)` throughout; no `print()` in library code.
- **No memory leaks** — all file handles and workbooks are opened inside `with` / `try-finally` blocks.
- **PEP 8 compliant** — enforced via `ruff` (configured in `pyproject.toml`).
- **Lazy imports** — heavy converters are only imported when the file type is actually needed.

---

## License

MIT
