Metadata-Version: 2.4
Name: anyformat
Version: 0.6.0
Summary: Python SDK for the AnyFormat API
Author-email: Anyformat <info@anyformat.ai>
Requires-Python: <3.14,>=3.13
Requires-Dist: httpx>=0.27.0
Requires-Dist: pillow>=12.2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pypdfium2>=4.30.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.15
Description-Content-Type: text/markdown

# anyformat

Python SDK for the [AnyFormat](https://anyformat.ai) API. Build a document
workflow with a fluent builder, run a file through it, and read typed results —
including drawing the parser's layout boxes back onto the page.

```bash
pip install anyformat
```

```python
from anyformat.sdk import Client
from anyformat.workflow import Schema

client = Client(api_key="af_...")  # or set ANYFORMAT_API_KEY

workflow = (
    client.workflow("Invoices")
    .parse()
    .extract([Schema.string("vendor", "Vendor name on the invoice.")])
    .create()
)

result = workflow.run("invoice.pdf").wait()
print(result.fields["vendor"].value)        # -> "Acme Corp"
result.parse.draw("out/")                    # -> out/page_1.png with boxes drawn
```

The client reads `ANYFORMAT_API_KEY` from the environment if you omit `api_key`.
`base_url` defaults to `https://api.anyformat.ai`.

---

## Concepts in 30 seconds

- **`client.workflow(name)`** returns a fluent builder. Chain verbs
  (`.parse()`, `.classify()`, `.split()`, `.extract()`, `.validate()`) then
  `.create()` to register it and get a `Workflow`.
- **`workflow.run(file)`** uploads a file and returns a `Run`. **`run.wait()`**
  polls until processing finishes and returns a `Result`.
- **`result.fields`** is the extracted data; **`result.parse`** is the parser's
  output (markdown + per-block layout); **`result.raw`** is the full response.
- Each node's behaviour lives on its own result view — `draw()` is a method of
  `result.parse` because only the parser produces layout boxes.

---

## Cookbook

Every recipe below is a complete, runnable script. They share this header:

```python
from anyformat.sdk import Client
from anyformat.workflow import Schema, ClassifyCategory, SplitterRule, ValidationRule

client = Client(api_key="af_...")
```

### 1. Parse a document → markdown + blocks

A parse-only workflow turns any document into structured markdown plus a typed
list of layout blocks (each with a bounding box, type, and confidence).

```python
workflow = client.workflow("Plain parse").parse().create()
result = workflow.run("contract.pdf").wait()

print(result.parse.markdown)          # full document as markdown
print(result.parse.text)              # markdown with HTML/anchors stripped

for block in result.parse.blocks:
    print(block.page, block.type, block.parse_confidence, block.bbox)
```

### 2. Visualize the bounding boxes

`result.parse.draw()` renders every block's box onto its page and writes one PNG
per page. Boxes are coloured by `parse_confidence`, matching the Studio:
**≥80 emerald · 50–79 amber · <50 rose · unknown blue**.

```python
workflow = client.workflow("Parse + draw").parse().create()
result = workflow.run("invoice.pdf").wait()

paths = result.parse.draw("out/")     # ["out/page_1.png", "out/page_2.png", ...]
print(paths)
```

The page background is the document you ran — it's retained automatically, so
you don't hand the file back. Options:

```python
# Override the background (path or bytes) — useful if you ran from raw bytes:
result.parse.draw("out/", source="invoice.pdf")

# Higher-resolution raster:
result.parse.draw("out/", dpi=300)

# No source available → boxes are drawn on blank canvases (layout map only).
```

PDFs are rasterized per page; image files (PNG/JPG/TIFF) are drawn on directly.

### 3. Extract fields (linear `parse → extract`)

`Schema` is the field factory. Mix any field types in one extract.

```python
workflow = (
    client.workflow("Invoice extraction")
    .parse()
    .extract([
        Schema.string("vendor", "Vendor name on the invoice."),
        Schema.float("total", "Grand total amount."),
        Schema.date("issued_on", "Invoice issue date."),
        Schema.boolean("is_paid", "Whether the invoice is marked paid."),
        Schema.integer("line_count", "Number of line items."),
        Schema.enum("currency", "Currency of the totals.", options=[
            Schema.option("USD", "US Dollar."),
            Schema.option("EUR", "Euro."),
        ]),
        Schema.object("line_items", "One row per line item.", fields=[
            Schema.string("sku", "Stock keeping unit."),
            Schema.string("description", "Line description."),
            Schema.float("amount", "Line amount."),
        ]),
    ])
    .create()
)

result = workflow.run("invoice.pdf").wait()
field = result.fields["vendor"]
print(field.value, field.confidence, [e.text for e in field.evidence])
```

Field types: `string`, `integer`, `float`, `boolean`, `date`, `datetime`,
`enum`, `multi_select`, `object` (nested).

### 4. Classify, then extract per category

`classify()` adds a branching node. Each `extract()` after it must declare which
category routes into it via `branch=` (the category `id`, or the object itself).

```python
invoice = ClassifyCategory(id="INVOICE", name="Invoice", description="A vendor invoice.")
receipt = ClassifyCategory(id="RECEIPT", name="Receipt", description="A point-of-sale receipt.")

workflow = (
    client.workflow("Invoice or Receipt")
    .parse()
    .classify(invoice, receipt)
    .extract([Schema.string("vendor", "Vendor name.")], branch=invoice)     # object form
    .extract([Schema.string("merchant", "Merchant name.")], branch="RECEIPT")  # id form
    .create()
)

result = workflow.run("doc.pdf").wait()
for category in result.raw["classifications"]:
    print(category["category"], category["confidence"])
```

### 5. Split a multi-document file, then extract

`split()` carves one upload into sub-documents by rule. Extracts branch off the
splitter rule `id`.

```python
statements = SplitterRule(id="STMT", name="Statement", description="A bank statement.")
checks = SplitterRule(id="CHECK", name="Check", description="A scanned check.")

workflow = (
    client.workflow("Statement bundle")
    .parse()
    .split(statements, checks)
    .extract([Schema.string("account", "Account number.")], branch="STMT")
    .extract([Schema.float("amount", "Check amount.")], branch="CHECK")
    .create()
)

result = workflow.run("bundle.pdf").wait()
for split in result.raw["splits"]:
    print(split["name"], split["files"])
```

When you chain `split()` after `classify()`, tell the splitter which category
feeds it with `route_from=`:

```python
(
    client.workflow("Classify then split")
    .parse()
    .classify(invoice, receipt)
    .split(statements, checks, route_from=invoice)
    .extract([Schema.string("account", "Account number.")], branch="STMT")
    .create()
)
```

### 6. Validate extracted fields

`validate()` adds rules the model checks against the extracted values. It must
follow an `extract()`; use `branch=` to target a specific extract in a
multi-branch workflow.

```python
workflow = (
    client.workflow("Invoice with checks")
    .parse()
    .extract([
        Schema.float("subtotal", "Subtotal before tax."),
        Schema.float("tax", "Tax amount."),
        Schema.float("total", "Grand total."),
    ])
    .validate(
        ValidationRule(id="totals", description="total must equal subtotal + tax.", severity="error"),
        ValidationRule(id="positive", description="All amounts must be positive.", severity="warning"),
    )
    .create()
)
```

### 7. Different file inputs

`run()` accepts a path string, a `Path`, or raw `bytes`.

```python
from pathlib import Path

workflow.run("invoice.pdf")                       # path string
workflow.run(Path("scans") / "page.png")          # Path (PDF or image)
workflow.run(open("invoice.pdf", "rb").read())    # raw bytes
```

### 8. Async client

`AsyncClient` mirrors the sync API with `await` on `create`, `run`, and `wait`.

```python
import asyncio
from anyformat.sdk import AsyncClient
from anyformat.workflow import Schema

async def main():
    client = AsyncClient(api_key="af_...")
    workflow = await (
        client.workflow("Async invoices")
        .parse()
        .extract([Schema.string("vendor", "Vendor name.")])
        .create()
    )
    run = await workflow.run("invoice.pdf")
    result = await run.wait()
    print(result.fields["vendor"].value)
    await client.aclose()

asyncio.run(main())
```

---

## CLI

The package installs an `afx` command for one-off parses without writing code:

```bash
export ANYFORMAT_API_KEY=af_...
afx parse invoice.pdf
```

---

## API quick reference

| Call | Returns | Notes |
|------|---------|-------|
| `Client(api_key, base_url=...)` | `Client` | `ANYFORMAT_API_KEY` if `api_key` omitted |
| `client.workflow(name)` | builder | chain verbs, then `.create()` |
| `.parse(mode="standard"\|"agentic")` | builder | exactly one per workflow |
| `.classify(*categories)` | builder | branching; needs `parse()` first |
| `.split(*rules, route_from=...)` | builder | `route_from` required after `classify()` |
| `.extract(fields, branch=...)` | builder | `branch` required after classify/split |
| `.validate(*rules, branch=...)` | builder | follows `extract()` |
| `.create()` | `Workflow` | registers the workflow |
| `workflow.run(file)` | `Run` | `file`: path \| `Path` \| `bytes` |
| `run.wait(timeout=300, poll_interval=3)` | `Result` | polls until done |
| `result.fields` | `dict[str, ExtractedField]` | `.value`, `.confidence`, `.evidence` |
| `result.parse` | `ParseView \| None` | `.blocks`, `.markdown`, `.text`, `.draw()` |
| `result.parse.draw(out_dir, source=None, dpi=150)` | `list[Path]` | one PNG per page |
| `result.raw` | `dict` | full results payload |
