Metadata-Version: 2.4
Name: contextractor
Version: 0.4.1
Summary: Drive the Contextractor Node crawler/extractor from Python — clean main-content text in txt, markdown, json, or html.
Project-URL: Homepage, https://apify.com/glueo/contextractor
Project-URL: Repository, https://github.com/contextractor/contextractor
Author-email: glueo <company@glueo.com>
License-Expression: Apache-2.0
Keywords: content-extraction,crawlee,crawler,markdown,playwright,trafilatura,web-scraping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: nodejs-wheel-binaries<25,>=24.16
Provides-Extra: test
Requires-Dist: pytest-asyncio>=1.4; extra == 'test'
Requires-Dist: pytest>=9; extra == 'test'
Requires-Dist: pyyaml>=6; extra == 'test'
Description-Content-Type: text/markdown

# contextractor

<p align="center"><img width="220" src="https://www.contextractor.com/media/cover-mini.svg" alt="Contextractor"></p>

Crawl web pages and extract clean main-content text — `txt`, `markdown`, `json`,
or `html` — from Python. Built on [`rs-trafilatura`](https://github.com/Murrough-Foley/rs-trafilatura)
(extraction) and [Crawlee](https://crawlee.dev/) + [Playwright](https://playwright.dev/)
(crawling).

This package is a thin, typed wrapper that **drives the bundled Node engine** — it
does not reimplement the crawler. A self-contained Node runtime ships with the
wheel (via [`nodejs-wheel-binaries`](https://pypi.org/project/nodejs-wheel-binaries/)),
so **no Node.js install is required**.

## Install

```bash
pip install contextractor
python -m contextractor install   # one-time: download the Chromium browser
```

Platform wheels are published for macOS (arm64, x86_64), Linux (x86_64, aarch64;
glibc ≥ 2.28), and Windows (x64). Requires Python 3.12+.

## Quick start

```python
import contextractor

summary = contextractor.extract(
    ["https://example.com"],
    save=["markdown-kvs"],
    output_dir="./out",
    max_requests_per_crawl=10,
)
print(summary)
# ExtractSummary(total=1, succeeded=1, failed=0, skipped=0,
#                output_dir='/abs/out', manifest_path='/abs/out/manifest.json')
```

Extracted files and a `manifest.json` index are written to `output_dir`
(default: `./contextractor-output`). The manifest is a JSON array of records, each
tagged `status: "success" | "failed" | "skipped"`.

### Async

```python
import asyncio
import contextractor

async def main():
    summary = await contextractor.aextract(
        ["https://example.com", "https://example.org"],
        save=["markdown-dataset", "original-kvs"],
        output_dir="./out",
        max_concurrency=5,
    )
    print(summary.succeeded, "of", summary.total)

asyncio.run(main())
```

## Single-page extraction

`extract_one()` crawls exactly one URL (no link-following) and returns the
extracted content directly — no output directory, nothing written to disk:

```python
import contextractor

markdown = contextractor.extract_one("https://example.com")
print(markdown)  # str — markdown is the default format
```

Request several formats to get a `dict` keyed by format:

```python
contents = contextractor.extract_one(
    "https://example.com",
    formats=["markdown", "json", "original"],
)
print(contents["markdown"])  # extracted markdown
print(contents["original"])  # raw page HTML
```

Formats: `txt`, `markdown` (default), `json`, `html`, `original` (the raw page
HTML). With one requested format the return value is a `str`; with several it is
a `dict[str, str]`. `extract_one` accepts the single-page subset of the crawl
options (`ExtractOneOptions`) — e.g. `proxy`, `mode`, `user_agent`, `cookies`,
`headers`, `headless` — but not crawl-frontier options like `globs` or
`max_crawl_depth`. A page that cannot be fetched or extracted raises
`ContextractorError`. If the page yields no content for one of several requested
formats, that format's key is simply absent from the returned `dict`; when the
single requested format yields no content, `ContextractorError` is raised.

### Async single page

```python
import asyncio
import contextractor

async def main():
    markdown = await contextractor.aextract_one("https://example.com")
    print(markdown)

asyncio.run(main())
```

## Return value

`extract()` / `aextract()` return an `ExtractSummary`:

| Field | Meaning |
| --- | --- |
| `total` | Number of records in the manifest |
| `succeeded` / `failed` / `skipped` | Counts by record `status` |
| `output_dir` | Absolute path where files + manifest were written |
| `manifest_path` | Absolute path to `manifest.json` |

Partial failures (some URLs failed) **do not raise** — they are reflected in
`summary.failed`. Validation/config errors and real crawl failures raise
`ContextractorError`; a missing browser raises `MissingBrowserError` pointing you
at `python -m contextractor install`.

## Options

All crawl options are typed keyword arguments (`ExtractOptions`). A selection:

| Option | Type | Notes |
| --- | --- | --- |
| `save` | `list[str]` | `format-destination` tokens: `{txt,markdown,json,html,original}-{dataset,kvs}` (e.g. `markdown-kvs`, `original-dataset`). Default `markdown-kvs`; list a format twice to save to both. Saving `original`/`html` to the dataset risks OOM on large pages |
| `mode` | `str` | `precision`, `balanced` (default), `recall` |
| `max_requests_per_crawl` | `int` | `0` = unlimited |
| `max_crawl_depth` | `int` | `0` = unlimited |
| `globs` / `exclude` | `list[str]` | enqueue / skip URL patterns |
| `headless` | `bool` | `False` runs a headed browser |
| `block_media` / `images` | `bool` | toggle media loading / image extraction |
| `links` / `comments` / `tables` | `bool` | `False` excludes that content |
| `proxy` | `list[str]` | `http`, `https`, `socks4`, `socks5` URLs |
| `cookies` | `list[dict]` | initial cookies (JSON) |
| `headers` | `dict[str, str]` | custom HTTP headers (JSON) |
| `selector` | `str` | restrict extraction to a CSS selector |
| `deduplication` | `str` | `minimal`, `standard` (default), `aggressive` |
| `output_dir` | `str` | where files + manifest are written |

Boolean options that have a CLI default emit a flag only when you set them.
Editor autocomplete and type-checkers see every option via the `ExtractOptions`
`TypedDict`.

### Config file

Share configuration across runs with a JSON config file:

```json
{
  "mode": "precision",
  "save": ["markdown-kvs", "json-dataset"],
  "maxRequestsPerCrawl": 25,
  "maxCrawlDepth": 2
}
```

```python
contextractor.extract(
    ["https://example.com"],
    config_file="config.json",
    output_dir="./out",
)
```

Keyword arguments override values from the config file.

### Proxies

Only `http`, `https`, `socks4`, and `socks5` proxy URLs are accepted; an
unsupported scheme raises `ProxySchemeError` before anything runs. Proxy
credentials are never echoed in errors or logs.

```python
contextractor.extract(
    ["https://example.com"],
    proxy=["http://user:pass@proxy.example.com:3128"],
    output_dir="./out",
)
```

## Browser provisioning

Browsers are not bundled in the wheel. Run `python -m contextractor install` once
to download Chromium for the bundled engine. The standard
`PLAYWRIGHT_BROWSERS_PATH` environment variable is honored.

## Advanced

- `CONTEXTRACTOR_NODE_PATH` — point at a host Node binary to use instead of the
  bundled runtime.
- `storage_dir` — reuse a Crawlee storage directory across runs (defaults to a
  private temporary directory cleaned up after each call).
- `timeout` — per-process wall-clock limit (seconds).

## License

Apache-2.0
