Metadata-Version: 2.4
Name: ws-data-extractor
Version: 0.2.0
Summary: Python client for the Web Scraper Data Extractor API
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.24
Provides-Extra: release
Requires-Dist: build>=1.2.1; extra == "release"
Requires-Dist: twine>=5.0.0; extra == "release"
Dynamic: license-file

# ws-data-extractor (Python)

Public Python client for the Web Scraper Data Extractor API. The interface stays minimal while documentation stays explicit.

## Install

```bash
pip install ws-data-extractor==0.2.0
```

## Quickstart (sync)

```python
from ws_data_extractor import Client

client = Client(api_key="YOUR_KEY")
result = client.extract(
    url="https://web-scraper.io/product",
    prompt="Extract title and price."
)
print(result.data)
```

## Quickstart (async)

```python
import asyncio
from ws_data_extractor import AsyncClient

async def main():
    client = AsyncClient(api_key="YOUR_KEY")
    result = await client.extract(
        url="https://web-scraper.io/product",
        prompt="Extract title and price."
    )
    print(result.data)

asyncio.run(main())
```

## Configuration

Required:
- `api_key: str`

Optional:
- `base_url` (defaults to production API URL)
- `timeout_ms` (default 120000)
- `retries` (default 2)
- `user_agent`
- `headers` (merged into all requests)

Environment variables:
- `WS_API_KEY`
- `WS_API_BASE_URL`
- `WS_TIMEOUT_MS`
- `DOWNLOAD_ARTIFACT_HOST_ALLOWLIST`
- `DOWNLOAD_ARTIFACT_MAX_BYTES`
- `DOWNLOAD_ARTIFACT_ALLOWED_CONTENT_TYPES`

## Core API

Sync client:
- `extract(url, prompt, schema=None, schema_id=None, options=None)`
- `get_run(run_id)`
- `wait_run(run_id, timeout_ms=..., poll_interval_ms=...)`
- `download(url=None, options=None, profile_id=None, workflow=None)`
- `download_html(url=None, options=None, profile_id=None, workflow=None)`
- `get_schema(schema_id)`
- `get_schema_versions(schema_id, limit=20)`
- `run_template(template_id, parameters=None, max_pages=None, dataset_name=None, input_mode=None, dataset_input=None)`
- `list_template_runs(template_id, limit=None)`
- `resume_template_run(template_id, run_id)`
- `get_template_run_diagnostics(template_id, run_id)`
- `get_run_preview_rows(run_id, limit=None, cursor=None)`
- `resolve_urls(base_url, data, field)`

Async client:
- `extract_async(...)` -> returns `ExtractAsyncResponse` (`run_id`, `status_url`, `websocket`,
  `websocket_token`, `websocket_token_expires_at`, and `raw`)
- `get_run(run_id)`
- `wait_run(run_id, timeout_ms=..., poll_interval_ms=...)`
- `await AsyncClient.extract(...)` waits for the async run to finish
- `download(...)`
- `download_html(...)`
- `get_schema(schema_id)`
- `get_schema_versions(schema_id, limit=20)`
- `run_template(template_id, parameters=None, max_pages=None, dataset_name=None, input_mode=None, dataset_input=None)`
- `list_template_runs(template_id, limit=None)`
- `resume_template_run(template_id, run_id)`
- `get_template_run_diagnostics(template_id, run_id)`
- `get_run_preview_rows(run_id, limit=None, cursor=None)`
- `resolve_urls(base_url, data, field)`

`wait_run(...)` polls `GET /v1.0/runs/{run_id}` and returns the final run payload for every terminal state:
- extract terminal statuses: `success`, `failed`
- template terminal statuses: `completed`, `completed_partial`, `failed`, `cancelled`

## Response shape

Both sync and async `extract(...)` return an `ExtractResponse` with these fields:
- `data` (can be any JSON type: dict, list, string, number, boolean, null)
- `schema_id`
- `schema_hash`
- `schema_version`
- `schema_state` (sync responses only when returned by the API)
- `validation_errors`
- `raw` (the full response payload; sync-only fields live here)

Sync-only fields such as `schema_used`, `duration_ms`, `detected_language`, and `screenshot_url` are available in `result.raw` when present.

## Options (pass-through to API)

`options` is forwarded as-is:
- `wait_ms`
- `wait_until` (`load`, `domcontentloaded`, `networkidle`)
- `wait_for_selector`
- `fail_on_selector_timeout`
- `screenshot`
- `full_size`
- `headers`
- `cookies`
- `country`
- `language`
- `time_zone`
- `geolocation`
- `user_agent`
- `timeout_ms`

Example:

```python
from ws_data_extractor import Client, ExtractOptions

client = Client(api_key="YOUR_KEY")
result = client.extract(
    url="https://web-scraper.io",
    prompt="Extract title",
    options=ExtractOptions(wait_ms=1500, wait_until="domcontentloaded")
)
```

Download options use the same client-facing names as extract options. The data-extractor API translates these to
downstream downloader fields internally (`wait_ms` → `delay_time`, `timeout_ms` → `timeout`, and `headers`/`cookies`
→ `request_headers`); callers should not send downstream downloader field names inside `options`.
- `wait_ms`
- `wait_until`
- `wait_for_selector`
- `fail_on_selector_timeout`
- `screenshot`
- `full_size`
- `country`
- `language`
- `time_zone`
- `geolocation`
- `user_agent`
- `headers`
- `cookies`
- `timeout_ms`

## Schema usage

Provide either `schema` or `schema_id`:

```python
schema = {
    "type": "object",
    "properties": {"price": {"type": "number"}},
    "required": ["price"]
}

result = client.extract(
    url="https://web-scraper.io",
    prompt="Extract price",
    schema=schema
)
```

Fetch the active canonical schema with `get_schema(schema_id)`. Fetch version history with
`get_schema_versions(schema_id, limit=20)`, which returns the API payload containing `versions[]` entries with
`schema_version`, `schema_hash`, `created_at`, and `schema`.

## Template workflows

Template helpers use the existing flat `Client` / `AsyncClient` style. They do not introduce a separate S-SHOT-style
SDK or facade.

```python
from ws_data_extractor import Client

client = Client(api_key="YOUR_KEY")

run = client.run_template(
    "amazon-search-listings",
    parameters={"search_or_url": "wireless mouse"},
    max_pages=3,
    dataset_name="Wireless mouse search",
)

status = client.wait_run(run["run_id"])
preview = client.get_run_preview_rows(run["run_id"], limit=50)
diagnostics = client.get_template_run_diagnostics("amazon-search-listings", run["run_id"])
recent_runs = client.list_template_runs("amazon-search-listings", limit=10)
```

Resume is backend-gated. Call `resume_template_run(template_id, run_id)` only when the run payload exposes
`resume.available=true`; the API returns `409 resume_not_available` when no resumable checkpoint exists.

## Error handling

```python
from ws_data_extractor import Client, ApiError

client = Client(api_key="YOUR_KEY")
try:
    client.extract(url="https://web-scraper.io", prompt="Extract")
except ApiError as exc:
    print(exc.status_code, exc.error, exc.message, exc.step)
    print(exc.request_id)
    print(exc.payload)
```

## Manual multi-page flow (example)

```python
from ws_data_extractor import Client
from ws_data_extractor import dedupe_urls

client = Client(api_key="YOUR_KEY")

search_url = "https://www.amazon.es/s?k=dell+portatil"
search_prompt = (
    "Return JSON with key \"items\" as a list of the first 5 products. "
    "Each item should include: title, product_url (canonical)."
)

detail_prompt = (
    "Extract product title, price (with currency), rating, review_count, "
    "canonical product_url, and technical specifications as key/value pairs in \"specs\"."
)

search = client.extract(url=search_url, prompt=search_prompt)
product_urls = client.resolve_urls(search_url, search.data, field="product_url")
product_urls = dedupe_urls(product_urls)

results = [
    client.extract(url=url, prompt=detail_prompt).data
    for url in product_urls[:5]
]

print(results)
```

Notes:
- Keep the search prompt to fields that exist on the listing page (e.g., `title`, `product_url`).
- Ask for full details only on product pages.
- If the search step returns duplicates, call `dedupe_urls(...)` before fetching details.

## Follow-up helpers (no auto-follow)

```python
from ws_data_extractor import resolve_urls

next_pages = resolve_urls(base_url, result.data, field="next_page_url")
```

## HTML artifact downloads

`download(...)` returns the downloader payload, including `html_url` when the API stored HTML output.
`download_html(...)` follows that artifact URL and returns the same payload plus the fetched HTML body.

Artifact fetch safety is controlled by environment variables:
- `DOWNLOAD_ARTIFACT_HOST_ALLOWLIST`: optional comma-separated host/domain allowlist for `html_url`
- `DOWNLOAD_ARTIFACT_MAX_BYTES`: max HTML artifact size the client will read
- `DOWNLOAD_ARTIFACT_ALLOWED_CONTENT_TYPES`: optional comma-separated content-type allowlist

## Retries and idempotency (advanced)

Retries are enabled for transient failures (timeouts, connection errors, 429, 502, 503).
For POST requests, the client generates an idempotency key by default to make retries safe.
You can override this with `idempotency_key=...` when calling `extract` or `extract_async`.

## Logging

The client uses a `ws_data_extractor` logger with structured extras:
- `request_id`
- `duration_ms`
- `endpoint`

Verbose logging is off by default.

## FAQ

- **Rate limits**: 429 responses are retried when possible and honor `Retry-After`.
- **Billing**: the client does not change server-side billing behavior.
- **Compatibility**: API v1.0

## Maintainers

Release instructions live in `RELEASE.md`.
