Metadata-Version: 2.4
Name: geonode-scraper-sdk
Version: 0.2.0
Summary: Python SDK for the Geonode Scraper API
Author: Geonode Team
License-Expression: MIT
Keywords: OpenAPI,OpenAPI-Generator,Scraper API
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: urllib3<3.0.0,>=2.1.0
Requires-Dist: python-dateutil>=2.8.2
Requires-Dist: pydantic>=2.11
Requires-Dist: typing-extensions>=4.7.1
Provides-Extra: dev
Requires-Dist: pytest>=7.2.1; extra == "dev"
Requires-Dist: pytest-cov>=2.8.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Requires-Dist: types-python-dateutil>=2.8.19.14; extra == "dev"
Requires-Dist: ruff>=0.12.11; extra == "dev"
Dynamic: license-file

# Geonode Scraper SDK

Python SDK for the Geonode Scraper API. Supports single-URL extraction, batch
extraction, site crawling, URL mapping, job polling, and usage statistics.

## Requirements

- Python 3.10+

## Installation

```sh
pip install geonode-scraper-sdk
```

## Configuration And Authentication

Create a client configuration with your API base URL and API key.

```python
from geonode_scraper_sdk import Configuration

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)
```

If you do not set `host`, the generated client defaults to `http://localhost`.

## Quick Start

Synchronous extraction — blocks until the result is ready.

```python
from geonode_scraper_sdk import (
    ApiClient,
    ApiException,
    Configuration,
    ExtractRequest,
    ExtractionApi,
    OutputFormat,
    ProcessingMode,
)

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    try:
        response = api.extract_v1_extract_post(
            ExtractRequest(
                url="https://example.com",
                formats=[OutputFormat.MARKDOWN],
                processing_mode=ProcessingMode.SYNC,
            )
        )
        print(response.data.markdown)
        print(response.tokens_charged)
    except ApiException as exc:
        print(exc.status)
        print(exc.body)
```

## Async Extraction Workflow

When `processing_mode=ProcessingMode.ASYNC`, the extract call returns an async
job response with a job ID and status URL.

```python
from geonode_scraper_sdk import ApiClient, Configuration, ExtractRequest, ExtractionApi, ProcessingMode

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    submit = api.extract_v1_extract_post(
        ExtractRequest(
            url="https://example.com",
            processing_mode=ProcessingMode.ASYNC,
        )
    )

    job = api.get_job_result_v1_extract_job_id_get(submit.job_id)
    print(job.status)
    if job.data and job.data.markdown:
        print(job.data.markdown)
```

Use `get_job_result_v1_extract_job_id_get(job_id)` to poll a single job, or
`list_jobs_v1_extract_jobs_get(...)` to inspect and filter job history.

## Batch Extraction

Submit multiple URLs in one request and poll for results.

```python
from geonode_scraper_sdk import ApiClient, BatchApi, BatchRequest, Configuration, OutputFormat

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = BatchApi(api_client)

    accepted = api.create_batch_v1_batch_post(
        BatchRequest(
            urls=["https://example.com", "https://example.org"],
            formats=[OutputFormat.MARKDOWN],
        )
    )
    print(accepted.job_id, accepted.accepted_urls)

    status = api.get_batch_status_v1_batch_job_id_get(
        job_id=accepted.job_id, page=1, page_size=10
    )
    print(status.status, status.completed_urls, status.total_urls)
```

## Site Crawling

Crawl a website from a seed URL up to a configurable depth and page limit.

```python
from geonode_scraper_sdk import ApiClient, Configuration, CrawlApi, CrawlRequest, OutputFormat

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = CrawlApi(api_client)

    accepted = api.create_crawl_v1_crawl_post(
        CrawlRequest(
            url="https://example.com",
            depth=2,
            limit=50,
            formats=[OutputFormat.MARKDOWN],
        )
    )
    print(accepted.job_id, accepted.estimated_pages)

    status = api.get_crawl_status_v1_crawl_job_id_get(
        job_id=accepted.job_id, page=1, page_size=10
    )
    print(status.status, status.completed_pages, status.total_pages)
```

## URL Mapping

Discover all URLs under a base URL by combining sitemap parsing with HTML
link extraction. Returns synchronously.

```python
from geonode_scraper_sdk import ApiClient, Configuration, MapApi, MapRequest

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = MapApi(api_client)

    result = api.map_urls_v1_map_post(MapRequest(url="https://example.com"))
    for link in result.links:
        print(link.url, link.source)
```

## Error Handling

Non-2xx responses raise `ApiException` or one of its subclasses.
The exception includes the HTTP status, response body, and any deserialized
error model in `exc.data`.

```python
from geonode_scraper_sdk import ApiClient, ApiException, Configuration, ExtractionApi, ExtractRequest

configuration = Configuration(
    host="https://api.example.com",
    api_key={"APIKeyHeader": "your-api-key"},
)

with ApiClient(configuration) as api_client:
    api = ExtractionApi(api_client)

    try:
        api.extract_v1_extract_post(ExtractRequest(url="https://example.com"))
    except ApiException as exc:
        print(exc.status)
        print(exc.body)
        print(exc.data)
```

## Request Options

`ExtractRequest` supports the following fields:

- `formats`: output formats to return; defaults to `[OutputFormat.HTML]`
- `render_js`: use a headless browser for JavaScript-rendered pages; defaults to `False`
- `processing_mode`: `ProcessingMode.SYNC` or `ProcessingMode.ASYNC`; defaults to sync
- `extract_links`: extract all links found on the page; defaults to `False`
- `proxy`: optional `ProxySettings` for country and proxy type selection
- `headers`: optional request headers dictionary
- `wait_config`: optional `WaitConfig` for explicit browser wait policy (`wait_until`, `wait_for`, `wait_timeout`)

Example with additional options:

```python
from geonode_scraper_sdk import ExtractRequest, OutputFormat, ProcessingMode, ProxySettings, ProxyType, WaitConfig, WaitUntil

request = ExtractRequest(
    url="https://example.com",
    formats=[OutputFormat.HTML, OutputFormat.MARKDOWN],
    render_js=True,
    processing_mode=ProcessingMode.SYNC,
    extract_links=True,
    proxy=ProxySettings(country="US", type=ProxyType.RESIDENTIAL),
    headers={"User-Agent": "geonode-scraper-sdk-demo"},
    wait_config=WaitConfig(
        wait_until=WaitUntil.NETWORKIDLE,
        wait_for="#content",
        wait_timeout=2000,
    ),
)
```

## API Reference

**ExtractionApi** (`/v1/extract`)
- `extract_v1_extract_post(extract_request)`
- `get_job_result_v1_extract_job_id_get(job_id)`
- `list_jobs_v1_extract_jobs_get(job_id, url, status, output, start_date, end_date, page, page_size)`

**BatchApi** (`/v1/batch`)
- `create_batch_v1_batch_post(batch_request)`
- `get_batch_status_v1_batch_job_id_get(job_id, page, page_size)`
- `cancel_batch_v1_batch_job_id_delete(job_id)`

**CrawlApi** (`/v1/crawl`)
- `create_crawl_v1_crawl_post(crawl_request)`
- `get_crawl_status_v1_crawl_job_id_get(job_id, page, page_size)`
- `cancel_crawl_v1_crawl_job_id_delete(job_id)`

**MapApi** (`/v1/map`)
- `map_urls_v1_map_post(map_request)`

**StatisticsApi** (`/v1/statistics`)
- `get_statistics_v1_statistics_get(start_date, end_date)`

**SystemApi** (`/health`)
- `health_check_health_get()`

**WebhooksApi** (`/v1/webhooks`)
- `list_webhooks_v1_webhooks_get(page, page_size)`
- `create_webhook_v1_webhooks_post(webhook_create)`
- `get_webhook_v1_webhooks_webhook_id_get(webhook_id)`
- `update_webhook_v1_webhooks_webhook_id_patch(webhook_id, webhook_update)`
- `delete_webhook_v1_webhooks_webhook_id_delete(webhook_id)`
- `list_deliveries_v1_webhooks_webhook_id_deliveries_get(webhook_id, page, page_size, status)`
- `rotate_secret_v1_webhooks_webhook_id_rotate_secret_post(webhook_id)`

## Advanced Usage

Each generated API method also exposes:

- `*_with_http_info()` to get the deserialized payload together with status and headers
- `*_without_preload_content()` to work with the raw HTTP response directly
