Metadata-Version: 2.4
Name: kazparserbot
Version: 0.1.2
Summary: Keyword-driven web scraping pipeline (Serper + OpenAI).
Author: Kirill Yakunin
Author-email: yakunin.k@mail.ru
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: aiohttp (>=3.11.11,<4.0.0)
Requires-Dist: beautifulsoup4 (>=4.13.0,<5.0.0)
Requires-Dist: fire (>=0.7.0,<0.8.0)
Requires-Dist: openai (>=1.58.1,<2.0.0)
Requires-Dist: pillow (>=11.1.0,<12.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: tenacity (>=9.0.0,<10.0.0)
Description-Content-Type: text/markdown

# kazparserbot

Keyword-driven scraping pipeline using Serper (search + scrape) and OpenAI.

## Install

```bash
pip install kazparserbot
```

## Required environment variables

Put these in your environment or in a local `.env` file (see `.env.template`):

- `SERPER_API_KEY` (required)
- `OPENAI_API_KEY` (recommended)

## Library usage (recommended)

### Basic usage (sync)

```python
from kazparserbot import KazParserBot

bot = KazParserBot.from_env()

result = bot.scrape_keywords_sync(
    ["keyword1", "keyword2"],
    google_results_to_get_per_query=10,
    top_results_to_get=5,
    queries_to_generate_per_keyword=3,
)
```

### With image retrieval enabled

This downloads images to an `imgs/` folder in the current working directory and returns image metadata under the `__imgs__` key.

```python
from kazparserbot import KazParserBot

bot = KazParserBot.from_env()

result = bot.scrape_keywords_sync(
    ["keyword1", "keyword2"],
    collect_imgs_and_context=True,
)

imgs = result.get("__imgs__", [])
```

## Public API

### `kazparserbot.KazParserBot`

Create an instance:

- `KazParserBot.from_env(...)`
  - `dotenv_path`: path to a `.env` file (optional)
  - `load_dotenv_first`: whether to load `.env` before reading env vars (default `True`)
  - `gpt_concurrency`: max in-flight OpenAI requests (default `100`)
  - `serper_concurrency`: max in-flight Serper requests (default `100`)
  - `http_timeout_s`: per-request HTTP timeout in seconds for Serper + image fetching (default `30.0`)

Core methods:

- `await bot.scrape_keywords(keywords, *, google_results_to_get_per_query=10, top_results_to_get=5, queries_to_generate_per_keyword=3, collect_imgs_and_context=False)`
  - `keywords`: list of keywords to process
  - `google_results_to_get_per_query`: how many Serper search results to fetch per query
  - `top_results_to_get`: how many results to keep per language (RU + KK); internally the model chooses `top_results_to_get * 2`
  - `queries_to_generate_per_keyword`: number of RU queries and KK queries to generate per keyword
  - `collect_imgs_and_context`: if `True`, downloads images and extracts nearby text context
  - Returns `dict[str, list[dict]]`. When image retrieval is enabled, images are stored under `__imgs__` as a list of dicts containing:
    - `url`, `img_url`, `context_text_before`, `context_text_after`, `file_path`

- `bot.scrape_keywords_sync(keywords, **kwargs)`
  - Synchronous wrapper around `scrape_keywords(...)` (same kwargs)

- `bot.scrape_from_files(*, keywords_json, output_json, google_results_to_get_per_query=10, top_results_to_get=5, queries_to_generate_per_keyword=3, collect_imgs_and_context=False)`
  - `keywords_json`: path to a JSON file containing a list of keywords
  - `output_json`: path to write results JSON

### `kazparserbot.Settings`

- `Settings.from_env(dotenv_path=None, load_dotenv_first=True)`
  - Validates `SERPER_API_KEY` is present and reads `OPENAI_API_KEY` if set.

## CLI usage

After installing:

```bash
kazparserbot keywords.json output.json \
  --google_results_to_get_per_query=10 \
  --top_results_to_get=5 \
  --queries_to_generate_per_keyword=3 \
  --collect_imgs_and_context=True
```

The legacy entrypoint is still available:

```bash
python scrap_by_keywords.py keywords.json output.json --collect_imgs_and_context=True
```

