Metadata-Version: 2.4
Name: waveflowdb_client
Version: 1.0.5
Summary: VectorLake SDK — Deterministic backend engine powering agent workflows
Author-email: "agentanalytics.ai" <nitin@agentanalytics.ai>
License: MIT License
        
        Copyright (c) 2026 agentanalytics.ai
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://agentanalytics.ai
Project-URL: Documentation, https://www.agentanalytics.ai/docs/waveflow-db
Keywords: vector db,VECTOR QUERY LANGUAGE,waveflowdb,agentanalytics,VQL
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: numpy
Requires-Dist: tqdm
Dynamic: license-file

# WaveflowDB SDK — VectorLake Python Client v2.3.0

A Python SDK for interacting with **WaveflowDB** and performing semantic retrieval with optional hybrid (keyword + vector) filtering.

This SDK provides:
- Full `Config` management via constructor args, environment variables, or `.env` file
- Document ingestion — direct payload mode or batched filesystem mode
- Document deletion — remove documents from the index by filename
- Targeted insertion — insert documents at a controlled position in the corpus (`run_insert_path` / `run_insert_path_direct`)
- MD5-based change detection for `add_documents` so unchanged files are never re-extracted
- Two processing modes — `"sequential"` (chunk-then-upload, supports `start_from_batch` / `end_batch`) and `"streaming"` (pipelined extraction + upload)
- Semantic search and top-matching-doc retrieval with optional hybrid filtering (`search_type="flat"` / `"flat_filter"`)
- Namespace and document metadata queries
- Full-text exact keyword search across the corpus (`full_corpus_search`)
- Structured CSV logging for performance, API errors, and skipped files
- Simple `{stem}_part{num}.txt` chunk naming for idempotent re-runs
- **Image & OCR extraction** — local (PyMuPDF/Pillow + Tesseract) and cloud engines (AWS Textract, GCP Vision, Azure Document Intelligence, Mathpix, LlamaParse)
- A ready-to-run `run.py` launcher (via `waveflowdb_client/run_utils.py`) with checkpointing and failure auditing

> This README documents the SDK exactly as implemented in `client.py`, `config.py`, `utils.py`, `extractors.py`, `cloud_extractors.py`, `models.py`, `exceptions.py`, `__init__.py`, and `run_utils.py` / `run.py`. There is no scale-out cluster module (`cluster_controller.py` / `cluster_api.py`) in this codebase — if you've seen cluster documentation elsewhere for this SDK, it does not apply to the version covered here.

---

## 🚀 Getting Started

### 1. Install dependencies

```bash
pip install waveflowdb_client
```

Optional extras for richer file support:

```bash
pip install PyPDF2 python-docx python-dotenv tqdm
```

Optional extras for local OCR (image-bearing PDFs, scanned documents, standalone images):

```bash
pip install pymupdf pytesseract pillow   # local OCR — requires the Tesseract binary on PATH
```

`pymupdf` is used to render PDF pages for OCR; `pillow` + `pytesseract` are required for standalone image files (PNG/JPG/TIFF/BMP/WebP). If `pillow`/`pytesseract` are missing, standalone image OCR raises a `FileProcessingError` rather than silently skipping.

Optional extras for cloud OCR (install only the engine you use):

```bash
pip install boto3                           # AWS Textract
pip install google-cloud-vision             # GCP Vision
pip install azure-ai-documentintelligence  # Azure Document Intelligence
pip install requests                        # Mathpix + LlamaParse (REST APIs)
pip install llama-parse                     # LlamaParse (optional SDK wrapper)
```

### 2. Configure credentials

Settings resolve in this priority order: **constructor args → environment variables → `.env` file → hard-coded defaults.**

Create a `.env` file next to `run.py`:

```
VECTOR_LAKE_API_KEY=your_api_key_here
VECTOR_LAKE_HOST=https://waveflow-analytics.com
USER_ID=your@email.com
NAMESPACE=your_namespace
```

`USER_ID` and `NAMESPACE` are read directly by `run_utils.py` (not by `Config`) and forwarded as `user_id` / `vector_lake_description` to every SDK call.

### 3. Use `run.py`

`run.py` is the ready-to-use launcher. Open it, set `ACTION` in the `CONFIG` block near the top, and run:

```bash
python run.py
```

See [`run.py` — Operations Launcher](#️-runpy--operations-launcher) below for the full action list and settings.

### Constructor API (alternative to `run.py`)

```python
from waveflowdb_client import Config, VectorLakeClient

cfg = Config(
    api_key="your_api_key_here",
    host="https://waveflow-analytics.com",
    vector_lake_path="/path/to/documents",
)
client = VectorLakeClient(cfg)
```

---

## ⚙️ Configuration Reference

All values can be set via constructor keyword argument or the listed environment variable. Constructor args always win; environment variables are read at `Config.__init__` time via `.env` (loaded with `override=False`, so real environment variables take precedence over `.env` file values).

### Core

| Constructor arg | Env var | Default | Description |
|---|---|---|---|
| `api_key` | `VECTOR_LAKE_API_KEY` | — (required) | API key. `ConfigError` is raised at init if missing. |
| `host` | `VECTOR_LAKE_HOST` | `https://waveflow-analytics.com` | Server base URL. Trailing slash is stripped automatically. |
| `service_port` | `VECTOR_LAKE_PORT` | `None` | Optional port. When set, query/upload URLs become `host:port` / `host:port+1` instead of `host/query` / `host/upload` (see URL helpers below). |
| `timeout` | `VECTOR_LAKE_TIMEOUT` | `240` | HTTP request timeout in seconds. |
| `max_retries` | `VECTOR_LAKE_MAX_RETRIES` | `2` | Total request attempts per call (not "retries in addition to the first attempt" — see [Retry and Backoff](#-retry-and-backoff)). |
| `max_files_per_batch` | `VECTOR_LAKE_MAX_FILES_PER_BATCH` | `1000` | Maximum chunk files per upload batch. |
| `max_batch_size_mb` | `VECTOR_LAKE_MAX_BATCH_SIZE_MB` | `0.5` | Maximum payload size per batch, in MB. Accepts fractional values. |
| `streaming_minibatch_files` | `VECTOR_LAKE_STREAMING_MINIBATCH_FILES` | `100` | Streaming mode: flush after this many chunks are ready. Acts mainly as a safety ceiling — the size threshold below usually fires first for normal text/PDF chunks. |
| `streaming_minibatch_size_mb` | `VECTOR_LAKE_STREAMING_MINIBATCH_SIZE_MB` | `0.5` | Streaming mode: flush if the pending buffer exceeds this size. **Must be ≥ `max_batch_size_mb`** — `Config` raises `ConfigError` at init otherwise. |
| `extraction_workers` | `VECTOR_LAKE_EXTRACTION_WORKERS` | `4` | Parallel extraction threads used by `create_batches` / `streaming_create_batches`. Set to `1` for fully sequential extraction. |
| `vector_lake_path` | `VECTOR_LAKE_PATH` | `upload` | Directory containing source files. Created automatically if missing. |
| `log_dir` | `VECTOR_LAKE_LOG_DIR` | `logs` | Directory for CSV log files. Created automatically if missing. |
| `allowed_extensions` | `VECTOR_LAKE_ALLOWED_EXTENSIONS` | see below | Comma-separated extension list (lower-cased). |

Default `allowed_extensions`: `txt, csv, json, jsonl, py, docx, pdf, png, jpg, jpeg, tiff, tif, bmp, webp`. Note: `ipynb` is **not** in the default list — `extract_text` has an extractor for `.ipynb` (`_extract_ipynb`), but you must add `ipynb` to `VECTOR_LAKE_ALLOWED_EXTENSIONS` / `allowed_extensions` yourself for `.ipynb` files to be picked up by `FileProcessor.is_supported`.

When the local OCR stack (`pymupdf`/`pytesseract`/Pillow) is importable, image extensions (`png, jpg, jpeg, tiff, tif, bmp, webp`) are unioned into the allowed list automatically regardless of what `allowed_extensions` says.

### Local OCR

| Constructor arg | Env var | Default | Description |
|---|---|---|---|
| `enable_ocr` | `VECTOR_LAKE_ENABLE_OCR` | `True` | Master switch for local OCR. |
| `ocr_lang` | `VECTOR_LAKE_OCR_LANG` | `eng` | Tesseract language code. |

### Cloud OCR — universal

| Constructor arg | Env var | Default | Description |
|---|---|---|---|
| `cloud_ocr_provider` | `VECTOR_LAKE_CLOUD_OCR_PROVIDER` | `None` | `aws`, `gcp`, `azure`, `mathpix`, or `llama_parse`. Lower-cased and hyphens normalized to underscores. `ConfigError` if not one of these five. |
| `cloud_ocr_strategy` | `VECTOR_LAKE_CLOUD_OCR_STRATEGY` | `hybrid` | `hybrid` = local OCR first, cloud fallback when local output is short. `cloud` = always use the cloud engine. `ConfigError` if not `hybrid`/`cloud`. |
| `cloud_ocr_threshold` | `VECTOR_LAKE_CLOUD_OCR_THRESHOLD` | `100` | In hybrid mode, minimum character count from local OCR before falling back to cloud. |

### AWS Textract

| Constructor arg | Env var | Default |
|---|---|---|
| `aws_region` | `AWS_DEFAULT_REGION` | `us-east-1` |
| `aws_textract_mode` | `VECTOR_LAKE_AWS_MODE` | `text` (`text` \| `tables`) |
| `aws_access_key_id` | `AWS_ACCESS_KEY_ID` | — |
| `aws_secret_access_key` | `AWS_SECRET_ACCESS_KEY` | — |
| `aws_session_token` | `AWS_SESSION_TOKEN` | — |
| `aws_s3_bucket` | `VECTOR_LAKE_AWS_S3_BUCKET` | — (required for files > 5 MB) |

> **⚠️ Known issue:** `CloudOCRFactory.build()` reads `getattr(config, "cloud_ocr_mode", "text")` to decide AWS Textract mode, but `Config` only ever sets `aws_textract_mode` — there is no `cloud_ocr_mode` attribute on `Config`. As a result, setting `aws_textract_mode="tables"` (via constructor or `VECTOR_LAKE_AWS_MODE`) has **no effect** when going through `Config` → `CloudOCRFactory.build()`; Textract always runs in `"text"` mode through this path. To get table extraction today, instantiate `AWSTextractExtractor(mode="tables", ...)` directly and pass it in rather than relying on `cloud_ocr_provider="aws"` + `aws_textract_mode="tables"`.

### Google Cloud Vision

| Constructor arg | Env var | Default |
|---|---|---|
| `gcp_vision_mode` | `VECTOR_LAKE_GCP_MODE` | `document` (`document` \| `image`) |
| `gcp_credentials_json` | `GOOGLE_APPLICATION_CREDENTIALS` | — |
| `gcp_language_hints` | `VECTOR_LAKE_GCP_LANGUAGE_HINTS` | — (comma-separated, e.g. `en,de`) |

> **⚠️ Same known issue applies to GCP**: `CloudOCRFactory.build()` reads `cloud_ocr_mode` (default `"document"`) for GCP too, not `gcp_vision_mode`. Setting `gcp_vision_mode="image"` has no effect through this path today.

### Azure Document Intelligence

| Constructor arg | Env var | Default |
|---|---|---|
| `azure_endpoint` | `AZURE_FORM_RECOGNIZER_ENDPOINT` | required |
| `azure_api_key` | `AZURE_FORM_RECOGNIZER_KEY` | required |
| `azure_model_id` | `VECTOR_LAKE_AZURE_MODEL_ID` | `prebuilt-read` |
| `azure_include_tables` | `VECTOR_LAKE_AZURE_INCLUDE_TABLES` | `True` |

Available Azure pre-built models: `prebuilt-read`, `prebuilt-layout`, `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-idDocument`, or any custom model ID from your Azure portal. (Unlike AWS/GCP, Azure's `azure_model_id` and `azure_include_tables` are read correctly by the factory — there is no equivalent bug here.)

### Mathpix

| Constructor arg | Env var | Default |
|---|---|---|
| `mathpix_app_id` | `MATHPIX_APP_ID` | required |
| `mathpix_app_key` | `MATHPIX_APP_KEY` | required |
| `mathpix_output_format` | `VECTOR_LAKE_MATHPIX_FORMAT` | `markdown` (`markdown` \| `mmd` \| `text` \| `latex`) |

### LlamaParse

| Constructor arg | Env var | Default |
|---|---|---|
| `llama_parse_api_key` | `LLAMA_CLOUD_API_KEY` | required |
| `llama_parse_result_type` | `VECTOR_LAKE_LLAMA_RESULT_TYPE` | `markdown` (`markdown` \| `text`) |
| `llama_parse_language` | `VECTOR_LAKE_LLAMA_LANGUAGE` | `en` |

### Quick-start `Config` examples

```python
# Local OCR only (default)
Config(api_key="...")

# AWS Textract — hybrid (note the mode caveat above)
cfg = Config(
    api_key="...",
    cloud_ocr_provider="aws",
    aws_region="eu-west-1",
)

# AWS Textract — large files via S3
cfg = Config(
    api_key="...",
    cloud_ocr_provider="aws",
    aws_s3_bucket="my-bucket",     # required for files > 5 MB
)

# Google Cloud Vision
cfg = Config(
    api_key="...",
    cloud_ocr_provider="gcp",
    gcp_credentials_json="/path/to/key.json",
    gcp_language_hints=["en", "de"],
)

# Azure Document Intelligence — layout model, tables included
cfg = Config(
    api_key="...",
    cloud_ocr_provider="azure",
    azure_endpoint="https://my-resource.cognitiveservices.azure.com/",
    azure_api_key="...",
    azure_model_id="prebuilt-layout",
    azure_include_tables=True,
)

# Mathpix — scientific PDFs and equations
cfg = Config(
    api_key="...",
    cloud_ocr_provider="mathpix",
    mathpix_app_id="...",
    mathpix_app_key="...",
    mathpix_output_format="mmd",
)

# LlamaParse — pure cloud, no local OCR
cfg = Config(
    api_key="...",
    cloud_ocr_provider="llama_parse",
    cloud_ocr_strategy="cloud",     # skip local entirely
    llama_parse_api_key="...",
)
```

### URL helpers

`Config` builds endpoint URLs from `host` and `service_port`:

```python
base_url_query  = f"{host}:{service_port}"   if service_port else f"{host}/query"
base_url_upload = f"{host}:{service_port+1}" if service_port else f"{host}/upload"
```

So if you leave `service_port` unset (the default), the SDK talks to `{host}/query/...` and `{host}/upload/...`. If you set `service_port`, it instead talks to `{host}:{service_port}/...` for query endpoints and `{host}:{service_port+1}/...` for upload endpoints.

---

## ⚡ Processing Modes

All path-based upload methods (`add_documents`, `refresh_documents`, `run_insert_path`) support `processing_mode`:

### `"sequential"` (default)

All source files are chunked upfront — OCR, PDF extraction, CSV/JSONL splitting, etc. — before the first HTTP request is made. Batches are then uploaded one at a time with a 1.0 second delay between batches (`batch_delay` parameter, internal). This is the only mode that supports `start_from_batch` / `end_batch` resume control.

```
  Files → chunk ALL → [batch 1] → [batch 2] → [batch 3] → ...
```

### `"streaming"`

Upload begins the moment the first minibatch is ready (governed by `streaming_minibatch_files` / `streaming_minibatch_size_mb`), with a 0.5 second delay between minibatch uploads. `start_from_batch` / `end_batch` are **not** supported in streaming mode — batch numbers aren't known upfront. `run_utils.py` layers its own checkpoint-based recovery on top (see [Checkpoints](#checkpoints) below) by tracking completed file stems and skipping them on restart.

```
  Files → [chunk file 1] → [upload minibatch 1] ──────────────►
           [chunk file 2]         → [upload minibatch 2] ─────►
           [chunk file 3]                 → [upload minibatch 3]►
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `processing_mode` | str | `"sequential"` | `"sequential"` or `"streaming"` |
| `streaming_minibatch_files` | int | `100` | Max chunk filenames per streaming minibatch |
| `streaming_minibatch_size_mb` | float | `0.5` | Max total byte size per streaming minibatch |

```python
client.add_documents(
    user_id="u1",
    vector_lake_description="my_lake",
    processing_mode="streaming",
)
```

---

## 🗂️ Supported File Types

The allowed extension list is the single source of truth, defined in `Config.allowed_extensions` (or `VECTOR_LAKE_ALLOWED_EXTENSIONS`). The default set is:

```
txt, csv, json, jsonl, py, docx, pdf, png, jpg, jpeg, tiff, tif, bmp, webp
```

| Extension | Processing strategy |
|-----------|---------------------|
| `txt` | Plain text, paragraph-chunked |
| `py` | Plain text read, paragraph-chunked |
| `ipynb` | Code cells extracted, paragraph-chunked — **not in the default allowed-extensions list**, must be added explicitly |
| `pdf` | Text extracted via PyMuPDF; embedded images OCR'd if `enable_ocr=True` |
| `docx` | Text + inline images extracted; images OCR'd if `enable_ocr=True` |
| `csv` | Row-safe split by byte size; header row repeated in every chunk |
| `jsonl` / `ndjson` | Record-safe split; each line validated as JSON before being written |
| `json` | Kept atomic — never split |
| `png`, `jpg`, `jpeg`, `tiff`, `tif`, `bmp`, `webp` | Full-image OCR via local Tesseract (requires Pillow + pytesseract) or a cloud engine |

To add or remove extensions without changing code, set `VECTOR_LAKE_ALLOWED_EXTENSIONS` in your `.env` (comma-separated).

---

## 📦 Chunk Naming

All chunks written to disk follow this convention:

```
{stem}_part{num}.txt
```

Example — a PDF that splits into 3 parts:
```
electrolytes-test-report_part1.txt
electrolytes-test-report_part2.txt
electrolytes-test-report_part3.txt
```

All formats (`pdf`, `docx`, `py`, `txt`, etc.) produce `.txt` chunk files. The `VersionedChunkName` dataclass (in `models.py`) provides `build()`, `parse()`, and the static `is_versioned()` helper for working with this convention:

```python
from waveflowdb_client import VersionedChunkName

VersionedChunkName.build("report.pdf", part=3).filename   # "report_part3.txt"
VersionedChunkName.parse("report_part3.txt")               # VersionedChunkName(stem='report', part=3, ext='.txt', version=1)
VersionedChunkName.is_versioned("report_part3.txt")        # True
```

Chunks already on disk are reused on re-run when their content is unchanged — see [Stale-chunk handling](#stale-chunk-handling) for exactly how `add` vs. `refresh`/`insert` differ here.

---

## 🖼️ Image & OCR Extraction

The SDK extracts text from images embedded in PDFs and DOCX files, as well as from standalone image files. Every OCR-derived block is wrapped in a provenance marker so chunk files carry a full audit trail:

```
<<<IMAGE_EXTRACT source="scan.pdf" page=2 image_index=1 method="embedded_image_ocr">>>
... extracted text ...
<<<END_IMAGE_EXTRACT>>>
```

The actual `method` values used by each extractor:

| Source | `method` value |
|---|---|
| Embedded image inside a PDF/DOCX, local OCR | `embedded_image_ocr` |
| Full-page OCR fallback inside a PDF, local OCR | `full_page_ocr` |
| Standalone image file, local OCR | `standalone_image_ocr` |
| AWS Textract | `aws_textract` |
| Google Cloud Vision | `gcp_vision` |
| Azure Document Intelligence | `azure_doc_intelligence` |
| Mathpix | `mathpix` |
| LlamaParse | `llama_parse` |
| Generic cloud base class (should not normally surface) | `cloud_ocr` |

### Local OCR (Tesseract)

Enabled by default (`enable_ocr=True`) when the local extraction stack is importable. PDF/DOCX embedded-image OCR needs `pymupdf` + `pytesseract`; standalone image files additionally need `pillow`.

```python
from waveflowdb_client import Config, VectorLakeClient

cfg = Config(
    api_key="...",
    enable_ocr=True,       # default True
    ocr_lang="eng",        # Tesseract language code
)
client = VectorLakeClient(cfg)
```

If OCR is requested (`enable_ocr=True`) for a PDF/DOCX but `pymupdf`/`pytesseract` aren't installed, the SDK raises `FileProcessingError` rather than silently extracting text-layer-only content. Set `enable_ocr=False` if you want plain text-layer extraction without the rich stack installed.

### Cloud OCR Engines

Select a cloud engine via `cloud_ocr_provider`. All engines return the same `ExtractedContent` dataclass and use identical provenance markers, so they are interchangeable downstream.

| Provider | `cloud_ocr_provider` | Best for | Install |
|----------|----------------------|----------|---------|
| AWS Textract | `"aws"` | Tables, forms, handwriting, multi-column (subject to the mode caveat above) | `pip install boto3` |
| Google Cloud Vision | `"gcp"` | Highest accuracy, 100+ languages | `pip install google-cloud-vision` |
| Azure Document Intelligence | `"azure"` | Forms, invoices, receipts, custom models | `pip install azure-ai-documentintelligence` |
| Mathpix | `"mathpix"` | Math, scientific papers, LaTeX equations | `pip install requests` |
| LlamaParse | `"llama_parse"` | LLM-native parsing, complex layouts, MDX | `pip install requests` or `pip install llama-parse` |

### OCR strategy — hybrid vs. cloud-only

```
cloud_ocr_strategy="hybrid"  (default)
─────────────────────────────────────────────────────────
  File
   │
   ▼
  Local OCR (Tesseract)
   │
   ├── text length ≥ cloud_ocr_threshold (default 100 chars)?
   │       YES → return local result (fast, free)
   │       NO  → fall back to cloud engine
   │
   ▼
  Cloud OCR engine
   │
   ▼
  ExtractedContent (markers identify actual engine used)

cloud_ocr_strategy="cloud"
─────────────────────────────────────────────────────────
  Skip local entirely — always send to the cloud engine.
  Use when document quality is consistently poor, or you
  need math/formula extraction (Mathpix).
```

### Using extractors directly

```python
from pathlib import Path
from waveflowdb_client import (
    PDFExtractor, DOCXExtractor, ImageFileExtractor,   # local
    AWSTextractExtractor, CloudOCRFactory,              # cloud
    strip_image_markers, list_image_regions,
)

# Local extraction
result = PDFExtractor().extract(Path("report.pdf"))
print(result.full_text)          # text + OCR blocks with markers (see ExtractedContent below)

# Strip markers to get clean plain text
clean = strip_image_markers(result.full_text)

# List all image regions with metadata
for region in list_image_regions(result.full_text):
    print(region)  # {source, page, image_index, method, text}

# Cloud extraction via factory — bypasses the mode bug by going direct
from waveflowdb_client import Config
cfg = Config(api_key="...", cloud_ocr_provider="aws")
extractor = CloudOCRFactory.build(cfg)
result = extractor.extract(Path("scanned.pdf"))

# To force AWS Textract "tables" mode today, instantiate directly instead of via Config:
direct = AWSTextractExtractor(mode="tables", region_name="eu-west-1")
result = direct.extract(Path("invoice.pdf"))
```

> **Note:** Local extractors (`PDFExtractor`, `DOCXExtractor`, `ImageFileExtractor`, `strip_image_markers`, `list_image_regions`) and cloud extractors are conditionally imported in `waveflowdb_client/__init__.py` — wrapped in `try/except ImportError`. If `pymupdf`/`pytesseract` aren't installed, the local-extractor names simply aren't exported; the rest of the SDK still works. Same for cloud extractors and their respective provider SDKs.

---

## 🔌 Bringing Your Own Extractor

The SDK does not care how your text was produced. If you have a custom parser, a proprietary OCR model, or any pre-processing pipeline, the pattern is always: **extract with your code → write `.txt` files to the upload folder → call the SDK as normal.**

```python
from pathlib import Path
from waveflowdb_client import Config, VectorLakeClient
import os
from dotenv import load_dotenv

load_dotenv()

UPLOAD_DIR = Path("upload")
UPLOAD_DIR.mkdir(exist_ok=True)

def my_extractor(source_path: Path) -> str:
    """Replace this with any extraction logic you need."""
    return source_path.read_text(encoding="utf-8", errors="replace")

for src in Path("raw_data").glob("*.txt"):
    text = my_extractor(src)
    if text.strip():
        (UPLOAD_DIR / f"{src.stem}.txt").write_text(text, encoding="utf-8")

cfg = Config(
    api_key=os.getenv("VECTOR_LAKE_API_KEY"),
    vector_lake_path=str(UPLOAD_DIR),
)
client = VectorLakeClient(cfg)

result = client.add_documents(
    user_id=os.getenv("USER_ID"),
    vector_lake_description=os.getenv("NAMESPACE"),
)
print(result)
```

`add_documents()` is safe to call repeatedly for this pattern — the MD5 sidecar mechanism (implemented in `run_utils.py`, see [Stale-chunk handling](#stale-chunk-handling)) skips files whose content hasn't changed.

### Which upload method to use

| Situation | Method |
|-----------|--------|
| All files are brand-new to the namespace | `add_documents()` / `run_add()` |
| All files already exist and you're updating them | `refresh_documents()` / `run_refresh()` |
| Mix of new and updated, re-running after changes | `add_documents()` — handled by MD5-aware invalidation when called via `run.py` / `run_utils.py`; the raw SDK method itself does not do MD5 comparison (see below) |

> **Important distinction:** MD5-based change detection (`_invalidate_changed_files`) and unconditional chunk purging (`_purge_chunks_for_files`) both live in `run_utils.py`, not in `client.py`/`utils.py`. If you call `client.add_documents()` directly (bypassing `run.py`/`run_utils.py`) against a `vector_lake_path` with stale chunks already on disk, the SDK's own `BatchManager.create_batches()` will reuse those existing chunk files rather than re-extracting — there's no MD5 check at that layer. The MD5 sidecar workflow described in this README is something `run_utils.py` adds on top of the raw SDK for `run.py` users; if you build directly on `VectorLakeClient`, you're responsible for clearing stale chunks yourself (e.g. delete the `chunks/` directory, or call `client.refresh_documents()` for files that already exist in the namespace, which purges unconditionally inside `_regenerate_chunks_for_refresh`).

---

## 📋 `files_meta` — Root File Type Identification

Every batch upload automatically attaches a `files_meta` array alongside `files_name` and `files_data`. Each entry corresponds to one chunk and carries:

```json
{
  "filename":  "invoice_part1.txt",
  "extension": "pdf"
}
```

| Field | Description |
|-------|-------------|
| `filename` | The chunk filename as uploaded (e.g. `invoice_part1.txt`) |
| `extension` | The original source file extension — the root type before chunking |

### How the SDK resolves `extension`

The module-level function `_source_ext(chunk_name, base_path)` in `client.py` is called whenever `files_meta` isn't supplied explicitly. It parses the chunk stem via `VersionedChunkName.parse()`, globs `base_path` for a non-chunk file with that stem, and takes its real extension. If no source file is found (e.g. direct-upload mode), it falls back to the chunk's own extension (`"txt"`).

```
  invoice_part1.txt
       │
       ▼
  VersionedChunkName.parse()  →  stem = "invoice"
       │
       ▼
  glob("upload/invoice.*")  →  invoice.pdf  found
       │
       ▼
  extension = "pdf"          ←  sent in files_meta
```

For direct-mode uploads (`add_documents(files_name=..., files_data=...)`, `run_insert_path_direct`), the auto-generated metadata stub instead just takes the suffix of the supplied filename directly — there's no chunk parsing or globbing involved since there's no chunk on disk.

### Supplying your own `files_meta`

Any public upload method (`add_documents`, `refresh_documents`, `run_insert_path`, `run_insert_path_direct`) accepts an explicit `files_meta` list of dicts, forwarded verbatim to the server instead of the auto-generated `{filename, extension}` stub. When supplied, it must match the number of source files (not chunks) being uploaded.

---

## 📊 CSV and JSONL Handling

### CSV

`_split_csv_safe` in `utils.py` splits CSV files row-safely: the header row is repeated at the top of every chunk, rows are never split across chunk boundaries, and a single row that exceeds the byte limit on its own raises `FileProcessingError`. The reader tries `utf-8`, then `cp1252`, then `latin-1` encodings in order, falling back to `utf-8` with `errors="replace"` if none cleanly parse the header.

### JSONL / NDJSON

`_split_jsonl_safe` validates every non-blank line as JSON before writing it (raising `FileProcessingError` with the line number on failure), and splits strictly on byte-size boundaries — never splitting a single record across two chunks.

### JSON

Kept atomic by the `FileProcessor`/`extract_text` pipeline — `.json` files are never split into multiple chunks regardless of size.

---

## 🔄 Document Sync — `add` vs. `refresh` vs. `insert`

There is no `sync_documents()` method on `VectorLakeClient`. The three relevant public methods are:

- `add_documents()` — for brand-new files. Existing on-disk chunks (if any) are reused by `BatchManager.create_batches()` without re-extraction.
- `refresh_documents()` — for files that already exist in the namespace. Stale chunks are purged and regenerated before upload (`_regenerate_chunks_for_refresh`), guaranteeing fresh content is always sent.
- `run_insert_path()` / `run_insert_path_direct()` — insert at a server-controlled position via the dedicated `insert_docs` endpoint, rather than `add_docs`'s implicit append behavior.

### Stale-chunk handling

This is implemented **in `run_utils.py`**, not in the SDK core, and is what `run.py` relies on:

| Action (via `run.py`) | Behaviour |
|---|---|
| `add` | `_invalidate_changed_files()` — computes an MD5 of each source file and compares it to a `.md5` sidecar in `chunks/`. Only changed (or first-time) files have their chunks purged; unchanged files keep their cache, skipping redundant OCR. |
| `refresh` / `insert` | `_purge_chunks_for_files()` — unconditionally deletes all `*_part*.txt` chunks (for the given files, or all files if none specified) before extraction, so the server always receives current content. |

### Resumable batch ranges

In **sequential** mode, `add_documents` / `refresh_documents` / `run_insert_path` support `start_from_batch` and `end_batch`:

```
  Batch 1 ✓  Batch 2 ✓  Batch 3 ✓  [CRASH]  Batch 4 ...  Batch 39
  ──────────────────────────────────────────────────────────────────
                                     ↑
                              start_from_batch=4
                              Re-run resumes here
```

This is a parameter you control directly when calling the SDK. `run.py`/`run_utils.py` layers automatic resume on top of this (see [Checkpoints](#checkpoints)).

---

## Search Modes — `flat` vs. `flat_filter`

Both values require `hybrid_filter=True` — `get_matching_docs` returns `InvalidSearchTypeError` (as a response dict) if `search_type` is set to anything other than `"flat"` / `"flat_filter"` / `None`, and a `ValidationError` response dict if `search_type` is set but `hybrid_filter` is `False`.

### `search_type="flat"`

Semantic search runs across **all** documents in the namespace; `hybrid_filter` acts as a scoring signal rather than restricting scope. Best for exploratory search and maximum recall.

### `search_type="flat_filter"`

Hybrid filter runs as a hard gate first; semantic ranking then only considers documents that passed. Best for targeted, high-precision retrieval.

### `search_type=None`

Pure semantic search with no hybrid filter — the server decides the strategy. This is the SDK default if you don't pass `search_type` at all.

```python
result = client.get_matching_docs(
    query="quarterly revenue breakdown",
    user_id="u1",
    vector_lake_description="finance_lake",
    hybrid_filter=True,
    search_type="flat_filter",
    top_docs=5,
)
```

---

## 🖥️ `run.py` — Operations Launcher

`run.py` is the ready-to-use local launcher shipped with the SDK. All logic lives in `waveflowdb_client/run_utils.py`; `run.py` itself is a `CONFIG` block plus a small dispatcher, so you only ever need to edit the top of the file.

```bash
python run.py
```

### Full source

This is the actual `run.py` shipped with the SDK:

```python
"""
run.py — WaveflowDB SDK launcher
═══════════════════════════════════════════════════════════════════
SETUP (3 steps)

  1. Create a .env file next to this script:

         VECTOR_LAKE_API_KEY=your_api_key_here
         VECTOR_LAKE_HOST=https://waveflow-analytics.com
         USER_ID=your@email.com
         NAMESPACE=your_namespace

  2. Drop files to index into the folder set by VECTOR_LAKE_PATH
     (default: ./upload).  Supported: txt, pdf, docx, csv, json,
     jsonl, py, ipynb, png, jpg, jpeg, tiff, bmp, webp.

  3. Set ACTION below and run:  python run.py

All logic lives in run_utils.py.  Edit only the CONFIG block here.
═══════════════════════════════════════════════════════════════════
"""

import sys

from waveflowdb_client.run_utils import (
    run_add,
    run_delete,
    run_docs_info,
    run_full_corpus_search,
    run_health,
    run_insert,
    run_namespace_details,
    run_query,
    run_refresh,
)

# ════════════════════════════════════════════════════════════════════
#  CONFIG — edit only this section
# ════════════════════════════════════════════════════════════════════

# ── Action ──────────────────────────────────────────────────────────
# "add"     — Index new documents from VECTOR_LAKE_PATH.
#             Changed files are detected via MD5; unchanged files reuse cache.
# "refresh" — Re-index existing documents. Stale chunks always purged first.
# "insert"  — Insert documents at a server-controlled position.
#             Stale chunks always purged first.
# "delete"  — Remove documents from the index (set FILES_TO_DELETE below).
# "query"   — Semantic / VQL search.
# "health"  — Ping the server and confirm connectivity.
# "info"    — List documents indexed in the namespace.
# "ns"      — Show storage and quota metadata for the namespace.
# "rfc"     — Exact keyword search across the full corpus.
# "all"     — Run every action in sequence (delete is skipped unless
#             FILES_TO_DELETE is populated).
ACTION = "ns"

# ── Upload settings  (add / refresh / insert / all) ─────────────────
# "streaming"  — Upload begins as soon as the first batch of chunks is
#                ready.  Best for large or OCR-heavy collections.
# "sequential" — All files are chunked first, then uploaded one batch
#                at a time.  Required for checkpoint/resume support.
MODE             = "streaming"
RESUME           = True   # sequential only: auto-resume from last checkpoint.
START_FROM_BATCH = 1      # sequential only: used when RESUME=False.
END_BATCH        = None   # Stop after this batch number; None = run all.

# ── Query settings  (query / all) ───────────────────────────────────
# VQL syntax: {keyword}  {(phrase)}  {A or B}


QUERY = """select top 4
          where query is "Each chunk carries a repeated header row so the backend always 
          has full column context regardless of which segment" 
          contains  {}"""




TOP_DOCS  = 1      # Maximum results returned.
THRESHOLD = 0.1    # Minimum relevance score (0–1); lower = broader recall.
WITH_DATA = True   # True = include document text snippets in results.

# ── Delete settings  (delete / all) ─────────────────────────────────
# List exact filenames as they appear in the index.
FILES_TO_DELETE = [
    # "old_report.pdf",
    # "draft_notes.txt",
]

# ── Full-corpus search settings  (rfc / all) ────────────────────────
RFC_KEYWORD  = "{pre-wedding jitters}"
RFC_TOP_DOCS = 10

# ════════════════════════════════════════════════════════════════════
#  DISPATCH — nothing to edit below this line
# ════════════════════════════════════════════════════════════════════

# Shared kwargs forwarded to every upload action.
_UPLOAD_KWARGS = dict(
    mode=MODE,
    resume=RESUME,
    start_from_batch=START_FROM_BATCH,
    end_batch=END_BATCH,
)

# Shared kwargs forwarded to run_query.
_QUERY_KWARGS = dict(
    search_type="flat_filter",
    top_docs=TOP_DOCS,
    threshold=THRESHOLD,
    with_data=WITH_DATA,
)

# All actions executed by ACTION="all" in this order.
# "delete" is intentionally excluded — it runs only when FILES_TO_DELETE
# is populated, to prevent accidental data loss.
_ALL_SEQUENCE = [
    "health",
    "ns",
    "info",
    "add",
    "refresh",
    "insert",
    "query",
    "rfc",
]


def _run_one(action: str) -> None:
    """Execute a single named action."""

    if action == "add":
        run_add(**_UPLOAD_KWARGS)

    elif action == "refresh":
        run_refresh(**_UPLOAD_KWARGS)

    elif action == "insert":
        run_insert(**_UPLOAD_KWARGS)

    elif action == "delete":
        if not FILES_TO_DELETE:
            print(
                "\nERROR: FILES_TO_DELETE is empty.\n"
                "       Add filenames to remove in the CONFIG section above.\n"
            )
            sys.exit(1)
        run_delete(FILES_TO_DELETE)

    elif action == "query":
        run_query(QUERY, **_QUERY_KWARGS)

    elif action == "health":
        run_health()

    elif action == "info":
        run_docs_info()

    elif action == "ns":
        run_namespace_details()

    elif action == "rfc":
        run_full_corpus_search(RFC_KEYWORD, top_docs=RFC_TOP_DOCS)

    else:
        valid = ", ".join(_ALL_SEQUENCE + ["delete", "all"])
        print(f"\nERROR: Unknown ACTION={action!r}\n       Valid choices: {valid}\n")
        sys.exit(1)


if __name__ == "__main__":

    if ACTION == "all":
        print("\n══ Running ALL actions ══\n")
        passed, failed = [], []

        for act in _ALL_SEQUENCE:
            print(f"\n── {act.upper()} {'─' * (54 - len(act))}")
            try:
                _run_one(act)
                passed.append(act)
            except Exception as exc:
                print(f"  ✗  {act} failed: {type(exc).__name__}: {exc}")
                failed.append(act)

        # Run delete only when explicitly configured — avoids accidental loss.
        if FILES_TO_DELETE:
            print(f"\n── DELETE {'─' * 61}")
            try:
                run_delete(FILES_TO_DELETE)
                passed.append("delete")
            except Exception as exc:
                print(f"  ✗  delete failed: {type(exc).__name__}: {exc}")
                failed.append("delete")

        # Summary
        print(f"\n{'═' * 70}")
        print(f"  Completed: {len(passed)}   Failed: {len(failed)}")
        if passed:
            print(f"  ✓  {', '.join(passed)}")
        if failed:
            print(f"  ✗  {', '.join(failed)}")
        print(f"{'═' * 70}\n")

    else:
        _run_one(ACTION)
```

> Note: the docstring and inline comments in this file say `"VQL syntax"` and reference `.ipynb` support — these comments are accurate to what's shipped (see [Supported File Types](#️-supported-file-types) for the `ipynb` allowed-extensions caveat). The default `RFC_KEYWORD` value above (`"{pre-wedding jitters}"`) and the brace in the default `QUERY` are literally what ships in the file; `full_corpus_search` treats `RFC_KEYWORD` as a plain string passed straight to the server; the SDK itself does not parse or strip the braces client-side.

### CONFIG block

The part of `run.py` you actually edit:

```python
ACTION = "add"   # see table below
```

### Supported `ACTION` values

| `ACTION` | What it does |
|----------|---------------|
| `"add"` | Index new documents from `VECTOR_LAKE_PATH` via `run_add()`. Changed files are detected via MD5; unchanged files reuse cached chunks. |
| `"refresh"` | Re-index existing documents via `run_refresh()`. Stale chunks are always purged first. |
| `"insert"` | Insert documents at a server-controlled position via `run_insert()`. Stale chunks are always purged first. |
| `"delete"` | Remove documents from the index via `run_delete()` (set `FILES_TO_DELETE`). |
| `"query"` | Semantic / hybrid search via `run_query()` (set `QUERY`). |
| `"health"` | Ping the server via `run_health()`. |
| `"info"` | List indexed documents via `run_docs_info()`. |
| `"ns"` | Show storage and quota metadata via `run_namespace_details()`. |
| `"rfc"` | Exact keyword search via `run_full_corpus_search()` (set `RFC_KEYWORD`). |
| `"all"` | Run `health`, `ns`, `info`, `add`, `refresh`, `insert`, `query`, `rfc` in that order. `delete` is skipped unless `FILES_TO_DELETE` is populated, to prevent accidental data loss. |

There is no `"update"` alias for `"refresh"` — `ACTION` must be one of the exact values above, or `run.py` prints an error listing valid choices and exits.

### Upload settings (`add` / `refresh` / `insert` / `all`)

```python
MODE             = "streaming"   # "streaming" | "sequential"
RESUME           = True          # sequential: auto-resume from last checkpoint; streaming: skip already-completed file stems
START_FROM_BATCH = 1             # sequential only: used when RESUME=False (or no checkpoint exists)
END_BATCH        = None          # sequential only: stop after this batch number; None = run all
```

- `"streaming"` — upload begins as soon as the first minibatch of chunks is ready. Best for large or OCR-heavy collections. `START_FROM_BATCH` / `END_BATCH` have no effect in this mode.
- `"sequential"` — all files are chunked first, then uploaded one batch at a time. Required for `START_FROM_BATCH` / `END_BATCH` control and for `last_ok_batch`-based checkpoint/resume.

### Query settings (`query` / `all`)

```python
QUERY = """select top 4
          where query is "..."
          contains {keyword}"""

TOP_DOCS  = 1      # maximum results returned
THRESHOLD = 0.1    # minimum relevance score (0-1); lower = broader recall
WITH_DATA = True   # True = include document text snippets in results
```

`run_query()` always passes `search_type="flat_filter"` when called from `run.py`'s dispatcher (`_QUERY_KWARGS`), so `hybrid_filter` is enabled automatically.

> Note on the query string itself: `select ... where query is "..." contains {...}` is the natural-language query syntax this SDK's backend expects — `get_matching_docs` forwards the full `query` string to the server as-is along with the `hybrid_filter` / `search_type` / `top_docs` / `threshold` parameters. The client-side code in this repository does not itself parse braces or `select/where/contains` syntax; that parsing happens server-side.

### Delete settings (`delete` / `all`)

```python
FILES_TO_DELETE = [
    "old_report.pdf",
    "draft_notes.txt",
]
```

`ACTION = "delete"` (or `ACTION = "all"` with this list populated) exits with an error if `FILES_TO_DELETE` is empty, rather than silently doing nothing.

### Full-corpus search settings (`rfc` / `all`)

```python
RFC_KEYWORD  = "shipped"
RFC_TOP_DOCS = 10
```

`full_corpus_search` is a plain keyword scan (see `client.py`'s `full_corpus_search` method) — pass an ordinary search term/phrase string here, not brace syntax.

### Stale-chunk handling

See [Stale-chunk handling](#stale-chunk-handling) above — `run.py`'s upload actions always run either `_invalidate_changed_files()` (for `add`) or `_purge_chunks_for_files()` (for `refresh`/`insert`) before dispatching to the SDK.

### Checkpoints

Checkpoints are written to:

```
{VECTOR_LAKE_PATH}/chunks/checkpoint_{USER_ID}_{NAMESPACE}_{operation}.json
```

`USER_ID` and `NAMESPACE` are sanitized (`/` → `_`, `@` → `_at_`) so the filename is always filesystem-safe, and different namespaces/operations never collide.

**Sequential checkpoint** (stores the last successfully completed batch number):

```json
{
  "last_ok_batch": 14,
  "operation": "add_docs",
  "user_id": "alice_at_example.com",
  "namespace": "finance_q3",
  "mode": "sequential",
  "updated_at": "2025-08-01T10:22:00+00:00"
}
```

**Streaming checkpoint** (stores completed source-file stems):

```json
{
  "completed_stems": ["report_2024", "appendix_a"],
  "operation": "add_docs",
  "user_id": "alice_at_example.com",
  "namespace": "finance_q3",
  "mode": "streaming",
  "updated_at": "2025-08-01T10:22:00+00:00"
}
```

Checkpoints are cleared automatically (`_clear_checkpoint`) once a run completes with zero failed batches. To force a fresh start, set `RESUME = False` or delete the checkpoint file manually.

### Failure auditing

Every failed batch — from `run_add`, `run_refresh`, `run_insert`, or `run_delete` — is appended to `logs/failed_files.jsonl` by `_log_failed_files()` in `run_utils.py`:

```json
{
  "timestamp": "2025-08-01T10:23:45Z",
  "operation": "add_docs",
  "user_id": "alice_at_example.com",
  "namespace": "finance_q3",
  "batch_num": 7,
  "files": ["report_part3.txt", "report_part4.txt"],
  "error_type": "HTTP_500",
  "error_message": "Internal server error",
  "traceback": null,
  "extra": {}
}
```

### Full action map

```
  run.py
  │
  ├── UPLOAD
  │   ├── run_add()      "add"      — new documents, MD5-aware caching
  │   ├── run_refresh()  "refresh"  — existing documents, chunks always purged
  │   └── run_insert()   "insert"   — controlled-position insert, chunks always purged
  │
  ├── DELETE
  │   └── run_delete()   "delete"   — remove by filename (FILES_TO_DELETE required)
  │
  ├── QUERY
  │   ├── run_query()                "query" — flat_filter semantic/hybrid search
  │   └── run_full_corpus_search()   "rfc"   — exact keyword scan
  │
  └── INFO
      ├── run_health()             "health" — ping server
      ├── run_namespace_details()  "ns"     — storage & quota metadata
      └── run_docs_info()          "info"   — list indexed documents
```

### Sample output — `run_health()`

```
──────────────────────────────────────────────────────────────────────
  HEALTH CHECK
──────────────────────────────────────────────────────────────────────
{
  "status": "ok",
  "message": "...",
  ...
}
```

(The exact response shape comes from the server; the SDK forwards it as-is via `_print_result`.)

### Sample output — `run_namespace_details()`

```
──────────────────────────────────────────────────────────────────────
  NAMESPACE DETAILS
──────────────────────────────────────────────────────────────────────
{
  "reply": {
    "content": [
      {
        "vector_lake_description": "customdataset_",
        "faiss_disk_size_mb": "14.43 mb",
        "source_store_files": 166,
        "source_store_used_mb": "30.9062 mb",
        "source_store_quota_remaining_mb": "12257.09 mb"
      }
    ]
  },
  "message": "Information retrieved successfully",
  "status_code": 200
}
```

This response body's exact shape is server-defined; `_print_result` in `run_utils.py` pretty-prints whatever JSON dict comes back.

---

## 📚 VectorLakeClient API Reference

All public methods return a plain `dict` and never raise — every method body is wrapped in `try/except VectorLakeError` / `except Exception`, converting failures into `{"success": False, "error": "...", "message": "..."}` (or the richer schema from `APIError.to_response()` for HTTP failures, which adds `status_code` and `response_text`).

### `add_documents(...)`

Add new documents to the index.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `user_id` | str | required | Caller identity forwarded verbatim to the server. |
| `vector_lake_description` | str | required | Namespace / lake identifier. |
| `start_from_batch` | int | `1` | (Path mode, sequential) Resume from this 1-based batch. |
| `end_batch` | int \| None | `None` | (Path mode, sequential) Stop after this batch (inclusive). `None` = all. |
| `intelligent_segmentation` | bool | `True` | Whether the server applies intelligent chunking. |
| `session_id` | str \| None | `None` | Optional session correlation token. |
| `files` | List[str] \| None | `None` | (Path mode) Explicit source filenames. `None` = all supported files in `vector_lake_path`. |
| `files_name` | List[str] \| None | `None` | (Direct mode) Ordered filenames. |
| `files_data` | List[str] \| None | `None` | (Direct mode) Ordered contents — must match `files_name` length. |
| `files_meta` | List[dict] \| None | `None` | Per-file metadata; auto-generated `{filename, extension}` when omitted. |
| `processing_mode` | str | `"sequential"` | `"sequential"` or `"streaming"`. |
| `on_batch_success` | callable \| None | `None` | Optional callback invoked with the list of chunk filenames after each successful batch — used by `run_utils.py`'s streaming checkpoint logic. |

Direct mode activates whenever both `files_name` and `files_data` are not `None` — no disk I/O, no chunking, no progress bar, single request via `_direct_upload`.

### `refresh_documents(...)`

Same signature and parameter table as `add_documents`. The behavioral difference: stale on-disk chunks for the affected source files are always purged and regenerated (`_regenerate_chunks_for_refresh`) before upload, in both path and... actually in path mode only — direct mode never touches disk chunks at all since content comes from `files_data` directly.

### `delete_documents(user_id, vector_lake_description, files_name, session_id=None)`

Removes documents from the index by filename. Lightweight metadata-only call — no chunking, no batching.

- `files_name` must be non-empty — an empty list returns a `ValidationError` response immediately.
- Basenames are extracted automatically (`os.path.basename`) — full paths are accepted and stripped.

```python
client.delete_documents(
    user_id="u1",
    vector_lake_description="my_lake",
    files_name=["report.pdf", "notes.txt"],
)
```

### `run_insert_path(...)`

Insert documents at a server-controlled position, reading source files from disk (path mode only — no direct-mode `files_name`/`files_data` parameters on this method).

| Parameter | Type | Default | Description |
|---|---|---|---|
| `user_id` | str | required | Caller identity. |
| `vector_lake_description` | str | required | Namespace / lake identifier. |
| `start_from_batch` | int | `1` | Resume point (sequential mode). |
| `end_batch` | int \| None | `None` | Upper batch bound (sequential mode). |
| `intelligent_segmentation` | bool | `True` | Server-side segmentation. |
| `session_id` | str \| None | `None` | Optional session token. |
| `files` | List[str] \| None | `None` | Explicit source filenames; `None` = all supported files. |
| `files_meta` | List[dict] \| None | `None` | Per-file metadata. |
| `processing_mode` | str | `"sequential"` | `"sequential"` or `"streaming"`. |
| `on_batch_success` | callable \| None | `None` | Same purpose as in `add_documents`. |

### `run_insert_path_direct(user_id, vector_lake_description, files_name, files_data, intelligent_segmentation=True, session_id=None, files_meta=None)`

Direct-mode insert — bypasses disk I/O entirely. `files_name` and `files_data` are required and must be non-empty and equal length, or a `ValidationError` response is returned.

```python
client.run_insert_path_direct(
    user_id="u1",
    vector_lake_description="my_lake",
    files_name=["appendix.txt"],
    files_data=["Section A content ..."],
)
```

### `get_matching_docs(...)`

Retrieve top-matching document chunks via semantic search with optional hybrid filtering.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `query` | str | required | Natural-language search query. |
| `user_id` | str | required | Caller identity. |
| `vector_lake_description` | str | required | Namespace / lake identifier. |
| `pattern` | str | `"static"` | `"static"` for the persistent index; `"dynamic"` is set automatically when `files_name`/`files_data` are supplied. |
| `session_id` | str \| None | `None` | Optional session token. |
| `hybrid_filter` | bool | `False` | Activates hybrid keyword + vector filtering. **Required (`True`)** when `search_type` is set. |
| `search_type` | str \| None | `None` | `"flat"`, `"flat_filter"`, or `None` (server default / pure semantic). |
| `top_docs` | int | `10` | Maximum chunks to return. |
| `threshold` | float | `0.2` | Minimum similarity score (0–1). |
| `files_name` | List[str] \| None | `None` | (Dynamic mode) Temporary files to search over. |
| `files_data` | List[str] \| None | `None` | (Dynamic mode) Must match `files_name` length. |
| `with_data` | bool | `False` | If `True`, raw chunk text is included in the response (routes to the `top_matching_docs_with_data` endpoint instead of `top_matching_docs`). |

Validation: `InvalidSearchTypeError` response if `search_type` isn't `None`/`"flat"`/`"flat_filter"`; `ValidationError` response if `search_type` is set but `hybrid_filter=False`; `ValidationError` response if `files_name`/`files_data` lengths mismatch.

### `health_check(user_id, vector_lake_description, session_id=None)`

Pings the server and confirms namespace connectivity.

### `get_namespace_details(user_id, vector_lake_description=None, session_id=None)`

Returns storage/quota metadata for one namespace (if `vector_lake_description` supplied) or all namespaces belonging to `user_id`.

### `get_docs_information(user_id, vector_lake_description, session_id=None, keyword=None, threshold=70)`

Returns document-level metadata. `keyword` applies fuzzy filename matching at the given `threshold` (0–100); omitted when not supplied.

### `full_corpus_search(user_id, vector_lake_description, keyword, session_id=None, top_docs=10)`

Exact keyword scan across all indexed text — distinct from `get_matching_docs`'s vector similarity search. Useful for compliance lookups or exact-identifier searches.

---

## Error Handling

All exceptions inherit from `VectorLakeError` (defined in `exceptions.py`) and expose `.to_response()`. Since every public method catches and converts exceptions internally, you generally only need to check the `"error"` key in the returned dict.

| Exception | When raised |
|-----------|-------------|
| `ConfigError` | API key missing, or `streaming_minibatch_size_mb < max_batch_size_mb`, or an invalid `cloud_ocr_provider` / `cloud_ocr_strategy` — all at `Config.__init__` time. |
| `ValidationError` | Mismatched `files_name`/`files_data` lengths, empty `files_name` on delete/insert, or `search_type` set without `hybrid_filter=True`. |
| `InvalidSearchTypeError` (subclass of `ValidationError`) | `search_type` is not `None`, `"flat"`, or `"flat_filter"`. |
| `DocumentNotFoundError` (subclass of `ValidationError`) | Carries `.filename`. Defined in `exceptions.py`, but not currently raised anywhere in `client.py` — `delete_documents` does not verify the filename exists before calling the server. |
| `UnsupportedFileTypeError` (subclass of `FileProcessingError`) | File extension not in `allowed_extensions`. Carries `.filename` and `.allowed`. |
| `FileProcessingError` | File I/O, encoding, OCR-stack-missing, or parsing failure (e.g. malformed JSONL line, empty CSV, oversized single row/record). |
| `APIError` | HTTP 4xx/5xx from the server, or a network-level failure that exhausted retries. Carries `.status_code` and `.response_text`. |
| `ThrottleError` (subclass of `APIError`) | HTTP 429. Carries `.retry_after`. |

---

## Structured Logging

### Python `logging` (module logger)

`client.py`'s logger (`logging.getLogger(__name__)`) propagates to the root logger, which `run_utils.py` configures with both a file handler (`logs/run_operations.log`, level `DEBUG`) and a stdout handler (level `INFO`). If you use the raw SDK without `run_utils.py`, no handlers are attached by default — configure `logging` yourself to see output.

Representative log lines actually emitted by the client (sequential mode):

```
Batch 3/39  ✓  221.4ms  ok=3 fail=0  ETA 312s  sources=[report]  chunks=[report_part3.txt]
Batch 5/39  ✗  1820.0ms  ok=4 fail=1  ETA 280s  sources=[notes]  chunks=[notes_part1.txt]
```

And streaming mode:

```
Batch #2  ✓  1823.4ms  ok=2 fail=0  sources=[report]  chunks=[report_part1.txt]
```

A run-completion summary line is also logged via `_print_run_summary`:

```
ADD_DOCS COMPLETE — All batches succeeded | range #1→#39 | uploaded 39/39 | skipped 0 | avg 1.84s max 3.12s
```

### CSV logs (`log_dir`, default `logs/`)

`Logger` (in `utils.py`) writes three CSV files, each created with a header row on first write:

| File | Columns | Written from |
|------|---------|---------------|
| `performance.csv` | `timestamp`, plus whatever keyword args are passed to `log_performance(**kwargs)` — in practice `operation`, `batch_num`, `latency_ms`, `request_kb`, `response_kb`, `http_status` | `_make_request`, on every HTTP call |
| `api_errors.csv` | `timestamp`, `operation`, `batch_num`, `message` | `_make_request`, on throttling/HTTP-error/exhausted-retry paths |
| `skipped_files.csv` | `timestamp`, `filename`, `reason` | `Logger.log_skipped_file` (available for callers; not currently invoked anywhere in `client.py`/`utils.py` itself) |

In streaming or otherwise concurrent contexts, rows in `performance.csv` may arrive out of strict chronological order; sort by `timestamp` or the relevant batch number column when analysing.

### `failed_files.jsonl`

This file (and its JSON-lines schema shown under [Failure auditing](#failure-auditing)) is written by `run_utils.py`, not by the core SDK (`client.py`/`utils.py`). If you call `VectorLakeClient` methods directly without going through `run.py`, failures are *not* automatically appended anywhere on disk beyond the `api_errors.csv`/`run_operations.log` paths above — you'd need to inspect the returned dict and persist failures yourself.

### Run reports (`chunks/`)

After every sequential or streaming run, a JSON report is written to `{vector_lake_path}/chunks/run_{run_id}.json` by `_save_run_report`:

```json
{
  "run_id": "a3f1b2",
  "timestamp": "2024-01-15T10:23:01",
  "operation": "add_docs",
  "parallel_batches": 1,
  "start_batch": 1,
  "end_batch": 39,
  "successful": 38,
  "failed": 1,
  "server_skipped": 0,
  "batches": [
    { "batch_number": 1, "success": true,  "server_skip": false, "files": ["report_part1.txt"], "processing_time": 3.2, "elapsed_ms": 220.1, "error": null },
    { "batch_number": 5, "success": false, "server_skip": false, "files": ["notes_part1.txt"],  "processing_time": 1.1, "elapsed_ms": 1820.0, "error": "upstream timeout" }
  ]
}
```

The dict returned by `add_documents`/`refresh_documents`/`run_insert_path` (built by `_build_run_result`) mirrors this, plus `mode`, `total_batches`, `successful_batches`, `failed_batches`, and `failed_batch_nums` — use `failed_batch_nums` together with `start_from_batch`/`end_batch` to target a retry.

---

## Retry and Backoff

```
  Request attempt (up to max_retries total attempts)
      │
      ├── HTTP 429     → wait Retry-After header (or 2^attempt s) → retry
      ├── Timeout      → wait 2^attempt s → retry
      ├── ConnectionError → wait 2^attempt s → retry
      ├── Other RequestException → return error dict immediately (no retry)
      ├── HTTP 4xx/5xx (non-429) → return error dict immediately (no retry)
      └── Success (HTTP < 400)   → return response dict
```

`max_retries` (default `2`, env `VECTOR_LAKE_MAX_RETRIES`) is the total number of attempts in the `for attempt in range(self.config.max_retries)` loop — so the default of `2` means at most 2 HTTP attempts total for timeout/connection/429 cases, not 2 retries on top of an initial attempt.

---

## 🎯 No Schema Required

```
  Traditional approach                  WaveflowDB approach
  ────────────────────                  ───────────────────
  1. Define a schema               vs.  1. Upload raw files
  2. Extract & map every field               (PDF, txt, docx, csv, ...)
  3. Maintain schema consistency
  4. Update schema for new fields    →  2. Query immediately
  5. Re-index on schema change
```

| Feature | Traditional | WaveflowDB |
|---------|-------------|------------|
| Data ingestion | Extract, map, validate | Direct upload |
| Schema definition | Required upfront | Not required |
| Query capability | Exact field matching | Semantic + optional hybrid filtering |
| New document types | Requires schema update | Works immediately |
| Maintenance | High | Low |

---

## Package Layout

```
waveflowdb_client/
├── __init__.py          # public exports; conditional imports for OCR/cloud stacks
├── client.py            # VectorLakeClient — all public SDK methods
├── config.py            # Config — env/.env/constructor-arg resolution
├── exceptions.py         # VectorLakeError hierarchy
├── extractors.py         # local extractors: PDFExtractor, DOCXExtractor, ImageFileExtractor
├── cloud_extractors.py   # AWS / GCP / Azure / Mathpix / LlamaParse extractors + CloudOCRFactory
├── models.py             # DocumentInfo, MatchingDocsResponse, HealthResponse, BatchResult, VersionedChunkName
├── utils.py              # FileProcessor, BatchManager, Logger, chunking/extraction helpers
└── run_utils.py          # run_add/run_refresh/run_insert/run_delete/run_query/run_health/
                           # run_namespace_details/run_docs_info/run_full_corpus_search,
                           # plus MD5 invalidation, chunk purging, checkpointing, failure logging

run.py                    # CONFIG block + dispatcher; imports everything from run_utils
```

`Path(self.log_dir).mkdir(...)` and `Path(self.vector_lake_path).mkdir(...)` are called at the end of `Config.__init__`, so both directories are created automatically on first use — no manual setup required beyond the `.env` file.

---

## Support

For API or platform support, visit: **https://db.agentanalytics.ai**

---

## License

Copyright DIBR tech private ltd.
