Metadata-Version: 2.4
Name: beaver-index-importer
Version: 0.1.2
Summary: A CLI tool to import siglip2 image  index files (.jsonl, .pkl) into a BeaverDB.
Project-URL: Homepage, https://github.com/yudivian/beaver-index-importer
Project-URL: Issues, https://github.com/yudivian/beaver-index-importer/issues
Author-email: Yudivián Almeida Cruz <yudivian@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.13
Requires-Dist: beaver-db[full]
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.60.0
Description-Content-Type: text/markdown

# Image Index Importer for BeaverDB

A command-line tool to import image index files (generated by `siglip2-image-indexer`) into a BeaverDB database for vector search.

This tool uses the full path of each image as its unique ID. Running the importer multiple times with the same or updated index files automatically **updates** existing entries and adds new ones (an "**upsert**" operation), preventing duplicates.

## Installation

The tool can be installed using two methods. The key dependencies, including `beaver-db[full]`, `tqdm`, and `PyYAML`, will be installed automatically.

### 1\. Install from PyPI (Recommended for Users)

If the package is published on PyPI, you can install it directly:

```bash
pip install beaver-index-importer
```

### 2\. Install from Source (For Developers)

To install the package in **editable mode** for development or using the latest source code:

1.  **Clone the Repository:**
2.  **Access the Directory:** Navigate into the cloned directory, which is **`beaver-index-importer`**.
3.  **Install in Editable Mode:**

<!-- end list -->

```bash
# Clone your repository
git clone https://github.com/yudivian/beaver-index-importer
cd beaver-index-importer

# Install the package in editable mode
pip install -e .
```

## Usage

You can run the importer using command-line arguments, a YAML configuration file, or a combination of both. Command-line arguments will always take precedence over settings in the configuration file.

### 1\. Using Command-Line Arguments

The most direct way is to provide all parameters via the command line.

```bash
beaver-index-importer --index-file /path/to/your/image_index.jsonl --db-file /path/to/your/images.db
```

### 2\. Using a Configuration File

For easier management, you can define your settings in a YAML file (e.g., `config.yml`).

**Example `config.yml`:**

```yaml
index_file: /path/to/your/image_index.jsonl
db_file: /path/to/your/images.db
collection: my_image_collection
mode: upsert
```

Then, run the importer by pointing it to your config file:

```bash
beaver-index-importer --config config.yml
```

## Execution Examples

Here are simple examples demonstrating how to run the importer using different modes and configurations.

| Operation Mode | Purpose | Command Example |
| :--- | :--- | :--- |
| **`upsert`** | (Default). Updates existing and inserts new. | `beaver-index-importer --index-file index.jsonl --db-file images.db --mode upsert` |
| **`rebuild`** | Deletes the entire collection before importing. | `beaver-index-importer --index-file index.jsonl --db-file images.db --mode rebuild` |
| **`insert-only`** | Only adds documents that do not exist; ignores existing ones. | `beaver-index-importer --index-file index.jsonl --db-file images.db --mode insert-only` |
| **`update-only`** | Only updates existing documents; ignores new ones. | `beaver-index-importer --index-file index.jsonl --db-file images.db --mode update-only` |
| **`sync`** | Performs `upsert` and then removes documents from the DB that are missing from the index file. | `beaver-index-importer --index-file index.jsonl --db-file images.db --mode sync` |
| **`--collection`** | Imports to a collection with a specific name. | `beaver-index-importer --index-file index.jsonl --db-file images.db --collection my_photos` |

## Import Modes

The tool supports five different modes of operation, controlled by the `--mode` argument (or `mode` in the config file).

| Mode | Description | Behavior |
| :--- | :--- | :--- |
| **`upsert` (Default)** | **Update or Insert.** | Uses the image path as a unique ID. It updates existing documents and inserts new ones. Ideal for incremental synchronization. |
| **`rebuild`** | **Clear and Re-insert.** | Completely **deletes all documents** from the collection before importing the new data. Use to start from a clean slate. |
| **`insert-only`** | **Insert New Documents Only.** | Only inserts documents that **do not** have an existing ID in the database. Existing documents are skipped (not updated). |
| **`update-only`** | **Update Existing Documents Only.** | Only updates documents that **already exist** in the database. New documents from the index file are skipped (not inserted). |
| **`sync`** | **Mirror/Synchronize.** | Performs an `upsert` of all index file documents, and then **removes** any documents from the database that are **missing** from the current index file. This ensures the database exactly mirrors the index source. |

## All Options

| Argument | Short | Description | Default |
| :--- | :--- | :--- | :--- |
| `--config` | `-c` | Path to a YAML configuration file. | `None` |
| `--index-file` | | Path to the `.jsonl` or `.pkl` index file to import. | **Required** |
| `--db-file` | | Path to the BeaverDB database file to create or update. | **Required** |
| `--collection` | | The name of the collection to use inside the database. | `images` |
| `--mode` | | The import strategy: `upsert`, `rebuild`, `insert-only`, `update-only`, or `sync`. | `upsert` |