Metadata-Version: 2.4
Name: assort
Version: 1.0.0
Summary: Using AI to sort lists of unstructured text through iterative batches.
Author-email: Tom Gorbett <wthomasgorbett@gmail.com>
License: MIT
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: openai>=1.14.0
Requires-Dist: tiktoken>=0.7.0

# assort

Text clustering and sorting with an LLM that discovers categories, classifies items, optionally merges overlapping themes, and cleans up labels. Designed for quick drop in use with a single function call and clear cost tracking.

## Highlights

- Discovers category names from your data
- Sorts each item with calibrated confidence per category
- Merges overlapping themes when the model judges a high likelihood of overlap
- Refines the miscellaneous bucket when it is too large
- Optionally renames categories to be clearer and more specific
- Tracks tokens and estimated cost in USD
- Simple one function API that returns results and rich stats

## Install

```bash
pip install assort
```

You also need an OpenAI API key available to the runtime, for example

```bash
export OPENAI_API_KEY=sk_your_key_here
```

## Quick start

```python
from assort import assort

texts = [
    "Build a responsive landing page in React",
    "How to index a Postgres table",
    "Cognitive behavioral therapy exercises",
    "Vector search with Azure AI Search",
    "Tailwind utility classes for layouts",
    "Managing anxiety before a big presentation",
]

results, stats = assort(
    texts,
    min_clusters=3,
    max_clusters=6,
    description="Short notes that mix software topics and mental health topics",
)

print(results["sorted_results"])
print(round(stats["cost_usd"], 4), "USD")
```

Example shape of `sorted_results`

```python
{
    "Front End Engineering": [
        "Build a responsive landing page in React",
        "Tailwind utility classes for layouts"
    ],
    "Data and Search": [
        "Vector search with Azure AI Search",
        "How to index a Postgres table"
    ],
    "Anxiety and CBT": [
        "Cognitive behavioral therapy exercises",
        "Managing anxiety before a big presentation"
    ],
    "Miscellaneous": []
}
```

## API

### assort

```python
results, stats = assort(
    batch,
    min_clusters=2,
    max_clusters=5,
    policy=None,
    description="",
    print_estimate=False,
    confirm=False,
    max_budget=None,
    model=None,
    rename_final=True,
)
```

Parameters

- `batch`
  List of strings to categorize. Empty or blank strings are ignored.

- `min_clusters` and `max_clusters`
  Bounds for initial category discovery.

- `policy`
  `Policy.fuzzy` or `Policy.exhaustive`. This affects internal cost estimation. Both modes perform miscellaneous refinement.

- `description`
  Optional corpus context. Helps the model choose better category boundaries and names.

- `print_estimate`
  If true, a cost estimate is computed before any model calls. The estimate is also used when `confirm` or `max_budget` are set.

- `confirm`
  If true, the function will prompt in the console before running. Useful for scripts.

- `max_budget`
  Float in USD. If the estimate exceeds this amount, the function returns an empty result without calling the model.

- `model`
  Optional model name to override the default. If omitted, a capable multimodal GPT model is used by default.

- `rename_final`
  If true, the library proposes clearer category names at the end based on samples from each group.

Returns

- `results`
  Dict with key `sorted_results`. Values are lists of the original items per category. A `Miscellaneous` bucket is always present.

- `stats`
  Dict with detailed run information

  - `model`
  - `items_total`
  - `initial_categories_count`
  - `final_categories_count`
  - `miscellaneous_count`
  - `calls` with counts for internal steps
  - `retries` for API retries with backoff
  - `tokens` with `input` and `output` counts
  - `combination_attempts` and `combined_merges`
  - `elapsed_seconds`
  - `cost_usd` estimated from token counts and the internal price table
  - `category_sizes` mapping category to item count

## How it works

- Category discovery
  The model reads a sample of your corpus and proposes a set of category names between your bounds.

- Sorting
  Each item is scored for every discovered category with confidences high, medium, low. High confidence categories win. Ties are broken by simple rules.

- Combining overlapping themes
  When items frequently score high for the same pairs or sets of categories, the library asks the model if they should be combined. On a high decision, a single concise name is requested and the merge proceeds.

- Refining miscellaneous
  If `Miscellaneous` grows larger than a data guided threshold, the same discovery and sorting routine runs on that subset. Items are pulled out into new focused categories when possible.

- Renaming for clarity
  At the end, the library proposes clearer names that preserve meaning using a small sample from each category. Names are deduplicated.

## Cost, tokens, and budgets

- Token accounting uses `tiktoken` with an encoder chosen for the active model.
- The estimate and the final `cost_usd` are derived from token counts and an internal price table. Treat these as helpful approximations.
- Use `max_budget` to enforce a strict upper bound before any calls are made.
- Use `print_estimate` or `confirm` when running in scripts where you want an explicit checkpoint.

## Advanced examples

Run with a budget and keep original names

```python
from assort import assort, Policy

texts = [...]
results, stats = assort(
    texts,
    min_clusters=4,
    max_clusters=8,
    description="Product feedback notes",
    policy=Policy.fuzzy,
    max_budget=0.75,
    rename_final=False,
)
```

Inspect stats for simple analytics

```python
results, stats = assort(texts)

sizes = stats["category_sizes"]
by_size = sorted(sizes.items(), key=lambda kv: kv[1], reverse=True)
for name, count in by_size:
    print(name, count)
```

## Behavior notes

- Non deterministic sampling is used during corpus selection, so runs can vary.
- The module keeps a single OpenAI client and encoder in module scope. In process concurrency is not recommended. Use separate processes for parallel work.
- The console prompt only appears when `confirm=True`. Avoid this in non interactive environments.
