Metadata-Version: 2.4
Name: contextbuddy
Version: 0.4.2
Summary: From raw PDFs to compressed prompts in 3 lines. Cut your LLM bill by 60%.
Author: ContextBuddy contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/mohithgowdak/ContextBuddy
Project-URL: Repository, https://github.com/mohithgowdak/ContextBuddy
Project-URL: Issues, https://github.com/mohithgowdak/ContextBuddy/issues
Project-URL: Changelog, https://github.com/mohithgowdak/ContextBuddy/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/mohithgowdak/ContextBuddy#readme
Keywords: llm,prompt,context,compression,routing,tokens,cost,rag,pdf,vector-store
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.6.0; extra == "tiktoken"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.30.0; extra == "anthropic"
Provides-Extra: mcp
Requires-Dist: mcp[cli]; extra == "mcp"
Provides-Extra: ollama
Requires-Dist: httpx>=0.27.0; extra == "ollama"
Provides-Extra: sbert
Requires-Dist: sentence-transformers>=3.0.0; extra == "sbert"
Provides-Extra: gemini
Requires-Dist: google-genai>=0.3.0; extra == "gemini"
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24.0; extra == "pdf"
Provides-Extra: web
Requires-Dist: httpx>=0.27.0; extra == "web"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "web"
Provides-Extra: docx
Requires-Dist: python-docx>=1.1.0; extra == "docx"
Provides-Extra: loaders
Requires-Dist: pymupdf>=1.24.0; extra == "loaders"
Requires-Dist: httpx>=0.27.0; extra == "loaders"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "loaders"
Requires-Dist: python-docx>=1.1.0; extra == "loaders"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.1.0; extra == "langchain"
Provides-Extra: codegraph
Requires-Dist: tree-sitter==0.21.3; extra == "codegraph"
Requires-Dist: tree-sitter-languages==1.10.2; extra == "codegraph"
Provides-Extra: all
Requires-Dist: tiktoken>=0.6.0; extra == "all"
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: anthropic>=0.30.0; extra == "all"
Requires-Dist: google-genai>=0.3.0; extra == "all"
Requires-Dist: sentence-transformers>=3.0.0; extra == "all"
Requires-Dist: httpx>=0.27.0; extra == "all"
Requires-Dist: pymupdf>=1.24.0; extra == "all"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "all"
Requires-Dist: python-docx>=1.1.0; extra == "all"
Requires-Dist: mcp[cli]; extra == "all"
Requires-Dist: tree-sitter==0.21.3; extra == "all"
Requires-Dist: tree-sitter-languages==1.10.2; extra == "all"
Dynamic: license-file

<p align="center">
  <h1 align="center">ContextBuddy</h1>
  <p align="center">
    <strong>From raw PDFs to compressed prompts in 3 lines. Cut your LLM bill by 60%.</strong>
  </p>
</p>

<p align="center">
  <a href="https://github.com/mohithgowdak/ContextBuddy"><img src="https://img.shields.io/github/stars/mohithgowdak/ContextBuddy?style=social" alt="Stars"></a>
  <img src="https://img.shields.io/badge/version-0.3.0-blue" alt="Version">
  <img src="https://img.shields.io/badge/python-3.9%2B-blue" alt="Python">
  <a href="LICENSE"><img src="https://img.shields.io/github/license/mohithgowdak/ContextBuddy" alt="License"></a>
  <img src="https://img.shields.io/badge/dependencies-0_(core)-brightgreen" alt="Deps">
</p>
<!-->
<p align="center">
  <a href="docs/demo.md">Create the demo GIF</a> •
  Drop it at <code>assets/cli-demo.gif</code>
</p>
-->
<!-- Once recorded, uncomment this:
<p align="center">
  <img src="assets/cli-demo.gif" alt="ContextBuddy CLI demo" width="820" />
</p>
-->

```
   ______            __            __  ____            __    __
  / ____/___  ____  / /____  _  __/ /_/ __ )__  ______/ /___/ /_  __
 / /   / __ \/ __ \/ __/ _ \| |/_/ __/ __  / / / / __  / __  / / / /
/ /___/ /_/ / / / / /_/  __/>  </ /_/ /_/ / /_/ / /_/ / /_/ / /_/ /
\____/\____/_/ /_/\__/\___/_/|_|\__/_____/\__,_/\__,_/\__,_/\__, /
                                                            /____/
        The missing compression layer for every LLM stack.
```

> **One line. 60% cheaper. Zero core dependencies.**
> ContextBuddy sits between your raw data and your LLM call, strips the noise, keeps every entity, and shows you the savings -- in tokens and dollars -- on every single request.

>

## Install

### PyPI

```bash
pip install contextbuddy
```

### TestPyPI (for pre-release testing)

```bash
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple contextbuddy==0.4.0
```

### Optional extras

```bash
# MCP server tools
pip install "contextbuddy[mcp]==0.4.0"

# Python codegraph (tree-sitter call edges)
pip install "contextbuddy[codegraph]==0.4.0"
```

```
┌──────────────────────────────────────────────────────┐
│                   ContextBuddy                       │
├──────────────────────────────────────────────────────┤
│  Tokens   before    15000   after      3000          │
│  Saved    -80.0%              Est. $0.0600           │
│  [████████████████████████░░░░░░] 12000 tokens freed │
├──────────────────────────────────────────────────────┤
│  Chunks   total 12    kept 4    pruned 8             │
│  Entities INV-92831, 2026-04-01, acct_12345          │
└──────────────────────────────────────────────────────┘
```

---

## What is ContextBuddy?

ContextBuddy is a **lightweight, open-source Python library** that acts as a context middleware between your raw data (PDFs, web pages, documents, databases) and your LLM call. Its entire job is to take a massive, messy prompt -- like 20 pages of scraped text -- compress it, filter out the noise, preserve critical entities, and pass a clean, token-efficient prompt to any LLM.

Think of it as the **missing layer** in every AI stack: the part that makes sure you're not paying for 15,000 tokens when only 3,000 actually matter.

---

## The Problem

Every developer building with LLMs hits the same wall:

1. **You're overpaying.** You send 15,000 tokens of scraped text to GPT-4 when only 3,000 tokens actually matter. That's 5x the cost for worse results.
2. **Context is noisy.** Raw PDFs, web scrapes, and database dumps are full of irrelevant paragraphs, boilerplate, and filler. Your LLM wastes attention on noise.
3. **Critical details get lost.** When you manually truncate context to save tokens, you accidentally drop the one invoice ID or date the user asked about.
4. **Existing frameworks are bloated.** LangChain has 100+ dependencies. LlamaIndex has 50+. You just want to load a PDF and ask a question.

---

## The Solution

ContextBuddy solves all four problems in a single library:

- **Semantic pruning** -- scores every paragraph against your question and drops irrelevant content before it hits the expensive model.
- **Entity preservation** -- automatically extracts IDs, dates, URLs, phone numbers, and other critical data points, ensuring they are never accidentally pruned.
- **Token budgeting** -- enforces a strict token limit so your context always fits the window you set.
- **ROI telemetry** -- prints exactly how many tokens (and dollars) you saved on every call. Developers screenshot this and share it.

It works with **any LLM** -- OpenAI, Anthropic, Google, or local models -- because it only touches the prompt, not the model.

---

## Who is this for?

ContextBuddy works at every scale. The value just shows up differently:

| Scale | How they use it | Why it matters |
|-------|----------------|----------------|
| **Solo dev / hobbyist** | Drop-in middleware, skip LangChain entirely | Zero deps, 3 lines, no infrastructure to manage |
| **Startup (seed to Series A)** | Full pipeline replacing LangChain stack | Cut API bill from $10k to $3k/month, ship in days not weeks |
| **Mid-size company** | Compression layer inside their existing LangChain/LlamaIndex stack | Bolt on to existing code, save 60% without rewriting anything |
| **Enterprise** | Cost governance + smart routing across teams | ROI telemetry for budgeting, model routing to manage spend at scale |

The bigger the company, the more they overpay on tokens. A team running 1M LLM calls/day is burning $30k+/month in unnecessary tokens. A compression middleware that saves 60% is worth $18k/month to them -- and it plugs in with 3 lines.

Specifically built for:

- **AI engineers** building RAG pipelines who want to cut API costs without sacrificing answer quality.
- **Startups** shipping LLM-powered products who need to keep their OpenAI/Anthropic bill under control.
- **Solo developers** who want multi-doc RAG without installing LangChain and 100 transitive dependencies.
- **Platform teams** who need cost visibility and governance over LLM spend across the organization.
- **Agent builders** who need their tools to pass compressed, high-signal context to function calls.
- **LangChain users** who want to drop in a compression layer without rewriting a single retriever -- just install the `[langchain]` extra.
- **Anyone already using LangChain/LlamaIndex** who wants to cut costs without rewriting -- just drop ContextBuddy into your existing pipeline as a compression step.

---

## Why ContextBuddy over the alternatives?

| Feature | LangChain | LlamaIndex | LightRAG | **ContextBuddy** |
|---------|-----------|------------|----------|------------------|
| Install size | 100+ deps | 50+ deps | 20+ deps | **0 core deps** |
| Lines to first RAG | ~30 | ~15 | ~10 | **3** |
| Cost optimization | None | None | None | **Built-in** |
| ROI telemetry | None | None | None | **Every call** |
| Vector DB required | Yes | Yes | Yes | **No** |
| Context compression | None | None | None | **Semantic pruning + budgeting** |
| PDF/URL/DOCX loading | Separate install | Built-in | Separate | **Built-in (optional deps)** |
| LangChain compatible | N/A (is LangChain) | Adapter needed | No | **Native (`[langchain]` extra)** |

**ContextBuddy does 80% of what LangChain does in 10% of the code.** Zero dependencies for the core. Optional extras for PDFs, web scraping, accurate tokenizers, and native LangChain integration.

---

## Install

```bash
pip install contextbuddy
```

Optional extras:

```bash
pip install "contextbuddy[pdf]"         # PDF loading (pymupdf)
pip install "contextbuddy[web]"         # URL/web scraping (httpx + bs4)
pip install "contextbuddy[tiktoken]"    # Accurate OpenAI token counts
pip install "contextbuddy[openai]"      # OpenAI embeddings
pip install "contextbuddy[ollama]"      # Free local semantic embeddings (requires Ollama)
pip install "contextbuddy[sbert]"       # Free local semantic embeddings (sentence-transformers)
pip install "contextbuddy[loaders]"     # All document loaders
pip install "contextbuddy[langchain]"   # LangChain integration (langchain-core)
pip install "contextbuddy[mcp]"         # MCP server for Cursor / Claude Desktop
pip install "contextbuddy[all]"         # Everything (including MCP + LangChain)
```

---

## MCP Server (Cursor / Claude Desktop)

ContextBuddy ships an MCP server that gives AI assistants direct access to codebase search and context compression — no manual copy-paste needed.

```bash
pip install "contextbuddy[mcp]"
```

### Setup in Cursor

1. Copy `.cursor/mcp.json.example` to `.cursor/mcp.json`
2. Replace `/path/to/your/repo` with the absolute path to your project
3. Restart Cursor — the server starts automatically

```json
{
  "mcpServers": {
    "contextbuddy": {
      "command": "python",
      "args": ["-m", "contextbuddy.mcp.server"],
      "env": {
        "CONTEXTBUDDY_ALLOWED_ROOTS": "/absolute/path/to/your/repo"
      }
    }
  }
}
```

> **Note:** `.cursor/mcp.json` is gitignored (it contains your local path). Commit `.cursor/mcp.json.example` instead.

### Slash commands (in Cursor chat)

| Command | What it does |
|---|---|
| `/cb <question>` | Quick codebase search + compression |
| `/cb_deep <question>` | Semantic + graph search (best quality, requires indexes) |
| `/cb_index` | Build vector + graph indexes for the repo |

The server exposes 12 tools. The LLM picks them automatically based on your question — no explicit invocation needed.

---

## LangChain Integration

ContextBuddy plugs directly into LangChain as a **native compression layer**. No glue code, no adapters -- just install the extra and use the two provided classes.

```bash
pip install "contextbuddy[langchain]"
```

> Requires `langchain-core>=0.1.0`. If it is missing, importing `ContextBuddyCompressor` or `ContextBuddyRetriever` will raise a helpful `ImportError` telling you exactly what to install.

### `ContextBuddyCompressor`

A drop-in `base_compressor` for LangChain's `ContextualCompressionRetriever`. It scores retrieved documents against the query, prunes irrelevant ones, preserves entities, and enforces a token budget -- all before the LLM sees a single token.

### `ContextBuddyRetriever`

Wraps any `MemoryStore` (or any object with a `.search(query, top_k)` method). Runs semantic search, compresses the results, and returns standard LangChain `Document` objects. Use it anywhere LangChain expects a `BaseRetriever`.

### Example: both classes in action

```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI
from contextbuddy import (
    ContextBuddyCompressor,
    ContextBuddyRetriever,
    MemoryStore,
    load,
)

# --- Option A: Compress results from an existing LangChain retriever ---
compressor = ContextBuddyCompressor(max_context_tokens=3000, min_relevance=0.15)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=your_existing_retriever,  # any LangChain BaseRetriever
)
docs = compression_retriever.invoke("What are the payment terms?")
# `docs` contains only the chunks that survived pruning + budgeting

# --- Option B: Full retrieval + compression from a ContextBuddy store ---
store = MemoryStore()
store.add(load("./contracts/"))

retriever = ContextBuddyRetriever(store=store, max_context_tokens=2000, top_k=20)
docs = retriever.invoke("What is the late penalty clause?")
# Returns Document objects -- plug straight into any LangChain chain

# Use with any LangChain chain
llm = ChatOpenAI(model="gpt-4o-mini")
for doc in docs:
    print(doc.page_content[:120], "...")
```

| Class | Purpose | Key params |
|---|---|---|
| `ContextBuddyCompressor` | Prune docs from any retriever | `max_context_tokens`, `min_relevance`, `conservative_mode` |
| `ContextBuddyRetriever` | Search a `MemoryStore` + compress | `store`, `max_context_tokens`, `min_relevance`, `top_k` |

Both classes are exported from the top-level package: `from contextbuddy import ContextBuddyCompressor, ContextBuddyRetriever`.

---

## Embedding Levels (what to use)

ContextBuddy is **compression-first**. Embeddings are optional -- you only upgrade when you need more semantic accuracy.

| Level | What you get | Cost | Dependencies | When to use |
|---|---|---:|---|---|
| **Level 0 (default)** | Hash/BM25-style relevance (fast, decent) | **$0** | **None (core)** | Most business/technical docs with shared vocabulary |
| **Level 1 (free semantic, local)** | True semantic similarity (offline) | **$0** | **Optional** | Synonyms/paraphrases matter; you want better recall without paying APIs |
| **Level 2 (paid semantic)** | Best-in-class embeddings | $$ | **Optional** | Multilingual / high-stakes accuracy / heavy paraphrasing |

### Level 0 (default): zero-dependency

- Works out of the box, no setup.
- Best when the question and answer share some vocabulary.

```python
from contextbuddy import ContextEngine, ContextEngineConfig

engine = ContextEngine(ContextEngineConfig(max_context_tokens=4000))
```

### Level 1 (free semantic): local embeddings (recommended upgrade)

Pick one:

- **Ollama** (best DX, keeps your Python deps light): `pip install "contextbuddy[ollama]"`  
  Requires [Ollama](https://ollama.com/) installed and a local embedding model pulled.
- **Sentence Transformers** (in-process, heavier install): `pip install "contextbuddy[sbert]"`

```python
from contextbuddy import ContextEngine, ContextEngineConfig, OllamaEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=OllamaEmbedder(model="nomic-embed-text"),  # local + free
)
```

```python
from contextbuddy import ContextEngine, ContextEngineConfig, SentenceTransformersEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=SentenceTransformersEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)
```

### Level 2 (paid semantic): OpenAI / Gemini

- Use when you want the highest semantic accuracy and you're okay with external API calls.
- Install: `pip install "contextbuddy[openai]"` (and similarly for Gemini when you enable it).

```python
from contextbuddy import ContextEngine, ContextEngineConfig, OpenAIEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=OpenAIEmbedder(model="text-embedding-3-small"),
)
```

```python
from contextbuddy import ContextEngine, ContextEngineConfig, GeminiEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000),
    embedder=GeminiEmbedder(model="text-embedding-004"),
)
```

## 90-second quickstart (the only path you need)

Compress a huge, noisy context into a budgeted prompt before the LLM call -- in three lines.

```python
from contextbuddy import ContextEngine, ContextEngineConfig

engine = ContextEngine(ContextEngineConfig(dev_mode=True, max_context_tokens=4000))

huge_context = """
Invoice INV-92831 issued 2026-04-01 for account_id=acct_12345.
Amount: $4,500.00 USD. Payment due within 30 days.

... 20 pages of unrelated notes, meeting transcripts, old emails ...

Ticket ACME-2041: chargebacks for user_id=usr_9z8y7x6w.
"""

final_prompt, report = engine.build_prompt(
    user_prompt="Summarize the invoice and ticket. Include all IDs and dates.",
    context=huge_context,
)

print(report.reduction_pct, "% smaller,  $", report.estimated_savings, "saved per call")
# Pass `final_prompt` to any LLM (OpenAI, Anthropic, Gemini, local -- ContextBuddy doesn't care).
```

When you're ready to call an LLM, use `engine.run(...)` (sync) or `engine.arun(...)` (async) and pass any `llm_call` callable. See [5 Ways to Use It](#5-ways-to-use-it-pick-your-level) for loaders, full RAG, LangChain, and pipeline patterns.

---

## Benchmarks (quality gate)

ContextBuddy includes a small benchmark harness so "more compression" doesn't silently break correctness.

```bash
python -m pip install -e .
python -m contextbuddy bench --gate --json bench-report.json
```

See `docs/benchmarks/benchmarks.md` and `benchmarks/datasets/v0.sample.json`.

---

## Docs

Start at `docs/index.md`.

---

## What ContextBuddy guarantees

- **Entity survival.** Any regex-matched entity (IDs, emails, URLs, dates, money, tickets, phones, UUIDs, version strings) always survives compression.
- **Never larger.** Output is always shorter than input -- or unchanged if input already fits the budget.
- **Never empty.** If input has content, output is non-empty. Empty output is treated as a bug, not a valid result.
- **Deterministic core.** Same input + same config = same output. No randomness in the core pipeline.
- **Zero core dependencies.** Works on a fresh Python 3.9+ install. `pip install contextbuddy` -> done.
- **Budget respected.** Final prompt always fits `max_context_tokens`. No mid-sentence cuts.

## What ContextBuddy does not do

- **Not an agent framework.** It compresses context; it doesn't orchestrate tools, memory, or loops. Pair with LangGraph/CrewAI if you need that.
- **Not a vector database.** The in-memory store is great up to ~100k chunks. Above that, use Pinecone/Weaviate and plug ContextBuddy in as the compression layer.
- **Doesn't call LLMs itself.** You always pass `llm_call=...`. Works with OpenAI, Anthropic, Gemini, Ollama, anything.
- **Doesn't learn.** Scoring is algorithmic (BM25 + stemmer + synonyms + n-grams). No training, no drift.
- **Doesn't ship a UI.** It's a library, not a product.

---

## 5 Ways to Use It (pick your level)

### Path 1: Compress raw text (3 lines)

```python
from contextbuddy import ContextEngine, ContextEngineConfig

engine = ContextEngine(ContextEngineConfig(dev_mode=True, max_context_tokens=4000))
result = engine.run(
    user_prompt="Summarize the key points.",
    context=huge_raw_text,
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)
```

### Path 2: Load files + compress (3 lines)

```python
from contextbuddy import ContextEngine, load

engine = ContextEngine(dev_mode=True, max_context_tokens=4000)
result = engine.run(
    user_prompt="What are the payment terms?",
    context=load("contract.pdf"),
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)
```

### Path 3: Multi-document RAG (3 lines)

```python
from contextbuddy import Retriever, MemoryStore, load

store = MemoryStore().add(load("./docs/"))
result = Retriever(store, dev_mode=True).query(
    "What are the payment terms?",
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p),
)
```

### Path 4: Full pipeline (one-liner setup)

```python
from contextbuddy import Pipeline

pipeline = Pipeline.from_directory("./docs/", dev_mode=True)
result = pipeline.query("Summarize the contract", llm_call=my_llm)
```

### Path 5: LangChain pipeline

Drop ContextBuddy into any existing LangChain retriever as a `ContextualCompressionRetriever`. Your retriever stays the same -- ContextBuddy just compresses what it returns.

```python
from langchain.retrievers import ContextualCompressionRetriever
from contextbuddy import ContextBuddyCompressor

compressor = ContextBuddyCompressor(max_context_tokens=3000)
retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=your_existing_retriever,
)
docs = retriever.invoke("What is the refund policy?")
# Only high-relevance, budget-fitting chunks survive. Entities always kept.
```

No rewrites required. Install `contextbuddy[langchain]`, add 4 lines, and your pipeline is 60% cheaper.

---

## Architecture

```
Your Files (PDFs, URLs, DOCX, TXT, CSV, directories)
    |
    v
+--------------+
|  Loaders     |  load("file.pdf") / load("https://...") / load("./dir/")
+------+-------+
       |
       v
+--------------+
|  Store       |  In-memory vector index (auto-dedup, metadata, persistence)
+------+-------+
       |
       v
+--------------+
|  Retriever   |  Semantic search -> top-k chunks
+------+-------+
       |
       v
+--------------+
|  Compressor  |  Prune -> entity keep-list -> token budget -> compose
+------+-------+
       |
       v
+--------------+
|  Router      |  Score query complexity -> pick cheap or expensive model
+------+-------+
       |
       v
+--------------+
|  Cache       |  Embedding cache + response cache (skip redundant work)
+------+-------+
       |
       v
  Your LLM (OpenAI / Anthropic / Google / Local)
```

Every layer is optional. Use one, use all, or use any combination.

---

## How Compression Actually Works (No ML, No NumPy)

ContextBuddy doesn't use a neural network to compress your context. The entire pipeline is algorithmic, using techniques that predate deep learning by decades -- but combined in a way that delivers results competitive with embedding-based approaches. Here's exactly what happens when you call `engine.run()`:

### Step 1: Chunking

Your raw text (PDF/web/code/plain text) is chunked into **coherent units** using a document-aware chunker:

- **Generic text**: paragraph/sentence-aware merging (avoids tiny orphan fragments)
- **PDF**: normalizes line-break artifacts and avoids page-wise chunking
- **Contracts**: groups clause/section headers with their bodies (keeps related content together)
- **Python code**: keeps imports + functions/classes intact (no mid-function splits)

The goal is not "more chunks" -- it's **better chunk boundaries**, so relevance scoring and budgeting keep the right information with fewer tokens.

### Step 2: Relevance Scoring (HybridScorer -- the secret sauce)

This is where ContextBuddy is different from every other compression library. Instead of relying on a single signal, the default `HybridScorer` combines **four independent scoring signals** into one relevance score:

**Signal 1: BM25 (70% weight)** -- The same algorithm that powers Elasticsearch and Lucene. It handles term-frequency saturation (saying "payment" 10 times isn't 10x more relevant than once), document-length normalization (longer paragraphs don't cheat the ranking), and inverse-document-frequency weighting (rare words matter more than common ones). This alone is a massive upgrade over naive keyword matching.

**Signal 2: Stemming (built into BM25)** -- A lightweight suffix-stripping stemmer normalizes word forms before scoring. "payments" matches "payment". "running" matches "run". "organized" matches "organizing". No NLTK, no spaCy -- just 120 lines of pure Python implementing the most impactful Porter stemmer rules.

**Signal 3: Synonym Expansion (15% weight)** -- A built-in thesaurus of ~200 word groups covering business, legal, tech, medical, and general vocabulary. When you ask about "car insurance," the scorer automatically expands "car" to also check for "automobile," "vehicle," and "auto" in every paragraph. "Buy" matches "purchase." "Salary" matches "compensation." "Error" matches "bug." All offline, zero API calls.

**Signal 4: Character N-gram Fuzzy Matching (15% weight)** -- Catches morphological variants and typos that stemming misses. "optimise" matches "optimize." "colour" matches "color." Works by computing Jaccard similarity over character trigrams -- if two words share enough 3-character substrings, they're treated as partial matches.

The four signals are normalized to [0, 1] and combined with configurable weights. The result: paragraphs that are genuinely relevant to your question score high, even when they use completely different words.

```python
from contextbuddy import HybridScorer

scorer = HybridScorer()
scores = scorer.score(
    query="What is the car insurance policy?",
    chunks=[
        "The automobile coverage plan includes collision and liability.",  # scores HIGH (synonym match)
        "Employee cafeteria hours are 12pm to 2pm.",                      # scores LOW (irrelevant)
    ],
)
```

### Step 3: Entity Extraction

Regex patterns scan every paragraph for critical data: emails, URLs, dates, dollar amounts, IDs, phone numbers, ticket numbers, etc. Any paragraph containing a detected entity is **force-kept** regardless of its relevance score, so you never accidentally drop the invoice ID the user asked about.

### Step 4: Budget Enforcement

The surviving paragraphs are sorted by importance (entity-containing chunks first, then by relevance score) and greedily packed into the token budget. If even a single chunk won't fit, it's extractively summarized (leading sentences kept until the limit). The final prompt always fits the budget you set.

### Scorer Comparison

| | `HybridScorer` (default) | `SemanticScorer` + `LocalHashEmbedder` | `SemanticScorer` + `OpenAIEmbedder` |
|--|--|--|--|
| Understands synonyms | **Yes** (built-in thesaurus) | No | Yes |
| Handles word forms | **Yes** (stemming) | No | Yes |
| Fuzzy matching | **Yes** (n-grams) | No | No |
| IDF weighting | **Yes** (BM25) | No | Yes |
| Needs API key | **No** | No | Yes |
| Needs internet | **No** | No | Yes |
| Dependencies | **Zero** | Zero | `openai` package |
| Cost | **Free** | Free | ~$0.0002/doc |
| Latency | **<5ms** | <2ms | ~200ms |

The `HybridScorer` is the default because it gives the best results for zero cost and zero dependencies. For production use cases with highly specialized vocabulary (niche medical terms, non-English content), you can still swap in `OpenAIEmbedder` for true neural semantic matching:

```python
from contextbuddy.embedder import OpenAIEmbedder

engine = ContextEngine(
    ContextEngineConfig(max_context_tokens=4000, dev_mode=True),
    embedder=OpenAIEmbedder(),  # neural embeddings for edge cases
)
```

Or bring your own scorer -- any object with a `score(query=..., chunks=...) -> List[float]` method works.

---

## Document Loaders

```python
from contextbuddy.loaders import load

load("report.pdf")                    # PDF (pip install contextbuddy[pdf])
load("https://docs.example.com")      # Web page (pip install contextbuddy[web])
load("notes.docx")                    # Word doc (pip install contextbuddy[docx])
load("data.csv")                      # CSV (rows as chunks)
load("config.json")                   # JSON (keys/items as chunks)
load("./documents/")                  # Entire directory (recursive)
load(["a.pdf", "b.txt", "c.docx"])   # Batch load
```

Zero-dep formats: `.txt`, `.md`, `.csv`, `.json`, `.log`, `.xml`, `.yaml`, `.html`

---

## Vector Store

```python
from contextbuddy import MemoryStore, PersistentStore, load

# In-memory (default)
store = MemoryStore()
store.add(load("report.pdf"), metadata={"source": "report.pdf"})
store.add(load("notes.txt"), metadata={"source": "notes.txt"})
results = store.search("payment terms", top_k=10)

# Persistent (survives restarts)
store = PersistentStore("./my_index.json")
store.add(load("./docs/"))
# Auto-saves to disk. Reloads on next init.
```

Features: auto-deduplication, metadata tracking, serialization, pure-Python cosine search.

---

## Smart Model Router

Route simple queries to cheap models. Route complex ones to expensive models. All offline.

```python
from contextbuddy import Router, Pipeline

router = Router([
    {"max_complexity": 0.3, "model": "gpt-4o-mini"},
    {"max_complexity": 1.0, "model": "gpt-4o"},
])

pipeline = Pipeline.from_directory("./docs/", router=router, dev_mode=True)
result = pipeline.query(
    "Summarize the contract",
    llm_calls={
        "gpt-4o-mini": lambda p: cheap_client.responses.create(model="gpt-4o-mini", input=p),
        "gpt-4o": lambda p: expensive_client.responses.create(model="gpt-4o", input=p),
    },
)
```

---

## Caching

```python
from contextbuddy import Pipeline, EmbeddingCache, ResponseCache

pipeline = Pipeline.from_directory(
    "./docs/",
    embedding_cache=EmbeddingCache(persist_path="./cache/embeddings.json"),
    response_cache=ResponseCache(ttl_seconds=3600),
)
# First query embeds + calls LLM. Second identical query: instant.
```

---

## Agent Tools

ContextBuddy generates OpenAI-compatible function/tool schemas for agents:

```python
from contextbuddy.tools import make_search_tool, make_compress_tool, handle_tool_call

tools = [make_search_tool(store), make_compress_tool(engine)]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
)

# Dispatch tool calls
for tc in response.choices[0].message.tool_calls:
    result = handle_tool_call(tc, tools)
```

---

## Streaming

```python
for chunk in engine.run(
    user_prompt="Summarize",
    context=load("report.pdf"),
    llm_call=lambda p: client.responses.create(model="gpt-4o-mini", input=p, stream=True),
    stream=True,
):
    print(chunk, end="")
```

---

## OpenAI Drop-in Wrapper

Zero code changes to your existing app:

```python
from contextbuddy import wrap_openai

client = wrap_openai(openai.OpenAI(), max_context_tokens=4000, dev_mode=True)
# Use client.chat.completions.create() exactly as before.
# System messages are automatically compressed.
```

---

## CLI (no API key needed)

```bash
echo "Your huge context..." | python -m contextbuddy compress \
    --prompt "What are the key points?" \
    --max-tokens 2000 \
    --show-prompt

python -m contextbuddy compress \
    --file report.txt \
    --prompt "Extract action items" \
    --model gpt-4o
```

---

## Entity Types Preserved

| Category | Examples |
|----------|----------|
| Emails | `alice@example.com` |
| URLs | `https://api.example.com/v2/users` |
| Dates | `2026-04-13`, `04/13/2026`, `2026-04-13T10:30` |
| UUIDs | `550e8400-e29b-41d4-a716-446655440000` |
| Tickets | `JIRA-1234`, `ACME-2041` |
| Phone numbers | `+1-555-867-5309` |
| Money | `$4,500.00`, `1000 USD` |
| IPs | `192.168.1.100` |
| ID-like values | `account_id=acct_12345` |
| Versions | `v2.1.0` |

---

## Pre-built Model Pricing

```python
from contextbuddy.pricing import (
    OPENAI_GPT4O, OPENAI_GPT4O_MINI, OPENAI_GPT41, OPENAI_GPT41_MINI,
    OPENAI_O3, OPENAI_O4_MINI,
    CLAUDE_OPUS_4, CLAUDE_SONNET_4, CLAUDE_HAIKU_35,
    GEMINI_25_PRO, GEMINI_25_FLASH,
    LOCAL_FREE,
    get_pricing,  # get_pricing("gpt-4o") -> ModelPricing
)
```

---

## Programmatic Report

```python
report = engine.last_report

report.original_prompt_tokens   # 15000
report.final_prompt_tokens      # 3000
report.reduction_pct            # 80.0
report.estimated_savings        # 0.06
report.kept_chunks              # 4
report.total_chunks             # 12
report.entities                 # ["INV-92831", "2026-04-01", ...]
```

---

## Public API Reference

| Export | Module | Description |
|---|---|---|
| `ContextEngine` | `contextbuddy.engine` | Core compression engine |
| `ContextEngineConfig` | `contextbuddy.engine` | Configuration dataclass |
| `ContextReport` | `contextbuddy.engine` | Compression telemetry / ROI report |
| `HybridScorer` | `contextbuddy.hybrid_scorer` | BM25 + stemming + synonyms + n-grams scorer |
| `SemanticScorer` | `contextbuddy.scoring` | Embedding-based cosine scorer |
| `MemoryStore` | `contextbuddy.store.memory` | In-memory vector store |
| `PersistentStore` | `contextbuddy.store.persistent` | Disk-backed vector store |
| `Retriever` | `contextbuddy.retriever` | Search + compress pipeline |
| `Pipeline` | `contextbuddy.pipeline` | Full end-to-end pipeline |
| `Router` | `contextbuddy.router` | Complexity-based model router |
| `EmbeddingCache` | `contextbuddy.cache` | Persistent embedding cache |
| `ResponseCache` | `contextbuddy.cache` | TTL response cache |
| `ContextBuddyCompressor` | `contextbuddy.langchain` | LangChain `BaseDocumentCompressor` |
| `ContextBuddyRetriever` | `contextbuddy.langchain` | LangChain `BaseRetriever` with compression |
| `wrap_openai` | `contextbuddy.wrappers` | OpenAI client drop-in wrapper |
| `load` | `contextbuddy.loaders` | Universal file/URL/directory loader |
| `get_pricing` | `contextbuddy.pricing` | Model pricing lookup |
| `Embedder` | `contextbuddy.types` | Protocol for custom embedders |
| `Tokenizer` | `contextbuddy.types` | Protocol for custom tokenizers |

---

## Real-World Use Cases

### Customer Support Bot

Your chatbot pulls a customer's full history (invoices, tickets, emails, notes) for every query -- ~15,000 tokens. Most of it is irrelevant.

```python
from contextbuddy import Pipeline

pipeline = Pipeline.from_directory("./customer_data/acct_12345/", dev_mode=True, max_context_tokens=3000)
answer = pipeline.query("What was my last invoice amount?", llm_call=my_llm)
# [ContextBuddy] 15000 -> 2800 tokens (81.3% reduction). Est. savings: $0.0305
# Entity keep-list preserved: INV-92831, $4,500.00, 2026-04-01, acct_12345
```

At 10,000 queries/day: **$11,250/month without ContextBuddy vs $2,250/month with it.**

### Legal Document Review

A law firm uploads a 50-page contract. Lawyers ask questions about specific clauses.

```python
from contextbuddy import Pipeline

pipeline = Pipeline.from_directory("./contracts/", dev_mode=True, max_context_tokens=4000)
answer = pipeline.query("What are the payment terms and late penalties?", llm_call=my_llm)
```

ContextBuddy loads the PDF, indexes 200+ paragraphs, retrieves the relevant ones, prunes to the 5 that matter, and preserves all clause numbers, dates, and dollar amounts. Without it, you'd need LangChain + a vector database + 50 lines of glue code.

### Internal Knowledge Base

500 internal docs (Confluence exports, PDFs, Markdown). Engineers ask questions via Slack bot.

```python
from contextbuddy import Pipeline, PersistentStore, Router

pipeline = Pipeline(
    store=PersistentStore("./index.json"),
    router=Router([
        {"max_complexity": 0.3, "model": "gpt-4o-mini"},
        {"max_complexity": 1.0, "model": "gpt-4o"},
    ]),
    dev_mode=True,
)
pipeline.add("./company_docs/")
answer = pipeline.query(slack_message, llm_calls={"gpt-4o-mini": cheap_fn, "gpt-4o": expensive_fn})
```

Simple questions ("What's the WiFi password?") route to the cheap model. Complex questions ("Compare our auth architecture options") route to the expensive one. **Router alone saves 60-70% on top of compression.**

---

## When NOT to Use ContextBuddy

Being honest:

- **Full agent orchestration** (multi-step reasoning, tool chains, long-term memory) -- use LangGraph or CrewAI instead. ContextBuddy compresses context, it doesn't orchestrate agents.
- **Billion-scale vector search** -- if you have 100M+ documents and need sub-millisecond search, use Pinecone or Weaviate directly. ContextBuddy's in-memory store is designed for <100k chunks.
- **Already deep in LangChain and it's working** -- don't rewrite. Instead, add ContextBuddy as a compression layer with zero disruption:

```python
from langchain.retrievers import ContextualCompressionRetriever
from contextbuddy import ContextBuddyCompressor

# 4 lines. Your existing retriever stays untouched.
compressor = ContextBuddyCompressor(max_context_tokens=4000)
compressed_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=your_existing_langchain_retriever,
)
docs = compressed_retriever.invoke(user_question)
# Irrelevant chunks pruned, entities preserved, token budget enforced.
# Pass `docs` to your chain exactly as before -- just cheaper.
```

Or, if you prefer the lower-level approach:

```python
from contextbuddy import ContextEngine

engine = ContextEngine(max_context_tokens=4000)

# Inside your LangChain chain, after retrieval but before the LLM call:
compressed_prompt, report = engine.build_prompt(
    user_prompt=user_question,
    context=retrieved_documents,   # from your existing LangChain retriever
)
# Pass compressed_prompt to your LLM instead of the raw retrieved docs
```

---

## FAQ

**Will this hurt answer quality?**
It can if you prune too aggressively. Start with `min_relevance=0.10` and inspect the compressed prompt in dev mode. The entity keep-list ensures critical data points survive.

**Does it send my data anywhere?**
Not by default. The built-in embedder and vector store run 100% locally with zero dependencies. Only if you explicitly plug in `OpenAIEmbedder` does it call an external API.

**Does it work with async frameworks (FastAPI, etc.)?**
`engine.arun()` is async-compatible -- the LLM call is awaited. Note: the compression step (chunking + scoring) runs synchronously inside the coroutine. For high-concurrency workloads, wrap compression with `asyncio.to_thread(engine.build_prompt, ...)`. True async compression is planned for v0.4.0.

**Does it work with streaming?**
Yes. Pass `stream=True` to `engine.run()`. ContextBuddy emits the ROI report, then yields LLM chunks.

**How accurate is the token count?**
The default `HeuristicTokenizer` uses a 4-chars-per-token rule. For exact counts: `pip install contextbuddy[tiktoken]`.

**Can I use this in production?**
Yes. The core pipeline is deterministic, dependency-free, and fast (<10ms for typical payloads). Set `dev_mode=False` to disable telemetry.

**How is this different from LangChain?**
ContextBuddy is **compression-first**. LangChain retrieves context but sends it all to the LLM. ContextBuddy retrieves, compresses, preserves entities, and shows you exactly how much you're saving. Zero core dependencies vs 100+. And with the `[langchain]` extra, the two work together -- ContextBuddy plugs in as the compression layer LangChain never had.

**Does it work with LangChain?**
Yes, natively. Install `contextbuddy[langchain]` and use `ContextBuddyCompressor` as a drop-in `base_compressor` for `ContextualCompressionRetriever`, or use `ContextBuddyRetriever` to wrap a `MemoryStore`. See the [LangChain Integration](#langchain-integration) section.

**How does compression work without an LLM?**
It doesn't need one. The pipeline has four stages: (1) document-aware chunking, (2) relevance scoring via BM25 + stemming + synonym expansion + character n-gram fuzzy matching, (3) entity force-keep — any chunk containing an ID, date, dollar amount, UUID, etc. is kept regardless of score, (4) greedy budget packing. No neural network, no API calls, no randomness. Sub-5ms on a typical payload.

**How do you guarantee compression quality without an LLM?**
Two ways. First, the entity keep-list is a hard guarantee — regex-matched entities (IDs, dates, money, tickets) always survive, no matter what the scorer says. Second, every release must pass a benchmark gate: 100% entity survival rate and a minimum answer coverage threshold. If a code change breaks either, it doesn't ship. You can run the gate yourself: `python -m contextbuddy bench --gate`.

**Do I need to set up OpenAI/Gemini/Meta embeddings manually?**
No. Each provider is a one-line install:
```bash
pip install "contextbuddy[openai]"    # OpenAI
pip install "contextbuddy[gemini]"    # Google Gemini
pip install "contextbuddy[ollama]"    # Meta Llama / any local model via Ollama (no API key)
pip install "contextbuddy[sbert]"     # sentence-transformers (fully local, no API key)
```
Then pass the embedder as a single argument to `ContextEngine`. Your API key goes in the environment (`OPENAI_API_KEY`, `GOOGLE_API_KEY`). Nothing else to configure.

**What about Meta / Llama embeddings specifically?**
Meta doesn't offer a hosted embedding API, so the practical path is Ollama — install [Ollama](https://ollama.com), pull a model (`ollama pull nomic-embed-text`), and use `OllamaEmbedder`. Runs fully local, no API key, no data leaving your machine, zero cost.

**Why use this over other tools?**
They solve retrieval — fetching the right documents. None of them compress what they retrieve. They send all 20 chunks to the LLM regardless of relevance. ContextBuddy cuts that down to the 4 that actually matter, preserves every entity, enforces a token budget, and shows you the dollar savings on every call. It's not a replacement for those frameworks — it's the compression layer they're all missing. And it plugs into all of them with 3 lines.

---

## Why I Built This

I'm a Recent CS Grad. I was deep in the rabbit hole of context engineering -- reading papers, watching talks, experimenting with how LLMs actually use the context you feed them. And I kept hitting the same wall.

I had a project that needed RAG. Load some PDFs, ask questions, get answers. Simple, right? So I reached for LangChain. And then I spent two days wrestling with 100+ dependencies, cryptic abstractions, and a codebase that felt like it was designed for a different problem. I just wanted to load a PDF and compress the context before sending it to an LLM. I didn't need an agent framework. I didn't need a plugin ecosystem. I needed maybe 200 lines of focused code.

So I closed my laptop, went for a walk, and thought: **what if the entire layer between "raw data" and "LLM call" was just... simple?**

That's what ContextBuddy is. It's the library I wished existed when I started.

The core insight was that most LLM applications are sending 5-10x more context than they need to. You scrape a 50-page contract, dump the whole thing into GPT-4, and pay for 15,000 tokens when only 3,000 matter. The LLM doesn't even perform better with the extra noise -- it performs *worse*. Context engineering isn't about stuffing more tokens in. It's about sending the *right* tokens.

I built ContextBuddy with a few principles:

1. **Zero dependencies for the core.** If you just want to compress text, you shouldn't need to install anything else. No numpy. No torch. No tiktoken. Just Python.
2. **Three lines to integrate.** If it takes more than that, developers will bounce. I know because I bounced.
3. **Show the ROI.** Every call prints exactly how many tokens and dollars you saved. Not because it's a gimmick -- because developers need to justify tool choices to their managers, and a screenshot of "$0.12 saved per call" does that instantly.
4. **Grow with you.** Start with 3 lines. When you need PDF loading, add it. When you need a vector store, add it. When you need model routing, add it. You should never have to rip out ContextBuddy and replace it with LangChain because you outgrew it.

I'm not claiming this replaces LangChain for every use case. If you need multi-step agent orchestration with tool chains and long-term memory, LangChain/LangGraph is the right call. But for the 80% of LLM applications that just need to load data, compress context, and call a model? ContextBuddy does it in a fraction of the code, with zero bloat, and it shows you exactly how much money you're saving.

This started as a side project born out of frustration. I'm sharing it because I think every developer building with LLMs deserves a simpler option.

If it saves you time or money, star the repo. That's all I ask.

---

## Contributing

```bash
git clone https://github.com/mohithgowdak/ContextBuddy.git
cd contextbuddy
pip install -e ".[dev]"
pytest
```

---

## License

MIT License. See [LICENSE](LICENSE).
