Metadata-Version: 2.4
Name: genassert
Version: 0.2.1
Summary: pytest-native semantic testing for LLM and generative AI applications. No servers. No SaaS. Works with OpenAI, Anthropic, LiteLLM and any LLM client.
Project-URL: Homepage, https://github.com/genassert/genassert
Project-URL: Documentation, https://genassert.readthedocs.io
Project-URL: Repository, https://github.com/genassert/genassert
Project-URL: Bug Tracker, https://github.com/genassert/genassert/issues
Project-URL: Changelog, https://github.com/genassert/genassert/blob/main/CHANGELOG.md
Author: genassert contributors
License: MIT
License-File: LICENSE
Keywords: agent testing,ai quality assurance,ai testing,anthropic,claude testing,gen ai,genai testing,generative ai,generative ai testing,golden baseline,gpt testing,hallucination detection,langchain,llm,llm assertions,llm evaluation,llm quality,llm testing,machine learning testing,openai,prompt testing,pytest,pytest plugin,rag testing,regression testing,semantic assertions,semantic testing,testing
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.9
Provides-Extra: all
Requires-Dist: jsonschema>=4.0.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: pydantic>=2.0.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.7.0; extra == 'all'
Requires-Dist: tiktoken>=0.5.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: openai>=1.0.0; extra == 'dev'
Requires-Dist: pydantic>=2.0.0; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: sentence-transformers>=2.7.0; extra == 'dev'
Provides-Extra: jsonschema
Requires-Dist: jsonschema>=4.0.0; extra == 'jsonschema'
Provides-Extra: judge
Requires-Dist: torch>=2.0.0; extra == 'judge'
Requires-Dist: transformers>=4.40.0; extra == 'judge'
Provides-Extra: local
Requires-Dist: sentence-transformers>=2.7.0; extra == 'local'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Provides-Extra: pydantic
Requires-Dist: pydantic>=2.0.0; extra == 'pydantic'
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.5.0; extra == 'tiktoken'
Description-Content-Type: text/markdown

# genassert

**pytest-native semantic testing for LLM applications.**  
No servers. No SaaS. No config. Works with OpenAI, Anthropic, LiteLLM, and any LLM client.

[![PyPI version](https://badge.fury.io/py/genassert.svg)](https://pypi.org/project/genassert/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![pytest](https://img.shields.io/badge/framework-pytest-orange)](https://docs.pytest.org/)

---

## Why genassert?

Traditional `assert response == expected` breaks the moment your LLM changes a word.  
`genassert` gives you **semantic assertions** — tests that check *meaning*, not strings.

| Problem | Traditional testing | genassert |
|---------|-------------------|-----------|
| LLM changes wording | Test breaks | Test passes (same meaning) |
| Response drifts over time | No detection | Baseline regression alert |
| Wrong tone shipped | No check | `assert_tone(response, "professional")` |
| Hallucination in response | No check | `assert_no_hallucination(response, facts)` |
| Response too long | Manual count | `assert_token_budget(response, 200)` |
| Schema mismatch | Try/except JSON | `assert_schema(response, MyPydanticModel)` |

---

## Install

```bash
# Minimal install (uses hash-based fallback embedder)
pip install genassert

# Recommended: local embeddings — no API cost, runs in CI for free
pip install "genassert[local]"

# OpenAI embeddings backend
pip install "genassert[openai]"

# Everything
pip install "genassert[all]"
```

---

## Quick Start

```python
# test_my_llm.py
import pytest
from genassert import (
    assert_intent,
    assert_tone,
    assert_no_hallucination,
    assert_token_budget,
    assert_schema,
)

@pytest.mark.llm
def test_summarizer():
    response = my_summarize_function("Long article about climate change...")

    # Check the response is actually a summary
    assert_intent(response, "a concise summary of the article")

    # Check it's neutral — no opinion
    assert_tone(response, "neutral")

    # Check it doesn't hallucinate key facts
    assert_no_hallucination(response, known_facts=[
        "The article is about climate change",
        "CO2 levels are rising",
    ])

    # Check it's not too long
    assert_token_budget(response, max_tokens=250)
```

Run it:
```bash
pytest test_my_llm.py -v
```

That's it. No config files. No API keys needed (with `[local]` install).

---

## All Assertions

### `assert_intent(response, expected_intent, threshold=0.72)`

Checks that the response semantically addresses the expected intent.

```python
assert_intent(response, "a polite refusal to the user's request")
assert_intent(response, "Python code that reads a CSV file", threshold=0.80)
assert_intent(response, "step-by-step instructions for setting up Docker")
```

### `assert_tone(response, expected_tone, threshold=0.65)`

Checks the tone/style of the response.

**Built-in tones:** `professional`, `casual`, `friendly`, `formal`, `neutral`, `empathetic`, `assertive`, `humorous`, `concise`

```python
assert_tone(response, "professional")
assert_tone(response, "friendly and concise")       # custom description
assert_tone(response, "formal but empathetic")      # combine tones
```

### `assert_no_hallucination(response, known_facts)`

Checks that the response does NOT contradict known facts.

```python
assert_no_hallucination(response, known_facts=[
    "The product costs $49 per month",
    "The free trial lasts 14 days",
    "Python was created by Guido van Rossum",
])
```

### `assert_token_budget(response, max_tokens, tokenizer="approx")`

Checks the response doesn't exceed a token budget.

```python
assert_token_budget(response, max_tokens=200)                       # fast approx
assert_token_budget(response, max_tokens=200, tokenizer="tiktoken") # exact (pip install tiktoken)
assert_token_budget(response, max_tokens=800, tokenizer="chars")    # character-based
```

### `assert_schema(response, schema)`

Checks that the response (JSON string) matches a Pydantic model or JSON schema.

```python
from pydantic import BaseModel

class Summary(BaseModel):
    title: str
    body: str
    word_count: int

result = assert_schema(response, Summary)
print(result.title)   # validated Pydantic instance
```

```python
# Or a raw JSON Schema dict
schema = {
    "type": "object",
    "properties": {"title": {"type": "string"}},
    "required": ["title"],
}
assert_schema(response, schema)
```

### `assert_similar_to(response, reference, threshold=0.80)`

Checks that the response is semantically close to a reference string.
Useful for golden-baseline regression.

```python
score = assert_similar_to(response, golden_response, threshold=0.85)
print(f"Similarity: {score:.3f}")
```

---

## Golden Baseline Regression Testing

Record a known-good response once, then detect regression on every CI run.

```python
from genassert import record_baseline, compare_baseline

# Step 1: record (run once, commit the .genassert_baselines/ directory)
record_baseline("summarizer_v1", response)

# Step 2: compare on every subsequent run
def test_summarizer_no_regression():
    response = my_summarize("article...")
    compare_baseline("summarizer_v1", response, threshold=0.85)
```

Or use the pytest fixture for `--record-baselines` flag integration:

```python
def test_summarizer_baseline(llm_record):
    response = my_summarize("article...")
    if llm_record:
        record_baseline("summarizer", response, overwrite=True)
    else:
        compare_baseline("summarizer", response)
```

```bash
# First run — record
pytest --record-baselines

# Every subsequent run — compare
pytest
```

---

## Local Judge (Zero API Cost)

Use `LocalJudge` for complex, nuanced evaluations that go beyond embedding similarity:

```python
from genassert import LocalJudge

judge = LocalJudge()   # uses a tiny local model (auto-downloaded)

result = judge.evaluate(
    response="Paris is the capital of France.",
    criterion="The response correctly answers a geography question.",
)

assert result.passed
print(f"Score: {result.score:.2f}")
print(f"Reasoning: {result.reasoning}")
```

Install the local judge backend:
```bash
pip install "genassert[judge]"   # installs transformers + torch
```

---

## pytest CLI Options

```bash
# Skip all LLM tests (useful in fast unit-test runs)
pytest --skip-llm

# Override similarity threshold globally
pytest --llm-threshold=0.75

# Record golden baselines
pytest --record-baselines
```

---

## Configuration

All settings via environment variables — no config files needed:

| Variable | Default | Description |
|----------|---------|-------------|
| `genassert_EMBED_BACKEND` | `auto` | `local`, `openai`, `fallback` |
| `genassert_EMBED_MODEL` | `all-MiniLM-L6-v2` | Embedding model name |
| `genassert_JUDGE_MODEL` | `Qwen/Qwen2.5-0.5B-Instruct` | Local judge model |
| `genassert_BASELINE_DIR` | `.genassert_baselines` | Baseline storage directory |
| `OPENAI_API_KEY` | — | Required for `openai` backend |

---

## Embedding Backends

| Backend | Speed | Cost | Accuracy | Install |
|---------|-------|------|----------|---------|
| `local` (sentence-transformers) | Fast | Free | High | `pip install "genassert[local]"` |
| `openai` | Moderate | ~$0.0001/test | Very high | `pip install "genassert[openai]"` |
| `fallback` (hash-based) | Instant | Free | Smoke test only | Built-in |

Set backend:
```bash
export genassert_EMBED_BACKEND=local    # recommended for CI
export genassert_EMBED_BACKEND=openai   # highest accuracy
export genassert_EMBED_BACKEND=fallback # no deps, structural tests only
```

---

## Framework Compatibility

genassert is **framework-agnostic**. Use it with any LLM client:

```python
# OpenAI
import openai
client = openai.OpenAI()
response = client.chat.completions.create(...).choices[0].message.content

# Anthropic
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(...).content[0].text

# LiteLLM
import litellm
response = litellm.completion(...).choices[0].message.content

# LangChain
from langchain_openai import ChatOpenAI
response = ChatOpenAI().invoke("...").content

# Any string output — genassert only needs the final response string
assert_intent(response, "your expected intent here")
```

---

## Real-World Example: Testing a RAG Chatbot

```python
import pytest
from genassert import assert_intent, assert_no_hallucination, assert_token_budget, assert_schema

PRODUCT_FACTS = [
    "The product is called DataFlow Pro",
    "The price is $99 per month",
    "There is a 30-day free trial",
    "It supports Python, JavaScript, and Go",
]

@pytest.mark.llm
class TestChatbot:
    def test_pricing_question(self, chatbot):
        response = chatbot.ask("How much does it cost?")
        assert_intent(response, "information about pricing or cost")
        assert_no_hallucination(response, PRODUCT_FACTS)
        assert_token_budget(response, max_tokens=150)

    def test_technical_question(self, chatbot):
        response = chatbot.ask("What languages are supported?")
        assert_intent(response, "list of supported programming languages")
        assert_no_hallucination(response, PRODUCT_FACTS)

    def test_structured_response(self, chatbot):
        from pydantic import BaseModel
        class PricingInfo(BaseModel):
            price: str
            trial_days: int

        response = chatbot.ask_structured("Return pricing as JSON")
        assert_schema(response, PricingInfo)
```

---

## CI Integration

```yaml
# .github/workflows/llm-tests.yml
name: LLM Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install genassert
        run: pip install "genassert[local]" pytest

      - name: Run LLM tests
        run: pytest tests/ -m llm -v
        env:
          genassert_EMBED_BACKEND: local   # free, no API key needed
```

---

## License

MIT © genassert contributors

---

## Related Projects

- [pytest](https://docs.pytest.org/) — the test framework genassert is built on
- [sentence-transformers](https://www.sbert.net/) — local embedding models
- [Pydantic](https://docs.pydantic.dev/) — data validation
- [LiteLLM](https://github.com/BerriAI/litellm) — unified LLM client

---

*genassert is the missing pytest plugin for the LLM era.*  
*Stop shipping broken AI features. Start testing them.*
