Metadata-Version: 2.1
Name: evalora
Version: 0.1.2
Summary: Easy AI evals — drop into any AI code in 4 lines
Author-email: Evalora <hello@evalora.dev>
License: MIT
Project-URL: Homepage, https://evalora.dev
Project-URL: Documentation, https://evalora.dev/docs
Project-URL: Repository, https://github.com/ImJustRicky/evalora-python
Project-URL: Issues, https://github.com/ImJustRicky/evalora-python/issues
Keywords: ai,eval,llm,evaluation,testing,judge,openai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: httpx>=0.27
Requires-Dist: rich>=13.0
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == "test"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.2; extra == "langchain"
Requires-Dist: langchain-openai>=0.1; extra == "langchain"

# evalora

Dead-simple LLM evals. No account needed to start.

<!-- badges -->

## Install

```bash
pip install evalora
```

## Quick Start

```python
from evalora import Evalora

ev = Evalora()
run = ev.run()
run.check(lambda p: "4", "What is 2+2?", expected="4")
run.done()
```

No account, no API key. Install, run, see results.

## Key Features

- **[Local-first](docs/getting-started.md)** -- works offline with zero config, results save to `.evalora/`
- **[Exact match + semantic similarity](docs/scoring-modes.md)** -- normalized string matching or embedding-based cosine similarity
- **[AI judge panels](docs/judges.md)** -- 10 built-in presets, binary or rubric scoring with majority vote
- **[Templates](docs/advanced.md#templates)** -- 7 pre-configured eval setups: `rag-quality`, `chatbot`, `code-gen`, `summarization`, `translation`, `safety-audit`, `classification`
- **[Built-in validators](docs/validators.md)** -- `max_length`, `no_refusal`, `is_json`, and more -- pass as strings or imports
- **[Ghost mode](docs/advanced.md#ghost-mode)** -- polymorphic: `True`, `"light"`, `"hash"`, or per-field dict control
- **[Custom scorers](docs/advanced.md#custom-scorers)** -- functions returning 0.0-1.0 with configurable threshold
- **[Auto-logging](docs/advanced.md#wrap-auto-logging)** -- `ev.wrap(OpenAI())` captures all LLM calls
- **[Framework integrations](docs/advanced.md#framework-integrations)** -- LangChain, LlamaIndex, CrewAI, OpenAI Agents
- **[Cost guard](docs/advanced.md#cost-guard)** -- `max_cost=5.00` stops eval early at 90% budget
- **[Webhooks](docs/advanced.md#webhook)** -- POST results on completion, auto Slack format
- **[Trend detection](docs/advanced.md#trend-analysis)** -- `ev.trend("name", last=10, fail_on_regression=True)`
- **[File loading](docs/advanced.md#file-loading)** -- `eval_each("data.csv")` or `.json` or `.jsonl`
- **[Async support](docs/advanced.md#async-support)** -- async functions auto-detected in `eval_each`
- **[Resume on crash](docs/advanced.md#resume-crash-recovery)** -- checkpoint after every item, pick up where you left off
- **[Pairwise comparison](docs/advanced.md#pairwise-comparison)** -- head-to-head model comparison with position bias randomization
- **[pytest plugin](docs/advanced.md#pytest-plugin)** -- `evalora_run` fixture
- **[Connected mode](docs/getting-started.md)** -- send results to [evalora.dev](https://evalora.dev) for dashboards and team collaboration

## Modes at a Glance

| Mode | How it works | Config |
|---|---|---|
| Exact match | Normalized string matching (negation-aware) | `scoring="exact"` (default) |
| Semantic similarity | Cosine similarity via embeddings | `embedding_api_key="sk-..."` (auto-detected) |
| AI judges (binary) | LLM votes pass/fail with reason | `judges=["Accuracy", "Safety"]` |
| AI judges (rubric) | LLM scores 1-5 on criteria | `judges=[{..., "criteria": [...]}]` |
| Templates | Pre-configured judge + validator combos | `template="rag-quality"` |
| Custom scorers | Your functions return 0.0-1.0 | `scorers=[fn]` |
| Validators | Instant checks before judges | `validators=["not_empty", "is_json"]` |
| Pairwise | Head-to-head comparison of two runs | `run_pairwise(...)` |
| Human review | Items go to dashboard review queue | Dashboard config |

## Examples

### Basic eval

```python
ev = Evalora()
run = ev.run(name="math test")
run.eval_each([
    ("What is 2+2?", "4"),
    ("What is 3+1?", "4"),
], fn=ask)
```

### AI judges (string presets)

```python
run = ev.run(
    judges=["Accuracy", "Safety", "Helpfulness"],
    provider_keys={"openai": "sk-..."},
)
run.eval_each(items, fn=ask)
```

### Template

```python
run = ev.run(
    template="rag-quality",
    provider_keys={"openai": "sk-..."},
)
run.eval_each("questions.csv", fn=ask)
```

### Semantic similarity

```python
run = ev.run(
    embedding_api_key="sk-...",  # auto-detects semantic mode
    similarity_threshold=0.85,
)
run.check(ask, "What is the capital of France?", expected="Paris is the capital")
run.done()
```

### Ghost mode

```python
run = ev.run(ghost="full")       # redact everything
run = ev.run(ghost="light")      # redact responses only
run = ev.run(ghost="hash")       # hash all text
run = ev.run(ghost=True, ghost_server=False)  # fully offline
```

### Auto-log with wrap

```python
from openai import OpenAI

client = ev.wrap(OpenAI(), run=run)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# ^ auto-logged to the run
```

### CI pipeline

```python
ev = Evalora(api_key="evk_...", verbose=False)
run = ev.run(eval_id="...", name="nightly regression")
status = run.eval_each(fn=ask, raise_on_fail=True)
# Raises EvaloraError if any items fail
```

## Connected Mode

Sign up at [evalora.dev](https://evalora.dev), grab an API key, and results appear in your dashboard:

```python
ev = Evalora(api_key="evk_...")
run = ev.run(eval_id="your-eval-id", name="production check")
run.eval_each(fn=ask)
```

The dashboard is the source of truth for scoring mode, server judges, and human review. See [Getting Started](docs/getting-started.md) for details.

## Documentation

- [Getting Started](docs/getting-started.md) -- install, first eval, local vs connected mode
- [Scoring Modes](docs/scoring-modes.md) -- exact match, semantic similarity, judges, human review
- [AI Judges](docs/judges.md) -- presets, binary, rubric, majority voting, error handling
- [Validators](docs/validators.md) -- built-in library, string shorthand, custom validators
- [Configuration](docs/configuration.md) -- all parameters for `Evalora()`, `run()`, `wrap()`, `trend()`, and `eval_each()`
- [Advanced](docs/advanced.md) -- ghost mode, templates, scorers, wrap, webhooks, file loading, async, integrations, pytest, trend, resume, pairwise
- [API Reference](docs/api-reference.md) -- classes, methods, data types, exceptions

## Dashboard

[evalora.dev](https://evalora.dev) -- run history, comparisons, team collaboration, and webhook notifications.

## License

MIT
