Metadata-Version: 2.4
Name: veralith
Version: 0.2.1
Summary: Hallucination diagnosis for RAG systems — Sufficiency, Faithfulness, Completeness verdicts plus rule-based remediation.
Author: Srijan Shekhar, Kaustav Dasgupta
License: MIT
Project-URL: Homepage, https://github.com/SrijanShekhar21/VeralithAI
Project-URL: Repository, https://github.com/SrijanShekhar21/VeralithAI
Project-URL: Issues, https://github.com/SrijanShekhar21/VeralithAI/issues
Keywords: rag,llm,evaluation,hallucination,openai,langchain,observability,agents,veralith
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.40.0
Requires-Dist: pydantic>=2.6
Requires-Dist: python-dotenv>=1.0
Requires-Dist: tenacity>=8.2
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: httpx>=0.27
Provides-Extra: langchain
Requires-Dist: langchain>=0.1.0; extra == "langchain"
Provides-Extra: sample
Requires-Dist: chromadb>=0.5.0; extra == "sample"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# Veralith

**Hallucination diagnosis for RAG systems.** One line in your RAG pipeline, structured reports on *what failed* and *how to fix it* — not just a yes/no hallucination flag.

Veralith decomposes every `(query, context, response)` trace into atomic sub-questions and claims, runs three LLM-as-judge metrics (Sufficiency, Faithfulness, Completeness), and classifies the trace into one of six diagnostic cells with a concrete remediation suggestion. Traces stream into your dashboard at [app.veralithai.com](https://app.veralithai.com).

> Status: 0.2.x (hosted). Public API stable; new diagnostic features land in minor bumps.

---

## Why Veralith

A monolithic *"is this response hallucinated?"* judge is a smoke alarm — it tells you something is wrong but not what or where. Veralith is a diagnostic dashboard:

- **Sufficiency** — was the *retrieval* adequate for each part of the query?
- **Faithfulness** — is each *claim* in the response grounded in the retrieved context?
- **Completeness** — does the response actually *answer* every part of the query?

Cross-tabulating these gives you a named failure mode (retrieval gap, intrinsic hallucination, padded answer, etc.) plus actionable fixes (lower temperature, bump retrieval-K, tighten generator prompt, ...) for every trace.

---

## 5-minute quickstart

### 1. Get an API key

Sign up at [app.veralithai.com](https://app.veralithai.com), create a project, generate a key. Keys look like `vk_live_...`.

### 2. Install + configure

```bash
pip install veralith
```

```bash
export VERALITH_API_KEY=vk_live_...
```

### 3. Add one line to your RAG pipeline

```python
import veralith

def answer(query: str) -> str:
    chunks = my_retriever(query)
    response = my_generator(query, chunks)

    veralith.log(query=query, context=chunks, response=response)  # POSTs to Veralith
    return response
```

That's it. The trace POSTs to `api.veralithai.com`, the eval runs server-side (5 OpenAI judge calls, ~30s), and the result lands in your dashboard with a typed diagnosis and a concrete suggestion.

---

## Integration patterns

### 1. Explicit one-liner — works with any RAG stack

```python
import veralith

def answer(query: str) -> str:
    chunks = my_retriever(query)
    response = my_generator(query, chunks)

    veralith.log(query=query, context=chunks, response=response)
    return response
```

### 2. Decorator — zero code reshape

```python
import veralith

@veralith.trace
def my_rag(query: str):
    chunks = my_retriever(query)
    response = my_generator(query, chunks)
    return response, chunks   # the decorator captures (response, context)
```

### 3. LangChain — zero-code auto-tracing

```python
import veralith.adapters.langchain as adapter
adapter.install()

# every RetrievalQA.invoke() now auto-traces to Veralith
```

### 4. Offline evaluation (no account, no network)

For CI tests, prompt tuning, or air-gapped environments where you don't want traces leaving the machine:

```python
result = veralith.evaluate(
    query="What is a P/E ratio?",
    context=["The price-to-earnings ratio is share price ÷ earnings per share."],
    response="A P/E ratio is share price divided by earnings per share.",
    persist=False,
)
print(result.diagnosis.failure_cell.value)  # 'complete_grounded'
```

`evaluate()` runs the full eval pipeline locally using your `OPENAI_API_KEY`. No data leaves your machine. Use this for tests; use `log()` for production.

---

## What Veralith detects

Each evaluated trace lands in one of six cells from the cross-tab of Completeness × Faithfulness. The cell name follows the pattern `<completeness>_<faithfulness>`, so you can decode any cell without a lookup chart:

| | Grounded (every claim supported) | Ungrounded (some claim invented) |
|---|---|---|
| **Complete answer** | `complete_grounded` | `complete_ungrounded` |
| **Incomplete answer** | `incomplete_grounded` | `incomplete_ungrounded` |
| **Extra unrequested content** | `extra_grounded` | `extra_ungrounded` |

Read each cell as *"the response is `<X>` and the claims are `<Y>`."* So `incomplete_ungrounded` means the response *didn't cover everything asked AND some of what it did say is unsupported* — the worst-case trace.

Plus a per-trace Sufficiency level (HIGH/LOW), learned per knowledge base from the distribution of healthy traces. Together they drive a rule-based suggester that maps every diagnosis to a concrete remediation (lower temperature / bump K / tighten generator prompt / etc.).

---

## Configuration

| Variable | Default | Purpose |
|---|---|---|
| `VERALITH_API_KEY` | — | **Required for `log()`** — get one at [app.veralithai.com](https://app.veralithai.com) |
| `VERALITH_API_URL` | `https://api.veralithai.com` | Override if self-hosting or testing against staging |
| `OPENAI_API_KEY` | — | Required only for offline `evaluate()` (the hosted backend uses its own key) |
| `VERALITH_JUDGE_MODEL` | `gpt-4o` | Model for S/F/C judges (offline `evaluate()` only) |
| `VERALITH_DECOMPOSER_MODEL` | `gpt-4o-mini` | Model for decomposition (offline `evaluate()` only) |

Hosted evaluations are billed against your project's monthly trace quota (200/month on the free tier). Offline `evaluate()` calls run against your own `OPENAI_API_KEY` and cost about $0.005 per trace.

---

## The result object

```python
class EvaluationResult:
    trace_id: int
    query: str
    sub_questions: list[SubQuestion]           # decomposed Q
    claims: list[Claim]                         # decomposed R
    sufficiency: list[SufficiencyJudgment]      # per-Qi verdicts
    faithfulness: list[FaithfulnessJudgment]    # per-Ri verdicts + grounding chunks
    completeness: CompletenessJudgment | None   # Ri ↔ Qi alignment
    diagnosis: Diagnosis | None                 # failure_cell + sufficiency level + counts
    suggestion: Suggestion                      # title + body + actionable steps
    created_at: datetime
    errors: dict[str, str]                       # any per-metric failures (D3)
    latency_ms: dict[str, float]                 # per-phase wall-clock timing
```

Every field is a typed Pydantic model.

---

## Roadmap

What's in 0.2:
- Three judges (Sufficiency, Faithfulness, Completeness) with batched LLM calls.
- Diagnostic classifier and rule-based suggester.
- **Hosted backend** — `log()` posts to `api.veralithai.com`, dashboard view at `app.veralithai.com`.
- Outcome-based threshold calibration per knowledge base.
- SDK: `log()`, `@trace`, LangChain adapter.
- Offline `evaluate()` for CI / tests.
- Cost tracking with per-trace budget guard.

On the roadmap:
- **Self-heal via Claude Code MCP** (0.2.5) — Veralith hands a diagnosis to your local Claude Code, which fixes the underlying RAG code in a branch and opens a PR.
- LLM-enriched trace-specific suggestions (`Suggestion.detailed_body`).
- Cross-trace pattern detection ("you keep hallucinating on time-sensitive queries").
- Additional judges (reasoning validity, temporal validity).
- More framework adapters (LlamaIndex, raw OpenAI tools).

---

## Authors

Srijan Shekhar and Kaustav Dasgupta.

## License

MIT — see [LICENSE](LICENSE).

## Links

- Source: https://github.com/SrijanShekhar21/VeralithAI
- Issues: https://github.com/SrijanShekhar21/VeralithAI/issues
