Metadata-Version: 2.4
Name: chatbot-auditor
Version: 0.1.0
Summary: Detect the 7 ways AI chatbots silently fail customers
Project-URL: Homepage, https://github.com/HemantBK/chatbot-auditor
Project-URL: Documentation, https://github.com/HemantBK/chatbot-auditor#readme
Project-URL: Repository, https://github.com/HemantBK/chatbot-auditor
Project-URL: Issues, https://github.com/HemantBK/chatbot-auditor/issues
Project-URL: Changelog, https://github.com/HemantBK/chatbot-auditor/blob/main/CHANGELOG.md
Author: BK
License: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: ai-evaluation,ai-reliability,chatbot,conversational-ai,customer-experience,llm,llm-observability,responsible-ai
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Customer Service
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.6
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Provides-Extra: intercom
Requires-Dist: httpx>=0.27; extra == 'intercom'
Provides-Extra: llm
Requires-Dist: scikit-learn>=1.4; extra == 'llm'
Requires-Dist: sentence-transformers>=2.5; extra == 'llm'
Provides-Extra: server
Requires-Dist: fastapi>=0.110; extra == 'server'
Requires-Dist: uvicorn>=0.29; extra == 'server'
Provides-Extra: zendesk
Requires-Dist: httpx>=0.27; extra == 'zendesk'
Description-Content-Type: text/markdown

# chatbot-auditor

[![CI](https://github.com/HemantBK/chatbot-auditor/actions/workflows/ci.yml/badge.svg)](https://github.com/HemantBK/chatbot-auditor/actions/workflows/ci.yml)
[![Docs](https://github.com/HemantBK/chatbot-auditor/actions/workflows/docs.yml/badge.svg)](https://HemantBK.github.io/chatbot-auditor/)
[![PyPI version](https://img.shields.io/pypi/v/chatbot-auditor.svg)](https://pypi.org/project/chatbot-auditor/)
[![Python versions](https://img.shields.io/pypi/pyversions/chatbot-auditor.svg)](https://pypi.org/project/chatbot-auditor/)
[![License: Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-203%20passing-brightgreen.svg)](#)
[![Coverage](https://img.shields.io/badge/coverage-90%25-brightgreen.svg)](#)
[![mypy: strict](https://img.shields.io/badge/mypy-strict-blue.svg)](https://mypy.readthedocs.io/en/stable/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

> **Detect the 7 ways AI chatbots silently fail customers** — from existing conversation logs, using tools your ML team doesn't need to install.

Your chatbot dashboard says "95% of conversations resolved." The real number is usually 60–70%. The gap is customers who gave up, got stuck in loops, asked for a human and were refused, or silently walked away. Your chatbot records all of them as "resolved."

`chatbot-auditor` reads your conversation logs and tells you the truth.

---

## Table of contents

- [Why this exists](#why-this-exists)
- [The 7 failure modes](#the-7-failure-modes)
- [Quick start](#quick-start)
- [Features](#features)
- [Architecture](#architecture)
- [Where this fits](#where-this-fits)
- [Installation](#installation)
- [Documentation](#documentation)
- [Contributing](#contributing)
- [License](#license)

---

## Why this exists

- **75%** of consumers are frustrated by AI customer support
- **56%** of unhappy customers leave without complaining
- **88%** will not return after a negative chatbot interaction
- Chatbot platforms (Intercom, Zendesk, Drift) grade their own homework — their dashboards are designed to make the chatbot look good
- [**Air Canada**](https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers-should-know) was sued because its chatbot promised a non-existent refund policy
- [**DPD**](https://www.theguardian.com/technology/2024/jan/20/dpd-ai-chatbot-swears-calls-itself-useless-and-criticises-firm)'s chatbot swore at a customer and wrote a poem about how terrible the company was
- [**Klarna**](https://www.cnbc.com/2025/05/08/klarna-reverses-ai-strategy-hiring-back-human-customer-service-agents.html) reversed course after firing 700 agents when customer satisfaction tanked

Companies don't need a *better* chatbot. They need to know *when* their current chatbot is failing. That's what this library does.

---

## The 7 failure modes

| # | Mode | What it catches | Detector |
|---|------|-----------------|----------|
| 1 | **Death Loop** | Bot gives the same answer 3+ times | `DeathLoopDetector` |
| 2 | **Silent Churn** | Customer left without saying anything | `SilentChurnDetector` |
| 3 | **Escalation Burial** | Bot refused to transfer to a human | `EscalationBurialDetector` |
| 4 | **Sentiment Collapse** | Customer got frustrated, bot didn't notice | `SentimentCollapseDetector` |
| 5 | **Confident Lies** | Bot promised something outside policy | `ConfidentLiesDetector` |
| 6 | **Brand Damage** | Bot said something embarrassing | `BrandDamageDetector` |
| 7 | **Confident Misinformation** | Bot stated wrong facts | `ConfidentMisinformationDetector` |

[Deep dive on each →](https://HemantBK.github.io/chatbot-auditor/concepts/failure-modes/)

---

## Quick start

```bash
pip install chatbot-auditor
```

```python
from chatbot_auditor import audit, Conversation, Message, Role

conv = Conversation(
    id="demo",
    messages=[
        Message(role=Role.USER, content="I need a refund for my order"),
        Message(role=Role.BOT, content="Please check our FAQ at example.com/faq."),
        Message(role=Role.USER, content="I already did. Can someone process it?"),
        Message(role=Role.BOT, content="Please check our FAQ at example.com/faq."),
        Message(role=Role.USER, content="this is useless"),
        Message(role=Role.BOT, content="Please check our FAQ at example.com/faq."),
    ],
)

for d in audit([conv]):
    print(f"[{d.severity.value.upper()}] {d.detector}: {d.explanation}")
```

Output:

```
[MEDIUM] death_loop: Bot gave 3 consecutive similar responses ...
[HIGH]   silent_churn: Conversation of 6 messages ended without ...
```

Or from the command line:

```bash
chatbot-audit analyze conversations.json --format markdown --output audit.md
```

---

## Features

| Category | What you get |
|----------|--------------|
| **7 detectors** | Full framework for chatbot-specific failures. Works out of the box. |
| **4 adapters** | JSON, CSV, Intercom, Zendesk. Point at any source, same pipeline. |
| **Zero-dep defaults** | Stdlib-only detection for the core five detectors. No API keys required. |
| **Pluggable backends** | Optional sentence-transformers semantic similarity, LLM moderation, etc. |
| **Report generator** | Markdown and self-contained HTML. Email-safe, Slack-compatible, XSS-escaped. |
| **FastAPI server** | Drop-in HTTP service with bearer-token auth, Docker-ready. |
| **CLI** | `analyze`, `analyze-intercom`, `analyze-zendesk` with `--format` and `--output`. |
| **Knowledge bases** | Optional `PolicyBase` / `FactBase` to cross-check bot claims. |
| **Full type safety** | `mypy --strict` passes on every public symbol. |
| **Benchmarked** | Precision / recall / F1 on a synthetic corpus. Reproducible. |

---

## Architecture

Detectors are pure functions over `Conversation` objects. Adapters feed conversations in; reporters format detections out. Every layer is independently swappable.

```mermaid
flowchart LR
    subgraph Sources
        A[Intercom API]
        B[Zendesk API]
        C[JSON / JSONL files]
        D[CSV / TSV files]
        E[Your custom source]
    end

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

    subgraph Pipeline
        F[Adapter.fetch] --> G[Conversation]
        G --> H[DetectorRegistry.run]
        H --> I[Detection]
    end

    subgraph Outputs
        I --> J[Python API]
        I --> K[JSON]
        I --> L[Markdown]
        I --> M[HTML]
        I --> N[REST API]
    end

    classDef src fill:#e0f2fe,stroke:#0284c7
    classDef pipe fill:#fef3c7,stroke:#d97706
    classDef out fill:#dcfce7,stroke:#16a34a
    class A,B,C,D,E src
    class F,G,H,I pipe
    class J,K,L,M,N out
```

Full system design in [ARCHITECTURE.md](ARCHITECTURE.md).

---

## Where this fits

The AI observability and conversational-quality space is crowded, and rightly so — there are several strong tools depending on who you are and what you're measuring. `chatbot-auditor` is deliberately narrow in scope:

- **It's built for CX and support leaders, not ML engineers.** The output uses language the VP of Support already thinks in — "silent churn", "escalation burial", "confident lies" — rather than ML metrics like "faithfulness" or "perplexity". That framing is the whole point.

- **It operates on existing conversation logs, not during the LLM call.** You don't need to instrument your chatbot, change the runtime, or add an SDK to your production path. You point the library at logs you already have and it tells you what's wrong.

- **The 7-mode framework is the contribution, not the code.** The detection algorithms are mostly simple statistics. What makes this project different is that each failure mode is *named*, *defined*, and *detectable with clear false-positive characteristics* — a vocabulary teams can use to discuss chatbot quality the way DevOps teams discuss SLIs and SLOs.

- **It runs zero-cost by default.** The five core detectors need no API keys, no model downloads, no cloud accounts — Python's stdlib is enough. Richer backends (embeddings, LLM moderation) are opt-in when the defaults aren't enough.

- **It stops at detection.** There's no chatbot to replace, no agent to train, no dashboard SaaS to sign up for. Detections leave as JSON / Markdown / HTML / HTTP responses, and what your team does with them is up to you.

If you're evaluating LLMs during invocation, tracing prompts, or running model-level evals, the LLM observability tools in the ecosystem will serve you better. If you're coaching human agents or running call-center QA, the voice-QA category is a different product. This library is specifically for **"my AI chatbot is in production handling real customers — what's it actually doing, and where is it quietly failing?"**

---

## Installation

```bash
# Minimum — just the core library and CLI
pip install chatbot-auditor

# With API adapters
pip install "chatbot-auditor[intercom,zendesk]"

# With semantic similarity for paraphrased loop detection
pip install "chatbot-auditor[llm]"

# With the HTTP server
pip install "chatbot-auditor[server]"

# Everything
pip install "chatbot-auditor[intercom,zendesk,llm,server]"
```

Verify:

```bash
chatbot-audit version
```

---

## Documentation

- **[Full docs site](https://HemantBK.github.io/chatbot-auditor/)** — tutorials, reference, guides
- **[Getting started](https://HemantBK.github.io/chatbot-auditor/getting-started/first-audit/)** — run your first audit in under a minute
- **[The 7 failure modes](https://HemantBK.github.io/chatbot-auditor/concepts/failure-modes/)** — what each detector catches and why
- **[Audit Intercom data](https://HemantBK.github.io/chatbot-auditor/tutorials/audit-intercom/)** — end-to-end with real data
- **[Write a custom detector](https://HemantBK.github.io/chatbot-auditor/tutorials/custom-detector/)** — add your own failure modes
- **[Self-host the HTTP server](https://HemantBK.github.io/chatbot-auditor/tutorials/self-host/)** — Docker, auth, deployment
- **[LLM & embedding backends](https://HemantBK.github.io/chatbot-auditor/tutorials/llm-backends/)** — swap in richer scorers
- **[API reference](https://HemantBK.github.io/chatbot-auditor/reference/detectors/)** — auto-generated from docstrings
- **[Architecture](ARCHITECTURE.md)** — system design, extension points, design decisions

## Examples

Runnable scripts in [`examples/`](examples/):

- [`01_audit_json_file.py`](examples/01_audit_json_file.py) — smallest end-to-end audit
- [`02_audit_csv_file.py`](examples/02_audit_csv_file.py) — CSV export from any platform
- [`03_custom_detector.py`](examples/03_custom_detector.py) — write your own failure mode
- [`04_knowledge_bases.py`](examples/04_knowledge_bases.py) — `PolicyBase` + `FactBase`
- [`05_embeddings_backend.py`](examples/05_embeddings_backend.py) — semantic similarity for high-paraphrase loops

---

## Contributing

Issues and pull requests welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for dev setup, coding standards, and how to add a detector.

- **Report a bug** → [GitHub issues](https://github.com/HemantBK/chatbot-auditor/issues/new/choose)
- **Propose a feature** → same, pick the feature-request template
- **Report a security issue** → see [SECURITY.md](SECURITY.md) — do not open a public issue
- **Get help / ask a question** → [SUPPORT.md](SUPPORT.md)

## Development

```bash
git clone https://github.com/HemantBK/chatbot-auditor.git
cd chatbot-auditor
uv sync --all-extras
uv run pytest          # 203 tests
uv run mypy            # strict type check
uv run ruff check .    # lint
uv run ruff format .   # format
uv run mkdocs serve    # live docs
```

---

## License

Apache License 2.0 — see [LICENSE](LICENSE) and [NOTICE](NOTICE).

**Original author: BK.** If you use, fork, or build on this project, preserve the `NOTICE` file as required by the license. The copyright header `Copyright 2026 BK` must remain intact in every derivative source file.

## Acknowledgments

Framework and failure-mode research grounded in public incident reports: Air Canada (2024), DPD (2024), Klarna (2025), Cursor (2024), and industry analyses from Qualtrics, Gartner, and others cited in the [failure modes docs](https://HemantBK.github.io/chatbot-auditor/concepts/failure-modes/).
