Metadata-Version: 2.4
Name: evidence-evaluator
Version: 0.1.0
Summary: 6-stage agentic pipeline for clinical evidence quality evaluation
Project-URL: Homepage, https://github.com/SciSpark-ai/evidence_evaluator_cli
Project-URL: Repository, https://github.com/SciSpark-ai/evidence_evaluator_cli
Author-email: SciSpark <info@scispark.ai>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.25
Requires-Dist: litellm>=1.40
Requires-Dist: numpy>=1.24
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: scipy>=1.10
Requires-Dist: statsmodels>=0.14
Requires-Dist: tomli>=2.0; python_version < '3.11'
Provides-Extra: dev
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: pdf-local
Requires-Dist: pdfplumber>=0.11; extra == 'pdf-local'
Description-Content-Type: text/markdown

# Evidence Evaluator

[![PyPI](https://img.shields.io/pypi/v/evidence-evaluator)](https://pypi.org/project/evidence-evaluator/)
[![CI](https://img.shields.io/github/actions/workflow/status/SciSpark-ai/evidence_evaluator_cli/ci.yml?branch=main)](https://github.com/SciSpark-ai/evidence_evaluator_cli/actions)
![License](https://img.shields.io/badge/license-MIT-blue)
[![Python](https://img.shields.io/pypi/pyversions/evidence-evaluator)](https://pypi.org/project/evidence-evaluator/)

6-stage agentic pipeline for clinical evidence quality evaluation.

---

## What it does

Evidence Evaluator processes a clinical research paper (PDF, DOI, PMID, or raw text) through six sequential stages and produces a structured evidence quality report in Markdown and/or JSON. Stages 3 and 5 (score engine) are fully deterministic; stages 0, 1, 2, 4, and the narrative portion of 5 use an LLM via your own API key.

```
Input (PDF / DOI / PMID / text)
  -> Stage 0: Study Type Routing
  -> Stage 1: Variable Extraction (3x majority vote)
  -> Stage 2: MCID & Domain Search (LLM tool-use)
  -> Stage 3: Deterministic Math Audit
  -> Stage 4: Bias Risk Assessment
  -> Stage 5: Report Synthesis + Optional Score
  -> Output: Markdown Report + JSON
```

Stage 2 drives PubMed E-utilities and CrossRef searches using native LLM tool-use. No search API keys are required.

---

## Quick start

```bash
pip install evidence-evaluator
evidence-evaluator config --init
evidence-evaluator evaluate paper.pdf
```

---

## Installation

**From PyPI:**

```bash
pip install evidence-evaluator
```

**From source (editable install with dev dependencies):**

```bash
git clone https://github.com/SciSpark-ai/evidence_evaluator_cli.git
cd evidence-evaluator-cli
pip install -e ".[dev]"
```

**With local PDF extraction support (optional):**

```bash
pip install "evidence-evaluator[pdf-local]"
```

The `pdf-local` extra installs `pdfplumber` for offline PDF text extraction. Without it, PDF input requires a model that supports native PDF input (Claude 3.5+, GPT-4o, etc.).

**Note on PMID input:** For open-access papers, the tool automatically fetches full text from PubMed Central (PMC). For paywalled papers, only the abstract is available via PMID — provide the PDF for full evaluation.

---

## Configuration

### First-run setup

Run the interactive setup wizard to save your API key and preferred model:

```bash
evidence-evaluator config --init
```

The wizard prompts for your LLM provider, API key, and model selection, then writes to `~/.evidence-evaluator/config.toml`.

### Other config commands

```bash
evidence-evaluator config --show                          # Print current config
evidence-evaluator config --set model claude-opus-4-20250514
evidence-evaluator config --set api_key sk-ant-...
```

### Environment variables

| Variable | Description |
|---|---|
| `EVIDENCE_EVALUATOR_API_KEY` | Primary API key for the configured provider |
| `ANTHROPIC_API_KEY` | Anthropic key (read directly by litellm) |
| `OPENAI_API_KEY` | OpenAI key (read directly by litellm) |
| `EVIDENCE_EVALUATOR_MODEL` | Model override |

### Config file location

`~/.evidence-evaluator/config.toml`

### Resolution order

CLI flags > environment variables > `~/.evidence-evaluator/config.toml` > built-in defaults

---

## Usage examples

**Evaluate a PDF:**

```bash
evidence-evaluator evaluate paper.pdf
```

**Evaluate by DOI:**

```bash
evidence-evaluator evaluate --doi 10.1056/NEJMoa1911303
```

**Evaluate by PMID:**

```bash
evidence-evaluator evaluate --pmid 31535829
```

**Evaluate raw text:**

```bash
evidence-evaluator evaluate --text "Abstract: ..."
```

**JSON output only:**

```bash
evidence-evaluator evaluate paper.pdf --output-format json
```

**Both Markdown and JSON:**

```bash
evidence-evaluator evaluate paper.pdf --output-format both
```

**Run an individual stage (useful for debugging or re-running):**

```bash
evidence-evaluator run-stage 3 --input context.json
evidence-evaluator run-stage 5 --input context.json
```

**Quiet mode (suppress all output except the output file path):**

```bash
evidence-evaluator evaluate paper.pdf --quiet
```

**Single extraction, no majority vote (faster and cheaper):**

```bash
evidence-evaluator evaluate paper.pdf --no-vote
```

**Force local PDF parsing via pdfplumber (offline, cost, or privacy control):**

```bash
evidence-evaluator evaluate paper.pdf --force-local-pdf
```

---

## Python API

```python
from evidence_evaluator import evaluate, evaluate_async

# Sync
result = evaluate(
    path="paper.pdf",              # or doi="10.1056/...", pmid="31535829", text="..."
    model="claude-sonnet-4-20250514",
    api_key="sk-ant-...",
    output_format="both",          # "markdown" | "json" | "both"
    include_score=True,
)

# Async
result = await evaluate_async(
    path="paper.pdf",
    model="claude-sonnet-4-20250514",
    api_key="sk-ant-...",
)

# Result fields:
# result["report_text"]   -> structured report string
# result["score"]         -> score dict (if include_score=True)
# result["context"]       -> full PipelineContext object
# result["output_paths"]  -> list of written file paths
```

---

## Pipeline stages

**Stage 0 — Study Type Routing**

An LLM classifies the paper's study design (e.g., RCT, observational, diagnostic, phase 0/1) and sets downstream routing flags. Phase 0/1 studies skip stages 2 and 3, and the final score is locked to 1–2. Key output: `study_type`, `confidence`, `human_review_flag`.

**Stage 1 — Variable Extraction (3x majority vote)**

An LLM extracts all quantitative variables needed for downstream scoring: sample sizes, event counts, LTFU counts, p-value, effect size, blinding status, randomization method, PICO elements, and more. By default, three independent LLM calls run in parallel and their outputs are compared field-by-field; fields where all three agree receive high confidence, disagreements are flagged in `low_confidence_fields`. Use `--no-vote` to run a single extraction. Key output: `ExtractedVariables`, `PICO`, `extraction_qa`.

**Stage 2 — MCID & Domain Search (LLM tool-use)**

An LLM drives a tool-use loop (up to 5 rounds) calling `search_pubmed`, `search_crossref`, and `fetch_abstract` to locate the MCID (minimal clinically important difference) for the outcome and domain-specific sample size thresholds. The MCID lookup follows a four-tier hierarchy: COMET/OMERACT registry, PubMed systematic reviews, clinical guidelines, Cohen's d proxy. Key output: `mcid`, `mcid_unit`, `source`, `tier`, `effect_vs_mcid`, `domain_nnt_threshold`.

**Stage 3 — Deterministic Math Audit**

No LLM is used. The stage computes: fragility index (FI), NNT, diagnostic odds ratio (DOR), statistical power, LTFU vs FI comparison, and de-duplication of overlapping grade deductions. The LTFU > FI hard rule applies a -2 grade deduction with no exceptions. Key output: computation traces, grade deltas, boundary-matrix-capped deduction set.

**Stage 4 — Bias Risk Assessment**

An LLM evaluates risk of bias across standard domains (e.g., selection bias, performance bias, attrition bias, reporting bias) using the paper text and extracted variables. For diagnostic studies, QUADAS-2 criteria are applied. Key output: per-domain `evidence_found`, `judgment`, `delta`, `reasoning`; `overall_concern`; `surrogate_endpoint` flag; `heterogeneity` assessment.

**Stage 5 — Report Synthesis + Optional Score**

The deterministic score engine aggregates all grade deltas, applies the boundary matrix, and computes the final 1–5 score. An LLM then generates the narrative report sections. Key output: `report_text`, `score_path` (sequence of grade adjustments), `score` (1–5 integer), `narrative`. The score disclaimer is always included in output.

---

## Supported providers

| Provider | Example model string |
|---|---|
| Anthropic (Claude) | `claude-opus-4-20250514`, `claude-sonnet-4-20250514` |
| OpenAI | `gpt-4o`, `gpt-4o-mini` |
| Any litellm-supported provider | `together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1`, `ollama/llama3`, `bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0` |

The default model is `claude-sonnet-4-20250514`. Use the `--model` flag or `EVIDENCE_EVALUATOR_MODEL` env var to override. Any model string accepted by [litellm](https://docs.litellm.ai/docs/providers) works.

---

## Configuration reference

Full annotated `~/.evidence-evaluator/config.toml`:

```toml
[llm]
model = "claude-sonnet-4-20250514" # LLM model (sonnet default; opus via -m flag)
api_key = "sk-ant-api03-..."       # Your LLM API key

[pipeline]
majority_vote = true               # Run 3x extraction and take majority (Stage 1)
max_search_rounds = 5              # Max tool-use rounds in Stage 2
fail_fast = true                   # Stop on first stage error (false = best-effort)

[output]
format = "markdown"                # "markdown" | "json" | "both"
directory = "."                    # Output directory for report files
include_score = true               # Include heuristic 1-5 score in report
```

---

## CLI reference

```
Usage: evidence-evaluator evaluate [OPTIONS] [PATH]

  Evaluate a clinical research paper.

Arguments:
  PATH  Path to a PDF file (optional; use flags for other input types)

Options:
  --doi TEXT                  Evaluate by DOI
  --pmid TEXT                 Evaluate by PubMed ID
  --text TEXT                 Evaluate raw text input
  -f, --output-format TEXT    Output format: markdown | json | both
                              [default: markdown]
  -o, --output-dir PATH       Directory to write output files  [default: ./]
  -m, --model TEXT            LLM model override
  -k, --api-key TEXT          API key override
  --no-score                  Disable the heuristic 1-5 score
  --no-vote                   Single extraction, skip 3x majority vote
  --resume PATH               Resume from a checkpoint JSON file
  --fail-fast                 Stop on first error  [default]
  --best-effort               Continue past stage errors, produce partial report
  --force-local-pdf           Always use pdfplumber instead of native LLM PDF
  -v, --verbose               Enable debug-level logging
  -q, --quiet                 Suppress all output except the output file path
  --help                      Show this message and exit.
```

---

## Development

**Setup:**

```bash
git clone https://github.com/SciSpark-ai/evidence_evaluator_cli.git
cd evidence-evaluator-cli
pip install -e ".[dev]"
```

**Run tests:**

```bash
pytest tests/ -v
```

**Lint:**

```bash
ruff check src/ tests/
```

**Test coverage:**

```bash
pytest tests/ --cov=src/evidence_evaluator --cov-report=term-missing
```

The test suite includes 161 tests: 74 ported Stage 3 math tests (covering 147 original assertions), 35 ported Stage 5 score-engine tests (covering 70 original assertions), Click CLI tests, mocked LLM stage tests, search client tests, pipeline orchestrator tests, and Pydantic model validation tests.

---

## Domain rules

Key evidence-based medicine (EBM) rules implemented in the pipeline:

**LTFU definition** — Loss to follow-up (LTFU) includes exclusions, withdrawals, and AE-related dropouts. Deaths counted as primary endpoint events are not LTFU.

**LTFU > FI hard rule** — If LTFU exceeds the fragility index, a -2 grade deduction applies unconditionally. This deduction is never de-duplicated against other deductions.

**MCID tier hierarchy** — MCID lookup stops at the first successful tier: (1) COMET/OMERACT registry, (2) PubMed systematic reviews, (3) clinical guidelines, (4) Cohen's d proxy. Tier 3 HR values are converted via `CER x (1 - HR)`.

**Effect vs MCID** — Comparison is binary only: "exceeds" or "below". The word "borderline" is never used.

**De-duplication** — Among the candidate deductions for {power < 0.80, N < domain threshold, NNT > threshold}, only the largest single delta is applied.

**Study type routing** — Phase 0/1 studies skip stages 2 and 3; the score is locked to 1–2.

**Initial grade** — Sample size (N) takes precedence over phase label. Grade 5 requires all three: multi-center + double-blind + N > 1000.

**Boundary matrix** — Maximum score per initial grade:

| Initial grade | Base | Max | Min |
|---|---|---|---|
| 5 | 5 | 5 | 3 |
| 4 | 4 | 4 | 2 |
| 3 | 3 | 4 | 2 |
| 2 | 2 | 3 | 1 |
| 1 | 1 | 1 | 1 |

---

## Validated result

First end-to-end evaluation (2026-03-29): **DAPA-HF** (McMurray et al., NEJM 2019, PMID 31535829)

```
Study type: RCT_intervention (Phase III)
N: 2373 intervention / 2371 control
Fragility Index: 62 (robust)
NNT: 20.4
Post-hoc Power: 99.8%
LTFU (34) < FI (62): safe
MCID: exceeds (Tier 3)
Bias: all RoB 2.0 domains low risk
Score: 5/5 Excellent
```

---

## Disclaimer

The 1-5 score is a heuristic generated by a rule engine. Scoring design choices are pending expert calibration. This tool does not replace clinical judgment.

---

## Citation

If you use Evidence Evaluator in published research, please cite:

> Claw4S (2026). *Evidence Evaluator: An agentic pipeline for clinical evidence quality evaluation.* Research note.

---

## License

MIT — see [LICENSE](LICENSE) for details.
