Metadata-Version: 2.4
Name: agentworkbench
Version: 0.3.8
Summary: Local-first harness for learning, evaluating, comparing, and optimizing repo-native agents.
License-Expression: MIT
Keywords: agents,agent-evals,cli,developer-tools,evaluation,optimization
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML>=6.0.1
Requires-Dist: pydantic<3,>=2.8
Requires-Dist: scipy>=1.11
Requires-Dist: scikit-learn>=1.3
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: mypy>=1.11; extra == "dev"
Requires-Dist: pytest>=8.2; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.11; extra == "dev"
Requires-Dist: types-PyYAML>=6.0.12.20240917; extra == "dev"
Requires-Dist: twine>=5.1; extra == "dev"
Requires-Dist: vulture>=2.11; extra == "dev"
Dynamic: license-file

# Agent Workbench

Agent Workbench (`awb`) is a local-first Python package and CLI for evaluating, measuring, and iteratively improving repo-native coding agents. It is a disciplined experiment controller—it proposes, measures, and narrows options. You (or your coding agent) still apply the changes.

## Framing Thesis

Agent Workbench is the local evidence and control plane for changing repo-native agents safely. It should make every important agent surface visible, connect behavior changes to approved evals, and preserve enough artifacts that a release decision can be replayed without relying on manual recollection.

The practical bar is simple: prompts, tools, skills, plugins, MCPs, media, scorers, traces, evidence records, and promotion state are all product surfaces. If one of them can change behavior, it needs a contract, a reportable diff, and a path through full-suite evaluation before promotion.

## Release-Ready Guides

- [Documentation Index](docs/README.md) — the recommended reading order and workflow map.
- [CLI Reference](docs/cli_reference.md) — complete command reference for all `awb` commands.
- [Troubleshooting](docs/troubleshooting.md) — common errors and solutions.
- [Codex Agent Guide](docs/codex_agent_guide.md) — exact workflow for a downstream Codex agent using `awb` to optimize a target agent.
- [AI Agent Assurance](docs/assurance.md) — evidence packs, readiness gates, control maps, and CI checks for agent release review.
- [Contracts](docs/contracts/README.md) — scorer, media, artifact, trace, evidence, and release-readiness contracts.
- [Adaptation Surfaces](docs/adaptation_surfaces.md) — the full map of what a coding agent or human can change after behavior tests fail.
- [Repository Dogfood Release Path](docs/release_dogfood.md) — how this repo promotes its own canonical streams using full runs.
- [Release And Laptop Setup](docs/release_and_laptop_setup.md) — validation, build, publish, and work-laptop installation steps.
- [Multi-Turn And Chain Features](docs/multi_turn_and_chains.md) — conversation, chain, and workflow details.
- [Domain Package Migration](docs/restructure_migration.md) — canonical subpackage imports, compatibility shims, and the coding-agent loop.

## What It Does

Agent Workbench helps optimizer engineers understand, evaluate, and improve coding agents (Codex, Copilot, Claude Code) that operate within their own repository. The harness:

1. **Learns** the repo — discovers prompts, tools, skills, plugins, MCPs, and runtime behavior
2. **Captures** what "better" means — creates behavior briefs defining desired agent behavior
3. **Generates** evals — creates test cases from briefs
4. **Runs** experiments — executes agents against eval cases with controlled variables
5. **Analyzes** results — diagnoses failures, compares candidates, and recommends improvements

Every improvement should be traceable to approved behavior tests that fail,
explicit regression evidence, or an approved benchmark gap. Workbench should
push changes toward general nondeterministic behavior, not case-id shortcuts,
exact expected-output branches, or canned answers. A candidate is not ready to
promote until it passes full-suite checks without unacceptable regressions.

## Installation

```bash
python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"

awb --help
python -m agentworkbench --help
```

## Quickstart

```bash
# Initialize workbench in current repo
awb init

# Discover the repo structure
awb learn

# Create a behavior brief
awb brief start --brief-id my-brief --agent-id my-agent

# Generate evals from an approved brief
awb evals generate --brief-id my-brief --eval-id my-eval

# Run the agent against eval cases
awb run --agent-id my-agent --eval-id my-eval --subset smoke

# Check loop readiness before promotion
awb check --agent-id my-agent --eval-id my-eval --subset full

# Compare against baseline
awb compare --baseline baseline-id --candidate candidate-id

# Diagnose failures
awb diagnose --eval-id my-eval
```

The golden path is evidence-first: learn the repo, approve the behavior brief,
approve the eval, run the full suite, diagnose failures, choose the right
adaptation surface, apply a bounded change, rerun full, compare against the best
compatible baseline, and promote only if the candidate is non-regressing. Use
smoke runs for debugging, not promotion decisions.

For a fuller operator walkthrough, start with [docs/quickstart.md](docs/quickstart.md).

---

# Core Concepts

## Agents

An **agent spec** defines a target agent with:

- **Adapter**: How to invoke the agent (command, python, provider)
- **Prompts**: System prompts and prompt variants
- **Model**: For provider adapters (model name, provider, temperature)
- **Tools**: External tools the agent can use
- **Skills**: Internal capabilities (grounding, memory recall, etc.)
- **Plugins**: Additional capabilities (knowledge base, workflow memory)
- **MCPs**: Model Context Protocol servers
- **Entrypoints**: How to run the agent (command, python function, API)

### Example Agent Spec

```yaml
id: groq-support-baseline
adapter: provider
prompts:
  system: target_agent/prompts/system.md
  reviewer: target_agent/prompts/reviewer.md
model:
  name: groq/llama-3.1-8b-instant
  provider: groq
  temperature: 0
tools:
  - name: policy_search
    enabled: true
  - name: incident_lookup
    enabled: true
skills:
  - name: policy_grounding
    enabled: true
  - name: incident_triage
    enabled: true
```

## Behavior Briefs

A **behavior brief** captures what "better" means for the agent:

- **Goal**: What the agent should accomplish
- **Desired behaviors**: Explicit behaviors to encourage
- **Worse signals**: Behaviors to avoid
- **Preferred surfaces**: Which tools/skills/MCPs to use
- **Workflow scenarios**: Multi-step workflows the agent should handle

### Creating a Brief

```bash
awb brief start --brief-id my-brief --agent-id my-agent
# Follow prompts to define goal, behaviors, etc.

awb brief finalize --brief-id my-brief
```

## Eval Suites

An **eval suite** contains test cases for evaluating an agent:

- **Cases**: Individual test cases with input, expected output, and metadata
- **Splits**: Named subsets (smoke, full, etc.)
- **Scorers**: Functions that score agent outputs

### Eval Case Structure

```json
{
  "id": "case-1",
  "input": "What should I do after a failed benchmark run?",
  "expected": "diagnose before rerun",
  "rubric": ["Answers are grounded and actionable"],
  "tags": ["output", "smoke"],
  "metadata": {
    "process_expectations": {
      "required_tools": ["repo_search"],
      "max_step_count": 3
    },
    "score_weights": {
      "output_quality": 0.7,
      "process_alignment": 0.3
    }
  }
}
```

### Multi-Turn Eval Cases

```json
{
  "id": "support-multiturn-refund",
  "input": "What refund amount should we offer?",
  "turns": [
    {"role": "user", "content": "A customer wants a refund"},
    {"role": "assistant", "content": "I found the customer's account..."}
  ],
  "metadata": {
    "expected_turns": [
      {"step": 0, "expected_tools": ["policy_search"]},
      {"step": 1, "expected_tools": ["customer_memory"]}
    ]
  }
}
```

### Branching Multi-Turn Benchmarks

Define multiple valid conversation paths to the same goal for comparing agent strategies using the `metadata.dialogue` contract:

```json
{
  "id": "support-multiturn-refund",
  "metadata": {
    "dialogue": {
      "goal": "Resolve refund request",
      "paths": [
        {
          "path_id": "direct",
          "strategy": "direct",
          "turns": [
            {"role": "user", "content": "I want a refund"},
            {"role": "assistant", "expected": "I can help with that. Order ID?"}
          ]
        },
        {
          "path_id": "alternate",
          "strategy": "alternative",
          "turns": [
            {"role": "user", "content": "Can I get my money back?"},
            {"role": "assistant", "expected": "I'll look into your refund. What is the order number?"}
          ]
        }
      ]
    }
  }
}
```

This enables testing whether agents follow **valid alternate routes** to the same goal and measuring path similarity.

```python
from agentworkbench import BranchingConversationBuilder, PathStrategy

benchmark = (
    BranchingConversationBuilder("support-escalation")
    .add_conversation_path(
        path_id="direct_resolution",
        turns=[
            {"role": "user", "content": "I need help with my order"},
            {"role": "assistant", "content": "I'll help. Order ID?", "expected": "I'll help"},
            {"role": "user", "content": "ORD-123"},
            {"role": "assistant", "expected": "Order shipped yesterday"},
        ],
        expected_outcome="resolved",
        strategy=PathStrategy.DIRECT,
    )
    .add_conversation_path(
        path_id="escalation_path",
        turns=[
            {"role": "user", "content": "I need help with my order"},
            {"role": "assistant", "content": "I'll help. Order ID?"},
            {"role": "user", "content": "I don't have it"},
            {"role": "assistant", "content": "Let me look it up by email"},
            {"role": "user", "content": "user@example.com"},
            {"role": "assistant", "expected": "Escalating to support"},
        ],
        expected_outcome="escalated",
        strategy=PathStrategy.ALTERNATIVE,
    )
    .with_path_comparison_metric("efficiency", weight=0.4)
    .with_path_comparison_metric("user_satisfaction", weight=0.6)
    .save(project_root="./my-project")
)
```

This enables testing whether agents choose the most appropriate path based on context, and comparing path efficiency across different strategies.

---

# CLI Reference

## Initialization

### `awb init`

Scaffolds workbench files:

```bash
awb init                    # Initialize in current directory
```

Creates `.agent-workbench/` with:
- `agents/` — Agent specs
- `briefs/` — Behavior briefs
- `evals/` — Eval suites
- `scorers/` — Custom scorers
- `workbench.yaml` — Workbench configuration

### `awb learn`

Discovers repo structure:

```bash
awb learn
```

Detects:
- Prompt files
- Tool definitions
- Skill implementations
- Plugin code
- MCP configurations
- Entry points

## Behavior Briefs

### `awb brief start`

Creates a new behavior brief:

```bash
awb brief start --brief-id support-brief --agent-id my-agent --goal "Help users with support tickets"
```

### `awb brief draft`

Seeds a behavior brief from an agent spec, linked evals, and discovery artifacts:

```bash
awb brief draft --brief-id my-brief --agent-id my-agent
```

### `awb brief finalize`

Finalizes an approved brief:

```bash
awb brief finalize --brief-id my-brief
```

### `awb brief packet`

Generates a coding agent packet:

```bash
awb brief packet --brief-id my-brief --agent-id my-agent
```

The packet now includes `Human Checkpoints` when the brief is too subjective or under-specified for the package to lock down alone. Those checkpoints are for target-surface confirmation, representative benchmark examples, answer acceptance decisions, or short guidance to the model. They are not prompts for the human to write a gold answer.

## Eval Suites

### `awb evals generate`

Generates eval cases from a brief:

```bash
awb evals generate --brief-id my-brief --eval-id my-eval
```

### `awb evals expand`

Appends draft coverage from the linked brief and recent experiment failures, then resets approval so benchmark drift stays explicit:

```bash
awb evals expand --eval-id my-eval --experiment my-agent-my-eval-full-20260410112233445566
```

### `awb evals approve`

Approves an eval suite:

```bash
awb evals approve --eval-id my-eval
```

### `awb evals review`

Shows eval suite details in a human-readable review report:

```bash
awb evals review --eval-id my-eval
```

The review report includes `Human Checkpoints Before Approval` when cases still need human help to define subjective acceptance rules or workflow contracts. Record those answers with `awb human-review answer --decision ... --guidance ...` so the package can stop re-asking the same checkpoint.

## Human Review

### `awb human-review answer`

Records a human decision or short guidance without asking the human to write an ideal answer:

```bash
awb human-review answer \
  --queue-id eval-my-eval \
  --checkpoint-id eval-subjective-acceptance \
  --decision accept_candidate \
  --guidance "Prefer the shorter grounded answer."
```

Recorded answers are automatically applied back into the relevant brief/eval artifacts so the same checkpoint can disappear from later `brief packet`, `evals review`, and `diagnose` runs.

### `awb human-review apply`

Re-applies stored answers into benchmark artifacts and refreshes the derived review reports:

```bash
awb human-review apply --queue-id eval-my-eval
awb human-review apply --checkpoint-id eval-subjective-acceptance --dry-run
```

## Running Experiments

### `awb run`

Executes an agent against eval cases:

```bash
awb run --agent-id my-agent --eval-id my-eval --subset smoke
awb run --agent-id my-agent --eval-id my-eval --case-id case-1 --case-id case-2
awb run --agent-id my-agent --eval-id my-eval --tag workflow
awb run --agent-id my-agent --eval-id my-eval --match "refund"
awb run --agent-id my-agent --eval-id my-eval --case-limit 10
```

### `awb run chain`

Runs cases through multiple agents sequentially:

```bash
awb run chain \
  --agent-id analyzer-agent \
  --agent-id router-agent \
  --eval-id my-eval \
  --subset chain

# With step configuration
awb run chain \
  --agent-id agent-a \
  --agent-id agent-b \
  --eval-id my-eval \
  --subset smoke \
  --step-config '{"agent_id": "agent-b", "branch_condition": "prior_score >= 0.5"}'
```

#### Chain Handoff Template

Customize how output is passed between chain steps:

```bash
awb run chain \
  --agent-id agent-a \
  --agent-id agent-b \
  --chain-handoff-template "Given: {prior_output}\n\nOriginal: {original_input}"
```

Template variables:
- `{prior_output}` — Previous agent's output
- `{prior_structured_output}` — JSON of structured output
- `{prior_tool_calls}` — JSON of tool calls
- `{prior_evidence}` — JSON of evidence records
- `{prior_trace}` — JSON of trace events
- `{original_input}` — Original case input
- `{case_id}` — Case ID
- `{step}` — Current step index
- `{chain_context}` — Full accumulated context

### `awb run workflow`

Runs routed multi-agent workflows, including fan-out/fan-in graphs and nested chain nodes:

```bash
awb run workflow \
  --workflow-file groq-routing-workflow \
  --eval-id workflow-routing-live \
  --subset full

awb run workflow \
  --workflow-file groq-routing-nested-chain \
  --eval-id workflow-routing-live \
  --subset smoke
```

Workflow nodes can be plain agents or `node_type: chain` compositions, so a router can hand work to a nested planner-worker-reviewer style lane instead of only a single downstream agent.

Compatibility alias:

```bash
awb run-workflow --workflow-file groq-routing-workflow --eval-id workflow-routing-live --subset full
```

## Comparison & Diagnosis

### `awb check`

Runs local loop-readiness checks for long-running Codex optimization loops:

```bash
awb check
awb check --agent-id my-agent --eval-id my-eval --subset full
awb check --experiment exp-id
awb check --workflow-file workflow.yaml --eval-id my-eval --subset full
awb check --chain-agent-id planner --chain-agent-id reviewer --eval-id my-eval --subset full
awb check --format text --fail-on none
```

The command writes JSON and Markdown reports under `.agent-workbench/reports/`, including `check-latest.json` and `check-latest.md`. By default it exits non-zero only for `critical` findings. Promotion and `compare --best` include a `loop_readiness` gate when `loop_checks.block_promotion_on_critical` is enabled.

Check categories:

- `code_structure`: large files, long functions/classes, and unclear target surface splits.
- `agent_topology`: unnecessary multi-agent setups, missing routing/handoff structure, workflow loops, dead ends, and unmatched routes.
- `eval_contract`: missing rationales for required/forbidden tools, MCPs, prompt keys, workflow steps, and workflow sequences.
- `route_policy`: multi-path dialogue cases without route policies, under-specified invalid routes/applicability, undeclared required tools, invalid completed routes, and provisional novel routes needing review.
- `deterministic_fallback`: Checks for **deterministic-path advisories**, such as hard-coded case IDs, exact expected strings, or canned final-output fallbacks in target runtime files that may compromise evaluation integrity.
- `loop_hygiene`: draft evals, benchmark drift, **smoke vs full promotion policy** issues (decisions based only on smoke subsets), missing promoted baselines, review items, gates, and scorer-audit issues.

### `awb compare`

Compares experiment results:

```bash
awb compare --baseline baseline-id --candidate candidate-id
awb compare --best --candidate candidate-id
```

### `awb diagnose`

Analyzes failures and recommends improvements:

```bash
awb diagnose --eval-id my-eval --subset full
awb diagnose --experiment exp-id
```

Diagnosis reports include an `Ask The Human` section when a regression or change cannot be trusted without human preference input or benchmark clarification.

### `awb status`

Shows optimization stream status:

```bash
awb status --eval-id my-eval
awb status
```

### `awb path-compare`

Summarizes dialogue path behavior and strategy effectiveness from a completed experiment:

```bash
awb path-compare --experiment exp-id
```

This command inspects multi-turn results to visualize which conversation paths were taken and which strategies were most effective.

### `awb route-policy`

Authors, lints, explains, and reviews route policies for multi-turn dialogue benchmarks:

```bash
awb route-policy template --strategy direct
awb route-policy draft --eval-id my-dialogue-eval
awb route-policy draft --eval-id my-dialogue-eval --apply
awb route-policy lint --eval-id my-dialogue-eval --agent-id my-agent
awb route-policy explain --eval-id my-dialogue-eval
awb route-policy benchmark --eval-id awb-canonical-dialogue
```

`metadata.dialogue.route_policy` defines case signals, valid routes, invalid route patterns, and route-score weights. Workbench uses it to decide whether an agent’s dialogue route was valid, context-appropriate, and optimal among acceptable alternatives. Novel goal-achieving routes can pass provisionally, but `path-compare` and `check` keep them visible until a human or coding agent approves, rejects, or converts them:

```bash
awb route-policy review --experiment exp-id --case-id case-1 --decision convert --route-id safe_alternate
```

See [Route Policy V1](docs/route_policy.md) for full metadata examples and authoring patterns.

## Artifacts

### `awb artifacts show`

Inspect stored experiment, case, or compare artifacts:

```bash
awb artifacts show experiment --experiment exp-id
awb artifacts show case --experiment exp-id --case-id case-1
awb artifacts show compare --latest
```

### `awb artifacts export`

Export a full experiment artifact bundle:

```bash
awb artifacts export --experiment exp-id --output exp-id.json
```

### `awb artifacts review`

Generate an HTML review bundle:

```bash
awb artifacts review --experiment exp-id --output review.html
```

Legacy aliases still work for compatibility: `run-chain`, `run-workflow`, `replay`, `show`, `export`, `review`.

## Red Teaming

### `awb redteam generate`

Builds adversarial, robustness, or stress suites from an approved source eval:

```bash
awb redteam generate \
  --source-eval-id provider-tool-interface \
  --output-eval-id provider-tool-interface-redteam \
  --mode adversarial \
  --mode robustness \
  --mode stress
```

Generated suites carry attack metadata for family, severity, mode, and source-case lineage, and they include mode-specific splits alongside `smoke` and `full`.

### `awb redteam report`

Summarizes outcomes for a completed red-team experiment:

```bash
awb redteam report --experiment <experiment-id>
```

## Optimization

**Important**: Agent Workbench provides a disciplined experiment framework for evaluating and comparing agent configurations. It measures, proposes, and ranks mutations—but it does **not** automatically rewrite agent code. A coding agent or human is still required to implement changes based on the generated proposals and patch templates.

Mutations should be motivated by failed behavior cases, regressions, or approved
benchmark gaps. Do not patch toward the benchmark by adding case-specific
deterministic fallbacks; patch the underlying surface that explains the failure.
Valid targets include specification, evaluation, prompting, skills, tools,
orchestration, memory/state, data, retrieval, context assembly, model choice,
output contracts, verification, execution environment, migration and
infrastructure, observability, safety and policy, human interaction, and the
adaptation policy itself. See [Adaptation Surfaces](docs/adaptation_surfaces.md).

### `awb propose`

Generates bounded mutation proposals with target files, risks, validation commands, and acceptance checks:

```bash
awb propose --eval-id my-eval --subset full
```

### `awb optimize`

Runs a bounded optimization loop over the existing search primitives, then writes a single artifact that includes the measured frontier, finalist revalidation on the preferred subset, promotion-style gate checks, and a proposal-to-patch handoff:

```bash
awb optimize --agent-id my-agent --eval-id my-eval --subset full --budget 6
awb optimize --agent-id my-agent --eval-id my-eval --subset full --budget 6 --dry-run
awb optimize --resume optimize-plan-my-agent-my-eval-20260421000000000000
```

**Note**: The optimizer narrows options and produces ranked recommendations. It is not a fully autonomous coding agent—you must review the proposals and apply changes manually or via a coding agent.

You can also add an optional model/runtime axis:

```bash
awb optimize --agent-id my-agent --eval-id my-eval --subset smoke --budget 8 --temperature 0 --temperature 0.2 --max-tokens 256
```

Use `--dry-run` to write a launch plan under `.agent-workbench/optimizations` without starting candidate runs. Use `--timeout-seconds` for a best-effort timeout checked between stages, and `--resume <optimize-id>` to reload an existing optimization artifact.

### `awb patch-template`

Creates editable implementation templates:

```bash
awb patch-template --proposal-id prop-1
```

### `awb promote`

Promotes a candidate to baseline:

```bash
awb promote --experiment candidate-id
awb promote --experiment candidate-id --baseline baseline-id
```

### `awb assurance`

Builds an assurance evidence pack from completed experiment artifacts. The report does not certify an agent; it records the available evidence, SHA-256 artifact fingerprints, surface exposure, readiness level, residual risks, and next actions for engineering, security, and governance review:

```bash
awb assurance report --experiment candidate-id
awb assurance report --experiment candidate-id --format text
awb assurance report --experiment candidate-id --baseline baseline-id --redteam-experiment redteam-id
```

Use `assurance check` in CI or release gates. It writes the same reports, then exits non-zero when readiness is below the configured threshold:

```bash
awb assurance check --experiment candidate-id
awb assurance check --experiment candidate-id --fail-below needs_review
```

Reports are written under `.agent-workbench/reports/` as `assurance-<experiment-id>.json`, `assurance-<experiment-id>.md`, `assurance-latest.json`, and `assurance-latest.md`. See [AI Agent Assurance](docs/assurance.md) for policy configuration and workflow guidance.

### `awb ablate`

Tests surface ablation:

```bash
awb ablate --agent-id my-agent --eval-id my-eval --surfaces tools
awb ablate --agent-id my-agent --eval-id my-eval --surfaces prompts,tools --limit 4
```

### `awb matrix`

Cross-agent/provider matrix:

```bash
awb matrix --agent-id agent-a --agent-id agent-b --eval-id my-eval
```

### `awb factorial`

Factorial experiments:

```bash
awb factorial --experiment exp-id
```

---

# Scoring

## Default Scoring

The default scorer bundle includes:

- **output_quality** (0.7): Did the output match expectations?
- **process_alignment** (0.3): Did the agent use expected tools/skills?

## Scorer Bundles

Predefined scorer bundles:

| Bundle | Use Case |
|--------|----------|
| `default` | General purpose |
| `support_agent` | Support desk agents |
| `coding_agent` | Code generation agents |
| `retrieval_agent` | RAG/retrieval agents |
| `router_agent` | Decision/routing agents |
| `release_gate` | Release decision agents |

## Custom Scoring

Generate and lint scorer files before relying on them for release claims:

```bash
awb scorer template --kind heuristic
awb scorer template --kind dialogue
awb scorer lint --eval-id my-eval
awb scorer schema
```

Add `score_cases()` to `.agent-workbench/scorers/<eval-id>_default.py`:

```python
from agentworkbench.scorer_utils import score_output_quality, score_process_alignment
from agentworkbench.scorers.workflow import score_conversation_turns

def score_cases(run_artifact: dict, case: dict) -> list[dict]:
    scores = [
        score_output_quality(run_artifact, case),
        score_process_alignment(run_artifact, case),
    ]
    turn_scores = score_conversation_turns(run_artifact, case)
    scores.extend(turn_scores)
    return scores
```

Custom scores should include judgment provenance. Prefer built-in scorer helpers or `enrich_score_with_judgment(..., heuristic_score())` for deterministic scoring.

## Turn-Level Scoring

For multi-turn cases, define `expected_turns` in case metadata:

```json
{
  "metadata": {
    "expected_turns": [
      {"step": 0, "expected_tools": ["policy_search"], "expected_skills": ["policy_grounding"]},
      {"step": 1, "expected_tools": ["customer_memory"], "expected_skills": ["memory_recall"]}
    ]
  }
}
```

This generates per-turn scores: `turn_0_alignment`, `turn_1_alignment`, etc.

---

# Configuration

## Workbench Config

`.agent-workbench/workbench.yaml`:

```yaml
artifacts_dir: .agent-workbench/runs
snapshot_mode: auto
env_files:
  - .env.providers.local
gates:
  min_average_score: 0.75
  min_passed_case_delta: 0
  require_clean_benchmark: true
  max_regressed_cases: 0
loop_checks:
  enabled: true
  fail_on_severity: critical
  block_promotion_on_critical: true
  large_file_line_threshold: 500
  large_function_line_threshold: 80
  min_rationale_chars: 12
  expected_string_match_min_chars: 8
```

## Acceptance Gates

| Gate | Description |
|------|-------------|
| `min_average_score` | Minimum average score to pass |
| `min_passed_case_delta` | Minimum improvement in passed cases |
| `require_clean_benchmark` | No benchmark changes allowed |
| `max_regressed_cases` | Maximum allowed regressions |
| `min_stability_runs` | Minimum runs for stability check |

---

# Architecture

## Models

Key data models in `agentworkbench/models.py`:

- **AgentSpec**: Agent definition
- **BehaviorBrief**: What "better" means
- **EvalSuiteSpec**: Eval suite configuration
- **EvalCase**: Individual test case
- **RunArtifact**: Execution result
- **ExperimentRecord**: Complete experiment
- **ComparisonRecord**: Baseline vs candidate comparison
- **ChainStepConfig**: Chain step configuration
- **ConversationMetrics**: Multi-turn conversation stats

## Adapters

| Adapter | Description |
|---------|-------------|
| `command` | Run shell commands |
| `python` | Call Python functions |
| `provider` | OpenAI-compatible API calls |

### Adapter Environment Variables

- `AWB_INPUT` — Case input
- `AWB_TURNS_JSON` — Multi-turn conversation
- `AWB_CHAIN_CONTEXT_JSON` — Chain context
- `AWB_PROMPTS_JSON` — Prompt definitions
- `AWB_TOOLS_JSON` — Tool definitions
- `AWB_CASE_PATH` — Case file path
- `AWB_OUTPUT_PATH` — Output file path

## Scorers

Located in `agentworkbench/scorers/`:

- `output_quality` — Output matching expectations
- `process_alignment` — Tool/skill usage
- `grounding` — Evidence-based answers
- `json_schema` — Structured output validation
- `safety` — Forbidden term checking
- `workflow` — Turn-level alignment

---

# Examples

See `examples/` for complete examples:

- `live-agent-demo/` — Local command adapter demo
- `live-provider-demo/` — Provider (Groq/Cerebras) demo

### Running Examples

```bash
cd examples/live-provider-demo

# Run smoke tests
awb run --agent-id groq-support-baseline --eval-id support-ops-agent --subset smoke

# Run multiturn tests
awb run --agent-id groq-support-baseline --eval-id support-ops-agent --subset multiturn

# Run chain tests
awb run chain \
  --agent-id groq-support-baseline \
  --agent-id groq-router-workflow \
  --eval-id support-ops-agent \
  --subset chain

# Run routed workflow tests
awb run workflow \
  --workflow-file groq-routing-workflow \
  --eval-id workflow-routing-live \
  --subset full

# Generate and run a red-team smoke suite
awb redteam generate \
  --source-eval-id provider-tool-interface \
  --output-eval-id provider-tool-interface-redteam \
  --mode adversarial \
  --mode robustness \
  --mode stress
awb evals approve --eval-id provider-tool-interface-redteam
awb run --agent-id groq-provider-tool-interface --eval-id provider-tool-interface-redteam --subset smoke

# Diagnose
awb diagnose --eval-id support-ops-agent --subset chain
```

---

# Output Interpretation

## Experiment Results

```json
{
  "experiment": {
    "experiment_id": "...",
    "agent_id": "groq-support-baseline",
    "eval_id": "support-ops-agent",
    "average_score": 0.65,
    "passed_cases": 3,
    "total_cases": 5
  },
  "aggregate": {
    "metric_averages": {
      "output_quality": 0.6,
      "process_alignment": 0.7,
      "turn_0_alignment": 1.0
    },
    "behavior_summary": {
      "average_turn_count": 2.0,
      "escalation_rate": 0.0,
      "tools_used": ["policy_search", "incident_lookup"]
    }
  }
}
```

## Chain Diagnosis

```json
{
  "chain_diagnosis": {
    "is_chain_run": true,
    "weakest_step": 0,
    "weakest_agent": "groq-support-baseline",
    "step_failure_counts": {
      "0": {"groq-support-baseline": 2},
      "1": {"groq-router-workflow": 1}
    }
  }
}
```

---

# Troubleshooting

## `awb doctor`

Run diagnostics:

```bash
awb doctor
```

Checks:
- Required environment variables
- Repo structure
- Config validity

## Common Issues

### Missing API Keys

Set required environment variables:

```bash
export GROQ_API_KEY=your-key
export CEREBRAS_API_KEY=your-key
```

### Scorer Not Found

Ensure scorer file exists at `.agent-workbench/scorers/<eval-id>_default.py`

### Adapter Errors

Check agent spec adapter configuration and entrypoint validity.

---

# Security

Please report vulnerabilities via the process in `SECURITY.md`.

# Code of Conduct

This project follows the Contributor Covenant. See `CODE_OF_CONDUCT.md`.

# License

MIT — see `LICENSE`.
