Dashboard
Server: local
Port: 8000
Status: Online
Execute
Your project
{% if grouped_agents.local or grouped_tasks.local %}
{% if grouped_agents.local %}
Agents: {{ grouped_agents.local|length }}
{% endif %} {% if grouped_tasks.local %}
Tasks: {{ grouped_tasks.local|length }}
{% endif %}
{% else %}
No local agents or tasks discovered yet.
{% endif %}
Built-in examples
Agents: {{ grouped_agents.bundled|length }}
Tasks: {{ grouped_tasks.bundled|length }}

Deterministic runs; artifacts land in .agent_bench/runs/. Replay accepts overrides.

Global metrics
Avg success rate
{% if baselines and baselines|length > 0 %} {{ ((baselines | map(attribute='success_rate') | sum) / baselines|length * 100) | round(1) }}% {% else %}—{% endif %}
Total runs (recent)
{{ recent_runs|length if recent_runs else 0 }}

Aggregate performance data from persisted runs.

Task database
Your project
{% if grouped_tasks.local %}
{% for task in grouped_tasks.local %} {{ task.ref }} {% endfor %}
{% else %}

Local tasks will appear here after you scaffold or add one.

{% endif %}
Built-in examples
{{ grouped_tasks.bundled|length }} bundled task{{ '' if grouped_tasks.bundled|length == 1 else 's' }} available for reference.
{% if selected_task_meta %}
{{ selected_task_meta.ref }} · {{ selected_task_meta.suite }}

{{ selected_task_meta.description or 'No description available.' }}

{% else %}

Select a task to see its description.

{% endif %}
Execution logs
{% if trace_id %} {% endif %}
Reset
{% if recent_runs %}
    {% for entry in recent_runs %}
  • {{ entry.agent }} → {{ entry.task_ref }}
    Seed {{ entry.seed }} {% if entry.failure_type %} {{ entry.failure_type }} {% else %} success {% endif %} trace {% if compare_suggestions and compare_suggestions.run_b == entry.run_id and compare_suggestions.run_a %} compare {% elif compare_suggestions and compare_suggestions.run_a == entry.run_id and compare_suggestions.run_b %} compare {% endif %}
  • {% endfor %}
{% else %}

No runs logged yet.

{% endif %}
Output buffer
JSON
{% if error %}
{{ error }}
{% elif result %} {% set success = result.success %} {% if success %}Success{% else %}Failure{% endif %} {% if result_download_id %} {% endif %}
{{ result | tojson(indent=2) }}
{% else %}

Results will appear here after you launch a run.

{% endif %}
Trace dump
{% if trace_run %}
RUN_ID: {{ trace_run.run_id }} · SEED: {{ trace_run.seed }}
{% endif %}
{% if trace_error %}
{{ trace_error }}
{% elif trace_run %}
Run ID: {{ trace_run.run_id }}
Agent: {{ trace_run.agent }}
Task: {{ trace_run.task_ref }}
Seed: {{ trace_run.seed }}
{% if compare_suggestions and compare_suggestions.run_b == trace_run.run_id and compare_suggestions.run_a %} {% elif compare_suggestions and compare_suggestions.run_a == trace_run.run_id and compare_suggestions.run_b %} {% endif %}
{% if trace_taxonomy %}

Outcome

{{ trace_taxonomy.label }}
{% endif %} {% if trace_budget_series %}

Budget burn

Steps
Tool calls
{% endif %} {% if trace_io_summary %}

IO audit

Total: {{ trace_io_summary.total }} · FS: {{ trace_io_summary.filesystem }} · NET: {{ trace_io_summary.network }}
{% for entry in trace_io_summary.steps %}
Step {{ entry.step }}{% if entry.action %} · {{ entry.action }}{% endif %}
{% for item in entry.io %} {{ item.type or 'io' }}{% if item.op %} · {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %} {% endfor %}
{% endfor %}
{% endif %}
{% if trace_run.action_trace %} {% for entry in trace_run.action_trace %}
Step {{ entry.step }} · {{ entry.action.type }}
{{ entry | tojson(indent=2) }}
{% endfor %} {% else %}

Trace is empty for this run.

{% endif %}
{% else %}

Select a trace from Recent Runs or run a task to view its steps.

{% endif %}
Quick-start pairings

One-click launch for known-good agent+task combinations. Equivalent to agent-bench run pairing <name>.

{% for p in pairings %}
{{ p.name }}
{{ p.agent }}
{{ p.task }}
{{ p.description }}
{% if p.last_run_id is not none %} {% else %}
no runs yet
{% endif %}
{% endfor %}
Benchmarks
{% if trace_id %} {% endif %}
Reset
{% if published_baseline %}
Latest published
{{ published_baseline.generated_at }}
Stored at:
{{ published_baseline._path }}
{% if published_baseline.metadata %}
Filters:
agent={{ published_baseline.metadata.agent_filter or 'all' }}, task={{ published_baseline.metadata.task_filter or 'all' }}
{% endif %}
{% endif %} {% if baselines %}
{{ baselines | length }} agent/task combos tracked.
Derived from persisted runs.
{% for row in baselines %} {% endfor %}
Agent Task Success % Avg Steps Avg Tool Calls Seed (latest) Runs Latest
{{ row.agent.replace('agents/', '') }} {{ row.task_ref }} {{ (row.success_rate * 100) | round(1) }} {% if row.avg_steps is not none %}{{ row.avg_steps | round(1) }}{% else %}—{% endif %} {% if row.avg_tool_calls is not none %}{{ row.avg_tool_calls | round(1) }}{% else %}—{% endif %} {% if row.last_seed is not none %}{{ row.last_seed }}{% else %}—{% endif %} {{ row.runs }} {% if row.last_run_id %} {% if row.last_success %}Success{% else %}Failure{% endif %} · view trace {% else %} — {% endif %}
{% else %}

Baseline stats will appear after you record a few runs.

{% endif %}
Replay compare
Inspect divergence between two runs with guided selection and drift filters

Compare runs

{% if recent_runs %} {% for r in recent_runs %} {% endfor %} {% if compare_suggestions and compare_suggestions.run_a and compare_suggestions.run_b %}

Suggested pair: {{ compare_suggestions.run_a[:8] }}{{ compare_suggestions.run_b[:8] }} based on recent runs for the same agent/task{% if compare_inputs.run_a == compare_suggestions.run_a and compare_inputs.run_b == compare_suggestions.run_b %}, prefilled below{% endif %}.

{% endif %} {% endif %}
{% set compare_run_a = compare_inputs.run_a if compare_inputs else '' %} {% if recent_runs %}
Use recent for A: {% for r in recent_runs[:3] %} {% endfor %}
{% endif %} {% set compare_run_b = compare_inputs.run_b if compare_inputs else '' %} {% if recent_runs %}
Use recent for B: {% for r in recent_runs[:3] %} {% endfor %}
{% endif %}
{% if recent_runs %}

Tip: start from recent runs, use the suggested pair when it matches your investigation, then refine with the drift filter.

{% endif %} {% if compare_error %}
{{ compare_error }}
{% elif compare_diff %}

Summary

Agent match: {{ compare_diff.summary.same_agent }}
Task match: {{ compare_diff.summary.same_task }}
Success match: {{ compare_diff.summary.same_success }}
Steps: {{ compare_delta.steps_a }} → {{ compare_delta.steps_b }} (Δ {{ compare_delta.steps_delta }})
Tool calls: {{ compare_delta.tools_a }} → {{ compare_delta.tools_b }} (Δ {{ compare_delta.tools_delta }})
Changed steps
{{ compare_changed_step_count }}
Steps with IO drift
{{ compare_io_step_count }}
{% if compare_diff.summary.io_audit and (compare_diff.summary.io_audit.added or compare_diff.summary.io_audit.removed) %}
IO added: {{ compare_diff.summary.io_audit.added }} IO removed: {{ compare_diff.summary.io_audit.removed }}
{% endif %} {% if compare_taxonomy_summary %}

Taxonomy shift

{% for item in compare_taxonomy_summary %}
{{ item.label }}
A: {{ item.run_a }} B: {{ item.run_b }} {% if item.same %} ✓ match {% else %} ✗ changed {% endif %}
{% endfor %}
{% endif %} {% if compare_budget_badges %}

Budget delta (B − A)

{% for badge in compare_budget_badges %} {{ badge.label }}: {% if badge.value > 0 %}+{% endif %}{{ badge.value }}{{ badge.suffix or '' }} {% endfor %}
{% endif %} {% if compare_step_summary %}

Delta view

Showing {{ compare_step_summary|length }} of {{ compare_step_summary_total }} changed step{{ '' if compare_step_summary_total == 1 else 's' }} {% if compare_filters and compare_filters.drift and compare_filters.drift != 'all' %} for {{ compare_filters.drift }} drift. {% else %} across all drift categories. {% endif %}

{% for entry in compare_step_summary %} {% endfor %}
Step Baseline action Current action Action changed? Result changed? IO drift?
{{ entry.step }} {{ entry.action_a or '—' }} {{ entry.action_b or '—' }} {{ entry.action_changed }} {{ entry.result_changed }} {{ entry.has_io_drift }}
{% elif compare_step_summary_total %}

Delta view

No changed steps match the current drift filter.

{% endif %} {% set io_steps = compare_diff.step_diffs | selectattr('io_audit_delta', 'defined') | list %} {% if io_steps %}

IO Drift

{% for entry in io_steps %}
Step {{ entry.step }} — IO changes {% set delta = entry.io_audit_delta %} {% if delta.added %}
Added
{% for item in delta.added %} {{ item.type or 'io' }} {% if item.op %}· {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %} {% endfor %}
{% endif %} {% if delta.removed %}
Removed
{% for item in delta.removed %} {{ item.type or 'io' }} {% if item.op %}· {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %} {% endfor %}
{% endif %} {% if not delta.added and not delta.removed %}

No IO drift for this step.

{% endif %}
{% endfor %}
{% endif %}

Step differences

{% if compare_diff.step_diffs %}
{% for entry in compare_diff.step_diffs %}
Step {{ entry.step }}
{{ entry.run_a | tojson(indent=2) }}
{{ entry.run_b | tojson(indent=2) }}
{% endfor %}
{% else %}

No step-level differences detected.

{% endif %}
{% endif %}
IO Audit
Filesystem & network access recorded per step
{% if recent_runs %} {% for r in recent_runs %} {% endfor %} {% endif %}

Browse run

{% if recent_runs %}
{% for r in recent_runs %} {% endfor %}
{% endif %}
Select a recent run above or enter a run ID, then click Load.

IO diff — Run A vs Run B

{% if recent_runs %}
Click to set Run A → B in order:
{% for r in recent_runs %} {% endfor %}
{% endif %}
Select two runs above and click Diff.
Plugin Registry
Discovered task plugins & lint status
Your project
{{ grouped_plugins.local|length }} local plugin{{ '' if grouped_plugins.local|length == 1 else 's' }}
Built-in examples
{{ grouped_plugins.bundled|length }} bundled plugin{{ '' if grouped_plugins.bundled|length == 1 else 's' }}
{% if plugin_registry %}
{{ plugin_registry|length }} plugin(s)
{% for p in plugin_registry %}
{{ p.id }} v{{ p.version }}
{{ p.suite }} {% if p.source == 'local' %} local {% else %} bundled {% endif %} {% if p.lint_ok == true %} ✓ lint {% elif p.lint_ok == false %} ✗ lint {% endif %}
{% if p.description %}

{{ p.description | truncate(100, True, '…') }}

{% endif %} {% if p.actions %}
{% for action in p.actions %} {{ action }} {% endfor %}
{% endif %} {% if p.lint_errors %}
    {% for err in p.lint_errors %}
  • {{ err }}
  • {% endfor %}
{% endif %}
{% endfor %}
{% else %}

No task plugins discovered.

{% endif %}