Aggregate performance data from persisted runs.
{{ task.ref }}
{% endfor %}
Local tasks will appear here after you scaffold or add one.
{% endif %}{{ selected_task_meta.description or 'No description available.' }}
{% else %}Select a task to see its description.
{% endif %}-
{% for entry in recent_runs %}
-
{{ entry.agent }} → {{ entry.task_ref }}
{% endfor %}
No runs logged yet.
{% endif %}{{ result | tojson(indent=2) }}
{% else %}
Results will appear here after you launch a run.
{% endif %}Outcome
Budget burn
IO audit
Step {{ entry.step }}{% if entry.action %} · {{ entry.action }}{% endif %}
Step {{ entry.step }} · {{ entry.action.type }}
{{ entry | tojson(indent=2) }}
Trace is empty for this run.
{% endif %}Select a trace from Recent Runs or run a task to view its steps.
{% endif %}
One-click launch for known-good agent+task combinations. Equivalent to agent-bench run pairing <name>.
| Agent | Task | Success % | Avg Steps | Avg Tool Calls | Seed (latest) | Runs | Latest |
|---|---|---|---|---|---|---|---|
| {{ row.agent.replace('agents/', '') }} | {{ row.task_ref }} | {{ (row.success_rate * 100) | round(1) }} | {% if row.avg_steps is not none %}{{ row.avg_steps | round(1) }}{% else %}—{% endif %} | {% if row.avg_tool_calls is not none %}{{ row.avg_tool_calls | round(1) }}{% else %}—{% endif %} | {% if row.last_seed is not none %}{{ row.last_seed }}{% else %}—{% endif %} | {{ row.runs }} | {% if row.last_run_id %} {% if row.last_success %}Success{% else %}Failure{% endif %} · view trace {% else %} — {% endif %} |
Baseline stats will appear after you record a few runs.
{% endif %}Compare runs
{% if recent_runs %} {% if compare_suggestions and compare_suggestions.run_a and compare_suggestions.run_b %}Suggested pair: {{ compare_suggestions.run_a[:8] }} → {{ compare_suggestions.run_b[:8] }} based on recent runs for the same agent/task{% if compare_inputs.run_a == compare_suggestions.run_a and compare_inputs.run_b == compare_suggestions.run_b %}, prefilled below{% endif %}.
{% endif %} {% endif %} {% if recent_runs %}Tip: start from recent runs, use the suggested pair when it matches your investigation, then refine with the drift filter.
{% endif %} {% if compare_error %}Summary
Taxonomy shift
Budget delta (B − A)
Delta view
Showing {{ compare_step_summary|length }} of {{ compare_step_summary_total }} changed step{{ '' if compare_step_summary_total == 1 else 's' }} {% if compare_filters and compare_filters.drift and compare_filters.drift != 'all' %} for {{ compare_filters.drift }} drift. {% else %} across all drift categories. {% endif %}
| Step | Baseline action | Current action | Action changed? | Result changed? | IO drift? |
|---|---|---|---|---|---|
| {{ entry.step }} | {{ entry.action_a or '—' }} | {{ entry.action_b or '—' }} | {{ entry.action_changed }} | {{ entry.result_changed }} | {{ entry.has_io_drift }} |
Delta view
No changed steps match the current drift filter.
{% endif %} {% set io_steps = compare_diff.step_diffs | selectattr('io_audit_delta', 'defined') | list %} {% if io_steps %}IO Drift
Step {{ entry.step }} — IO changes
{% set delta = entry.io_audit_delta %} {% if delta.added %}No IO drift for this step.
{% endif %}Step differences
{% if compare_diff.step_diffs %}Step {{ entry.step }}
{{ entry.run_a | tojson(indent=2) }}
{{ entry.run_b | tojson(indent=2) }}
No step-level differences detected.
{% endif %}Browse run
IO diff — Run A vs Run B
{{ p.description | truncate(100, True, '…') }}
{% endif %} {% if p.actions %}{{ action }}
{% endfor %}
-
{% for err in p.lint_errors %}
- {{ err }} {% endfor %}
No task plugins discovered.
{% endif %}