Metadata-Version: 2.4
Name: token-miser
Version: 0.4.0
Summary: Benchmark Claude Code configuration packages
Project-URL: Homepage, https://github.com/caylent/token_miser
Project-URL: Repository, https://github.com/caylent/token_miser
Project-URL: Issues, https://github.com/caylent/token_miser/issues
Author: Rubin Johnson
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: claude,claude-code,devtools,kanon,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.13
Requires-Dist: anthropic>=0.40
Requires-Dist: google-genai>=1.0.0
Requires-Dist: kanon-cli==1.2.0
Requires-Dist: pyyaml>=6.0
Description-Content-Type: text/markdown

# token-miser

Claude Code's behavior changes dramatically depending on what's in `CLAUDE.md`, `settings.json`, and any hooks you've wired up. token-miser gives you reproducible numbers instead of vibes.

It runs identical tasks under different configurations (called packages) and measures token usage, cost, and task-completion quality. It supports Claude Code and Codex from the same CLI, so you can compare a vanilla setup against a tuned package — across one agent or both — and know whether the improvement is real.

## Benchmark Results

Benchmark data is committed to `docs/results-{suite}.json` in this repo after each CI run. Open `docs/index.html` locally to view the dashboard.

All experiment data is collected by the repo owner on a consistent setup. See [Submitting a package](#submitting-a-package-for-benchmarking) if you want your package included.

## Concepts

- **Task** — A YAML file describing work for an agent: a prompt, a target repo, success criteria, and a quality rubric.
- **Package** — A configuration directory with a `manifest.yaml` that bundles `CLAUDE.md`/`AGENTS.md`, optional `settings.json` (hooks, permissions), and optional services.
- **Suite** — A named collection of tasks (e.g. `quick`, `standard`, `axis`, `domain-python-api`).
- **Run** — A single execution of a task under one package.
- **Experiment** — One or more runs across baseline/package and agent combinations.

## Requirements

- Python 3.11+
- [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code) installed and authenticated (`claude auth status`)
- [Codex CLI](https://developers.openai.com/codex/) installed and authenticated for Codex runs (`codex`)

## Install

```bash
pip install token-miser
```

Or with uv:

```bash
uv tool install token-miser
```

## Quick start

The primary workflow is three commands:

```bash
# 1. Run the benchmark suite
token-miser tune --suite quick --package caveman --yes

# 2. View the cross-package comparison
token-miser matrix --suite quick

# 3. Export a CI score artifact
token-miser score --suite quick --package caveman --output score.json
```

Additional tune examples:

```bash
# Test with both Claude and Codex
token-miser tune --suite quick --package caveman --agent both --yes

# Compare multiple packages across the axis suite
token-miser tune --suite axis --package c-structured --agent both --yes
```

For single-task experiments:

```bash
# Run one task against a package
token-miser run \
  --task benchmarks/tasks/bm-axis-explore.yaml \
  --baseline vanilla \
  --package caveman

# Compare results
token-miser compare --task bm-axis-explore

# Inspect a specific run
token-miser history
token-miser show 1
```

## Commands

| Command | Purpose |
|---------|---------|
| `tune` | Run a benchmark suite against baseline + package (primary workflow) |
| `matrix` | Cross-package comparison grid (text + JSON export) |
| `score` | Quality-first score report for CI artifacts |
| `run` | Execute a single task under baseline and/or package |
| `compare` | Side-by-side comparison of runs for a task |
| `analyze` | Statistical summary (mean, stdev, median per package) |
| `history` | List all recorded runs |
| `show <id>` | Inspect a specific run in detail |
| `digest` | Export run data for git tracking |
| `suite` | List, validate, or pre-clone suite repos |
| `packages` / `list` | List available packages |
| `publish` | Publish a package to a git repo for kanon distribution |
| `tasks` | List available task YAML files |
| `migrate` | Initialize or migrate the database |

### tune

```bash
token-miser tune \
  --suite axis \
  --package caveman \
  --agent both \
  --model sonnet \
  --timeout 600 \
  --yes
```

| Flag | Default | Purpose |
|------|---------|---------|
| `--suite` | `standard` | Benchmark suite name |
| `--package` | (optional) | Package to test: name or path |
| `--agent` | `claude` | `claude`, `codex`, `openai` (alias), or `both` |
| `--model` | agent-specific | Claude defaults to `sonnet`; Codex to `gpt-5.4`. For Bedrock, use the full cross-region inference ID (e.g. `us.anthropic.claude-sonnet-4-5-20250929-v1:0`) |
| `--timeout` | `600` | Per-task timeout in seconds |
| `--skip-baseline` | off | Reuse the last baseline run |
| `--bare` | off | Skip hooks/plugins (cheaper, less realistic) |
| `--yes` | off | Skip confirmation prompts |

### run

```bash
token-miser run \
  --task benchmarks/tasks/bm-axis-explore.yaml \
  --baseline vanilla \
  --package caveman \
  --agent codex \
  --timeout 600
```

| Flag | Default | Purpose |
|------|---------|---------|
| `--task` | (required) | Path to task YAML |
| `--baseline` | (required) | Baseline: `vanilla`, package name, or path |
| `--package` | (optional) | Package to test: name or path |
| `--agent` | `claude` | `claude`, `codex`, `openai` (alias), or `both` |
| `--model` | agent-specific | Claude defaults to `sonnet`; Codex to `gpt-5.4` |
| `--timeout` | `600` | Per-invocation timeout in seconds |

### score

```bash
token-miser score --suite quick --package caveman --output score.json
```

Generates a quality-first score for the most recent tune session. The output JSON is the artifact uploaded by the CI grading workflow and consumed by badge URLs.

| Flag | Default | Purpose |
|------|---------|---------|
| `--suite` | (required) | Suite name matching the tune run |
| `--package` | (required) | Package name (e.g. `lean`) or full label (e.g. `claude:lean`) |
| `--output` | `token-miser-score.json` | Path to write JSON |

The score report includes quality rate, cost and token totals for qualifying runs, per-task normalized figures (`cost_per_qual_task`, `tokens_per_qual_task`), and deltas against the baseline package. Output defaults to `token-miser-score.json` in the current directory.

### Global flags

| Flag | Default | Purpose |
|------|---------|---------|
| `--packages-dir` | `$TOKEN_MISER_PACKAGES_DIR` or `./packages` | Directory containing packages |

Package names (no `/`) resolve to `{packages-dir}/{name}/`. Paths with `/` are used as-is.

## Suites

Six benchmark suites ship in `benchmarks/suites/`:

| Suite | Tasks | Focus |
|-------|-------|-------|
| `quick` | 8 | Fast screening |
| `standard` | 15 | General-purpose benchmark |
| `axis` | 8 | Interaction patterns (explore, multiturn, diff, testrun, bashheavy, smallio, reasoning, bigoutput) |
| `domain-python-api` | 8 | Python API development |
| `domain-iac` | 8 | Infrastructure as code |
| `domain-frontend` | 8 | Frontend development |

```bash
token-miser suite list         # show available suites
token-miser suite validate     # check all task YAMLs are valid
token-miser suite prep         # pre-clone repos to speed up runs
```

## Task format

```yaml
id: my-task
name: "Human-readable name"
repo: "https://github.com/owner/repo"
starting_commit: "abc1234"
prompt: |
  What you want the agent to do...
success_criteria:
  - type: file_exists
    paths: ["some/file.py"]
  - type: command_exits_zero
    command: "uv run pytest"
quality_rubric:
  - dimension: "correctness"
    prompt: "Score 0-1 based on..."
```

Task `repo` fields support `${VAR}` expansion from environment variables.

## How it works

For each run in an experiment:

1. Create an isolated temp directory as `HOME`
2. Clone the task's target repo and checkout the starting commit
3. Copy the selected agent's auth/config into the isolated HOME
4. Apply the selected package (copy targets, deep-merge `settings.json`)
5. Invoke the selected backend:
   - Claude: `claude --print --dangerously-skip-permissions --output-format json`
   - Codex: `codex exec --json --sandbox workspace-write --full-auto`
6. Check success criteria against the workspace
7. Optionally score quality via Claude-as-judge (requires `ANTHROPIC_API_KEY`)
8. Store results in SQLite (`~/.token_miser/results.db`)

## Packages

24 packages ship in `packages/`, spanning different optimization strategies:

| Package | Strategy |
|---------|----------|
| `caveman` | Minimal tool use, think before acting |
| `c-structured` | Structured output, systematic approach |
| `tdd-strict` | Strict TDD — failing test first, always |
| `thorough` | Maximize correctness — read everything, explain reasoning |
| `token-miser` | Minimize tokens — terse output, lazy reads, no extras |
| `lean` | Minimal overhead, skip unnecessary steps |
| `personal` | Rubin's full personal config — everything, everywhere, all at once |
| `rtk` | Rust CLI hook that compresses verbose command outputs |
| ... | See `token-miser list` for all 24 |

Each package is a directory with a `manifest.yaml`:

```yaml
name: my-package
version: 0.1.0
author: your-name
description: What this package does
targets:
  - path: AGENTS.md
    dest: AGENTS.md
  - path: settings.json    # optional — deep-merged into experiment config
    dest: settings.json
```

Packages can include `CLAUDE.md`/`AGENTS.md` instructions, `settings.json` for hooks and permissions, and hook scripts. The `settings.json` is deep-merged with any existing experiment configuration rather than overwriting it.

To use packages from another directory:

```bash
export TOKEN_MISER_PACKAGES_DIR=~/.claude/packages
token-miser list
token-miser tune --package lean
```

If you install packages via [kanon](https://github.com/caylent-solutions/kanon), they land in `~/.claude-marketplaces/`. Point token-miser there:

```bash
export TOKEN_MISER_PACKAGES_DIR=~/.claude-marketplaces
token-miser list
```

### Shared instructions

Codex uses `AGENTS.md`. Claude Code uses `CLAUDE.md`. To share one instruction file across agents, use the `@AGENTS.md` import pattern in `CLAUDE.md`:

```md
# CLAUDE.md
@AGENTS.md
```

## Matrix runs

For package screening across a benchmark suite:

```bash
SUITE=quick REPEATS=1 AGENTS=claude,codex \
  ./scripts/run-suite-shared-baseline.sh
```

Runs one shared `vanilla` baseline per `agent x repeat x suite`, then reuses it across packages.

For stricter order-balanced comparisons:

```bash
SUITE=quick REPEATS=2 AGENTS=claude,codex \
  ./scripts/run-suite-crossover.sh
```

Alternates baseline/package order by repeat to control for run-to-run drift.

## Data

All results are stored locally in `~/.token_miser/results.db` (SQLite). Runs record the agent backend, so Claude and Codex experiments coexist in the same history and reports.

```bash
token-miser history            # compact list of all runs
token-miser show 12            # full token breakdown for one run
token-miser matrix --suite axis --json results.json   # export comparison grid
```

## CI grading

Add automated benchmarking and a quality badge to any package repo.

**Prerequisites**: AWS Bedrock access (Claude Sonnet 4.5 via cross-region inference, us-east-2). Add `AWS_ROLE_TO_ASSUME` as a repository variable in GitHub → Settings → Variables → Actions pointing to an IAM role with `bedrock:InvokeModel` permissions for `us.anthropic.claude-sonnet-4-5-20250929-v1:0`.

**Setup**:
1. Copy [`docs/grade-template.yml`](docs/grade-template.yml) to `.github/workflows/grade.yml` in your package repo
2. Configure your AWS OIDC role ([AWS OIDC guide](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services))
3. Push — the workflow runs on every PR and posts a quality score comment

**Workflow status badge** (green = all tasks passed):

```markdown
[![grade](https://github.com/<owner>/<repo>/actions/workflows/grade.yml/badge.svg)](https://github.com/<owner>/<repo>/actions/workflows/grade.yml)
```

**Score JSON badge** (shows quality rate %, updates each run):

```markdown
[![quality](https://img.shields.io/badge/dynamic/json?url=<raw-score-json-url>&query=%24.quality_rate&label=quality&suffix=%25&color=blue)](https://github.com/<owner>/<repo>/actions/workflows/grade.yml)
```

Host `score.json` somewhere public (GitHub Pages, a gist) and replace `<raw-score-json-url>` with its raw URL.

**PR comment format** (posted automatically on each pull request):

```
## token-miser grade

|  | this package | baseline |
|---|---|---|
| **quality rate** | 87% (7/8) | 50% (4/8) |
| **cost / task** | $0.1234 | $0.1456 |
| **tokens / task** | 12,345 | 14,567 |
```

## Development

The source repo is private (`caylent/token_miser`). If you have access:

```bash
git clone https://github.com/caylent/token_miser.git
cd token_miser
uv sync
uv run pytest -q
uv run token-miser --help
```

## Submitting a package for benchmarking

All benchmark data is collected by me (Rubin Johnson) on a consistent hardware and model setup. I don't accept self-reported numbers — I run all experiments myself so results are comparable.

**To submit a package:**

1. Open a GitHub issue titled `benchmark request: <package-name>`
2. Include your package files (`CLAUDE.md` / `AGENTS.md`, optional `settings.json`, optional hooks)
3. Describe the strategy in 2-3 sentences: what behavior it changes and why
4. Include reproduction details: model, suite, any env requirements
5. If you have your own numbers, include them — I'll run independently and compare

Packages that test a genuinely new strategy dimension are most likely to get picked up.

If your package is already in a [kanon](https://github.com/caylent-solutions/kanon) registry, link to it in the issue.

## Ecosystem

[kanon](https://github.com/caylent-solutions/kanon) distributes versioned configuration packages. token-miser benchmarks them by running identical tasks under baseline and package configurations, then comparing token usage, cost, and quality. Results can be published back to kanon so teams converge on the best-performing package.

```
kanon distributes  ->  token-miser measures  ->  publish to kanon
(versioned packages)   (A/B comparison)         (best package wins)
```
