Metadata-Version: 2.4
Name: sysml-bench
Version: 0.1.0
Summary: LLM tool augmentation benchmark for SysML v2 model comprehension.
Project-URL: Homepage, https://gitlab.com/nomograph/sysml-bench
Project-URL: Repository, https://gitlab.com/nomograph/sysml-bench
Project-URL: Documentation, https://gitlab.com/nomograph/sysml-bench/-/blob/main/README.md
Project-URL: Issues, https://gitlab.com/nomograph/sysml-bench/-/issues
Project-URL: Changelog, https://gitlab.com/nomograph/sysml-bench/-/blob/main/CHANGELOG.md
Author-email: Andrew Dunn <andrew@dunn.dev>
License: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.12
Requires-Dist: fastembed>=0.4
Requires-Dist: litellm>=1.40
Requires-Dist: numpy>=1.26
Requires-Dist: pydantic>=2.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: scipy>=1.11
Requires-Dist: tiktoken>=0.7.0
Description-Content-Type: text/markdown

# sysml-bench

[![Nomograph Labs](https://img.shields.io/badge/Nomograph_Labs-1a1a1a?style=flat&labelColor=f2f0eb&color=1a1a1a)](https://nomograph.ai)
[![License: MIT](https://img.shields.io/badge/License-MIT-1a1a1a.svg)](LICENSE)
[![Pipeline](https://gitlab.com/nomograph/sysml-bench/badges/main/pipeline.svg)](https://gitlab.com/nomograph/sysml-bench/-/pipelines)
[![PyPI](https://img.shields.io/pypi/v/sysml-bench?color=1a1a1a)](https://pypi.org/project/sysml-bench/)
[![Dataset](https://img.shields.io/badge/🤗_Dataset-sysml--v2--reasoning--benchmark-1a1a1a)](https://huggingface.co/datasets/nomograph/sysml-v2-reasoning-benchmark)

Benchmarking harness for evaluating LLM performance on SysML v2 model
comprehension tasks with CLI tool augmentation. Part of
[Nomograph Labs](https://nomograph.ai).

**Repository**: [gitlab.com/nomograph/sysml-bench](https://gitlab.com/nomograph/sysml-bench)

## What This Is

A Python evaluation framework that measures how different tool configurations
affect LLM accuracy on structured systems engineering tasks. The benchmark
uses a real SysML v2 model corpus (an Eve Online Mining Frigate design) and
tests five models across 40+ experimental conditions with 132 active tasks.

The primary question: **does giving an LLM more tools improve its ability to
answer questions about a SysML v2 model?** The answer is nuanced — it depends
on the task type, the model, and whether the agent receives guidance on tool
selection.

## Key Observations

14 observations from our evaluation. Selected highlights:

| Observation | Result |
|-------------|--------|
| **O12: Guided tool selection** | One sentence of tool selection guidance eliminates the discovery penalty entirely — guided graph 0.885 vs unguided 0.750 |
| **O4: Render vs assembly** | Pre-rendered views score 0.868 vs 0.399 for agentic assembly on explanation tasks (47pp gap, 6.6x lower cost) |
| **O1: Tool-task interaction** | Graph tools hurt discovery but help layer tasks. The effect is heterogeneous — no single tool set dominates |
| **O10: Corpus scaling** | Performance collapses from 0.880 at 19 files to 0.423 at 95 files; graph tools and vectors don't help at scale |
| **O2: Model quality gap** | Sonnet 0.880 vs best OpenAI 0.529 — 35pp gap, but GPT-4o-mini is 87x more cost-efficient |
| **O8: CLI vs RAG** | CLI dominates on discovery (+0.289); RAG edges CLI on reasoning (+0.136) |
| **O14: Structural traces** | Graph tools don't help even at 4-5 hop traces on a 19-file corpus |

## Corpus

**Eve Online Mining Frigate** — a SysML v2 model of a fitted mining ship
designed for the Eve Online universe. 19 files, 798 elements, 1,515
relationships. Covers requirements, concerns, stakeholders, logical
architecture, COTS modules, interfaces, verification cases, and rollup
analysis.

A secondary corpus of 95 files (SysML v2 specification examples) is used for
scaling experiments.

## Task Categories

### Primary corpus (Eve Online Mining Frigate, 19 files)

| Category | Count | File | Purpose |
|----------|-------|------|---------|
| Discovery | 16 | `eve_discovery.yaml` | Extract attributes, follow references, enumerate elements |
| Reasoning | 12 | `eve_reasoning.yaml` | Multi-hop traversal, counterfactual analysis, exhaustive enumeration |
| Explanation | 8 | `eve_explanation.yaml` | Generate human-readable descriptions from model data |
| Generative | 8 | `eve_generative.yaml` | Open-ended generation tasks with LLM-scored evaluation |
| Layer | 20 | `eve_layer_tools.yaml` | RFLP layer classification, coverage metrics |
| Boundary | 8 | `eve_boundary.yaml` | 2-3 hop traversals at the graph-tool benefit threshold |
| Vector-sensitive | 8 | `eve_vector_sensitive.yaml` | Paraphrase-gap tasks testing semantic retrieval |
| Structural trace | 8 | `eve_structural_trace.yaml` | 4-5 hop traces and exhaustive chain enumeration |

### Multi-corpus discovery (scaling and generalization)

| Corpus | Count | File | Purpose |
|--------|-------|------|---------|
| SysML v2 examples | 20 | `examples_discovery.yaml` | Discovery on 95-file specification corpus |
| Arrowhead | 6 | `arrowhead_discovery.yaml` | Discovery on Arrowhead framework model |
| Drone | 4 | `drone_discovery.yaml` | Discovery on drone system model |
| HVAC / HSUV | 6 | `hvac_hsuv_discovery.yaml` | Discovery on HVAC and hybrid SUV models |
| Vehicle | 8 | `vehicle_discovery.yaml` | Discovery on vehicle system model |

**Total: 132 active tasks** across 13 task files. An additional 82 archived tasks
from earlier experimental phases are in `eval/tasks/archive/`.

## Tool Sets

| Tool Set | Tools | Schema Tokens | Description |
|----------|-------|---------------|-------------|
| `cli_search` | `sysml_search`, `read_file` | ~250 | Search + read. Baseline. |
| `cli_graph` | + `sysml_trace`, `sysml_check`, `sysml_query`, `sysml_inspect` | ~1120 | Graph traversal tools |
| `cli_render` | cli_graph + `sysml_render` | ~1300 | Server-side Markdown rendering |
| `cli_full` | cli_render + `sysml_stat`, `sysml_plan` | ~1500 | Full tool set including planning |

## Scoring

Per-field structured scoring with the following types:

| Type | Mechanism |
|------|-----------|
| `Bool` | Exact boolean match → 1.0 or 0.0 |
| `Float` | Within tolerance (default ±0.05) → 1.0 or 0.0 |
| `Str` | Case-insensitive exact match with qualified-name suffix matching (`A::B` matches `B`) |
| `StrContains` | Case-insensitive substring match |
| `ListStr` | Set-based F1 score with threshold (default 0.8). Supports qualified-name matching. Binary: ≥threshold → 1.0, else 0.0 |

Task score = mean of field scores. Condition score = mean of task scores across N runs.

## Reproduction

### Prerequisites

- Python ≥3.12
- [uv](https://docs.astral.sh/uv/) package manager
- `nomograph-sysml` CLI binary on `$PATH`
- `ANTHROPIC_API_KEY` and/or `OPENAI_API_KEY` environment variables

### Setup

```bash
git clone https://gitlab.com/nomograph/sysml-bench.git
cd sysml-bench
uv sync
```

### Run an experiment

```bash
uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search \
    --runs 3 --max-turns 15 \
    --output results/my-experiment.json
```

### With Docker

```bash
docker pull registry.gitlab.com/nomograph/sysml-bench:latest
docker run -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
    registry.gitlab.com/nomograph/sysml-bench:latest \
    python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search --runs 1 --max-turns 15 \
    --output /results/my-experiment.json
```

### With guided system prompt

```bash
uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_graph \
    --system-prompt-file eval/prompts/guided_discovery.txt \
    --runs 5 --max-turns 15 \
    --output results/guided-experiment.json
```

### With vector search

```bash
uv run python -m eval.llm_cli \
    --task-file eval/tasks/eve_discovery.yaml \
    --models claude-sonnet-4-20250514 \
    --tool-set cli_search --vectors \
    --runs 3 --max-turns 15 \
    --output results/vector-experiment.json
```

## Results

Result JSON files are stored separately in
[nomograph/sysml-bench-results](https://gitlab.com/nomograph/sysml-bench-results)
(private — contains model outputs and cost data). Each result file is a JSON
document containing run metadata and per-task scored results with cost, token
counts, and per-field scoring breakdowns.

When running experiments, output files are written to a local `results/`
directory (gitignored in this repo).

## Dataset

The benchmark tasks and baseline results are available on HuggingFace:

```python
from datasets import load_dataset
tasks = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="tasks")
results = load_dataset("nomograph/sysml-v2-reasoning-benchmark", split="results")
```

See [nomograph/sysml-v2-reasoning-benchmark](https://huggingface.co/datasets/nomograph/sysml-v2-reasoning-benchmark) for the full dataset card.

## Known Limitations

1. **Corpus size**: Primary corpus is 19 files. Results may not generalize to
   production-scale models (1000+ files). Scaling experiments (O10) show
   dramatic performance drops at 95 files.

2. **Model versions**: All results are from specific model snapshots
   (claude-sonnet-4-20250514, gpt-4o-2024-08-06, etc.). Results may differ
   with future model versions.

3. **Stochastic variance**: Some tasks show high run-to-run variance (e.g.,
   D12, D13 are bimodal). N=3-5 replication mitigates but does not eliminate
   this.

4. **ListStr scoring**: The set-based F1 scorer handles flat string lists
   only. Compound answer types (list-of-dicts) require decomposition into
   separate fields.

5. **Single domain**: All tasks are from a SysML v2 model. The observations may
   not generalize to other modeling languages or domains.

## License

MIT — see [LICENSE](LICENSE)

## Citation

```bibtex
@article{dunn2026sysmlbench,
  title={sysml-bench: Evaluating Tool-Augmented {LLMs} on {SysML} v2 Model Comprehension},
  author={Dunn, Andrew},
  year={2026},
  journal={arXiv preprint},
  url={https://gitlab.com/nomograph/sysml-bench},
  note={Nomograph Labs}
}
```
