Metadata-Version: 2.4
Name: conjure-eval
Version: 0.3.1
Summary: Public-slice harness for the CONJURE transformative-creativity benchmark.
Author-email: Patrick Cooper <patrick.cooper@colorado.edu>
License: Apache-2.0
Keywords: benchmark,lean4,mathlib,llm,creativity
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: verify

# conjure-eval

Public-slice harness for the CONJURE transformative-creativity benchmark.
Ships the 393-instance public split (70 percent of the 560-instance Phase 4.8
frozen corpus: 510 closed-problem instances across 17 Lakatos families plus
the 50-instance C4-OPEN axis of formalised open mathematical conjectures,
SHA-256 `a8c9842ea4d59072802689603b1e38c679fd1695194aa1cf73f81c076903daf6`)
so frontier-model developers can self-evaluate locally before submitting to
the hidden split.

This package contains:

- The frozen public-slice corpus JSON (`conjure_eval.data.public_corpus`).
- A CLI for inspecting the corpus, driving a model pass, and checking
  submission files before they are sent to the hidden-split adjudicator.
- The deterministic split provenance, so any third party can re-derive the
  public/hidden split byte-for-byte from the source corpus.

## What this package is and isn't

`conjure-eval` is a self-service developer convenience: it lets a model team
inspect the public contracts, run their model against the public slice, and
smoke-test their submission format before sending results to the benchmark
author. It does not ship the hidden split, and it does not run the
kernel-verified tight-mode adjudicator that produces the headline accept rate.
Those live in the private `blanc` repository and are operated by the benchmark
author against frozen model snapshots; the headline number reported in the
brief is the hidden-split rate.

## Install

```bash
pip install conjure-eval
```

## Usage

```bash
# List all 393 public-slice instance IDs
conjure-eval list-public

# Inspect a single instance
conjure-eval show C1-bv-001

# Drive a model pass (OpenAI-compatible endpoint)
conjure-eval run \
    --base-url https://your-endpoint/v1 \
    --api-key-env MY_API_KEY \
    --model your-model-name \
    --out submissions.jsonl

# Check submission file well-formedness before sending
conjure-eval verify-submission submissions.jsonl

# Print corpus provenance fields
conjure-eval provenance
```

## Provenance

The public corpus is a deterministic 70/30 axis-stratified slice of the
560-instance Phase 4.8 frozen corpus maintained in the private `blanc`
repository (510 closed-problem instances + 50 C4-OPEN instances). Seed:
`4317`. Anyone with the source corpus can reproduce both slices via
`scripts/build_conjure_split.py`. Every C4-OPEN instance carries a
snapshot-pinned open-status certificate generated by
`scripts/certify_open_status.py`; certificate JSONs live under
`instances/open_status_certificates/`.
