cto/evals
2026-05-25 13:41:12 -04:00
..
artifacts Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
fixtures Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
reports Tighten CTO live promotion opt-in audit 2026-05-25 13:41:12 -04:00
runners Tighten CTO live promotion opt-in audit 2026-05-25 13:41:12 -04:00
expectations.yaml Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
manifest.yaml Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
README.md Add CTO acceptance audit proof 2026-05-25 13:37:46 -04:00

CTO Eval Suite

This directory holds the test-first promotion and regression suite for the CTO WebUI coding agent PRD.

The suite is evidence-based: a run is not accepted from prose alone. Scoring must inspect transcripts, diffs, logs, screenshots, approval events, capsule artifacts, and report YAML.

Run the static PRD gate from the Hermes root:

pytest -q tests/e2e/test_j_cto_webui_prd.py

Score all current evidence reports from cto/:

for r in evals/reports/*.yaml; do python3 evals/runners/score.py "$r"; done

Run the deterministic local CTO/WebUI regression execution slice from cto/:

./evals/runners/run-webui-cto.sh

Run the executable promotion-suite readiness gate from cto/:

python3 evals/runners/run-promotion-suite.py
python3 evals/runners/score.py evals/reports/2026-05-25-promotion-suite-readiness.yaml

Run the isolated deterministic fixture execution gate from cto/:

python3 evals/runners/run-promotion-fixtures.py
python3 evals/runners/score.py evals/reports/2026-05-25-promotion-fixture-execution.yaml

Run the live-promotion readiness gate from cto/:

python3 evals/runners/run-live-promotion-readiness.py
python3 evals/runners/score.py evals/reports/2026-05-25-live-promotion-readiness.yaml

Run the section-20 acceptance audit from cto/:

python3 evals/runners/audit-acceptance.py
python3 evals/runners/score.py evals/reports/2026-05-25-acceptance-audit.yaml

Check Codex comparative readiness from cto/:

./evals/runners/run-codex-cli.sh

fixtures/manifest.yaml is the deterministic contract layer for the full PRD promotion suite. It proves every required eval has a prompt, evidence expectations, event expectations, and gates. It does not claim live promotion success or Codex CLI parity.

audit-acceptance.py maps every PRD section 20 acceptance criterion to current evidence and explicit external blockers. It is scoreable evidence for the audit surface, not a production-parity claim.