cto/evals
2026-05-25 13:27:29 -04:00
..
artifacts Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
fixtures Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
reports Harden CTO sandcastle provider gate 2026-05-25 13:27:29 -04:00
runners Refresh CTO WebUI audit eval proof 2026-05-25 13:21:01 -04:00
expectations.yaml Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
manifest.yaml Upgrade CTO webui coding profile 2026-05-25 12:57:33 -04:00
README.md Add CTO live promotion readiness gate 2026-05-25 13:11:24 -04:00

CTO Eval Suite

This directory holds the test-first promotion and regression suite for the CTO WebUI coding agent PRD.

The suite is evidence-based: a run is not accepted from prose alone. Scoring must inspect transcripts, diffs, logs, screenshots, approval events, capsule artifacts, and report YAML.

Run the static PRD gate from the Hermes root:

pytest -q tests/e2e/test_j_cto_webui_prd.py

Score all current evidence reports from cto/:

for r in evals/reports/*.yaml; do python3 evals/runners/score.py "$r"; done

Run the deterministic local CTO/WebUI regression execution slice from cto/:

./evals/runners/run-webui-cto.sh

Run the executable promotion-suite readiness gate from cto/:

python3 evals/runners/run-promotion-suite.py
python3 evals/runners/score.py evals/reports/2026-05-25-promotion-suite-readiness.yaml

Run the isolated deterministic fixture execution gate from cto/:

python3 evals/runners/run-promotion-fixtures.py
python3 evals/runners/score.py evals/reports/2026-05-25-promotion-fixture-execution.yaml

Run the live-promotion readiness gate from cto/:

python3 evals/runners/run-live-promotion-readiness.py
python3 evals/runners/score.py evals/reports/2026-05-25-live-promotion-readiness.yaml

Check Codex comparative readiness from cto/:

./evals/runners/run-codex-cli.sh

fixtures/manifest.yaml is the deterministic contract layer for the full PRD promotion suite. It proves every required eval has a prompt, evidence expectations, event expectations, and gates. It does not claim live promotion success or Codex CLI parity.