Calibration evidence
How Prism proves the sentinel can discriminate good, mediocre, and bad trading-agent reasoning.
Prism's sentinel is useful only if it can separate sound reasoning from broken reasoning before capital moves. Calibration evidence is the public trust surface for that claim.
Public startup gate
The public guarantee is a small, reproducible startup discrimination gate:
| Case | Expected quality | Verdict |
|---|---|---|
| Good trace | Coherent thesis, evidence-linked risks, calibrated uncertainty | 65 PASS |
| Mediocre trace | Partially supported but overconfident | 42 WARN |
| Bad trace | Unsupported conclusion and broken calibration | 20 REJECT |
The required gap between good and bad reasoning is >=30 points. The current gate separates them by 45 points and preserves monotonic ordering:
good > mediocre > bad
65 > 42 > 20Reproduce locally:
uv run pytest apps/sentinel/src/tests/test_calibration.pyPrivate calibration corpus
Prism also maintains a private local corpus used for release-quality work:
| Corpus slice | Count |
|---|---|
| Total rows | 60 |
| Real harvested Trading-R1 traces | 28 |
| Synthetic seed traces | 20 |
| Mutated adversarial traces | 12 |
| Human-reviewed labels | 43 |
| Frozen pilot slice | 54 |
Raw calibration rows are not committed. Harvested traces can include wallet, position, requester, or market context, so Prism publishes summary evidence and keeps the corpus local/private.
What this does and does not claim
What it claims:
- the sentinel passes a reproducible startup discrimination gate;
- good, mediocre, and bad reasoning traces produce different verdict bands;
- a broader private corpus exists for release-quality review and regression work;
- raw private rows are intentionally withheld to avoid leaking sensitive context.
What it does not claim yet:
- production-grade LLM-as-judge agreement;
- a public gold set;
- nightly CI release gating on the full corpus;
- that every historical verdict includes a structured issue ledger.
Legacy receipts without a structured issue ledger remain review-gated by the capital gate.
Dashboard evidence
The live dashboard evidence page is here:
https://prism-dashboard-production-e6e3.up.railway.app/calibrationThe stats page also includes a calibrationGap metric, but that is different: it is the live score spread between high-scoring and low-scoring verdicts. The /calibration page is the startup discrimination and corpus-evidence surface.
Corpus workflow
The local calibration package supports:
build, harvest, label, freeze, sync, eval, inspect, validateInspect the CLI surface:
uv run python -m prism_calibration.cli --help