PrismDocs
PrismDocs

Calibration evidence

How Prism proves the sentinel can discriminate good, mediocre, and bad trading-agent reasoning.

Prism's sentinel is useful only if it can separate sound reasoning from broken reasoning before capital moves. Calibration evidence is the public trust surface for that claim.

Public startup gate

The public guarantee is a small, reproducible startup discrimination gate:

CaseExpected qualityVerdict
Good traceCoherent thesis, evidence-linked risks, calibrated uncertainty65 PASS
Mediocre tracePartially supported but overconfident42 WARN
Bad traceUnsupported conclusion and broken calibration20 REJECT

The required gap between good and bad reasoning is >=30 points. The current gate separates them by 45 points and preserves monotonic ordering:

good > mediocre > bad
65   > 42       > 20

Reproduce locally:

uv run pytest apps/sentinel/src/tests/test_calibration.py

Private calibration corpus

Prism also maintains a private local corpus used for release-quality work:

Corpus sliceCount
Total rows60
Real harvested Trading-R1 traces28
Synthetic seed traces20
Mutated adversarial traces12
Human-reviewed labels43
Frozen pilot slice54

Raw calibration rows are not committed. Harvested traces can include wallet, position, requester, or market context, so Prism publishes summary evidence and keeps the corpus local/private.

What this does and does not claim

What it claims:

  • the sentinel passes a reproducible startup discrimination gate;
  • good, mediocre, and bad reasoning traces produce different verdict bands;
  • a broader private corpus exists for release-quality review and regression work;
  • raw private rows are intentionally withheld to avoid leaking sensitive context.

What it does not claim yet:

  • production-grade LLM-as-judge agreement;
  • a public gold set;
  • nightly CI release gating on the full corpus;
  • that every historical verdict includes a structured issue ledger.

Legacy receipts without a structured issue ledger remain review-gated by the capital gate.

Dashboard evidence

The live dashboard evidence page is here:

https://prism-dashboard-production-e6e3.up.railway.app/calibration

The stats page also includes a calibrationGap metric, but that is different: it is the live score spread between high-scoring and low-scoring verdicts. The /calibration page is the startup discrimination and corpus-evidence surface.

Corpus workflow

The local calibration package supports:

build, harvest, label, freeze, sync, eval, inspect, validate

Inspect the CLI surface:

uv run python -m prism_calibration.cli --help

On this page