§ VIII · Long-form
8 min readlast revised 2026-04-22snapshot 2026-06-15T03:47ZEvaluation
Brier score, log-loss, RPS, and closing-line value; what each metric measures, how to read the Ledger, and why calibration matters more than accuracy.
Contents
Graded fairly: the proper-scoring frame
Calibration is the property we care about. Proper scoring rules are the math that makes calibration enforceable. Everything below is one of those rules, or a hypothesis test built on top of one.
The implementation lives in evaluation/accuracy_metrics.py and
evaluation/clv_tracker.py; every threshold, every bootstrap parameter,
and every level on this page is read from
evaluation/pre_reg_constants.yaml, which was sealed and signed before
any 2026 forecast was published.
The epistemic argument for why this frame is the only honest way to grade a probabilistic model lives at Why probabilities.
The three scoring rules
Brier score
The multiclass Brier score, summed over outcomes:
where is the model's predicted probability for outcome on match and is the one-hot indicator of the realized outcome. The score is non-negative and bounded above by for . Lower is better.
What the Brier score penalizes most heavily is large probability errors on the realized outcome: assigning to the outcome that did not happen costs more than assigning . The geometric reading is the squared distance between the forecast simplex point and the realized one-hot vector.
We report the unnormalized form. Some references divide by (or further normalize to ). The unnormalized form is what the code returns and what the cross-validation battery output encodes.
Ranked Probability Score
The Ranked Probability Score (Epstein 1969) generalizes the Brier score to ordered categorical outcomes by squaring the cumulative distribution errors instead of the per-outcome errors:
The implementation uses the natural ordering H ≻ D ≻ A. The normalization by puts the score in for .
What RPS gets right that Brier does not: ordered miss-distance. Misclassifying a Home win as an Away win is more wrong than misclassifying it as a Draw, because the categories have a natural rank. Brier treats both mistakes equivalently. RPS rewards distributional sharpness on the ordered 1X2 outcome.
Log-loss
The cross-entropy loss with a clipping floor:
with sealed in pre_reg_constants.yaml under
scoring_rules.log_loss_eps_min. Without the clipping floor a
probability of exactly on the realized outcome would produce an
infinite loss; the clip preserves a finite penalty for confidently wrong
predictions while keeping the metric well-defined.
Log-loss is the steepest of the three rules at high-confidence wrong predictions. Saying for the realized outcome incurs . Saying for the wrong outcome incurs . This is the metric the kill criterion is built on, because it is the most discriminating of the three at the model-comparison stage.
Why three rules, not one
Different rules disagree usefully when models trade off calibration in different ways. Brier is the general-purpose calibration measure, RPS rewards distributional sharpness on ordered outcomes, and log-loss is the most punishing of confident wrong predictions.
Reporting all three is the pre-registered commitment; reporting only one would leave the project free to pick the rule that flatters the result. The reliability diagram (a visual calibration check) is published on the Transparency Ledger alongside the per-rule metric values.
Bootstrap confidence intervals
The percentile bootstrap with resamples, sealed under
accuracy_bootstrap.n_bootstrap. For any per-match loss array we report
the empirical mean and the bootstrap 95% CI on that mean
(evaluation/accuracy_metrics.py::_bootstrap_ci).
Resampling is from the match-level loss series (one observation per settled match), so the CI reflects the genuine sampling variability we have at our small sample size.
Pairwise model comparison: Diebold-Mariano
The test statistic
Given two models with per-match losses and , define the difference series . The Diebold-Mariano (1995) test statistic is the standardized mean of that series:
where is the sample mean and is the long-run heteroskedasticity- and autocorrelation-consistent variance estimator. The null is , that the two models have equal expected loss.
Newey-West HAC variance and Harvey-Leybourne-Newbold correction
The HAC variance uses the Newey and West (1987) Bartlett kernel:
where is the j-th sample autocovariance of and is the bandwidth. The bandwidth follows the standard rule of thumb, . The variance of the mean is then .
At our sample size the asymptotic normal approximation is unreliable, so we apply the Harvey, Leybourne, and Newbold (1997) small-sample correction:
The corrected statistic is compared to a Student -distribution with degrees of freedom rather than to a standard normal. With in the dozens of matches plus a small hold-out, the correction matters: it widens the rejection thresholds appropriately for the sample size we actually have.
Pre-registered rejection thresholds
Three locked levels live in
pre_reg_constants.yaml::diebold_mariano.comparisons:
| Comparison | Notes | |
|---|---|---|
| M★ vs M0 | 0.05 | Primary test of model-vs-baseline |
| M★ vs Market | 0.05 | Primary test of model-vs-de-vigged-market |
| All other pairwise | 0.005 | Bonferroni-corrected for the four shadow comparisons |
The Bonferroni correction on the shadow comparisons keeps the family-wise error rate at across the four shadow-vs-baseline tests. The slightly stricter accounts for the fact that some shadow comparisons appear under multiple loss types (Brier and log-loss), which inflates the effective number of tests beyond a naive count.
The kill criterion as a Diebold-Mariano application
The pre-registered kill criterion is the same DM-style comparison applied between M★'s log-loss and M0's log-loss on the cross-validation hold-out. The decision rule is:
Locked at 2.0 standard errors in
pre_reg_constants.yaml::kill_criterion.threshold_standard_errors.
Phase 8 returned 1.75 SE, falling 0.25 short of the bar.
The mathematical statement, the live status block, and the operational consequence live at Kill criteria. What this page contributes is the metric the criterion is built on: log-loss as a proper scoring rule with a Bonferroni-aware rejection threshold and an honest small-sample correction.
Market efficiency: the Nyberg test
The hypothesis
The Nyberg (2014) test asks whether the market closing line incorporates all information in . If the closing line already absorbs the model's signal, then conditional on the closing line the model probabilities should add no further explanatory power for realized outcomes. Rejection means the model contains information beyond what the closing line reflects.
The multinomial logit specification
For each match and outcome (with Draw as the reference category), fit:
where is the de-vigged closing probability and
is the model probability. Logits are clipped at
(nyberg.logit_clip).
The likelihood ratio test
The null is (the model coefficients contribute nothing once the closing line is conditioned on). The likelihood ratio statistic is:
Pre-registered critical values from nyberg.comparisons:
| Comparison | critical value | |
|---|---|---|
| M★ primary | 0.05 | 5.991 |
| Shadow models (Bonferroni) | 0.0125 | 8.668 |
Rejection at the M★ level says contains information the market closing line did not absorb. Failure to reject says the closing line already does the same work the model does.
Wald HC3 robustness and opening-line re-fit
A Wald test on the same null hypothesis, using the HC3 sandwich variance estimator, is reported as a robustness panel alongside the LR test. Empirically the two should agree; persistent disagreement signals a mis-specified covariance structure that the LR test alone would not catch.
The same multinomial logit is re-fit using opening-line de-vigged probabilities rather than closing-line. The opening-line re-fit asks: does the model contain information that the opening line did not have?
This catches a subtle robustness concern that the closing-line specification cannot. If the closing line moves in response to the model's published edges, the closing line will partially absorb the model's information by construction, even when the model genuinely was first to it. The opening-line re-fit isolates the model's information advantage at the moment the line opens, before any market adjustment to the model could have happened.
Closing Line Value (CLV)
Why CLV at small N
CLV is the trading community's gold standard for edge detection at small sample size. For a 64-match World Cup with M★'s edges flagged on perhaps 30 to 40 matches, win-rate analysis is dominated by sampling variance: a few unlucky bounces and the win rate looks bad even when the underlying probabilities were correct. CLV is dominated by signal as long as the model's edges actually move the market, because every match contributes a CLV observation regardless of whether the bet won or lost.
Two CLV definitions
The implementation logs both forms
(clv_tracker.py::compute_clv_series). The probability-space form is
the primary metric:
The log-odds form is logged as a secondary, for cross-checking against trading-desk conventions:
where is the de-vigged probability (power method) on the chosen outcome and is the decimal odds. Both forms are computed for every paired open/close observation. Inferential procedures use the probability-space form.
The naive Z-test
A naive i.i.d. Z-test for :
Pre-specified rejection: one-sided at ,
(clv.z_test.critical_value). Reported as
a floor, not as the primary inferential procedure. CLV time series have
non-trivial autocorrelation across consecutive matches in the same
market regime, and the i.i.d. assumption is too strong. The Hall and
Mueller bootstrap below handles the dependence honestly.
The Hall-Mueller bootstrap Sharpe (primary)
The primary inferential procedure for genuine edge detection. Five steps:
- Compute the realized Sharpe ratio of the CLV series, .
- Choose the block length , with bounds in
clv.bootstrap. - Generate stationary block bootstrap (Politis and Romano 1994) resamples of the CLV series.
- Compute the bootstrap Sharpe for each resample.
- Apply the Hall and Mueller (1997) bias correction to each bootstrap Sharpe: .
The 90% and 95% percentile CIs of the corrected distribution are reported.
Bootstrap reproducibility is governed by a deterministic seed derivation from the frozen code SHA:
Anyone with access to the same code SHA can reproduce the full bootstrap distribution byte-for-byte.
The three-state decision rule
The bootstrap output decides between three labels
(clv.edge_decision.labels):
| Label | Condition |
|---|---|
| GENUINE_EDGE | 95% CI lower bound |
| WEAK_EDGE | 95% CI lower bound , but 90% CI lower bound |
| NO_EDGE | otherwise |
The three-state structure is itself a pre-registered choice. We could have used a binary cutoff at . We chose three states because the boundary case (the 95% bound just barely failing) is genuinely ambiguous; labelling it WEAK_EDGE rather than forcing it into either bucket is more honest about what we know.
Pseudo-CLV: shadow models in M★'s clothes
The "Champion's Clothes" principle. Each shadow model M0 through M3 is hypothetically dressed in M★'s exact market-facing machinery:
- the same edge thresholds, 3% on mainline markets and 5% on derivatives, from
market.edge_threshold_*, - the same Volatility Gate state from the live snapshot, so shadow bets are also suppressed when news shocks fire,
- the same Kelly fractions, for mainlines and for longshots (), from
kelly.fraction_*, - the same opening and closing lines from
forecast_log.
The only thing that varies across shadow books is the probability vector
. The same CLV math runs against each. The
implementation lives in evaluation/pseudo_clv.py; the CLV math is
delegated to clv_tracker and never re-implemented.
What pseudo-CLV answers: had M0, M1, M2, or M3 been the Champion instead of M★, would it also have generated edges that the market subsequently moved toward? The counterfactual is constrained, not unconstrained. We are not asking what M★ would look like as a different model entirely. We are asking how each shadow model's probabilities would have priced under M★'s exact market layer, edge thresholds, gate, and Kelly sizer. That constraint is what makes the comparison interpretable.
What pseudo-CLV does not answer. It is still bound to the small sample size that makes CLV inference noisy in the first place. A WEAK_EDGE label on a shadow book under the sanity-gate-warning regime does not mean that shadow book would have cleared the kill criterion's structural bar; the kill criterion is a log-loss-based primary criterion plus a 2-SE sanity gate, and pseudo-CLV is a market-edge-based bootstrap. They answer different questions.
A note on the current state. M★ is M2_fifa, sealed in
data/calibration/champion_model.json with CHAMPION_LOCKED: true;
the paired-difference SE convention in
evaluation/cv_battery_result.json reads a 1.75 SE gap and the
sanity-gate warning fired, while the marginal-SE convention in
champion_model.json reads 6.22 SE and clears the bar (see
Kill criteria for the dual reading). No real
Kelly stakes are placed against any model's edges, because the project
does not place real bets at any time; pseudo-CLV continues to grade
us anyway, because it is the same math run on hypothetical bets logged
in forecast_log: a paper portfolio that the metric tracks even when
no real money is at risk. CLV on M★ tracks M2's hypothetical
performance against the closing market; pseudo-CLV for M0, M1, and M3
remains an interesting counterfactual.
What the metrics do not do
The boundary list. Things this evaluation framework deliberately does not capture:
- No win/loss accounting. We do not grade by wins. Sampling variance on a 64-match tournament is too high to learn from win counts alone.
- No expected-value framing against a bankroll. The project does not place real bets, so EV-against-bankroll is not a metric we report.
- No backtest of an alternate betting strategy beyond what pseudo-CLV captures. Hypothetical strategies that change the edge threshold, the Kelly fraction, or the gate rules would be post-hoc by definition, and reporting them would compromise the pre-registration.
- No matchup-level or team-level calibration drift tests. With a 64-match tournament, any per-team binning would have single-digit observation counts. The reliability diagram on the Transparency Ledger reports stage-level breakdowns (Group, R32, R16, QF, SF, Final) at most.
- No latency analysis. The simulation engine is fast enough that sub-second forecasting is not a constraint we model or report.
- No comparison against forecasting services that do not publish their probabilities. We compare against bookmaker-implied closing lines because those are the only well-defined external probabilities we can match the engine against. Comparisons against undisclosed models would require trusting unverifiable claims.
Where to go next
- Kill criteria: the formal mathematical statement of the kill criterion, the live status block, and the operational consequence of firing.
- Pre-registration: the OSF DOI, the signed Git tag, and the sealed
pre_reg_constants.yamlthat locked every level on this page. - Models: the cross-validation battery that fired the kill criterion using the metrics described here.
- Why probabilities: the epistemic argument for why proper scoring rules are the only honest grading frame.
- Transparency Ledger: the live ablation surface, with the reliability diagram, the CLV tile plot, and the rolling per-stage metric values.