§ V · Status page

8 min readlast revised 2026-04-22snapshot 2026-07-30T01:17Z

Kill criteria

If M★ performs worse than the null baseline by the Round of 16, this project publishes a null result.

By The 45% Problem project

Contents

In Phase 8, the kill criterion's pre-flight sanity gate fired a warning. M2, the candidate that won the cross-validation log-loss battery, beat M0 by 1.75 standard errors under one of two SE conventions on disk, falling short of the pre-registered 2.0 SE bar by 0.25 SE under that reading. Under the other reading, in data/calibration/champion_model.json, the gap is 6.22 SE and the bar is cleared decisively. The pre-registered consequence on a sanity-gate firing was pivot_paper_framing, the procedural obligation reflected in this essay. M2 stays as M★, sealed in champion_model.json with CHAMPION_LOCKED: true. The live R16 checkpoint was evaluated on 2026-07-07 and did not fire: M2 was 1.47 SE better than M0 on the 72 pre-registered forecasts.

This page describes the criterion that fired the warning, the inequality it encodes, the two SE conventions that produce different gap numbers from the same locked data, the two stages at which the criterion is evaluated, and what the firing has and has not changed about the project's operations. The procedural argument for why a pre-registered stopping rule is necessary at all sits at the bottom; the substantive event the rule produced sits at the top.

The mathematical statement

The kill criterion is a two-condition gate. Both conditions must hold for M★ to remain the live trading model. Failure on either condition fires the criterion.

The two-condition gate

The first condition is that M★ has the lowest mean cross-validation log-loss across the candidate set $\{M_0, M_1, M_2, M_3\}$ . The second condition is that M★'s log-loss must beat M0's by at least 2 standard errors of the difference.

The two-condition structure is load-bearing. Without the first condition, the project could re-label any model as M★ retroactively. Without the second condition, the project could promote a model that beats M0 by a hair on sampling variance and call the result an edge. The pre-registration sealed both conditions before the tournament began.

The decision inequality

Let $d_i = \mathcal{L}^{M_0}_i - \mathcal{L}^{M^{\star}}_i$ be the per-match log-loss difference (positive $d$ means M★ is better than M0). The criterion requires:

\overline{d} \;\geq\; 2 \cdot \mathrm{SE}\!\big(\overline{d}\big)

The criterion fires when this inequality fails:

\overline{d} \;<\; 2 \cdot \mathrm{SE}\!\big(\overline{d}\big)

The firing condition is "M★ does not beat M0 by 2 SE." This can happen in two ways. M★ might beat M0 by less than 2 SE, which is what the paired-difference SE reading in evaluation/cv_battery_result.json reports for Phase 8: a 1.75 SE gap, falling 0.25 SE short of the bar. M★ might also fail to beat M0 at all, which would be a stronger failure mode. Both fire the criterion under that reading; both trigger the same pre-registered consequence (pivot_paper_framing). The marginal-SE reading in champion_model.json reports 6.22 SE for the same comparison, which clears the bar. The two readings, and what they mean, are documented in the "Dual SE reading" subsection below.

The threshold value is 2.0 standard errors, sealed in pre_reg_constants.yaml::kill_criterion.threshold_standard_errors and mirrored at kill.ll_gap_se. The threshold cannot be modified during the tournament without an OSF amendment, which is itself a public artifact.

Why two standard errors

The choice of 2 SE rather than zero is a deliberate buffer at our sample size. With 64 World Cup matches plus a small calibration hold-out, a 1-SE rule fires on roughly 16% of repeated draws of an honest null hypothesis. The 2-SE bar moves the false-positive rate to roughly 2.5% one-sided, while still being permissive enough that any genuine model improvement should clear it.

The 2-SE choice also accounts for the multiple-comparison structure across the four shadow models. A 1-SE rule applied independently to each shadow candidate would inflate the family-wise false-positive rate above 50%; the 2-SE rule, combined with the Bonferroni correction on the Diebold-Mariano comparisons described in Evaluation, keeps the family-wise rate below 5%.

Dual SE reading

The same 0.041 log-loss gap between M2 and M0 produces two different SE numbers in the two locked CV files. Both files are signed; both are sealed; both are part of the OSF pre-registration record. They disagree because they use different standard-error conventions, not because the underlying data disagrees.

The marginal-SE reading

data/calibration/champion_model.json reports $\sigma_{CV} = 0.006587$ , M2's own marginal sigma over the five-fold cross-validation mean. Dividing the locked $\Delta_{vs M_0} = -0.04096$ by this sigma yields a gap of $6.22$ SE. The file carries CHAMPION_LOCKED: true. Under this convention the sanity bar is cleared decisively.

The marginal-SE reading treats each model's CV-mean uncertainty as a property of that model alone, comparable to a confidence interval on the model's own log-loss. It does not directly model the per-fold correlation between M2 and M0 losses on the same matches.

The paired-difference SE reading

evaluation/cv_battery_result.json reports $\mathrm{m\_star\_vs\_m0\_gap\_se} = 1.7518$ , computed as a paired-difference SE on per-fold log-loss differences. The file carries sanity_gate_passed: false and a decision_narrative that contains the phrase "WARNING: sanity gate NOT passed". Under this convention the bar is not cleared and the sanity-gate warning fires.

The paired-difference SE acknowledges that M2 and M0 are evaluated on the same per-fold match samples and that their per-match log-losses are correlated. It is the SE convention most commonly used in the forecast-evaluation literature (Diebold-Mariano and its descendants). The pre-registered Diebold-Mariano machinery described in Evaluation is closer in spirit to this reading than to the marginal one.

How to read the two together

Neither file is wrong. Both readings answer different questions about the same data. The marginal SE asks "how precisely do we know M2's own CV mean log-loss?" The paired-difference SE asks "how precisely do we know the gap between M2 and M0 on the same evaluation samples?" The paired-difference SE is almost always the smaller of the two when the two models are correlated on the per-match level, which they are here.

The protocol seals both conventions implicitly by sealing both files. The pre-registration's primary criterion (lowest mean CV log-loss with adequate gap to runner-up) does not depend on the SE convention, and M2 wins under that criterion in both files. The 2.0-SE sanity gate does depend on the convention. The honest report is that the sanity-gate warning fired under the paired-difference reading and did not fire under the marginal reading. The locked file is champion_model.json; M2 is M★; the warning is documented here.

The two checkpoints

The same 2-SE rule is applied at two stages of the project. The Phase 8 sanity gate runs on cross-validation hold-out data before the tournament begins. The R16 live checkpoint runs on cumulative tournament forecasts once the Round of 16 settles. Both stages use the same machinery; they differ only in which sample of log-losses they evaluate.

The Phase 8 sanity gate

The Phase 8 sanity gate ran the kill criterion as a pre-flight check on the cross-validation hold-out (calibration 2010 to 2021, hold-out 2022 World Cup), before the 2026 tournament started. The full adjudication table:

Model	Mean CV LL	Marginal SE	Δ vs M0	DM p vs M0	Status
M2_fifa	0.99337	0.00659	−0.04096	0.0032	Champion (`CHAMPION_LOCKED: true`); sanity-gate warning under paired-difference SE
M3_macro	1.02694	0.02949	−0.00739	0.3443	Eligible; below M0 in point estimate
M0_elo	1.03433	0.03844	0.000	1.0000	Baseline
M1_form	1.08110	0.07514	+0.04677	0.0061	Disqualified (significantly worse than M0)

Values are from the locked data/calibration/cv_battery_results.json. M2 cleared the primary criterion by having the lowest mean log-loss across the candidate set, with a 1.49 SE gap to the runner-up (M3_macro) and a Diebold-Mariano test against M0 returning $p = 0.003$ .

The 2-SE sanity gate, applied with the paired-difference SE in evaluation/cv_battery_result.json, reports a 1.75 SE gap and sanity_gate_passed: false. The same gap measured with M2's marginal sigma in champion_model.json is 6.22 SE. The pre-registered consequence (kill.action: pivot_paper_framing) took effect under the paired-difference reading: the paper's framing pivoted to the sanity-gate-warning narrative, and the project committed to acknowledging the warning transparently in the public ledger and the vault essays. M★ was not demoted; the pre-registered action is a framing pivot, not an automatic identity change. The pipeline did not abort; the engine continues to run, and the forecast log continues to record every model's probabilities, edges, and divergences.

The R16 live checkpoint

The same 2-SE rule will be re-evaluated once the Round of 16 settles. The check fires once, after all eight R16 matches are settled; it is not re-run weekly thereafter.

The R16 live check is the kill criterion's first contact with live tournament data. M★ is M2_fifa, sealed in champion_model.json. The comparison is a real M2-versus-M0 head-to-head on live forecasts, not a degenerate self-comparison. The evaluation happens when the Round of 16 settles, as pre-registered: the statistic is the cumulative match-level log-loss of M★ (M2) against the M0 baseline over the 72 pre-registered group-stage forecasts, scored against their realized outcomes, with a paired per-match standard error. The kill criterion fires if M★ is worse than M0 by at least 2 standard errors on that statistic.

If the criterion fires, the pre-registered consequence is the same pivot_paper_framing action described for Phase 8: the paper pivots to the pre-committed contingency framing. The model is not silently swapped; only an OSF amendment can change M★'s identity mid-tournament (see the margin note below). Whether or not the criterion fires, the project publishes the evaluation report within 72 hours of the Round of 16 settling, following the template committed in the pre-registration. This live checkpoint is a separate evaluation from the Phase 8 sanity gate above (the 1.75 SE and 6.22 SE readings on cross-validation hold-out data); it applies the same 2-SE rule to live tournament forecasts instead.

The block below is this checkpoint's own live event, a separate construction from the Phase 8 cross-validation readings above. It is a paired per-match standard error over the 72 pre-registered group-stage forecasts re-scored against their realized results, published once when the Round of 16 settles. Until then it shows the pre-registered interim sentence; it is never numerically compared to the 1.75 SE or 6.22 SE cross-validation numbers.

R16 LIVE CHECKPOINTDID NOT FIRE2026-07-07

On the 72pre-registered group-stage forecasts re-scored against their realized results, M2 (M⋆) was 1.47 SE better than M0, against the 2.0 SE threshold.

M2 (M⋆) was not worse than M0 by 2 or more standard errors; the kill criterion did not fire. The full evaluation is published as promised, as the templated ablation report at /data/latest/ablation.json.

Paired per-match SE over the settled group results; a separate construction from the Phase 8 cross-validation readings above.

Operational response

The Phase 8 sanity gate firing triggered a specific, pre-registered set of changes. The changes are operational rather than computational. The engine still runs the same way; the claims the project is willing to make from its outputs have changed.

What changed:

The paper's framing pivoted from "we built a model that finds market mispricings" to the kill-criterion narrative described in The 45% Problem.
The vault essays, this page, and the live ledger acknowledge the sanity-gate warning under the paired-difference SE reading. The project's brand commitment is to acknowledge the warning visibly, not to bury it inside the locked file.
The R16 live checkpoint is treated as a real test of M2 against M0 on live forecasts, not a procedural formality.

What did not change:

The engine continues to run on its 60-second tick during live tournament hours and 5-minute tick off-hours. The forecast-log API exists, and the frozen champion batch pins the M0 and M2 (M★) per-match distributions the published surfaces score against; per-match logs for the M1 and M3 shadow variants were not committed.
The de-vigging machinery, divergence calculation, and Volatility Gate all remain active. They continue to produce flagged divergences and gate decisions against M★'s (M2_fifa) probabilities.
CLV continues to be tracked on the M★ row, measuring whether M2's probabilities lead the market over time. The same metric runs against the four shadow models as pseudo-CLV (see Evaluation).
The R16 live checkpoint was evaluated on 2026-07-07 when the Round of 16 settled, on a live M2-versus-M0 comparison; it did not fire, with M2 1.47 SE better than M0 on the 72 pre-registered forecasts.

The signed Git tag v1.0.0-mstar-lock and the OSF pre-registration both record the state of the project at the moment of the Phase 8 firing. The signed tag cannot be moved or backdated without invalidating its cryptographic signature; the OSF record cannot be amended without producing a visible fork in the audit trail. Together they make the firing, and the operational response that followed it, permanently verifiable.

Live status

The block below reflects the current state of the kill criterion as of the most recent snapshot. It updates nightly with each build.

KILL CRITERION2026-04-23

CLEARED: 6.22 SE / 2.0 SEmarginal sigma (champion_model.json)CHAMPION_LOCKED = true; M2_fifa sealed under the protocol's primary criterion.

WARNING: 1.75 SE / 2.0 SEpaired-difference SE (cv_battery_result.json)Sanity gate did not clear the 2.0 SE threshold under this convention; M2_fifa retained per the protocol's primary criterion (`pivot_paper_framing`).

Condition: M2 vs M0 stratified CV log-loss

Two SE readings; locked under the marginal reading; warning logged under the paired-difference reading; R16 live checkpoint is the next adjudication.

The status block reads from the snapshot and renders the dual reading: a WARNING: 1.75 SE / 2.0 SE (paired-difference) badge alongside a CLEARED: 6.22 SE / 2.0 SE (marginal) reference and the Phase 8 timestamp. The R16 live checkpoint resolved on 2026-07-07 and did not fire, with M2 1.47 SE better than M0 on the 72 pre-registered forecasts, so the Phase 8 badge state stands.

The Transparency Ledger shows the full kill-criteria check panel alongside the rolling calibration metrics (Brier, log-loss, RPS) and the per-team probability cards. The ledger and this page read from the same snapshot, so they cannot disagree about the criterion's state.

Why pre-registering the stopping rule matters

Pre-registering the kill criterion before the tournament prevents the worst form of result-chasing: keeping a model alive indefinitely because no one has formally decided when to stop. The same logic applies to post-hoc threshold tuning. With the 2 SE bar sealed before the cross-validation battery ran, there was no path to retroactively soften the bar to 1.75 SE in order to keep M2 as M★.

Pre-registration also removes the option to quietly remove the project from public record if the criterion fires. The OSF DOI 10.17605/OSF.IO/8B5HD and the signed Git tag v1.0.0-mstar-lock are public, time-stamped, and cryptographically verifiable. The Phase 8 sanity-gate warning is now part of that permanent record. The model cards have been updated with the dual SE reading; the terminal displays the warning badge alongside the cleared marginal reading; the forecast log carries every model's probabilities and divergences against M★ (M2_fifa) in every row.

Publishing a null result under a pre-registered stopping rule is not failure. It is the project working as designed.

. this project's pre-registration, OSF 2026-04-22

The Phase 8 sanity-gate warning is the strongest available vindication of the pre-registration discipline. Without the 2-SE bar, the project would have launched a public website with M2 as M★ and either quietly suppressed the paired-difference SE warning that one of the two locked files surfaces, or reported only the marginal reading that clears the bar without acknowledging the other. Pre-registration forces both readings into the public record. With the bar in place, the project published an honest dual reading on day one and a clear plan for the R16 checkpoint that follows on live tournament data. The criterion was not theatre; it bound on the very first run.

Where to go next

The 45% Problem: the lead essay that frames the project's purpose and reads the Phase 8 sanity-gate warning as part of the project's brand commitment.
Models: the four-candidate ablation, the cross-validation battery, and the table that adjudicated champion selection.
Evaluation: Brier, log-loss, RPS, and the Diebold-Mariano machinery the kill criterion is built on.
Pre-registration: the OSF DOI, the signed Git tag, the sealed pre_reg_constants.yaml, and the procedural commitments the criterion enforces.
Notation: the symbol table for $\mathcal{L}$ , $d_i$ , $\overline{d}$ , $\mathrm{SE}$ , and related quantities.

§ V · Status page

8 min readlast revised 2026-04-22snapshot 2026-07-30T01:17Z

Kill criteria

If M★ performs worse than the null baseline by the Round of 16, this project publishes a null result.

By The 45% Problem project

Contents

The mathematical statement

The kill criterion is a two-condition gate. Both conditions must hold for M★ to remain the live trading model. Failure on either condition fires the criterion.

The two-condition gate

The decision inequality

Let $d_i = \mathcal{L}^{M_0}_i - \mathcal{L}^{M^{\star}}_i$ be the per-match log-loss difference (positive $d$ means M★ is better than M0). The criterion requires:

\overline{d} \;\geq\; 2 \cdot \mathrm{SE}\!\big(\overline{d}\big)

The criterion fires when this inequality fails:

\overline{d} \;<\; 2 \cdot \mathrm{SE}\!\big(\overline{d}\big)

Why two standard errors

Dual SE reading

The marginal-SE reading

The paired-difference SE reading

How to read the two together

The two checkpoints

The Phase 8 sanity gate

Model	Mean CV LL	Marginal SE	Δ vs M0	DM p vs M0	Status
M2_fifa	0.99337	0.00659	−0.04096	0.0032	Champion (`CHAMPION_LOCKED: true`); sanity-gate warning under paired-difference SE
M3_macro	1.02694	0.02949	−0.00739	0.3443	Eligible; below M0 in point estimate
M0_elo	1.03433	0.03844	0.000	1.0000	Baseline
M1_form	1.08110	0.07514	+0.04677	0.0061	Disqualified (significantly worse than M0)

The R16 live checkpoint

The same 2-SE rule will be re-evaluated once the Round of 16 settles. The check fires once, after all eight R16 matches are settled; it is not re-run weekly thereafter.

R16 LIVE CHECKPOINTDID NOT FIRE2026-07-07

On the 72pre-registered group-stage forecasts re-scored against their realized results, M2 (M⋆) was 1.47 SE better than M0, against the 2.0 SE threshold.

Paired per-match SE over the settled group results; a separate construction from the Phase 8 cross-validation readings above.

Operational response

What changed:

The paper's framing pivoted from "we built a model that finds market mispricings" to the kill-criterion narrative described in The 45% Problem.
The vault essays, this page, and the live ledger acknowledge the sanity-gate warning under the paired-difference SE reading. The project's brand commitment is to acknowledge the warning visibly, not to bury it inside the locked file.
The R16 live checkpoint is treated as a real test of M2 against M0 on live forecasts, not a procedural formality.

What did not change:

The engine continues to run on its 60-second tick during live tournament hours and 5-minute tick off-hours. The forecast-log API exists, and the frozen champion batch pins the M0 and M2 (M★) per-match distributions the published surfaces score against; per-match logs for the M1 and M3 shadow variants were not committed.
The de-vigging machinery, divergence calculation, and Volatility Gate all remain active. They continue to produce flagged divergences and gate decisions against M★'s (M2_fifa) probabilities.
CLV continues to be tracked on the M★ row, measuring whether M2's probabilities lead the market over time. The same metric runs against the four shadow models as pseudo-CLV (see Evaluation).
The R16 live checkpoint was evaluated on 2026-07-07 when the Round of 16 settled, on a live M2-versus-M0 comparison; it did not fire, with M2 1.47 SE better than M0 on the 72 pre-registered forecasts.

Live status

The block below reflects the current state of the kill criterion as of the most recent snapshot. It updates nightly with each build.

KILL CRITERION2026-04-23

CLEARED: 6.22 SE / 2.0 SEmarginal sigma (champion_model.json)CHAMPION_LOCKED = true; M2_fifa sealed under the protocol's primary criterion.

Condition: M2 vs M0 stratified CV log-loss

Two SE readings; locked under the marginal reading; warning logged under the paired-difference reading; R16 live checkpoint is the next adjudication.

Why pre-registering the stopping rule matters

Publishing a null result under a pre-registered stopping rule is not failure. It is the project working as designed.

. this project's pre-registration, OSF 2026-04-22

Where to go next

The 45% Problem: the lead essay that frames the project's purpose and reads the Phase 8 sanity-gate warning as part of the project's brand commitment.
Models: the four-candidate ablation, the cross-validation battery, and the table that adjudicated champion selection.
Evaluation: Brier, log-loss, RPS, and the Diebold-Mariano machinery the kill criterion is built on.
Pre-registration: the OSF DOI, the signed Git tag, the sealed pre_reg_constants.yaml, and the procedural commitments the criterion enforces.
Notation: the symbol table for $\mathcal{L}$ , $d_i$ , $\overline{d}$ , $\mathrm{SE}$ , and related quantities.