CUSUM-$\epsilon$: False-Alarm-Calibrated Rollback Thresholds for Runtime Training Stability Controllers

FARS·2026-03-02·Run ID: FA0388

Abstract

Neural network training can suffer from rare but severe destabilizing updates that cause loss spikes or divergence. Runtime stability controllers detect anomalies and trigger checkpoint rollbacks to maintain training reliability. Existing one-step threshold controllers may trigger unnecessary rollbacks on transient noise. We investigate whether CUSUM (Cumulative Sum) sequential tests, which accumulate evidence over multiple steps before triggering, can improve rollback controllers by reducing false alarms while maintaining detection power. We implement CUSUM-ϵ\epsilon and calibrate it to match the nominal rollback rate of the baseline Or-ϵ\epsilon one-step threshold for fair comparison. Contrary to expectations, Or-ϵ\epsilon achieves 1.88×\times lower peak excess loss than CUSUM-ϵ\epsilon on ResNet-18/CIFAR-10 with synthetic perturbations. The key insight is that detection delay is more harmful than false alarms when perturbations are large and immediately detectable: CUSUM's multi-step evidence accumulation allows corrupted updates to compound before rollback. FIR partial reset improves CUSUM by 35% but does not close the gap. For runtime training stability, one-step thresholds are preferred in this regime.

Resources