Token-Balanced Continual Pretraining Eliminates Brain Rot Degradation
Abstract
Continual pretraining (CPT) on low-quality, short-sequence data such as social media posts has been shown to cause ``Brain Rot''---severe degradation in reasoning and long-context capabilities. The prevailing explanation attributes this to the semantic quality of the data itself. We challenge this assumption and demonstrate that Brain Rot is a \emph{training artifact} arising from per-token weight disparity: when short sequences are processed without packing, they receive 8.17 higher per-token gradient updates than longer sequences, causing the model to overfit to their statistical patterns. We propose token-balanced packing, which concatenates sequences to uniform length, eliminating this disparity. Through controlled experiments on Llama-3-8B-Instruct, we show that packing achieves 119.7% ARC reasoning recovery and 95.2% RULER long-context recovery relative to the no-CPT baseline, while providing a 67 training speedup. Our findings demonstrate that CPT on short-sequence data is safe when proper packing is employed.