Entropy Dynamics Do Not Provide Reliable Execution-Free Selection Signals for Code Generation
Abstract
Best-of-N sampling improves code generation but requires execution for candidate selection. Entropy dynamics (EDIS) have shown promise for detecting reasoning errors in math problems by identifying instability patterns in per-token entropy trajectories. We test whether entropy dynamics can provide execution-free selection signals for code generation by adapting EDIS as nEDIS with pre-registered success criteria. Our experiments demonstrate a clear negative result: nEDIS fails the pre-registered criterion, underperforming even random first-sample selection by 12.8--27.5 percentage points on HumanEval and MBPP. We identify entropy sparsity as a key failure mode---88.3% of entropy values are exactly zero with instruction-tuned code models, undermining spike detection. The optimization required to improve nEDIS contradicts the original hypothesis, suggesting the method captures length bias rather than meaningful entropy dynamics. This negative result prevents wasted effort and suggests alternative approaches are needed for execution-free code selection.