Self-reflection has emerged as a promising approach for improving reasoning in large language models, yet small models (7-9B parameters) often exhibit ``pseudo-reflection'' where self-critique introduces more errors than it corrects. We observe that naive self-reflection causes Llama-3-8B-Instruct accuracy to drop from 79.68\% to 55.27\% on GSM8K, with 36.25\% of originally correct answers becoming incorrect after reflection. To address this, we propose Equation-Consistency Gated Reflection (ECGR), a training-free method that uses deterministic arithmetic verification via SymPy to gate self-reflection output. ECGR extracts arithmetic equations from solutions, verifies their consistency, and selects the solution with higher consistency score. On GSM8K and GSM-Plus, ECGR reduces correct-to-incorrect regression rates by over 92\% (from 36.25\% to 2.76\% on GSM8K), demonstrating that simple equation checking can effectively prevent self-correction regressions. However, low equation coverage ($\sim$43\%) limits practical gains over simpler baselines like self-consistency.

Equation-Consistency Gated Reflection for Small Language Models: A Training-Free Approach to Preventing Self-Correction Regressions

Abstract

Resources