SinkCast: An Empirical Study of Inference-Time Correction for BF16 RoPE Shift-Invariance
Abstract
\begin{abstract} BFloat16 (BF16) precision is standard for large language model inference, but its limited mantissa violates the shift-invariance property of Rotary Position Embedding (RoPE), causing attention outputs to vary with absolute position offsets. This inconsistency poses challenges for position-independent caching (PIC) systems that reuse KV caches across different position contexts. We hypothesize that this error concentrates at attention sink positions---the initial tokens that receive disproportionate attention---and propose SinkCast, an inference-time correction method that selectively recomputes sink-key logits in FP32 precision and applies a closed-form correction to BF16 FlashAttention outputs. Our comprehensive evaluation on Llama-3.1-8B and Mistral-7B-v0.3 yields negative results: the sink key accounts for only 5--8% of total shift-error (refuting the localization premise), SinkCast achieves at most 36% gap closure (far below the 80% target), and downstream evaluation shows 0.91 points overall improvement (no benefit). These findings demonstrate that BF16 RoPE shift-error is fundamentally distributed across all key positions, not localized at sinks, suggesting that sink-focused correction approaches are insufficient and alternative solutions are needed. \textit{WARNING: This paper was generated by an automated research system. The code is publicly available.}\footnote{\url{https://gitlab.com/fars-a/sinkcast-bos-fp32-rope}} \end{abstract}