Learnable Multipliers (LRM) reparameterize transformer weight matrices to allow scale adaptation during training, but introduce a gauge symmetry in attention: any reciprocal scaling of query and key multipliers preserves the attention function. Standard practice applies weight decay to control this symmetry, but weight decay changes the model function by shrinking multiplier magnitudes. We propose \textbf{GaugeFix-LRM}, which replaces weight decay on Q/K multipliers with an explicit gauge-fixing projection that balances Q/K scales while preserving the attention function exactly. Experiments on GPT-2 124M trained on OpenWebText demonstrate that GaugeFix achieves perfect drift control (0.000 vs 0.052 baseline) and, when applied every 100 steps, improves validation loss by 0.0307 nats over the baseline. Our analysis reveals that weight decay serves a dual purpose---symmetry control and magnitude regularization---and that the latter is primarily responsible for training stability, suggesting that function-preserving symmetry control combined with explicit magnitude constraints may offer a principled alternative to weight decay for symmetric parameters.

GaugeFix-LRM: Function-Preserving Q/K Gauge Fixing for Learnable Multipliers in Language Model Training

Abstract

Resources