Order-Robustness Audit of Gradient Masking Methods for Continual Learning in LLMs
Abstract
Continual learning benchmarks typically evaluate methods on a single task ordering, yet rankings may not generalize across orderings. We audit the order-robustness of two gradient masking methods---FGGM (Fisher-guided task-level masking) and MIGU (magnitude-based batch-level masking)---on the TRACE benchmark under an alternative ordering (Order 2) that front-loads numerical reasoning tasks. Our audit reveals a ranking reversal: MIGU outperforms FGGM by 2.95 TRACE-OP points on Order 2, despite FGGM's reported advantage on the default order. MIGU exhibits superior order-robustness with only a 3.71-point performance drop compared to FGGM's 5.07-point drop. Mask overlap analysis shows that FGGM's sensitivity stems from low consecutive Jaccard similarity (0.368) in Order 2's early transitions, causing disruptive parameter shifts. Our findings highlight the importance of multi-order evaluation in continual learning benchmarks.