Continual fine-tuning of large language models (LLMs) on sequential tasks leads to catastrophic forgetting, where previously learned capabilities degrade. While reinforcement learning (RL) methods like GRPO mitigate forgetting, they introduce substantial complexity. We propose RazorSFT, a simple on-policy supervised fine-tuning method that achieves strong forgetting mitigation without RL's complexity. RazorSFT samples candidate responses from the current model, filters them through a task verifier, and selects the highest log-probability correct response as the training target. This KL-minimal selection ensures training targets remain close to the current policy distribution. On a 3-stage continual learning benchmark, RazorSFT reduces forgetting by 60.4 percentage points compared to offline SFT (FM $-0.039$ vs.\ $-0.643$) while outperforming GRPO on average accuracy (0.616 vs.\ 0.515) and task adaptation (Countdown: 0.628 vs.\ 0.261). Ablation studies reveal that on-policy data accounts for 76\% of the forgetting improvement, demonstrating that the benefits of on-policy learning can be obtained within a simple SFT framework.

RazorSFT: On-Policy Supervised Fine-Tuning with KL-Minimal Target Selection for Continual Learning

Abstract

Resources