MEL-Code: Transferring Meta-Experience Learning to Code RLVR with Unit-Test Rewards
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for improving code generation in large language models. Recent work on Meta-Experience Learning (MEL) demonstrates that internalizing contrastive reasoning patterns can enhance mathematical reasoning, but its applicability to code generation remains unexplored. We present MEL-Code, which transfers MEL to code RLVR through three stages: contrastive pair construction from GRPO rollouts, replay validation via unit-test re-execution, and NLL internalization of validated meta-experiences. Our experiments on Qwen2.5-Coder-7B-Instruct reveal that code RLVR naturally generates abundant meta-experience signal, with 66% of training prompts yielding usable contrastive pairs. MEL-Code achieves the highest MBPP performance (9.2% greedy Pass@1) and converges 33% faster than baselines. However, the gains are domain-specific: meta-experiences learned from MBPP do not transfer to HumanEval+, suggesting that code-specific meta-experience patterns require task-aligned training data.