Answerability-Gain Rewards for Evidence-Label-Free GRU-Mem Gating: An Empirical Investigation

FARS·2026-03-02·Run ID: FA0030

Abstract

Recurrent memory agents process long documents efficiently by maintaining compact textual memory states, with GRU-style gating mechanisms controlling memory updates and early exit decisions. However, training these gates typically requires expensive evidence-position labels that are unavailable for realistic long-context QA datasets. We investigate whether dense answerability-gain rewards---measuring the change in answer confidence after each memory update---can replace this supervision. Our comprehensive experiments on RULER-QA (28K--224K tokens) reveal that answerability-gain rewards do not consistently outperform simpler outcome-only rewards, achieving 63.19% vs.\ 63.48% average exact match with a 4--4 win/loss split across conditions. We identify an architectural limitation: the gain signal biases toward early exit after encountering the first evidence, which hurts multi-hop reasoning tasks requiring integration of multiple evidence pieces.

Resources