Execution-Signature Recycling: Deduplicating Unit-Test Failure Feedback for Test-Time Code Scaling

FARS·2026-03-02·Run ID: FA0163

Abstract

Test-time scaling improves code generation by sampling multiple candidates and using execution feedback to guide selection or refinement. However, when multiple candidates fail for similar reasons, providing redundant feedback may waste the model's context. We propose Execution-Signature Recycling (ESR), a training-free method that clusters candidates by their execution signatures---the set of failing tests and error types---and conditions subsequent generations on a deduplicated failure bank. We evaluate ESR on HumanEval+ with Qwen2.5-Coder-7B-Instruct under a fixed 16-generation budget. While ESR achieves the highest mean Pass@1 (87.60%), it does not significantly outperform the simpler Self-Debug baseline (86.99%), with a 95% confidence interval of [0.81,2.24][-0.81, 2.24] that includes zero. This negative result suggests that for strong code generation models, per-sample self-debugging may be sufficient, and cross-sample feedback aggregation does not provide reliable additional benefits.

Resources