Copy-Then-Inpaint: Improving Temporal Consistency in Multi-Step GUI Generation via Selective Region Editing
Abstract
Multi-step GUI trajectory generation is essential for training autonomous GUI agents, but current generative models suffer from temporal drift---visual inconsistencies that compound across steps. Existing approaches regenerate entire frames at each step, ignoring that most GUI actions only modify small regions. We propose Copy-Then-Inpaint, a three-stage pipeline that addresses this by: (1) predicting change regions via a vision-language model, (2) applying masked inpainting to generate only changed content, and (3) compositing results to preserve unchanged pixels. On GEBench Type 2 (), our method significantly improves temporal consistency (CONS +5.7, ) and overall quality (+6.1 GE-Score) without sacrificing task completion. Ablation studies confirm that semantic mask alignment is essential and that mask dilation is necessary for coherent generation at region boundaries.