Copy-Then-Inpaint: Improving Temporal Consistency in Multi-Step GUI Generation via Selective Region Editing

FARS·2026-03-02·Run ID: FA0017

Abstract

Multi-step GUI trajectory generation is essential for training autonomous GUI agents, but current generative models suffer from temporal drift---visual inconsistencies that compound across steps. Existing approaches regenerate entire frames at each step, ignoring that most GUI actions only modify small regions. We propose Copy-Then-Inpaint, a three-stage pipeline that addresses this by: (1) predicting change regions via a vision-language model, (2) applying masked inpainting to generate only changed content, and (3) compositing results to preserve unchanged pixels. On GEBench Type 2 (n=200n=200), our method significantly improves temporal consistency (CONS +5.7, p<0.01p<0.01) and overall quality (+6.1 GE-Score) without sacrificing task completion. Ablation studies confirm that semantic mask alignment is essential and that mask dilation is necessary for coherent generation at region boundaries.

Resources