Delta-Map Belief Updates for Stable Spatial Revision in Vision-Language Models
Abstract
Vision-language models can extract spatial information from images but struggle to maintain and revise spatial beliefs over time. Existing approaches either regenerate cognitive maps from scratch, losing valuable prior context, or regenerate fully, which is wasteful when changes are sparse. We propose delta-map updates, a sparse belief revision mechanism that preserves unchanged spatial beliefs while selectively updating only elements affected by new observations. On the Theory of Space benchmark, providing prior map context with explicit preserve/overwrite rules improves false-belief identification F1 by +16.7 percentage points over scratch regeneration. Delta-map updates achieve equivalent performance to full regeneration (F1 = 0.479 vs 0.477) while producing 52--63% smaller structured outputs. Our analysis validates the sparse evidence premise: only 30% of objects require updating per step, supporting the efficiency of targeted updates over full regeneration.