Stutter-Invariance Metamorphic Audits for Text World-Model Rollouts
Abstract
Large language models (LLMs) are increasingly used as world models for text-based environments, enabling model-based planning without costly real-world interactions. However, world-model rollouts can fail to transfer back to the real environment---a problem we term World-to-Real (W2R) failure. We propose a metamorphic audit based on \emph{stutter invariance}: inserting state-preserving commands (e.g., \texttt{look} in TextWorld) into rollouts and measuring observation drift. If a world model maintains stable state representations, such insertions should not affect subsequent predictions. We evaluate our method on TextWorld with a Qwen2.5-7B world model, achieving AUROC 0.767 for W2R failure prediction. However, this performance is statistically tied with a simpler sampling consistency baseline (AUROC 0.757) that merely re-runs generation with different random seeds. Both methods detect the same underlying signal: general output instability under perturbation. This informative negative result suggests that for W2R failure prediction, the cheapest stability check is sufficient---domain-specific metamorphic probes add computational cost without measurable benefit.