[ECCV'26] BeTTER: Diagnose the Illusion of Embodied Reasoning in Vision-Language-Action Models

Abstract

Embodied reasoning is crucial for general-purpose robotic agents. Although recent Vision-Language-Action (VLA) models and embodied Vision-Language Models (VLM) aim to integrate high-level planning with spatial understanding, their true reasoning capabilities remain unclear. We argue that current evaluation protocols — ranging from long-horizon manipulation tasks to standard visual-text robotic benchmarks — fail to accurately diagnose reasoning deficiencies and distinguish genuine reasoning abilities from shortcut correlations learned during training. To address this gap, we introduce a systematic diagnostic benchmark named BeTTER designed to rigorously decouple physical execution from reasoning. Our benchmark enables more precise evaluations of embodied reasoning by applying controlled diagnostic perturbations to test a model’s cognitive robustness. Through extensive experiments, we reveal that state-of-the-art models suffer from severe reasoning failures in dynamic environments, exhibiting semantic feature collapse, behavioral inertia, and state tracking deficits. These deficiencies expose a considerable gap between benchmark success and true embodied reasoning. Our findings emphasize the need for more rigorous diagnostic benchmarks and improved training strategies to ensure genuine embodied reasoning in future robotic models.

Publication
European Conference on Computer Vision (ECCV), 2026.
Date
Links