[ECCV'26] RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

Abstract

This paper focuses on a critical challenge in robotics: translating text-driven generated human motions into executable actions for real humanoid robots. While existing text-to-motion (T2M) generation methods achieve semantic alignment between language and motion, they often produce physically infeasible motions unsuitable for real-world deployment. To bridge the gap between T2M and humanoid execution, we propose Reinforcement Learning from Physical Feedback (RLPF), a novel framework that integrates text-conditioned motion generation with motion-conditioned humanoid whole-body control. RLPF employs a low-level motion tracking policy to assess feasibility in physical simulators, generating rewards for fine-tuning the high-level motion generator. Moreover, RLPF introduces an alignment verification module to preserve semantic alignment with text instructions. This joint optimization ensures both physical feasibility and text alignment of T2M generators. Furthermore, to overcome the limitation of frozen tracking policy, we propose Hierarchical RLPF, a co-training framework that iteratively optimizes the high-level T2M generator and the low-level tracking policy. Extensive experiments show that RLPF and Hierarchical RLPF greatly outperform baseline methods in generating physically feasible motions while maintaining semantic alignment with text instructions.

Publication
European Conference on Computer Vision (ECCV), 2026.
Date
Links