[NeurIPS'25] OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

Abstract

Recent advances in large multimodal models (LMMs) have significantly advanced video comprehension, yet their performance remains limited in first-person scenarios. The interactive nature of egocentric videos is critical for applications like embodied intelligence, but introduces complex visual contexts that conventional models struggle to capture. To bridge this gap, we introduce OpenMMEgo with innovations across three dimensions: data, model, and training strategy. To provide rich spatiotemporal visual knowledge, we curate a large-scale, high-quality dataset named OME10M, comprising over 8.2M egocentric video QA pairs synthesized from Ego4D series. We also establish OMEBench, a comprehensive benchmark for rigorous egocentric understanding assessment. To alleviate the frequent viewpoint shifts inherent in egocentric videos, we implement semantic-aware visual token compression. Further, a curriculum learning strategy is complemented to foster stable learning across various data complexities. OpenMMEgo consistently improves the performance of LMMs on egocentric benchmarks without sacrificing general video understanding performance. Notably, Qwen2.5-VL tuned with OpenMMEgo substantially outperforms other models of the same size in egocentric video understanding. All components will be open-sourced.

Publication
Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS), Dec. 2-7, 2025.
Date
Links