Generalist Agents

Reinforcement Learning, Foundation Models

Recently developed foundation models, such as large language models and multi-modal models, open great opportunities to build generally capable agents, combined with reinforcement learning. This project focuses on learning skills and foundation models and connecting them to build generalist agents. In the following, we introduce some of our studies. For more details, please refer to the papers.

Plan4MC

We study building a multi-task agent in Minecraft. Without human demonstrations, solving long-horizon tasks in this open-ended environment with reinforcement learning (RL) is extremely sample inefficient. To tackle the challenge, we decompose solving Minecraft tasks into learning basic skills and planning over the skills. We propose three types of fine-grained basic skills in Minecraft, and use RL with intrinsic rewards to accomplish basic skills with high success rates. For skill planning, we use Large Language Models to find the relationships between skills and build a skill graph in advance. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines in most tasks by a large margin.

State-to-Go Transformer

Learning from visual observation (LfVO), aiming at recovering policies from only visual observation data, is promising yet a challenging problem. Existing LfVO approaches either only adopt inefficient online learning schemes or require additional task-specific information like goal states, making them not suited for open-ended tasks. To address these issues, we propose a two-stage framework for learning from visual observation. In the first stage, we introduce and pretrain State-to-Go (STG) Transformer offline to predict and differentiate latent transitions of demonstrations. Subsequently, in the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks where an agent learns merely from intrinsic rewards. Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards.

Publications

[NeurIPS'23] Learning from Visual Observation via Offline Pretrained State-to-Go Transformer

Bohan Zhou, Ke Li, Jiechuan Jiang, and Zongqing Lu

Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS), Dec. 10-16, 2023.

(Acceptance Rate: 26.1%=³²²¹⁄₁₂₃₄₃)

Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks

Haoqi Yuan, Chi Zhang, Hongcheng Wang, Feiyang Xie, Penglin Cai, Hao Dong, Zongqing Lu

NeurIPS 2023 Workshop on Foundation Models for Decision Making.

[ICLR'24] Pre-Training Goal-Based Models for Sample-Efficient Reinforcement Learning

Haoqi Yuan, Zhancun Mu, Feiyang Xie, Zongqing Lu

Eighth International Conference on Learning Representations (ICLR), May 7-11, 2024.

(Acceptance Rate: 1.2%, oral)

[ICLR'24] Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Sipeng Zheng, Jiazheng Liu, Yicheng Feng, and Zongqing Lu

Eighth International Conference on Learning Representations (ICLR), May 7-11, 2024.

(Acceptance Rate: 31%)

[NAACL'24] AdaRefiner: Refining Decisions of Language Models with Adaptive Feedback

Wanpeng Zhang and Zongqing Lu

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Findings, June 16–21, 2024.

[NAACL'24] LLaMA Rider: Spurring Large Language Models to Explore the Open World

Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, and Zongqing Lu

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Findings, June 16–21, 2024.

Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F Karlsson, Bo An, and Zongqing Lu

ICLR 2024 Workshop on LLM Agents.

[ECCV'24] Pre-Trained Visual Dynamics Representations for Efficient Policy Learning

Hao Luo, Bohan Zhou, and Zongqing Lu

European Conference on Computer Vision (ECCV), Sep. 29- Oct. 4, 2024.

[ECCV'24] Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo, Ziluo Ding, and Zongqing Lu

European Conference on Computer Vision (ECCV), Sep. 29- Oct. 4, 2024.

[ECCV'24] UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, and Zongqing Lu

European Conference on Computer Vision (ECCV), Sep. 29- Oct. 4, 2024.

[ECCV'24] Visual Grounding for Object-Level Generalization in Reinforcement Learning

Haobin Jiang and Zongqing Lu

European Conference on Computer Vision (ECCV), Sep. 29- Oct. 4, 2024.

[NeurIPS'24] RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, and Jiaya Jia

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), Dec. 9-15, 2024.

(Acceptance Rate: 25.8%, oral)

[NAACL'25] LLM-Based Explicit Models of Opponents for Multi-Agent Games

Xiaopeng Yu, Wanpeng Zhang, and Zongqing Lu

Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), April 29- May 4, 2025.

[ICLR'25] Learning Video-Conditioned Policy on Unlabelled Data with Joint Embedding Predictive Transformer

Hao Luo and Zongqing Lu

The Thirteenth International Conference on Learning Representations (ICLR), April 24-28, 2025.

(Acceptance Rate: 32.08%)

[ICLR'25] MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Junpeng Yue, Xinrun Xu, Börje F. Karlsson, and Zongqing Lu

The Thirteenth International Conference on Learning Representations (ICLR), April 24-28, 2025.

(Acceptance Rate: 32.08%)

[ICLR'25] From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, and Zongqing Lu

The Thirteenth International Conference on Learning Representations (ICLR), April 24-28, 2025.

(Acceptance Rate: 32.08%)

[ICML'25] Cradle: Empowering Foundation Agents towards General Computer Control

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, YuJie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng YAN, Zongqing Lu

Forty-Second International Conference on Machine Learning (ICML), July 13-19, 2025

(Acceptance Rate: 26.9%=³²⁶⁰⁄₁₂₁₀₇)

[ICML'25] Scaling Large Motion Models with Million-Level Human Motions

Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, Zongqing Lu

Forty-Second International Conference on Machine Learning (ICML), July 13-19, 2025

(Acceptance Rate: 26.9%=³²⁶⁰⁄₁₂₁₀₇)

[ACL'25] Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han

The 63rd Annual Meeting of the Association for Computational Linguistics (ACL), July 27-August 1, 2025.

[IROS'25] NOLO: Navigate Only Look Once

Bohan Zhou, Zhongbin Zhang, Jiangxing Wang, Zongqing Lu

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), October 19-25, 2025.

[ICCV'25] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye

International Conference on Computer Vision (ICCV), October 19-23, 2025.

(Acceptance Rate: 24%)

[ICCV'25] MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model

Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu

International Conference on Computer Vision (ICCV), October 19-23, 2025.

(Acceptance Rate: 24%)

[ICCV'25] Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

International Conference on Computer Vision (ICCV), October 19-23, 2025.

(Acceptance Rate: 24%)

[ICCV'25] VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, Zongqing Lu

International Conference on Computer Vision (ICCV), October 19-23, 2025.

(Acceptance Rate: 24%)