Generalist Agents

Reinforcement Learning, Foundation Models

Recently developed foundation models, such as large language models and multi-modal models, open great opportunities to build generally capable agents, combined with reinforcement learning. This project focuses on learning skills and foundation models and connecting them to build generalist agents. In the following, we introduce some of our studies. For more details, please refer to the papers.

Plan4MC

We study building a multi-task agent in Minecraft. Without human demonstrations, solving long-horizon tasks in this open-ended environment with reinforcement learning (RL) is extremely sample inefficient. To tackle the challenge, we decompose solving Minecraft tasks into learning basic skills and planning over the skills. We propose three types of fine-grained basic skills in Minecraft, and use RL with intrinsic rewards to accomplish basic skills with high success rates. For skill planning, we use Large Language Models to find the relationships between skills and build a skill graph in advance. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines in most tasks by a large margin.

State-to-Go Transformer

Learning from visual observation (LfVO), aiming at recovering policies from only visual observation data, is promising yet a challenging problem. Existing LfVO approaches either only adopt inefficient online learning schemes or require additional task-specific information like goal states, making them not suited for open-ended tasks. To address these issues, we propose a two-stage framework for learning from visual observation. In the first stage, we introduce and pretrain State-to-Go (STG) Transformer offline to predict and differentiate latent transitions of demonstrations. Subsequently, in the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks where an agent learns merely from intrinsic rewards. Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards.

Publications

[NeurIPS'23] Learning from Visual Observation via Offline Pretrained State-to-Go Transformer

Bohan Zhou, Ke Li, Jiechuan Jiang, and Zongqing Lu

Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS), Dec. 10-16, 2023.

(Acceptance Rate: 26.1%=³²²¹⁄₁₂₃₄₃)

Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks

Haoqi Yuan, Chi Zhang, Hongcheng Wang, Feiyang Xie, Penglin Cai, Hao Dong, Zongqing Lu

NeurIPS 2023 Workshop on Foundation Models for Decision Making.

[ICLR'24] Pre-Training Goal-Based Models for Sample-Efficient Reinforcement Learning

Haoqi Yuan, Zhancun Mu, Feiyang Xie, Zongqing Lu

Eighth International Conference on Learning Representations (ICLR), May 7-11, 2024.

(Acceptance Rate: 1.2%, oral)

[ICLR'24] Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Sipeng Zheng, Jiazheng Liu, Yicheng Feng, and Zongqing Lu

Eighth International Conference on Learning Representations (ICLR), May 7-11, 2024.

(Acceptance Rate: 31%)

[NAACL'24] AdaRefiner: Refining Decisions of Language Models with Adaptive Feedback

Wanpeng Zhang and Zongqing Lu

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Findings, June 16–21, 2024.

[NAACL'24] LLaMA Rider: Spurring Large Language Models to Explore the Open World

Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, and Zongqing Lu

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Findings, June 16–21, 2024.

Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F Karlsson, Bo An, and Zongqing Lu

arXiv:2403.03186