[NeurIPS'24] Pre-Trained Multi-Goal Transformers with Prompt Optimization for Efficient Online Adaptation

Abstract

Efficiently solving unseen tasks remains a challenge in reinforcement learning (RL), especially for long-horizon tasks composed of multiple subtasks. Pre-training policies from task-agnostic datasets have emerged as a promising approach, yet existing methods still necessitate substantial interactions via RL to learn new tasks. We introduce MGPO, a method that leverages the power of Transformer-based policies to model sequences of goals, enabling efficient online adaptation through prompt optimization. In its pre-training phase, MGPO utilizes hindsight multi-goal relabeling and behavior cloning. This combination equips the policy to model diverse long-horizon behaviors that align with varying goal sequences. During online adaptation, the goal sequence, conceptualized as a prompt, is optimized to improve task performance. We adopt a multi-armed bandit framework for this process, enhancing prompt selection based on the returns from online trajectories. Our experiments across various environments demonstrate that MGPO holds substantial advantages in sample efficiency, online adaptation performance, robustness, and interpretability compared with existing methods.

Publication
Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), Dec. 9-15, 2024.
Date
Links