[AAAI'23] Online Tuning for Offline Decentralized Multi-Agent Reinforcement Learning

Abstract

Offline reinforcement learning could learn effective policies from a fixed dataset, which is promising for real-world applications. However, in offline decentralized multi-agent reinforcement learning, due to the discrepancy between the behavior policy and learned policy, the transition dynamics in offline experiences do not accord with the transition dynamics in online execution, which creates severe errors in value estimates, leading to uncoordinated low-performing policies. One way to overcome this problem is to bridge offline training and online tuning. However, considering both deployment efficiency and sample efficiency, we could only collect very limited online experiences, making it insufficient to use merely online data for updating the agent policy. To utilize both offline and online experiences to tune the policies of agents, we introduce online transition correction (OTC) to implicitly correct the offline transition dynamics by modifying sampling probabilities. We design two types of distances, i.e., embedding-based and value-based distance, to measure the similarity between transitions, and further propose an adaptive rank-based prioritization to sample transitions according to the transition similarity. OTC is simple yet effective to increase data efficiency and improve agent policies in online tuning. Empirically, OTC outperforms baselines in a variety of tasks.

Publication
Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI), February 7-14, 2023.
Date
Links