scispace - formally typeset
Open accessPosted Content

The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games

02 Mar 2021-arXiv: Learning-
Abstract: Proximal Policy Optimization (PPO) is a popular on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due the belief that on-policy methods are significantly less sample efficient than their off-policy counterparts in multi-agent problems. In this work, we investigate Multi-Agent PPO (MAPPO), a variant of PPO which is specialized for multi-agent settings. Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong performance in three popular multi-agent testbeds: the particle-world environments, the Starcraft multi-agent challenge, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. In the majority of environments, we find that compared to off-policy baselines, MAPPO achieves strong results while exhibiting comparable sample efficiency. Finally, through ablation studies, we present the implementation and algorithmic factors which are most influential to MAPPO's practical performance.

... read more

Topics: Hyperparameter (53%)
Citations
  More

25 results found


Open accessPosted Content
14 Jun 2020-
Abstract: Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we evaluate and compare three different classes of MARL algorithms (independent learners, centralised training with decentralised execution, and value decomposition) in a diverse range of multi-agent learning tasks. Our results show that (1) algorithm performance depends strongly on environment properties and no algorithm learns efficiently across all learning tasks; (2) independent learners often achieve equal or better performance than more complex algorithms; (3) tested algorithms struggle to solve multi-agent tasks with sparse rewards. We report detailed empirical data, including a reliability analysis, and provide insights into the limitations of the tested algorithms.

... read more

Topics: Reinforcement learning (58%)

10 Citations


Open accessJournal ArticleDOI: 10.1109/JIOT.2021.3078462
Liang Yu1, Shuqi Qin1, Meng Zhang2, Chao Shen2  +2 moreInstitutions (3)
Abstract: Global buildings account for about 30% of the total energy consumption and carbon emission, raising severe energy and environmental concerns. Therefore, it is significant and urgent to develop novel smart building energy management (SBEM) technologies for the advance of energy efficient and green buildings. However, it is a nontrivial task due to the following challenges. First, it is generally difficult to develop an explicit building thermal dynamics model that is both accurate and efficient enough for building control. Second, there are many uncertain system parameters (e.g., renewable generation output, outdoor temperature, and the number of occupants). Third, there are many spatially and temporally coupled operational constraints. Fourth, building energy optimization problems can not be solved in real time by traditional methods when they have extremely large solution spaces. Fifthly, traditional building energy management methods have respective applicable premises, which means that they have low versatility when confronted with varying building environments. With the rapid development of Internet of Things technology and computation capability, artificial intelligence technology find its significant competence in control and optimization. As a general artificial intelligence technology, deep reinforcement learning (DRL) is promising to address the above challenges. Notably, the recent years have seen the surge of DRL for SBEM. However, there lacks a systematic overview of different DRL methods for SBEM. To fill the gap, this article provides a comprehensive review of DRL for SBEM from the perspective of system scale. In particular, we identify the existing unresolved issues and point out possible future research directions.

... read more

Topics: Efficient energy use (57%), Energy management (55%), Building automation (55%) ... show more

9 Citations


Open accessPosted Content
Chi Jin1, Qinghua Liu1, Tiancheng Yu2Institutions (2)
29 Sep 2021-arXiv: Learning
Abstract: Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the single-agent setting. It remains elusive how to extend these results to multi-agent RL, especially due to the new challenges arising from its game-theoretical nature. This paper considers two-player zero-sum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multi-agent Bellman-Eluder dimension -- a new complexity measure adapted from its single-agent version (Jin et al., 2021). A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations.

... read more

5 Citations


Open accessPosted Content
Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen  +3 moreInstitutions (1)
Abstract: Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

... read more

Topics: Reinforcement learning (58%), Trust region (55%)

2 Citations


Open accessPosted Content
19 Aug 2021-arXiv: Learning
Abstract: Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin.

... read more

2 Citations


References
  More

32 results found


Open accessPosted Content
20 Jul 2017-arXiv: Learning
Abstract: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

... read more

Topics: Gradient descent (58%), Reinforcement learning (55%), Trust region (54%)

5,348 Citations


Proceedings ArticleDOI: 10.1109/IROS.2012.6386109
Emanuel Todorov1, Tom Erez1, Yuval Tassa1Institutions (1)
24 Dec 2012-
Abstract: We describe a new physics engine tailored to model-based control. Multi-joint dynamics are represented in generalized coordinates and computed via recursive algorithms. Contact responses are computed via efficient new algorithms we have developed, based on the modern velocity-stepping approach which avoids the difficulties with spring-dampers. Models are specified using either a high-level C++ API or an intuitive XML file format. A built-in compiler transforms the user model into an optimized data structure used for runtime computation. The engine can compute both forward and inverse dynamics. The latter are well-defined even in the presence of contacts and equality constraints. The model can include tendon wrapping as well as actuator activation states (e.g. pneumatic cylinders or muscles). To facilitate optimal control applications and in particular sampling and finite differencing, the dynamics can be evaluated for different states and controls in parallel. Around 400,000 dynamics evaluations per second are possible on a 12-core machine, for a 3D homanoid with 18 dofs and 6 active contacts. We have already used the engine in a number of control applications. It will soon be made publicly available.

... read more

Topics: Physics engine (56%), Inverse dynamics (53%), Optimal control (52%) ... show more

2,593 Citations


Open accessBook ChapterDOI: 10.1016/B978-1-55860-335-6.50027-1
Michael L. Littman1Institutions (1)
10 Jul 1994-
Abstract: In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsis-tic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Q-learning-like algorithm for finding optimal policies and demonstrates its application to a simple two-player game in which the optimal policy is probabilistic.

... read more

2,171 Citations


Open accessProceedings Article
01 Jan 2016-
Abstract: Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.

... read more

Topics: Reinforcement learning (53%)

1,859 Citations


Open accessProceedings Article
Ryan Lowe1, Yi Wu2, Aviv Tamar2, Jean Harb1  +2 moreInstitutions (3)
07 Jun 2017-
Abstract: We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

... read more

Topics: Reinforcement learning (57%)

1,269 Citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
202123
20202
Network Information
Related Papers (5)
Proximal Policy Optimization Algorithms20 Jul 2017, arXiv: Learning

John Schulman, Filip Wolski +3 more

Counterfactual Multi−Agent Policy Gradients29 Apr 2018

Jakob Foerster, Gregory Farquhar +3 more

Dota 2 with Large Scale Deep Reinforcement Learning01 Jan 2019, arXiv: Learning

Christopher Berner, Greg Brockman +23 more