QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
Citations
693 citations
692 citations
Cites background from "QMIX: Monotonic Value Function Fact..."
...Several other heuristics (with no theoretical backing) regarding either reward or value function factorization have been proposed to mitigate the scalability issue (Guestrin et al., 2002a,b; Kok and Vlassis, 2004; Sunehag et al., 2018; Rashid et al., 2018)....
[...]
...Specifically, to mitigate the partial information issue above, a great deal of work assumes the existence of a central controller that can collect information such as joint actions, joint rewards, and joint observations, and even design policies for all agents (Hansen et al., 2004; Oliehoek and Amato, 2014; Lowe et al., 2017; Foerster et al., 2017; Gupta et al., 2017; Foerster et al., 2018; Dibangoye and Buffet, 2018; Chen et al., 2018; Rashid et al., 2018)....
[...]
..., 2004; Oliehoek and Amato, 2014; Kraemer and Banerjee, 2016), and has been widely adopted in recent (deep) MARL works (Lowe et al., 2017; Foerster et al., 2017; Gupta et al., 2017; Foerster et al., 2018; Chen et al., 2018; Rashid et al., 2018)....
[...]
609 citations
589 citations
Cites background from "QMIX: Monotonic Value Function Fact..."
...Since agents have common behaviors, such as actions, domain knowledge, and goals (homogeneous agents), the scalability can be achievable by (partially) centralized training and decentralized execution [121], [122]....
[...]
413 citations
References
72,897 citations
"QMIX: Monotonic Value Function Fact..." refers methods in this paper
...n their entire actionobservation history. Hausknecht and Stone (2015) propose deep recurrent Q-networks (DRQN) that make use of recurrent neural networks. Typically, gated architectures such as LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Chung et al., 2014) are used to facilitate learning over longer timescales. 3.3 Independent Q-Learning Perhaps the most commonly applied method in multi-agent learning is independent Qlearnin...
[...]
23,074 citations
"QMIX: Monotonic Value Function Fact..." refers background in this paper
...Deep Q-networks (DQNs) (Mnih et al., 2015) use a replay memory to store the transition tuple 〈s, u, r, s′〉, where the state s′ is observed after taking the action u in state s and receiving reward r....
[...]
...(2017) extend this approach to deep neural networks using DQN (Mnih et al., 2015)....
[...]
9,478 citations
"QMIX: Monotonic Value Function Fact..." refers methods in this paper
...Typically, gated architectures such as LSTM (Hochreiter & Schmidhuber, 1997) or GRU (Chung et al., 2014) are used to facilitate learning over longer timescales....
[...]
...Architecture and Training The architecture of all agent networks is a DRQN with a recurrent layer comprised of a GRU with a 64-dimensional hidden state, with a fully-connected layer before and after....
[...]
4,301 citations
2,861 citations
"QMIX: Monotonic Value Function Fact..." refers methods in this paper
...This approach does not address the nonstationarity introduced due to the changing policies of the learning agents, and thus, unlike Q-learning, has no convergence guarantees even in the limit of infinite exploration....
[...]
...Independent Q-Learning Perhaps the most commonly applied method in multi-agent learning is independent Q-learning (IQL) (Tan, 1993), which decomposes a multi-agent problem into a collection of simultaneous single-agent problems that share the same environment....
[...]
...In this section, we propose a new approach called QMIX which, like VDN, lies between the extremes of IQL and centralised Q-learning, but can represent a much richer class of action-value functions....
[...]
...Independent Q-learning (Tan, 1993) trains independent action-value functions for each agent using Q-learning (Watkins, 1989)....
[...]
...Sparse cooperative Q-learning (Kok & Vlassis, 2006) is a tabular Q-learning algorithm that learns to coordinate the actions of a group of cooperative agents only in the states in which such coordination is necessary, encoding those dependencies in a coordination graph....
[...]