Markov games as a framework for multi-agent reinforcement learning
read more
Citations
Reinforcement Learning: An Introduction
Mastering the game of Go with deep neural networks and tree search
Machine learning
Mastering the game of Go without human knowledge
Reinforcement learning: a survey
References
Theory of Games and Economic Behavior
Machine learning
Technical Note : \cal Q -Learning
Theory of Games and Economic Behavior
Technical Note Q-Learning
Related Papers (5)
Human-level control through deep reinforcement learning
Frequently Asked Questions (10)
Q2. What are the future works mentioned in the paper "Markov games as a framework for multi-agent reinforcement learning" ?
Identifying an opponent of this type for the Q-learning agent described in this paper would be an interesting topic for future research.
Q3. What is the main idea of the paper?
In particular, the paper describes a reinforcement learning approach to solving two-player zero-sum games in which the “max” operator in the update step of a standard Q-learning algorithm is replaced by a “minimax” operator that can be evaluated by solving a linear program.
Q4. What is the need for probabilistic action choice?
The need for probabilistic action choice stems from the agent’s uncertainty of its opponent’s current move and its requirement to avoid being “second guessed.
Q5. What is the effect of the discount factor on the players?
For current purposes, the discount factor has the desirable effect of goading the players into trying to win sooner rather than later.
Q6. What is the alternative approach to solving an MDP?
solving an MDP using value iteration involves applying Equations 1–2 simultaneously over all s 2 S. Watkins [Watkins, 1989] proposed an alternative approach that involves performing the updates asynchronously without the use of the transition function, T .
Q7. What is the learning rule for a Markov game?
This learning rule converges to the correct values forQ and V , assuming that every action is tried in every state infinitely often and that new estimates are blended with previous ones using a slow enough exponentially weighted average [Watkins and Dayan, 1992].
Q8. How many steps will be needed before the system reaches convergence?
The use of linear programming in the innermost loop of a learning algorithm is somewhat problematic since the computational complexity of each step is large and typically many steps will be needed before the system reaches convergence.
Q9. What is the policy for maximizing the expected sum of discounted reward?
In an MDP, an optimal policy is one that maximizes the expected sum of discounted reward and is undominated, meaning that there is no state from which any other policy can achieve a better expected sum of discounted reward.
Q10. What is the purpose of reinforcement learning?
Reinforcement learning is a promising technique for creating agents that co-exist [Tan, 1993, Yanco and Stein, 1993], but the mathematical framework that justifies it is inappropriate for multi-agent environments.