scispace - formally typeset
P

Peter Sunehag

Researcher at Google

Publications -  65
Citations -  1924

Peter Sunehag is an academic researcher from Google. The author has contributed to research in topics: Reinforcement learning & Markov decision process. The author has an hindex of 15, co-authored 62 publications receiving 1348 citations. Previous affiliations of Peter Sunehag include NICTA & Australian National University.

Papers
More filters
Proceedings Article

Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward

TL;DR: This work addresses the problem of cooperative multi-agent reinforcement learning with a single joint reward signal by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions.
Posted Content

Reinforcement Learning in Large Discrete Action Spaces.

TL;DR: This paper leverages prior information about the actions to embed them in a continuous space upon which it can generalize, and uses approximate nearest-neighbor methods to allow reinforcement learning methods to be applied to large-scale learning problems previously intractable with current methods.
Posted Content

Value-Decomposition Networks For Cooperative Multi-Agent Learning

TL;DR: In this paper, a value decomposition network is proposed to decompose the team value function into agent-wise value functions, which leads to superior results when combined with weight sharing, role information and information channels.
Proceedings Article

The Sample-Complexity of General Reinforcement Learning

TL;DR: A new algorithm for general reinforcement learning where the true environment is known to belong to a finite class of N arbitrary models and compactness is a key criterion for determining the existence of uniform sample-complexity bounds is presented.
Posted Content

Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

TL;DR: The new agent's superiority over agents that either ignore the combinatorial or sequential long-term value aspect is demonstrated on a range of environments with dynamics from a real-world recommendation system.