Top 14 papers published by Aviv Tamar from Technion – Israel Institute of Technology in 2017

Posted Content•

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

[...]

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch - Show less +2 more

07 Jun 2017-arXiv: Learning

TL;DR: An adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination is presented.

...read moreread less

Abstract: We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

...read moreread less

1,477 citations

Proceedings Article•

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

[...]

Ryan Lowe¹, Yi Wu², Aviv Tamar², Jean Harb¹, OpenAI Pieter Abbeel, Igor Mordatch³ - Show less +2 more•Institutions (3)

McGill University¹, University of California, Berkeley², OpenAI³

07 Jun 2017

TL;DR: In this article, an actor-critic method was used to learn multi-agent coordination policies in cooperative and competitive multi-player RL games, where agent populations are able to discover various physical and informational coordination strategies.

...read moreread less

Abstract: We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

...read moreread less

1,273 citations

Proceedings Article•

Constrained policy optimization

[...]

Joshua Achiam¹, David Held¹, Aviv Tamar¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

06 Aug 2017

TL;DR: Constrained Policy Optimization (CPO) as discussed by the authors is the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.

...read moreread less

Abstract: For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al., 2016; Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

...read moreread less

768 citations

Proceedings Article•DOI•

Learning to Route

[...]

Asaf Valadarsky¹, Michael Schapira¹, Dafna Shahaf¹, Aviv Tamar²•Institutions (2)

Hebrew University of Jerusalem¹, University of California, Berkeley²

30 Nov 2017

TL;DR: The preliminary results regarding the power of data-driven routing suggest that applying ML (specifically, deep reinforcement learning) to this context yields high performance and is a promising direction for further research.

...read moreread less

Abstract: Recently, much attention has been devoted to the question of whether/when traditional network protocol design, which relies on the application of algorithmic insights by human experts, can be replaced by a data-driven (i.e., machine learning) approach. We explore this question in the context of the arguably most fundamental networking task: routing. Can ideas and techniques from machine learning (ML) be leveraged to automatically generate "good" routing configurations? We focus on the classical setting of intradomain traffic engineering. We observe that this context poses significant challenges for data-driven protocol design. Our preliminary results regarding the power of data-driven routing suggest that applying ML (specifically, deep reinforcement learning) to this context yields high performance and is a promising direction for further research. We outline a research agenda for ML-guided routing.

...read moreread less

162 citations

Posted Content•

Constrained Policy Optimization

[...]

Joshua Achiam¹, David Held¹, Aviv Tamar¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

30 May 2017-arXiv: Learning

TL;DR: Constrained Policy Optimization (CPO) as mentioned in this paper is the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.

...read moreread less

Abstract: For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function For example, systems that physically interact with or around humans should satisfy safety constraints Recent advances in policy search algorithms (Mnih et al, 2016, Schulman et al, 2015, Lillicrap et al, 2016, Levine et al, 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety

...read moreread less

137 citations

Proceedings Article•DOI•

Value iteration networks

[...]

Aviv Tamar¹, Yi Wu¹, Garrett Thomas¹, Sergey Levine¹, Pieter Abbeel¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

19 Aug 2017

TL;DR: This work introduces the value iteration network (VIN), a fully differentiable neural network with a `planning module' embedded within that shows that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

...read moreread less

Abstract: We introduce the value iteration network (VIN): a fully differentiable neural network with a 'planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

...read moreread less

72 citations

Journal Article•DOI•

Sequential Decision Making With Coherent Risk

[...]

Aviv Tamar¹, Yinlam Chow², Mohammad Ghavamzadeh³, Shie Mannor•Institutions (3)

University of California, Berkeley¹, Stanford University², Adobe Systems³

01 Jul 2017-IEEE Transactions on Automatic Control

TL;DR: This work provides sampling-based algorithms for optimization under a coherent-risk objective that is suitable for problems in which tuneable parameters control the distribution of the cost, such as in reinforcement learning or approximate dynamic programming with a parameterized policy.

...read moreread less

Abstract: We provide sampling-based algorithms for optimization under a coherent-risk objective. The class of coherent-risk measures is widely accepted in finance and operations research, among other fields, and encompasses popular risk-measures such as conditional value at risk and mean-semi-deviation. Our approach is suitable for problems in which tuneable parameters control the distribution of the cost, such as in reinforcement learning or approximate dynamic programming with a parameterized policy. Such problems cannot be solved using previous approaches. We consider both static risk measures and time-consistent dynamic risk measures. For static risk measures, our approach is in the spirit of policy gradient methods, while for the dynamic risk measures, we use actor-critic type algorithms.

...read moreread less

52 citations

Proceedings Article•DOI•

Learning from the hindsight plan — Episodic MPC improvement

[...]

Aviv Tamar¹, Garrett Thomas¹, Tianhao Zhang¹, Sergey Levine¹, Pieter Abbeel¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

01 May 2017

TL;DR: In this paper, a policy improvement scheme for MPC is proposed, where the same task can be repeated several times, and the main idea is to run MPC with a longer horizon, resulting in a hindsight plan.

...read moreread less

Abstract: Model predictive control (MPC) is a popular control method that has proved effective for robotics, among other fields. MPC performs re-planning at every time step. Re-planning is done with a limited horizon per computational and real-time constraints and often also for robustness to potential model errors. However, the limited horizon leads to suboptimal performance. In this work, we consider the iterative learning setting, where the same task can be repeated several times, and propose a policy improvement scheme for MPC. The main idea is that between executions we can, offline, run MPC with a longer horizon, resulting in a hindsight plan. To bring the next real-world execution closer to the hindsight plan, our approach learns to re-shape the original cost function with the goal of satisfying the following property: short horizon planning (as realistic during real executions) with respect to the shaped cost should result in mimicking the hindsight plan. This effectively consolidates long-term reasoning into the short-horizon planning. We empirically evaluate our approach in contact-rich manipulation tasks both in simulated and real environments, such as peg insertion by a real PR2 robot.

...read moreread less

49 citations

Posted Content•

Safer Classification by Synthesis.

[...]

William Yang Wang, Angelina Wang, Aviv Tamar, Xi Chen, Pieter Abbeel - Show less +1 more

22 Nov 2017-arXiv: Learning

TL;DR: This work shows that conventional discriminative methods can easily be fooled to provide incorrect labels with very high confidence to out of distribution examples, and posit that a generative approach is the natural remedy for this problem, and proposes a method for classification using generative models.

...read moreread less

Abstract: The discriminative approach to classification using deep neural networks has become the de-facto standard in various fields. Complementing recent reservations about safety against adversarial examples, we show that conventional discriminative methods can easily be fooled to provide incorrect labels with very high confidence to out of distribution examples. We posit that a generative approach is the natural remedy for this problem, and propose a method for classification using generative models. At training time, we learn a generative model for each class, while at test time, given an example to classify, we query each generator for its most similar generation, and select the class corresponding to the most similar one. Our approach is general and can be used with expressive models such as GANs and VAEs. At test time, our method accurately "knows when it does not know," and provides resilience to out of distribution examples while maintaining competitive performance for standard examples.

...read moreread less

48 citations

Posted Content•

A Machine Learning Approach to Routing

[...]

Asaf Valadarsky, Michael Schapira, Dafna Shahaf, Aviv Tamar

10 Aug 2017-arXiv: Networking and Internet Architecture

TL;DR: This work investigates the power of data-driven routing protocols and suggests that applying ideas and techniques from deep reinforcement learning to this context yields high performance, motivating further research along these lines.

...read moreread less

Abstract: Can ideas and techniques from machine learning be leveraged to automatically generate "good" routing configurations? We investigate the power of data-driven routing protocols. Our results suggest that applying ideas and techniques from deep reinforcement learning to this context yields high performance, motivating further research along these lines.

...read moreread less

32 citations

Proceedings Article•

Shallow Updates for Deep Reinforcement Learning

[...]

Nir Levine¹, Tom Zahavy¹, Daniel J. Mankowitz¹, Aviv Tamar², Shie Mannor¹ - Show less +1 more•Institutions (2)

Technion – Israel Institute of Technology¹, University of California, Berkeley²

01 Jan 2017

TL;DR: This work proposes a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method by periodically re-training the last hidden layer of a D RL network with a batch least squares update.

...read moreread less

Abstract: Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.

...read moreread less

Posted Content•

Learning Generalized Reactive Policies using Deep Neural Networks

[...]

Edward Groshev, Aviv Tamar, Siddharth Srivastava, Pieter Abbeel

24 Aug 2017-arXiv: Artificial Intelligence

TL;DR: In this paper, a deep neural network is used to learn and represent a generalized reactive policy (GRP) that maps a problem instance and a state to an action, and the learned GRPs efficiently solve large classes of challenging problem instances.

...read moreread less

Abstract: We present a new approach to learning for planning, where knowledge acquired while solving a given set of planning problems is used to plan faster in related, but new problem instances. We show that a deep neural network can be used to learn and represent a \emph{generalized reactive policy} (GRP) that maps a problem instance and a state to an action, and that the learned GRPs efficiently solve large classes of challenging problem instances. In contrast to prior efforts in this direction, our approach significantly reduces the dependence of learning on handcrafted domain knowledge or feature selection. Instead, the GRP is trained from scratch using a set of successful execution traces. We show that our approach can also be used to automatically learn a heuristic function that can be used in directed search algorithms. We evaluate our approach using an extensive suite of experiments on two challenging planning problem domains and show that our approach facilitates learning complex decision making policies and powerful heuristic functions with minimal human input. Videos of our results are available at this http URL.

...read moreread less

Posted Content•

Shallow Updates for Deep Reinforcement Learning

[...]

Nir Levine¹, Tom Zahavy¹, Daniel J. Mankowitz¹, Aviv Tamar², Shie Mannor¹ - Show less +1 more•Institutions (2)

Technion – Israel Institute of Technology¹, University of California, Berkeley²

21 May 2017-arXiv: Artificial Intelligence

TL;DR: The Least Squares Deep Q-Network (LS-DQN) as discussed by the authors proposes a hybrid approach, which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method.

...read moreread less

Abstract: Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.

...read moreread less

Posted Content•

Situationally Aware Options.

[...]

Daniel J. Mankowitz, Aviv Tamar, Shie Mannor

20 Nov 2017-arXiv: Artificial Intelligence

TL;DR: This work incorporates SA, in the form of vigor, into hierarchical RL by defining and learning situationally aware options in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP), achieved using the Situationally Aware oPtions (SAP) policy gradient algorithm which comes with a theoretical convergence guarantee.

...read moreread less

Abstract: Hierarchical abstractions, also known as options -- a type of temporally extended action (Sutton et. al. 1999) that enables a reinforcement learning agent to plan at a higher level, abstracting away from the lower-level details. In this work, we learn reusable options whose parameters can vary, encouraging different behaviors, based on the current situation. In principle, these behaviors can include vigor, defence or even risk-averseness. These are some examples of what we refer to in the broader context as Situational Awareness (SA). We incorporate SA, in the form of vigor, into hierarchical RL by defining and learning situationally aware options in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our Situationally Aware oPtions (SAP) policy gradient algorithm which comes with a theoretical convergence guarantee. We learn reusable options in different scenarios in a RoboCup soccer domain (i.e., winning/losing). These options learn to execute with different levels of vigor resulting in human-like behaviours such as `time-wasting' in the winning scenario. We show the potential of the agent to exit bad local optima using reusable options in RoboCup. Finally, using SAP, the agent mitigates feature-based model misspecification in a Bottomless Pit of Death domain.

...read moreread less

Showing papers by "Aviv Tamar published in 2017"