scispace - formally typeset
Search or ask a question
Author

Parsa Mahmoudieh

Bio: Parsa Mahmoudieh is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Task (project management) & Imitation. The author has an hindex of 7, co-authored 7 publications receiving 446 citations.

Papers
More filters
Proceedings ArticleDOI
23 Apr 2018
TL;DR: Imitating expert demonstration is a powerful mechanism for learning to perform tasks from raw sensory observations as discussed by the authors, where the expert typically provides multiple demonstrations of a task at training time, and this generates data in the form of observation-action pairs from the agent's point of view.
Abstract: Imitating expert demonstration is a powerful mechanism for learning to perform tasks from raw sensory observations. The current dominant paradigm in learning from demonstration (LfD) [3,16,19,20] requires the expert to either manually move the robot joints (i.e., kinesthetic teaching) or teleoperate the robot to execute the desired task. The expert typically provides multiple demonstrations of a task at training time, and this generates data in the form of observation-action pairs from the agent's point of view. The agent then distills this data into a policy for performing the task of interest. Such a heavily supervised approach, where it is necessary to provide demonstrations by controlling the robot, is incredibly tedious for the human expert. Moreover, for every new task that the robot needs to execute, the expert is required to provide a new set of demonstrations.

238 citations

Posted Content
TL;DR: The authors consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses, which offer ubiquitous and instantaneous supervision for representation learning even in the absence of reward.
Abstract: Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquitous and instantaneous supervision for representation learning even in the absence of reward. While current results show that learning from reward alone is feasible, pure reinforcement learning methods are constrained by computational and data efficiency issues that can be remedied by auxiliary losses. Self-supervised pre-training and joint optimization improve the data efficiency and policy returns of end-to-end reinforcement learning.

92 citations

Posted Content
TL;DR: Zero-shot imitation learning as mentioned in this paper explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss, where the role of the expert is only to communicate the goals (i.e., what to imitate) during inference.
Abstract: The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss In our framework, the role of the expert is only to communicate the goals (ie, what to imitate) during inference The learned policy is then employed to mimic the expert (ie, how to imitate) after seeing just a sequence of images demonstrating the desired task Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance Videos, models, and more details are available at this https URL

87 citations

Proceedings Article
15 Feb 2018
TL;DR: In this article, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference, and the learned policy is then employed to mimic the expert after seeing just a sequence of images demonstrating the desired task.
Abstract: The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance. Videos, models, and more details are available at this https URL

72 citations

Proceedings ArticleDOI
26 May 2015
TL;DR: This work develops a cooperative launching system for a 13.2 gram ornithopter micro-aerial vehicle (MAV) by carrying it on the back of a 32 gram hexapedal millirobot, the VelociRoACH, by determining the necessary initial velocity and pitch angle using force data collected in a wind tunnel.
Abstract: In this work, we develop a cooperative launching system for a 13.2 gram ornithopter micro-aerial vehicle (MAV), the H2Bird, by carrying it on the back of a 32 gram hexapedal millirobot, the VelociRoACH. We determine the necessary initial velocity and pitch angle for take off using force data collected in a wind tunnel and use the VelociRoACH to reach these initial conditions for successful launch. In the wind tunnel predicted success region, we were able to complete a successful launch for 75 percent of the 12 trials in that region. Although carrying the H2Bird on top of the VelociRoACH at a stride frequency of 17 Hz increases our average power consumption by about 24.5 percent over solo running, the H2Bird, in turn, provides stability advantages to the VelociRoACH. We observed that the variance in pitch and roll velocity with the H2Bird is about 90 percent less than without. Additionally, with the H2Bird flapping at 5 Hz during transport, we observed an increase of 12.7 percent of the steady state velocity. Lastly, we found that the costs of transport for carrying the H2Bird flapping and without (6.6 and 6.8) are lower than the solo costs of transport for the VelociRoACH and for the H2Bird (8.1 and 10.1).

16 citations


Cited by
More filters
Posted Content
TL;DR: This work forms curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model, which scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and ignores the aspects of the environment that cannot affect the agent.
Abstract: In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch. Demo video and code available at this https URL

638 citations

Proceedings Article
06 Aug 2017
TL;DR: A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied.
Abstract: We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.

591 citations

Posted Content
TL;DR: In this article, the authors propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before and apply their method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution.
Abstract: We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.

484 citations

Posted Content
TL;DR: The authors performed a large-scale study of purely curiosity-driven learning, i.e., without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite, and found that curiosity is a type of intrinsic reward function which uses prediction error as reward signal.
Abstract: Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups. Game-play videos and code are at this https URL

473 citations