Home
/
Authors
/
Parsa Mahmoudieh

Author

Parsa Mahmoudieh

Bio: Parsa Mahmoudieh is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Task (project management) & Imitation. The author has an hindex of 7, co-authored 7 publications receiving 446 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Zero-Shot Visual Imitation

[...]

Deepak Pathak¹, Parsa Mahmoudieh¹, Guanghao Luo¹, Pulkit Agrawal¹, Dian Chen¹, Fred Shentu¹, Evan Shelhamer¹, Jitendra Malik¹, Alexei A. Efros¹, Trevor Darrell¹ - Show less +6 more•Institutions (1)

University of California, Berkeley¹

23 Apr 2018

TL;DR: Imitating expert demonstration is a powerful mechanism for learning to perform tasks from raw sensory observations as discussed by the authors, where the expert typically provides multiple demonstrations of a task at training time, and this generates data in the form of observation-action pairs from the agent's point of view.

...read moreread less

Abstract: Imitating expert demonstration is a powerful mechanism for learning to perform tasks from raw sensory observations. The current dominant paradigm in learning from demonstration (LfD) [3,16,19,20] requires the expert to either manually move the robot joints (i.e., kinesthetic teaching) or teleoperate the robot to execute the desired task. The expert typically provides multiple demonstrations of a task at training time, and this generates data in the form of observation-action pairs from the agent's point of view. The agent then distills this data into a policy for performing the task of interest. Such a heavily supervised approach, where it is necessary to provide demonstrations by controlling the robot, is incredibly tedious for the human expert. Moreover, for every new task that the robot needs to execute, the expert is required to provide a new set of demonstrations.

...read moreread less

238 citations

Posted Content•

Loss is its own Reward: Self-Supervision for Reinforcement Learning

[...]

Evan Shelhamer¹, Parsa Mahmoudieh¹, Max Argus², Trevor Darrell¹•Institutions (2)

University of California, Berkeley¹, University of Freiburg²

21 Dec 2016-arXiv: Learning

TL;DR: The authors consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses, which offer ubiquitous and instantaneous supervision for representation learning even in the absence of reward.

...read moreread less

Abstract: Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquitous and instantaneous supervision for representation learning even in the absence of reward. While current results show that learning from reward alone is feasible, pure reinforcement learning methods are constrained by computational and data efficiency issues that can be remedied by auxiliary losses. Self-supervised pre-training and joint optimization improve the data efficiency and policy returns of end-to-end reinforcement learning.

...read moreread less

92 citations

Posted Content•

Zero-Shot Visual Imitation

[...]

Deepak Pathak¹, Parsa Mahmoudieh¹, Guanghao Luo¹, Pulkit Agrawal¹, Dian Chen¹, Yide Shentu, Evan Shelhamer¹, Jitendra Malik², Alexei A. Efros¹, Trevor Darrell¹ - Show less +6 more•Institutions (2)

University of California, Berkeley¹, Google²

23 Apr 2018-arXiv: Learning

TL;DR: Zero-shot imitation learning as mentioned in this paper explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss, where the role of the expert is only to communicate the goals (i.e., what to imitate) during inference.

...read moreread less

Abstract: The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss In our framework, the role of the expert is only to communicate the goals (ie, what to imitate) during inference The learned policy is then employed to mimic the expert (ie, how to imitate) after seeing just a sequence of images demonstrating the desired task Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance Videos, models, and more details are available at this https URL

...read moreread less

87 citations

Proceedings Article•

Zero-Shot Visual Imitation

[...]

University of California, Berkeley¹, Google²

15 Feb 2018

TL;DR: In this article, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference, and the learned policy is then employed to mimic the expert after seeing just a sequence of images demonstrating the desired task.

...read moreread less

Abstract: The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance. Videos, models, and more details are available at this https URL

...read moreread less

72 citations

Proceedings Article•DOI•

Coordinated launching of an ornithopter with a hexapedal robot

[...]

Cameron J. Rose¹, Parsa Mahmoudieh¹, Ronald S. Fearing¹•Institutions (1)

University of California, Berkeley¹

26 May 2015

TL;DR: This work develops a cooperative launching system for a 13.2 gram ornithopter micro-aerial vehicle (MAV) by carrying it on the back of a 32 gram hexapedal millirobot, the VelociRoACH, by determining the necessary initial velocity and pitch angle using force data collected in a wind tunnel.

...read moreread less

Abstract: In this work, we develop a cooperative launching system for a 13.2 gram ornithopter micro-aerial vehicle (MAV), the H2Bird, by carrying it on the back of a 32 gram hexapedal millirobot, the VelociRoACH. We determine the necessary initial velocity and pitch angle for take off using force data collected in a wind tunnel and use the VelociRoACH to reach these initial conditions for successful launch. In the wind tunnel predicted success region, we were able to complete a successful launch for 75 percent of the 12 trials in that region. Although carrying the H2Bird on top of the VelociRoACH at a stride frequency of 17 Hz increases our average power consumption by about 24.5 percent over solo running, the H2Bird, in turn, provides stability advantages to the VelociRoACH. We observed that the variance in pitch and roll velocity with the H2Bird is about 90 percent less than without. Additionally, with the H2Bird flapping at 5 Hz during transport, we observed an increase of 12.7 percent of the steady state velocity. Lastly, we found that the costs of transport for carrying the H2Bird flapping and without (6.6 and 6.8) are lower than the solo costs of transport for the VelociRoACH and for the H2Bird (8.1 and 10.1).

...read moreread less

16 citations

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Posted Content•

Curiosity-driven Exploration by Self-supervised Prediction

[...]

Deepak Pathak¹, Pulkit Agrawal¹, Alexei A. Efros², Trevor Darrell¹•Institutions (2)

University of California, Berkeley¹, University of California²

15 May 2017-arXiv: Learning

TL;DR: This work forms curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model, which scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and ignores the aspects of the environment that cannot affect the agent.

...read moreread less

Abstract: In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch. Demo video and code available at this https URL

...read moreread less

638 citations

Proceedings Article•

Reinforcement learning with deep energy-based policies

[...]

Tuomas Haarnoja¹, Haoran Tang¹, Pieter Abbeel², Sergey Levine¹•Institutions (2)

University of California, Berkeley¹, International Computer Science Institute²

06 Aug 2017

TL;DR: A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied.

...read moreread less

Abstract: We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.

...read moreread less

591 citations

Posted Content•

Reinforcement Learning with Deep Energy-Based Policies

[...]

Tuomas Haarnoja¹, Haoran Tang¹, Pieter Abbeel², Sergey Levine¹•Institutions (2)

University of California, Berkeley¹, International Computer Science Institute²

27 Feb 2017-arXiv: Learning

TL;DR: In this article, the authors propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before and apply their method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution.

...read moreread less

484 citations

Posted Content•

Large-Scale Study of Curiosity-Driven Learning

[...]

Yuri Burda¹, Harrison Edwards¹, Deepak Pathak², Amos Storkey, Trevor Darrell², Alexei A. Efros² - Show less +2 more•Institutions (2)

OpenAI¹, University of California, Berkeley²

13 Aug 2018-arXiv: Learning

TL;DR: The authors performed a large-scale study of purely curiosity-driven learning, i.e., without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite, and found that curiosity is a type of intrinsic reward function which uses prediction error as reward signal.

...read moreread less

Abstract: Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups. Game-play videos and code are at this https URL

...read moreread less

473 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

Collapse