scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2008"


Journal ArticleDOI
01 Mar 2008
TL;DR: The benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied, and an outlook for the field is provided.
Abstract: Multiagent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must, instead, discover a solution on their own, using learning. A significant part of the research on multiagent learning concerns reinforcement learning techniques. This paper provides a comprehensive survey of multiagent reinforcement learning (MARL). A central issue in the field is the formal statement of the multiagent learning goal. Different viewpoints on this issue have led to the proposal of many different goals, among which two focal points can be distinguished: stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents. The MARL algorithms described in the literature aim---either explicitly or implicitly---at one of these two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting. A representative selection of these algorithms is discussed in detail in this paper, together with the specific issues that arise in each category. Additionally, the benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied. Finally, an outlook for the field is provided.

1,878 citations


Journal ArticleDOI
TL;DR: This paper examines learning of complex motor skills with human-like limbs, and combines the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning with the theory of stochastic policy gradient learning.

921 citations


Journal ArticleDOI
TL;DR: It is shown that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor- Critic algorithms.

659 citations


Journal ArticleDOI
TL;DR: This document reviews phenomenological models of short-term and long-term synaptic plasticity, in particular spike-timing dependent plasticity (STDP), and focuses on phenomenological synaptic models that are compatible with integrate-and-fire type neuron models where each neuron is described by a small number of variables.
Abstract: Synaptic plasticity is considered to be the biological substrate of learning and memory. In this document we review phenomenological models of short-term and long-term synaptic plasticity, in particular spike-timing dependent plasticity (STDP). The aim of the document is to provide a framework for classifying and evaluating different models of plasticity. We focus on phenomenological synaptic models that are compatible with integrate-and-fire type neuron models where each neuron is described by a small number of variables. This implies that synaptic update rules for short-term or long-term plasticity can only depend on spike timing and, potentially, on membrane potential, as well as on the value of the synaptic weight, or on low-pass filtered (temporally averaged) versions of the above variables. We examine the ability of the models to account for experimental data and to fulfill expectations derived from theoretical considerations. We further discuss their relations to teacher-based rules (supervised learning) and reward-based rules (reinforcement learning). All models discussed in this paper are suitable for large-scale network simulations.

606 citations


Journal ArticleDOI
TL;DR: The latest dispatches from the forefront offorcement learning are reviewed, some of the territories where lie monsters are mapped, and the future of reinforcement learning is mapped.

585 citations


Journal ArticleDOI
TL;DR: A theoretical analysis of Model-based Interval Estimation and a new variation called MBIE-EB are presented, proving their efficiency even under worst-case conditions.

503 citations


Journal ArticleDOI
TL;DR: A well-known, coherent Bayesian approach to decision making is reviewed, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem.
Abstract: Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making.

491 citations


Journal ArticleDOI
01 Jun 2008
TL;DR: This work proposes a new, self-optimizing memory controller design that operates using the principles of reinforcement learning (RL), and shows that an RL-based memory controller improves the performance of a set of parallel applications run on a 4-core CMP by 19% on average and it improves DRAM bandwidth utilization by 22% compared to a state-of-the-art controller.
Abstract: Efficiently utilizing off-chip DRAM bandwidth is a critical issuein designing cost-effective, high-performance chip multiprocessors(CMPs). Conventional memory controllers deliver relativelylow performance in part because they often employ fixed,rigid access scheduling policies designed for average-case applicationbehavior. As a result, they cannot learn and optimizethe long-term performance impact of their scheduling decisions,and cannot adapt their scheduling policies to dynamic workloadbehavior.We propose a new, self-optimizing memory controller designthat operates using the principles of reinforcement learning (RL)to overcome these limitations. Our RL-based memory controllerobserves the system state and estimates the long-term performanceimpact of each action it can take. In this way, the controllerlearns to optimize its scheduling policy on the fly to maximizelong-term performance. Our results show that an RL-basedmemory controller improves the performance of a set of parallelapplications run on a 4-core CMP by 19% on average (upto 33%), and it improves DRAM bandwidth utilization by 22%compared to a state-of-the-art controller.

484 citations


Book ChapterDOI
08 Dec 2008
TL;DR: This paper extends previous work on policy learning from the immediate reward case to episodic reinforcement learning, resulting in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives.
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAM™ robot arm.

411 citations


Journal ArticleDOI
TL;DR: The importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot's learning behavior is demonstrated.

403 citations


Proceedings Article
08 Dec 2008
TL;DR: This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DS √AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm.

Journal Article
TL;DR: This paper compares a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large state spaces and hidden state.
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks.

Journal ArticleDOI
01 Mar 2008
TL;DR: This paper mathematically and experimentally proves that the simultaneous consideration of randomness and opposition is more advantageous than pure randomness, and applies that to accelerate differential evolution (DE).
Abstract: For many soft computing methods, we need to generate random numbers to use either as initial estimates or during the learning and search process. Recently, results for evolutionary algorithms, reinforcement learning and neural networks have been reported which indicate that the simultaneous consideration of randomness and opposition is more advantageous than pure randomness. This new scheme, called opposition-based learning, has the apparent effect of accelerating soft computing algorithms. This paper mathematically and also experimentally proves this advantage and, as an application, applies that to accelerate differential evolution (DE). By taking advantage of random numbers and their opposites, the optimization, search or learning process in many soft computing techniques can be accelerated when there is no a priori knowledge about the solution. The mathematical proofs and the results of conducted experiments confirm each other.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: Object-Oriented MDPs (OO-MDPs) are introduced, a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities and a polynomial bound on its sample complexity is proved.
Abstract: Rich representations in reinforcement learning have been studied for the purpose of enabling generalization and making learning feasible in large state spaces. We introduce Object-Oriented MDPs (OO-MDPs), a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities. We introduce a learning algorithm for deterministic OO-MDPs and prove a polynomial bound on its sample complexity. We illustrate the performance gains of our representation and algorithm in the well-known Taxi domain, plus a real-life videogame.

Journal ArticleDOI
TL;DR: The techniques covered are case-based reasoning, rule-based systems, artificial neural networks, fuzzy models, genetic algorithms, cellular automata, multi-agent systems, swarm intelligence, reinforcement learning and hybrid systems.

Journal ArticleDOI
Daeyeol Lee1
TL;DR: Molecular genetic studies have also begun to identify genetic mechanisms for personal traits related to reinforcement learning and complex social decision making, further illuminating the biological basis of social behavior.
Abstract: Decision making in a social group has two distinguishing features. First, humans and other animals routinely alter their behavior in response to changes in their physical and social environment. As a result, the outcomes of decisions that depend on the behavior of multiple decision makers are difficult to predict and require highly adaptive decision-making strategies. Second, decision makers may have preferences regarding consequences to other individuals and therefore choose their actions to improve or reduce the well-being of others. Many neurobiological studies have exploited game theory to probe the neural basis of decision making and suggested that these features of social decision making might be reflected in the functions of brain areas involved in reward evaluation and reinforcement learning. Molecular genetic studies have also begun to identify genetic mechanisms for personal traits related to reinforcement learning and complex social decision making, further illuminating the biological basis of social behavior.

Journal ArticleDOI
TL;DR: The demonstration that reinforcement learning applies selectively to formally equivalent aspects of task-performance supports broader consideration of two-system models in analyses of learning and decision making and suggests a privileged role for surface geometry in determining spatial context and support the idea of a “geometric module” albeit for location rather than orientation.
Abstract: Associative reinforcement provides a powerful explanation of learned behavior. However, an unproven but long-held conjecture holds that spatial learning can occur incidentally rather than by reinforcement. Using a carefully controlled virtual-reality object-location memory task, we formally demonstrate that locations are concurrently learned relative to both local landmarks and local boundaries but that landmark-learning obeys associative reinforcement (showing “overshadowing” and “blocking” or “learned irrelevance”), whereas boundary-learning is incidental, showing neither overshadowing nor blocking nor learned irrelevance. Crucially, both types of learning occur at similar rates and do not reflect differences in levels of performance, cue salience, or instructions. These distinct types of learning likely reflect the distinct neural systems implicated in processing of landmarks and boundaries: the striatum and hippocampus, respectively [Doeller CF, King JA, Burgess N (2008) Proc Natl Acad Sci USA 105:5915–5920]. In turn, our results suggest the use of fundamentally different learning rules by these two systems, potentially explaining their differential roles in procedural and declarative memory more generally. Our results suggest a privileged role for surface geometry in determining spatial context and support the idea of a “geometric module,” albeit for location rather than orientation. Finally, the demonstration that reinforcement learning applies selectively to formally equivalent aspects of task-performance supports broader consideration of two-system models in analyses of learning and decision making.

Book
06 Nov 2008
TL;DR: Reactive Search and Intelligent Optimization is an excellent introduction to the main principles of reactive search, as well as an attempt to develop some fresh intuition for the approaches.
Abstract: Reactive Search integrates sub-symbolic machine learning techniques into search heuristics for solving complex optimization problems. By automatically adjusting the working parameters, a reactive search self-tunes and adapts, effectively learning by doing until a solution is found. Intelligent Optimization, a superset of Reactive Search, concerns online and off-line schemes based on the use of memory, adaptation, incremental development of models, experimental algorithms applied to optimization, intelligent tuning and design of heuristics. Reactive Search and Intelligent Optimization is an excellent introduction to the main principles of reactive search, as well as an attempt to develop some fresh intuition for the approaches. The book looks at different optimization possibilities with an emphasis on opportunities for learning and self-tuning strategies. While focusing more on methods than on problems, problems are introduced wherever they help make the discussion more concrete, or when a specific problem has been widely studied by reactive search and intelligent optimization heuristics. Individual chapters cover reacting on the neighborhood; reacting on the annealing schedule; reactive prohibitions; model-based search; reacting on the objective function; relationships between reactive search and reinforcement learning; and much more. Each chapter is structured to show basic issues and algorithms; the parameters critical for the success of the different methods discussed; and opportunities and schemes for the automated tuning of these parameters. Anyone working in decision making in business, engineering, economics or science will find a wealth of information here.

Journal ArticleDOI
01 Oct 2008
TL;DR: The results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems and shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism.
Abstract: The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: The convergence properties of several variations of Q-learning when combined with function approximation are analyzed, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings.
Abstract: We address the problem of computing the optimal Q-function in Markov decision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combined with function approximation, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works.

Journal ArticleDOI
09 Apr 2008-PLOS ONE
TL;DR: The results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.
Abstract: Cooperation plays a key role in the evolution of complex systems. However, the level of cooperation extensively varies with the topology of agent networks in the widely used models of repeated games. Here we show that cooperation remains rather stable by applying the reinforcement learning strategy adoption rule, Q-learning on a variety of random, regular, small-word, scale-free and modular network models in repeated, multi-agent Prisoner's Dilemma and Hawk-Dove games. Furthermore, we found that using the above model systems other long-term learning strategy adoption rules also promote cooperation, while introducing a low level of noise (as a model of innovation) to the strategy adoption rules makes the level of cooperation less dependent on the actual network topology. Our results demonstrate that long-term learning and random elements in the strategy adoption rules, when acting together, extend the range of network topologies enabling the development of cooperation at a wider range of costs and temptations. These results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.

Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input.
Abstract: In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

Journal ArticleDOI
TL;DR: It is demonstrated that an appropriate feedback controller can be acquired within a few thousand trials by numerical simulations and the controller obtained in numerical simulation achieves stable walking with a physical robot in the real world.
Abstract: In this paper we describe a learning framework for a central pattern generator (CPG)-based biped locomotion controller using a policy gradient method. Our goals in this study are to achieve CPG-based biped walking with a 3D hardware humanoid and to develop an efficient learning algorithm with CPG by reducing the dimensionality of the state space used for learning. We demonstrate that an appropriate feedback controller can be acquired within a few thousand trials by numerical simulations and the controller obtained in numerical simulation achieves stable walking with a physical robot in the real world. Numerical simulations and hardware experiments evaluate the walking velocity and stability. The results suggest that the learning algorithm is capable of adapting to environmental changes. Furthermore, we present an online learning scheme with an initial policy for a hardware robot to improve the controller within 200 iterations.

Journal ArticleDOI
TL;DR: Behavioral, neuropsychological, functional neuroimaging, and computational studies of basal ganglia and dopamine contributions to learning in humans implicate the basalganglia in incremental, feedback-based learning that involves integrating information across multiple experiences.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: It is shown that linear value-function approximation is equivalent to a form of linear model approximation, and a relationship between the model-approximation error and the Bellman error is derived, which can guide feature selection for model improvement and/or value- function improvement.
Abstract: We show that linear value-function approximation is equivalent to a form of linear model approximation. We then derive a relationship between the model-approximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or value-function improvement. We also show how these results give insight into the behavior of existing feature-selection algorithms.

Journal ArticleDOI
01 Aug 2008
TL;DR: Several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms are described.
Abstract: This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: Q-learning, Sarsa, actor-critic (AC), QV-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms.

Book ChapterDOI
15 Sep 2008
TL;DR: In this paper, the authors extend this approach to include explicit coordination between neighboring traffic lights, which is achieved using the max-plus algorithm, which estimates the optimal joint action by sending locally optimized messages among connected agents.
Abstract: Since traffic jams are ubiquitous in the modern world, optimizing the behavior of traffic lights for efficient traffic flow is a critically important goal Though most current traffic lights use simple heuristic protocols, more efficient controllers can be discovered automatically via multiagent reinforcement learning, where each agent controls a single traffic light However, in previous work on this approach, agents select only locally optimal actions without coordinating their behavior This paper extends this approach to include explicit coordination between neighboring traffic lights Coordination is achieved using the max-plus algorithm, which estimates the optimal joint action by sending locally optimized messages among connected agents This paper presents the first application of max-plus to a large-scale problem and thus verifies its efficacy in realistic settings It also provides empirical evidence that max-plus performs well on cyclic graphs, though it has been proven to converge only for tree-structured graphs Furthermore, it provides a new understanding of the properties a traffic network must have for such coordination to be beneficial and shows that max-plus outperforms previous methods on networks that possess those properties

Proceedings ArticleDOI
05 Jul 2008
TL;DR: The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, it back up the set of expected rewards that are maximal for some set of linear preferences.
Abstract: We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, w). We present the algorithm along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector, w.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: A novel algorithm is introduced that transfers samples from the source tasks that are mostly similar to the target task, and is empirically show that, following the proposed approach, the transfer of samples is effective in reducing the learning complexity.
Abstract: The main objective of transfer in reinforcement learning is to reduce the complexity of learning the solution of a target task by effectively reusing the knowledge retained from solving a set of source tasks. In this paper, we introduce a novel algorithm that transfers samples (i.e., tuples 〈s, a, s', r〉) from source to target tasks. Under the assumption that tasks have similar transition models and reward functions, we propose a method to select samples from the source tasks that are mostly similar to the target task, and, then, to use them as input for batch reinforcement-learning algorithms. As a result, the number of samples an agent needs to collect from the target task to learn its solution is reduced. We empirically show that, following the proposed approach, the transfer of samples is effective in reducing the learning complexity, even when some source tasks are significantly different from the target task.

Journal ArticleDOI
TL;DR: The proposed method works in the setting of learning resolved motion rate control on a real, physical Mitsubishi PA-10 medical robotics arm and demonstrates feasibility for complex high degree-of-freedom robots.
Abstract: One of the most general frameworks for phrasing control problems for complex, redundant robots is operational-space control. However, while this framework is of essential importance for robotics and well understood from an analytical point of view, it can be prohibitively hard to achieve accurate control in the face of modeling errors, which are inevitable in complex robots (e.g. humanoid robots). In this paper, we suggest a learning approach for operational-space control as a direct inverse model learning problem. A first important insight for this paper is that a physically correct solution to the inverse problem with redundant degrees of freedom does exist when learning of the inverse map is performed in a suitable piecewise linear way. The second crucial component of our work is based on the insight that many operational-space controllers can be understood in terms of a constrained optimal control problem. The cost function associated with this optimal control problem allows us to formulate a learning algorithm that automatically synthesizes a globally consistent desired resolution of redundancy while learning the operational-space controller. From the machine learning point of view, this learning problem corresponds to a reinforcement learning problem that maximizes an immediate reward. We employ an expectation-maximization policy search algorithm in order to solve this problem. Evaluations on a three degrees-of-freedom robot arm are used to illustrate the suggested approach. The application to a physically realistic simulator of the anthropomorphic SARCOS Master arm demonstrates feasibility for complex high degree-of-freedom robots. We also show that the proposed method works in the setting of learning resolved motion rate control on a real, physical Mitsubishi PA-10 medical robotics arm.