scispace - formally typeset
Search or ask a question
Topic

Reward-based selection

About: Reward-based selection is a research topic. Over the lifetime, 365 publications have been published within this topic receiving 14137 citations.


Papers
More filters
Proceedings ArticleDOI
04 Jul 2004
TL;DR: This work thinks of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and gives an algorithm for learning the task demonstrated by the expert, based on using "inverse reinforcement learning" to try to recover the unknown reward function.
Abstract: We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function.

3,110 citations

Journal ArticleDOI
TL;DR: Human subjects indicated their preference between a hypothetical $1,000 reward available with various probabilities or delays and a certain reward of variable amount available immediately and the function relating the amount had the same general shape (hyperbolic) as the function found by Mazur (1987) to describe pigeons' delay discounting.
Abstract: Human subjects indicated their preference between a hypothetical $1,000 reward available with various probabilities or delays and a certain reward of variable amount available immediately. The function relating the amount of the certain-immediate reward subjectively equivalent to the delayed $1,000 reward had the same general shape (hyperbolic) as the function found by Mazur (1987) to describe pigeons' delay discounting. The function relating the certain-immediate amount of money subjectively equivalent to the probabilistic $1,000 reward was also hyperbolic, provided that the stated probability was transformed to odds against winning. In a second experiment, when human subjects chose between a delayed $1,000 reward and a probabilistic $1,000 reward, delay was proportional to the same odds-against transformation of the probability to which it was subjectively equivalent.

1,249 citations

Journal ArticleDOI
03 Aug 2006-Neuron
TL;DR: The results suggest that the primary task of the dopaminergic system is to convey signals of upcoming stochastic rewards, such as expected reward and risk, beyond its role in learning, motivation, and salience.

664 citations

Journal ArticleDOI
TL;DR: This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework, and a detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels.
Abstract: This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

397 citations

Proceedings Article
12 Dec 2011
TL;DR: A probabilistic algorithm that allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert's policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.

336 citations


Network Information
Related Topics (5)
Reinforcement learning
46K papers, 1M citations
80% related
Inference
36.8K papers, 1.3M citations
76% related
Heuristics
32.1K papers, 956.5K citations
73% related
Probabilistic logic
56K papers, 1.3M citations
70% related
Convex optimization
24.9K papers, 908.7K citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20192
20183
201725
201624
201525
201423