Home
/
Authors
/
Shalabh Bhatnagar

Author

Shalabh Bhatnagar

Other affiliations: University of Marne-la-Vallée, Indian Institutes of Technology, University of Maryland, College Park ...read more

Bio: Shalabh Bhatnagar is an academic researcher from Indian Institute of Science. The author has contributed to research in topics: Stochastic approximation & Markov decision process. The author has an hindex of 30, co-authored 294 publications receiving 4300 citations. Previous affiliations of Shalabh Bhatnagar include University of Marne-la-Vallée & Indian Institutes of Technology.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1995

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Fast gradient-descent methods for temporal-difference learning with linear function approximation

[...]

Richard S. Sutton¹, Hamid Reza Maei¹, Doina Precup², Shalabh Bhatnagar³, David Silver¹, Csaba Szepesvári¹, Eric Wiewiora¹ - Show less +3 more•Institutions (3)

University of Alberta¹, McGill University², Indian Institute of Science³

14 Jun 2009

TL;DR: In this paper, the authors introduced two new related algorithms with better convergence rates: linear TD with gradient correction (TDC) and TDC with zero term update rule, which can be used for off-policy TD.

...read moreread less

Abstract: Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

...read moreread less

605 citations

Journal Article•DOI•

Natural actor-critic algorithms

[...]

Shalabh Bhatnagar¹, Richard S. Sutton², Mohammad Ghavamzadeh³, Mark Lee²•Institutions (3)

Indian Institute of Science¹, University of Alberta², French Institute for Research in Computer Science and Automation³

01 Nov 2009-Automatica

TL;DR: Four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas are presented, and their convergence proofs are provided, providing the first convergence proofs and the first fully incremental algorithms.

...read moreread less

530 citations

Journal Article•DOI•

Reinforcement Learning With Function Approximation for Traffic Signal Control

[...]

L A Prashanth¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

01 Jun 2011-IEEE Transactions on Intelligent Transportation Systems

TL;DR: A reinforcement learning (RL) algorithm with function approximation for traffic signal control that incorporates state-action features and is easily implementable in high-dimensional settings and outperforms all the other algorithms on all the road network settings that it considers.

...read moreread less

Abstract: We propose, for the first time, a reinforcement learning (RL) algorithm with function approximation for traffic signal control. Our algorithm incorporates state-action features and is easily implementable in high-dimensional settings. Prior work, e.g., the work of Abdulhai , on the application of RL to traffic signal control requires full-state representations and cannot be implemented, even in moderate-sized road networks, because the computational complexity exponentially grows in the numbers of lanes and junctions. We tackle this problem of the curse of dimensionality by effectively using feature-based state representations that use a broad characterization of the level of congestion as low, medium, or high. One advantage of our algorithm is that, unlike prior work based on RL, it does not require precise information on queue lengths and elapsed times at each lane but instead works with the aforementioned described features. The number of features that our algorithm requires is linear to the number of signaled lanes, thereby leading to several orders of magnitude reduction in the computational complexity. We perform implementations of our algorithm on various settings and show performance comparisons with other algorithms in the literature, including the works of Abdulhai and Cools , as well as the fixed-timing and the longest queue algorithms. For comparison, we also develop an RL algorithm that uses full-state representation and incorporates prioritization of traffic, unlike the work of Abdulhai We observe that our algorithm outperforms all the other algorithms on all the road network settings that we consider.

...read moreread less

272 citations

Proceedings Article•

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

[...]

Shalabh Bhatnagar¹, Doina Precup², David Silver³, Richard S. Sutton³, Hamid Reza Maei³, Csaba Szepesvári³ - Show less +2 more•Institutions (3)

Indian Institute of Science¹, McGill University², University of Alberta³

07 Dec 2009

TL;DR: This work presents a Bellman error objective function and two gradient-descent TD algorithms that optimize it, and proves the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution.

...read moreread less

Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms' effectiveness.

...read moreread less

249 citations

Proceedings Article•

Toward Off-Policy Learning Control with Function Approximation

[...]

Hamid Reza Maei¹, Csaba Szepesv ri¹, Shalabh Bhatnagar², Richard S. Sutton¹•Institutions (2)

University of Alberta¹, Indian Institute of Science²

21 Jun 2010

TL;DR: The Greedy-GQ algorithm is an extension of recent work on gradient temporal-difference learning to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function.

...read moreread less

Abstract: We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

...read moreread less

233 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

I and i

[...]

Kevin Barraclough

08 Dec 2001-BMJ

TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.

...read moreread less

Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

...read moreread less

33,785 citations

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Data Mining - Concepts and Techniques.

[...]

Petra Perner

01 Jan 2002

9,314 citations

Posted Content•

Playing Atari with Deep Reinforcement Learning

[...]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller - Show less +3 more

19 Dec 2013-arXiv: Learning

TL;DR: This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

...read moreread less

Abstract: We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

...read moreread less

8,757 citations

Journal Article•DOI•

Convergence of Probability Measures

[...]

J. F. C. Kingman¹•Institutions (1)

University of Sussex¹

01 Nov 1969-Journal of The Royal Statistical Society Series C-applied Statistics

TL;DR: Convergence of Probability Measures as mentioned in this paper is a well-known convergence of probability measures. But it does not consider the relationship between probability measures and the probability distribution of probabilities.

...read moreread less

Abstract: Convergence of Probability Measures. By P. Billingsley. Chichester, Sussex, Wiley, 1968. xii, 253 p. 9 1/4“. 117s.

...read moreread less

5,689 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse