Showing papers by "Shalabh Bhatnagar published in 2020"

PDF

Open Access

Journal Article•DOI•

Reinforcement learning algorithm for non-stationary environments

[...]

Sindhu Padakandla¹, K J Prabuchandran¹, Shalabh Bhatnagar¹•Institutions (1)

01 Nov 2020-Applied Intelligence

TL;DR: It is illustrated that the change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward.

...read moreread less

Abstract: Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem, and a traffic signal control problem.

...read moreread less

23 citations

Journal Article•DOI•

Random Directions Stochastic Approximation With Deterministic Perturbations

[...]

L A Prashanth¹, Shalabh Bhatnagar², Nirav Bhavsar¹, Michael C. Fu³, Steven I. Marcus³ - Show less +1 more•Institutions (3)

Indian Institute of Technology Madras¹, Indian Institute of Science², University of Maryland, College Park³

01 Jun 2020-IEEE Transactions on Automatic Control

TL;DR: It is shown that the gradient and/or Hessian estimates in the resulting algorithms with DPs are asymptotically unbiased, so that the algorithms are provably convergent and derive convergence rates to establish the superiority of the first-order and second-order algorithms.

...read moreread less

Abstract: We introduce deterministic perturbation (DP) schemes for the recently proposed random directions stochastic approximation, and propose new first-order and second-order algorithms. In the latter case, these are the first second-order algorithms to incorporate DPs. We show that the gradient and/or Hessian estimates in the resulting algorithms with DPs are asymptotically unbiased, so that the algorithms are provably convergent. Furthermore, we derive convergence rates to establish the superiority of the first-order and second-order algorithms, for the special case of a convex and quadratic optimization problem, respectively. Numerical experiments are used to validate the theoretical results.

...read moreread less

11 citations

Journal Article•DOI•

Stochastic Recursive Inclusions in Two Timescales with Nonadditive Iterate-Dependent Markov Noise

[...]

Vinayaka G. Yaji¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

21 Jul 2020-Mathematics of Operations Research

TL;DR: In this paper, the asymptotic behavior of a stochastic approximation scheme on two timescales with set-valued drift functions and in the presence of nonadditive iterate-dependent Markov no...

...read moreread less

Abstract: In this paper, we study the asymptotic behavior of a stochastic approximation scheme on two timescales with set-valued drift functions and in the presence of nonadditive iterate-dependent Markov no...

...read moreread less

11 citations

Posted Content•

Robust Quadrupedal Locomotion on Sloped Terrains: A Linear Policy Approach.

[...]

Kartik Paigwar, Lokesh Krishna, Sashank Tirumala, naman khetan, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal¹, Bharadwaj Amrutur, Shishir Kolathaya - Show less +6 more•Institutions (1)

Indian Institute of Science¹

30 Oct 2020-arXiv: Robotics

TL;DR: A linear policy for realizing end-foot trajectories in the quadruped robot, Stoch, is used that is not only computationally light-weight but also uses minimal sensing and actuation capabilities in the robot, thereby justifying the approach.

...read moreread less

Abstract: In this paper, with a view toward fast deployment of locomotion gaits in low-cost hardware, we use a linear policy for realizing end-foot trajectories in the quadruped robot, Stoch $2$. In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs. The corresponding desired joint angles are obtained via an inverse kinematics solver and tracked via a PID control law. Augmented Random Search, a model-free and a gradient-free learning algorithm is used to train this linear policy. Simulation results show that the resulting walking is robust to terrain slope variations and external pushes. This methodology is not only computationally light-weight but also uses minimal sensing and actuation capabilities in the robot, thereby justifying the approach.

...read moreread less

9 citations

Journal Article•DOI•

Successive Over-Relaxation ${Q}$ -Learning

[...]

Chandramouli Kamanchi¹, Raghuram Bharadwaj Diddigi¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

01 Jan 2020

TL;DR: In this article, a successive over-relaxation (SOR)-based value iteration scheme is proposed to speed-up the computation of the optimal value function, which can be achieved by constructing a modified Bellman equation that ensures faster convergence to the value function.

...read moreread less

Abstract: In a discounted reward Markov decision process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation (SOR)-based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to reinforcement learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is ${Q}$ -learning. In this letter, we propose SOR ${Q}$ -learning. We first derive a modified fixed point iteration for SOR ${Q}$ -values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR ${Q}$ -learning to SOR ${Q}$ -values. Finally, through numerical experiments, we show that SOR ${Q}$ -learning is faster compared to the standard ${Q}$ -learning algorithm.

...read moreread less

7 citations

Proceedings Article•DOI•

Photonic Crystal Cavities based Biosensors: A Review

[...]

Ankit Agarwal, Saurabh Sahu¹, Nitesh Mudgal², Ghanshyam Singh², Shalabh Bhatnagar - Show less +1 more•Institutions (2)

Jabalpur Engineering College¹, Malaviya National Institute of Technology, Jaipur²

01 Feb 2020

TL;DR: In this paper, a review of different PCC structures with modified properties of sensing is presented, and the conclusion of reported work and result demonstrated that miniaturization of optical biosensors is possible by the different pCC structure.

...read moreread less

Abstract: The review paper is providing the brief idea of photonic Crystal Cavities (PCCs) structures and their applications for biosensing devices. The paper presents the review of different PCC structures with modified properties of sensing. For several optical sensing applications, different PCC structure with properties are described in details. The conclusion of reported work and result demonstrated that miniaturization of optical biosensors is possible by the different PCC structure. This PCC structure has flexibility in structure for different sensing applications.

...read moreread less

6 citations

Journal Article•DOI•

Generalized Speedy Q-Learning

[...]

Indu John¹, Chandramouli Kamanchi¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

30 Jan 2020

TL;DR: In this paper, a generalized version of the generalized Bellman operator (GSQL) was proposed to handle slow convergence of Watkins' Q-learning, which is called GSQL- $w$.

...read moreread less

Abstract: In this letter, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins’ Q-learning. In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxation. We use the generalized Bellman operator to derive a simple and efficient family of algorithms called Generalized Speedy Q-learning (GSQL- $w$ ) and analyze its finite time performance. We show that GSQL- $w$ has an improved finite time performance bound compared to SQL for the case when the relaxation parameter $w$ is greater than 1. This improvement is a consequence of the contraction factor of the generalized Bellman operator being less than that of the standard Bellman operator. Numerical experiments are provided to demonstrate the empirical performance of the GSQL- $w$ algorithm.

...read moreread less

5 citations

Journal Article•DOI•

Analysis of Stochastic Approximation Schemes With Set-Valued Maps in the Absence of a Stability Guarantee and Their Stabilization

[...]

Vinayaka G. Yaji¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

01 Mar 2020-IEEE Transactions on Automatic Control

TL;DR: In this paper, it was shown that after a large number of iterations, if the stochastic approximation process enters the domain of attraction of an attracting set, it gets locked into the attracting set with high probability.

...read moreread less

Abstract: In this paper, we analyze the behavior of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee. We prove that after a large number of iterations, if the stochastic approximation process enters the domain of attraction of an attracting set, it gets locked into the attracting set with high probability. We demonstrate that the above-mentioned result is an effective instrument for analyzing stochastic approximation schemes in the absence of a stability guarantee, by using it to obtain an alternate criterion for convergence in the presence of a locally attracting set for the mean field and by using it to show that a feedback mechanism, which involves resetting the iterates at regular time intervals, stabilizes the scheme when the mean field possesses a globally attracting set, thereby guaranteeing convergence. The results in this paper build on the works of Borkar, Andrieu et al., and Chen et al. , by allowing for the presence of set-valued drift functions.

...read moreread less

5 citations

Proceedings Article•DOI•

Learning Stable Manoeuvres in Quadruped Robots from Expert Demonstrations

[...]

Sashank Tirumala¹, Sagar Venkatesh Gubbi, Kartik Paigwar, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar², Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya² - Show less +5 more•Institutions (2)

Indian Institute of Technology Madras¹, Indian Institute of Science²

01 Aug 2020

TL;DR: In this article, a two-pronged approach is proposed to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner, by using a neural network based filter that takes in target velocity, radius and transforms them into new commands.

...read moreread less

Abstract: With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user’s input into an expert user’s input, thereby ensuring stable manoeuvres regardless of the user’s experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.

...read moreread less

4 citations

Robust Quadrupedal Locomotion on Sloped Terrains: A Linear Policy Approach.

[...]

Kartik Paigwar, Lokesh Krishna, Sashank Tirumala, naman khetan, aditya varma, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya - Show less +6 more

01 Jan 2020

2 citations

Proceedings Article•DOI•

Learning-Based Resource Allocation in Industrial IoT Systems

[...]

Sindhu Padakandla¹, Shilpa Rao², Shalabh Bhatnagar¹•Institutions (2)

Indian Institute of Science¹, Indian Institutes of Information Technology²

01 Aug 2020

TL;DR: This work provides model-free reinforcement learning (RL)-based policies for slot-sharing between UE and IoT data and compares the performance of the RL-based policies with low complexity heuristic-based slot- sharing schemes which either prioritise the UE data or account only for near-threshold aged UE data, or are oblivious to the amount of UE data.

...read moreread less

Abstract: We consider an industrial internet-of-things (IIoT) system with multiple IoT devices, a user equipment (UE), together with a base station (BS) that receives the UE and IoT data. To circumvent the issue of numerous IoT-to-BS connections and to conserve IoT devices’ energies, the UE serves as a relay to forward the IoT data to the BS. The UE employs frame-based uplink transmissions, wherein it shares few slots of every frame to relay the IoT data. The IIoT system experiences a transmission failure called outage when IoT data is not transmitted. The unsent UE data is stored in the UE’s buffer and is discarded after the storage time exceeds the age threshold. As the UE and IoT devices share the transmission slots, trade-offs exist between system outages and aged UE data loss. To resolve system outage-data ageing challenge, we provide model-free reinforcement learning (RL)-based policies for slot-sharing between UE and IoT data. We compare the performance of the RL-based policies with low complexity heuristic-based slot-sharing schemes which either prioritise the UE data or account only for near-threshold aged UE data or are oblivious to the amount of UE data.

...read moreread less

Posted Content•

A reinforcement learning approach to hybrid control design

[...]

Meet Gandhi¹, Atreyee Kundu¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

02 Sep 2020

TL;DR: A framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP) is proposed and it is observed that in each case the algorithm converges and finds the optimal policy.

...read moreread less

Abstract: In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal control policies. Second, we model a set of benchmark examples of hybrid control design problem in the proposed MDP framework. Third, we adapt the recently proposed Proximal Policy Optimisation (PPO) algorithm for the hybrid action space and apply it to the above set of problems. It is observed that in each case the algorithm converges and finds the optimal policy.

...read moreread less

Proceedings Article•DOI•

Stochastic Game Frameworks for Efficient Energy Management in Microgrid Networks

[...]

Shravan Nayak, Chanakya Ajit Ekbote¹, Annanya Pratap Singh Chauhan², Raghuram Bharadwaj Diddigi³, Prishita Ray⁴, Abhinava Sikdar⁵, Sai Koti Reddy Danda⁶, Shalabh Bhatnagar³ - Show less +4 more•Institutions (6)

Indian Institute of Technology Bhubaneswar¹, Indian Institute of Technology Guwahati², Indian Institute of Science³, VIT University⁴, Indian Institute of Technology Delhi⁵, IBM⁶

26 Oct 2020

TL;DR: A novel approach that makes use of independent learners Deep Q-learning algorithm to solve the problem of energy management in microgrid networks in the framework of stochastic games is proposed.

...read moreread less

Abstract: We consider the problem of energy management in microgrid networks. A microgrid is capable of generating power from a renewable resource and is responsible for handling the demands of its dedicated customers. Owing to the variable nature of renewable generation and the demands of the customers, it becomes imperative that each microgrid optimally manages its energy. This involves intelligently scheduling the demands at the customer side, selling (when there is a surplus) and buying (when there is a deficit) the power from its neighboring microgrids depending on its current and future needs. In this work, we formulate the problems of demand and battery scheduling, energy trading and dynamic pricing (where we allow the microgrids to decide the price of the transaction depending on their current configuration of demand and renewable energy) in the framework of stochastic games. Subsequently, we propose a novel approach that makes use of independent learners Deep Q-learning algorithm to solve this problem.

...read moreread less

Proceedings Article•DOI•

Deep Reinforcement Learning with Successive Over-Relaxation and its Application in Autoscaling Cloud Resources

[...]

Indu John¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

19 Jul 2020

TL;DR: A new deep reinforcement learning algorithm using the technique of successive over-relaxation (SOR) in Deep Q-networks (DQNs) achieves significant improvements over DQN on both synthetic and real datasets.

...read moreread less

Abstract: We present a new deep reinforcement learning algorithm using the technique of successive over-relaxation (SOR) in Deep Q-networks (DQNs). The new algorithm, named SOR-DQN, uses modified targets in the DQN framework with the aim of accelerating training. This work is motivated by the problem of auto-scaling resources for cloud applications, for which existing algorithms suffer from issues such as slow convergence, poor performance during the training phase and non-scalability. For the above problem, SOR-DQN achieves significant improvements over DQN on both synthetic and real datasets. We also study the generalization ability of the algorithm to multiple tasks by using it to train agents playing Atari video games.

...read moreread less

Posted Content•

Learning Stable Manoeuvres in Quadruped Robots from Expert Demonstrations

[...]

Indian Institute of Technology Madras¹, Indian Institute of Science²

28 Jul 2020-arXiv: Robotics

TL;DR: A neural network based filter is developed that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory, thereby ensuring stable manoeuvres regardless of the user’s experience.

...read moreread less

Abstract: With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network-based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user's input into an expert user's input, thereby ensuring stable manoeuvres regardless of the user's experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.

...read moreread less

Journal Article•DOI•

Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)

[...]

Akshay Dharmavaram¹, Matthew Riemer², Shalabh Bhatnagar³•Institutions (3)

Birla Institute of Technology and Science¹, IBM², Indian Institute of Science³

03 Apr 2020

TL;DR: This work extends the hierarchical option-critic policy gradient theorem for the average reward criterion and proves that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one.

...read moreread less

Abstract: Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards

...read moreread less

Posted Content•

Hindsight Experience Replay with Kronecker Product Approximate Curvature.

[...]

Dhuruva Priyan G. M, Abhik Singla¹, Shalabh Bhatnagar¹•Institutions (1)

Indian Institute of Science¹

09 Oct 2020-arXiv: Learning

TL;DR: This paper proposes a method to solve Reinforcement Learning tasks related to sparse rewarded environments with better sample efficiency and faster convergence with increased success rate by including Twin Delayed Deep Deterministic Policy Gradients in Hindsight Experience Replay.

...read moreread less

Abstract: Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. However updating parameters in neural networks requires expensive computation and thus increase in training time. Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. We solve this issue by including Twin Delayed Deep Deterministic Policy Gradients(TD3) in HER. TD3 learns two Q-functions instead of one and it adds noise tothe target action, to make it harder for the policy to exploit Q-function errors. The experiments are done with the help of OpenAis Mujoco environments. Results on these environments show that our algorithm (TDHER+KFAC) performs better inmost of the scenarios

...read moreread less