scispace - formally typeset
Search or ask a question

Showing papers by "Shalabh Bhatnagar published in 2020"


Journal ArticleDOI
TL;DR: It is illustrated that the change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward.
Abstract: Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem, and a traffic signal control problem.

23 citations


Journal ArticleDOI
TL;DR: It is shown that the gradient and/or Hessian estimates in the resulting algorithms with DPs are asymptotically unbiased, so that the algorithms are provably convergent and derive convergence rates to establish the superiority of the first-order and second-order algorithms.
Abstract: We introduce deterministic perturbation (DP) schemes for the recently proposed random directions stochastic approximation, and propose new first-order and second-order algorithms. In the latter case, these are the first second-order algorithms to incorporate DPs. We show that the gradient and/or Hessian estimates in the resulting algorithms with DPs are asymptotically unbiased, so that the algorithms are provably convergent. Furthermore, we derive convergence rates to establish the superiority of the first-order and second-order algorithms, for the special case of a convex and quadratic optimization problem, respectively. Numerical experiments are used to validate the theoretical results.

11 citations


Journal ArticleDOI
TL;DR: In this paper, the asymptotic behavior of a stochastic approximation scheme on two timescales with set-valued drift functions and in the presence of nonadditive iterate-dependent Markov no...
Abstract: In this paper, we study the asymptotic behavior of a stochastic approximation scheme on two timescales with set-valued drift functions and in the presence of nonadditive iterate-dependent Markov no...

11 citations


Posted Content
TL;DR: A linear policy for realizing end-foot trajectories in the quadruped robot, Stoch, is used that is not only computationally light-weight but also uses minimal sensing and actuation capabilities in the robot, thereby justifying the approach.
Abstract: In this paper, with a view toward fast deployment of locomotion gaits in low-cost hardware, we use a linear policy for realizing end-foot trajectories in the quadruped robot, Stoch $2$. In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs. The corresponding desired joint angles are obtained via an inverse kinematics solver and tracked via a PID control law. Augmented Random Search, a model-free and a gradient-free learning algorithm is used to train this linear policy. Simulation results show that the resulting walking is robust to terrain slope variations and external pushes. This methodology is not only computationally light-weight but also uses minimal sensing and actuation capabilities in the robot, thereby justifying the approach.

9 citations


Journal ArticleDOI
01 Jan 2020
TL;DR: In this article, a successive over-relaxation (SOR)-based value iteration scheme is proposed to speed-up the computation of the optimal value function, which can be achieved by constructing a modified Bellman equation that ensures faster convergence to the value function.
Abstract: In a discounted reward Markov decision process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation (SOR)-based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to reinforcement learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is ${Q}$ -learning. In this letter, we propose SOR ${Q}$ -learning. We first derive a modified fixed point iteration for SOR ${Q}$ -values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR ${Q}$ -learning to SOR ${Q}$ -values. Finally, through numerical experiments, we show that SOR ${Q}$ -learning is faster compared to the standard ${Q}$ -learning algorithm.

7 citations


Proceedings ArticleDOI
01 Feb 2020
TL;DR: In this paper, a review of different PCC structures with modified properties of sensing is presented, and the conclusion of reported work and result demonstrated that miniaturization of optical biosensors is possible by the different pCC structure.
Abstract: The review paper is providing the brief idea of photonic Crystal Cavities (PCCs) structures and their applications for biosensing devices. The paper presents the review of different PCC structures with modified properties of sensing. For several optical sensing applications, different PCC structure with properties are described in details. The conclusion of reported work and result demonstrated that miniaturization of optical biosensors is possible by the different PCC structure. This PCC structure has flexibility in structure for different sensing applications.

6 citations


Journal ArticleDOI
30 Jan 2020
TL;DR: In this paper, a generalized version of the generalized Bellman operator (GSQL) was proposed to handle slow convergence of Watkins' Q-learning, which is called GSQL- $w$.
Abstract: In this letter, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins’ Q-learning. In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxation. We use the generalized Bellman operator to derive a simple and efficient family of algorithms called Generalized Speedy Q-learning (GSQL- $w$ ) and analyze its finite time performance. We show that GSQL- $w$ has an improved finite time performance bound compared to SQL for the case when the relaxation parameter $w$ is greater than 1. This improvement is a consequence of the contraction factor of the generalized Bellman operator being less than that of the standard Bellman operator. Numerical experiments are provided to demonstrate the empirical performance of the GSQL- $w$ algorithm.

5 citations


Journal ArticleDOI
TL;DR: In this paper, it was shown that after a large number of iterations, if the stochastic approximation process enters the domain of attraction of an attracting set, it gets locked into the attracting set with high probability.
Abstract: In this paper, we analyze the behavior of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee. We prove that after a large number of iterations, if the stochastic approximation process enters the domain of attraction of an attracting set, it gets locked into the attracting set with high probability. We demonstrate that the above-mentioned result is an effective instrument for analyzing stochastic approximation schemes in the absence of a stability guarantee, by using it to obtain an alternate criterion for convergence in the presence of a locally attracting set for the mean field and by using it to show that a feedback mechanism, which involves resetting the iterates at regular time intervals, stabilizes the scheme when the mean field possesses a globally attracting set, thereby guaranteeing convergence. The results in this paper build on the works of Borkar, Andrieu et al., and Chen et al. , by allowing for the presence of set-valued drift functions.

5 citations


Proceedings ArticleDOI
01 Aug 2020
TL;DR: In this article, a two-pronged approach is proposed to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner, by using a neural network based filter that takes in target velocity, radius and transforms them into new commands.
Abstract: With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user’s input into an expert user’s input, thereby ensuring stable manoeuvres regardless of the user’s experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.

4 citations



Proceedings ArticleDOI
01 Aug 2020
TL;DR: This work provides model-free reinforcement learning (RL)-based policies for slot-sharing between UE and IoT data and compares the performance of the RL-based policies with low complexity heuristic-based slot- sharing schemes which either prioritise the UE data or account only for near-threshold aged UE data, or are oblivious to the amount of UE data.
Abstract: We consider an industrial internet-of-things (IIoT) system with multiple IoT devices, a user equipment (UE), together with a base station (BS) that receives the UE and IoT data. To circumvent the issue of numerous IoT-to-BS connections and to conserve IoT devices’ energies, the UE serves as a relay to forward the IoT data to the BS. The UE employs frame-based uplink transmissions, wherein it shares few slots of every frame to relay the IoT data. The IIoT system experiences a transmission failure called outage when IoT data is not transmitted. The unsent UE data is stored in the UE’s buffer and is discarded after the storage time exceeds the age threshold. As the UE and IoT devices share the transmission slots, trade-offs exist between system outages and aged UE data loss. To resolve system outage-data ageing challenge, we provide model-free reinforcement learning (RL)-based policies for slot-sharing between UE and IoT data. We compare the performance of the RL-based policies with low complexity heuristic-based slot-sharing schemes which either prioritise the UE data or account only for near-threshold aged UE data or are oblivious to the amount of UE data.

Posted Content
02 Sep 2020
TL;DR: A framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP) is proposed and it is observed that in each case the algorithm converges and finds the optimal policy.
Abstract: In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal control policies. Second, we model a set of benchmark examples of hybrid control design problem in the proposed MDP framework. Third, we adapt the recently proposed Proximal Policy Optimisation (PPO) algorithm for the hybrid action space and apply it to the above set of problems. It is observed that in each case the algorithm converges and finds the optimal policy.

Proceedings ArticleDOI
26 Oct 2020
TL;DR: A novel approach that makes use of independent learners Deep Q-learning algorithm to solve the problem of energy management in microgrid networks in the framework of stochastic games is proposed.
Abstract: We consider the problem of energy management in microgrid networks. A microgrid is capable of generating power from a renewable resource and is responsible for handling the demands of its dedicated customers. Owing to the variable nature of renewable generation and the demands of the customers, it becomes imperative that each microgrid optimally manages its energy. This involves intelligently scheduling the demands at the customer side, selling (when there is a surplus) and buying (when there is a deficit) the power from its neighboring microgrids depending on its current and future needs. In this work, we formulate the problems of demand and battery scheduling, energy trading and dynamic pricing (where we allow the microgrids to decide the price of the transaction depending on their current configuration of demand and renewable energy) in the framework of stochastic games. Subsequently, we propose a novel approach that makes use of independent learners Deep Q-learning algorithm to solve this problem.

Proceedings ArticleDOI
19 Jul 2020
TL;DR: A new deep reinforcement learning algorithm using the technique of successive over-relaxation (SOR) in Deep Q-networks (DQNs) achieves significant improvements over DQN on both synthetic and real datasets.
Abstract: We present a new deep reinforcement learning algorithm using the technique of successive over-relaxation (SOR) in Deep Q-networks (DQNs). The new algorithm, named SOR-DQN, uses modified targets in the DQN framework with the aim of accelerating training. This work is motivated by the problem of auto-scaling resources for cloud applications, for which existing algorithms suffer from issues such as slow convergence, poor performance during the training phase and non-scalability. For the above problem, SOR-DQN achieves significant improvements over DQN on both synthetic and real datasets. We also study the generalization ability of the algorithm to multiple tasks by using it to train agents playing Atari video games.

Posted Content
TL;DR: A neural network based filter is developed that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory, thereby ensuring stable manoeuvres regardless of the user’s experience.
Abstract: With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network-based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user's input into an expert user's input, thereby ensuring stable manoeuvres regardless of the user's experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.

Journal ArticleDOI
03 Apr 2020
TL;DR: This work extends the hierarchical option-critic policy gradient theorem for the average reward criterion and proves that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one.
Abstract: Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards

Posted Content
TL;DR: This paper proposes a method to solve Reinforcement Learning tasks related to sparse rewarded environments with better sample efficiency and faster convergence with increased success rate by including Twin Delayed Deep Deterministic Policy Gradients in Hindsight Experience Replay.
Abstract: Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. However updating parameters in neural networks requires expensive computation and thus increase in training time. Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. We solve this issue by including Twin Delayed Deep Deterministic Policy Gradients(TD3) in HER. TD3 learns two Q-functions instead of one and it adds noise tothe target action, to make it harder for the policy to exploit Q-function errors. The experiments are done with the help of OpenAis Mujoco environments. Results on these environments show that our algorithm (TDHER+KFAC) performs better inmost of the scenarios