scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Reachability-Based Trajectory Safeguard (RTS): A Safe and Fast Reinforcement Learning Safety Layer for Continuous Control

04 Mar 2021-Vol. 6, Iss: 2, pp 3663-3670
TL;DR: In this paper, a Reachability-based Trajectory Safeguard (RTS) algorithm is proposed to ensure safety during training and operation of a robot in a safety critical environment.
Abstract: Reinforcement Learning (RL) algorithms have achieved remarkable performance in decision making and control tasks by reasoning about long-term, cumulative reward using trial and error. However, during RL training, applying this trial-and-error approach to real-world robots operating in safety critical environment may lead to collisions. To address this challenge, this letter proposes a Reachability-based Trajectory Safeguard (RTS), which leverages reachability analysis to ensure safety during training and operation. Given a known (but uncertain) model of a robot, RTS precomputes a Forward Reachable Set of the robot tracking a continuum of parameterized trajectories. At runtime, the RL agent selects from this continuum in a receding-horizon way to control the robot; the FRS is used to identify if the agent's choice is safe or not, and to adjust unsafe choices. The efficacy of this method is illustrated in static environments on three nonlinear robot models, including a 12-D quadrotor drone, in simulation and in comparison with state-of-the-art safe motion planning methods.
Citations
More filters
Posted Content
11 Dec 2020
TL;DR: The proposed hierarchical multi-rate control architecture maximizes the probability of satisfying the high-level specifications while guaranteeing state and input constraint satisfaction and is tested in simulations and experiments on examples inspired by the Mars exploration mission.
Abstract: In this paper we present a hierarchical multi-rate control architecture for nonlinear autonomous systems operating in partially observable environments. Control objectives are expressed using syntactically co-safe Linear Temporal Logic (LTL) specifications and the nonlinear system is subject to state and input constraints. At the highest level of abstraction, we model the system-environment interaction using a discrete Mixed Observable Markov Decision Problem (MOMDP), where the environment states are partially observed. The high level control policy is used to update the constraint sets and cost function of a Model Predictive Controller (MPC) which plans a reference trajectory. Afterwards, the MPC planned trajectory is fed to a low-level high-frequency tracking controller, which leverages Control Barrier Functions (CBFs) to guarantee bounded tracking errors. Our strategy is based on model abstractions of increasing complexity and layers running at different frequencies. We show that the proposed hierarchical multi-rate control architecture maximizes the probability of satisfying the high-level specifications while guaranteeing state and input constraint satisfaction. Finally, we tested the proposed strategy in simulations and experiments on examples inspired by the Mars exploration mission, where only partial environment observations are available.

17 citations

Journal ArticleDOI
TL;DR: In this article, the authors present a holistic perspective on the state-of-the-art in the design of guidance, navigation, and control systems for autonomous multi-rotor small unmanned aerial systems (sUAS).

7 citations

Posted Content
TL;DR: The effectiveness of the proposed data-driven hierarchical control framework in a two-car collision avoidance scenario through simulations and experiments on a 1/10 scale autonomous car platform is demonstrated where the strategy-guided approach outperforms a model predictive control baseline in both cases.
Abstract: We present a hierarchical control approach for maneuvering an autonomous vehicle (AV) in tightly-constrained environments where other moving AVs and/or human driven vehicles are present. A two-level hierarchy is proposed: a high-level data-driven strategy predictor and a lower-level model-based feedback controller. The strategy predictor maps an encoding of a dynamic environment to a set of high-level strategies via a neural network. Depending on the selected strategy, a set of time-varying hyperplanes in the AV's position space is generated online and the corresponding halfspace constraints are included in a lower-level model-based receding horizon controller. These strategy-dependent constraints drive the vehicle towards areas where it is likely to remain feasible. Moreover, the predicted strategy also informs switching between a discrete set of policies, which allows for more conservative behavior when prediction confidence is low. We demonstrate the effectiveness of the proposed data-driven hierarchical control framework in a two-car collision avoidance scenario through simulations and experiments on a 1/10 scale autonomous car platform where the strategy-guided approach outperforms a model predictive control baseline in both cases.

7 citations

Journal ArticleDOI
TL;DR: In this article , a path planning algorithm for connected and non-connected automated road vehicles on multilane motorways is derived from the opportune formulation of an optimal control problem, where the objective function to be minimized contains appropriate respective terms to reflect: the goals of vehicle advancement; passenger comfort; and avoidance of collisions with other vehicles and of road departures.
Abstract: A path-planning algorithm for connected and non-connected automated road vehicles on multilane motorways is derived from the opportune formulation of an optimal control problem. In this framework, the objective function to be minimized contains appropriate respective terms to reflect: the goals of vehicle advancement; passenger comfort; and avoidance of collisions with other vehicles and of road departures. Connectivity implies, within the present work, that connected vehicles can exchange with each other (V2V) real-time information about their last generated short-term path. For the numerical solution of the optimal control problem, an efficient feasible direction algorithm (FDA) is used. To ensure high-quality local minima, a simplified Dynamic Programming (DP) algorithm is also conceived to deliver the initial guess trajectory for the start of the FDA iterations. Thanks to very low computation times, the approach is readily executable within a model predictive control (MPC) framework. The proposed MPC-based approach is embedded within the Aimsun microsimulation platform, which enables the evaluation of a plethora of realistic vehicle driving and advancement scenarios under different vehicles mixes. Results obtained on a multilane motorway stretch indicate higher efficiency of the optimally controlled vehicles in driving closer to their desired speed, compared to ordinary manually driven vehicles. Increased penetration rates of automated vehicles are found to increase the efficiency of the overall traffic flow, benefiting manual vehicles as well. Moreover, connected controlled vehicles appear to be more efficient in achieving their desired speed, compared also to the corresponding non-connected controlled vehicles, due to the improved real-time information and short-term prediction achieved via V2V communication.

6 citations

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , a shielding mechanism is proposed to ensure ISO-verified human safety while training and deploying RL algorithms on manipulators, which improves the RL performance by preventing episode-ending collisions.
Abstract: Deep reinforcement learning (RL) has shown promising results in the motion planning of manipulators. However, no method guarantees the safety of highly dynamic obstacles, such as humans, in RL-based manipulator control. This lack of formal safety assurances prevents the application of RL for manipulators in real-world human environments. Therefore, we propose a shielding mechanism that ensures ISO- verified human safety while training and deploying RL algorithms on manipulators. We utilize a fast reachability analysis of humans and manipulators to guarantee that the manipulator comes to a complete stop before a human is within its range. Our proposed method guarantees safety and significantly improves the RL performance by preventing episode-ending collisions. We demonstrate the performance of our proposed method in simulation using human motion capture data.

5 citations

References
More filters
Posted Content
TL;DR: This work uses new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, C mBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100.
Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL

5,709 citations

Proceedings Article
06 Jul 2015
TL;DR: A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
Abstract: In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

3,479 citations

Book
01 Jan 2003
TL;DR: This article provides a comprehensive introduction into the field of robotic mapping, with a focus on indoor mapping, and describes and compares various probabilistic techniques, as they are presently being applied to a vast array of mobile robot mapping problems.
Abstract: This article provides a comprehensive introduction into the field of robotic mapping, with a focus on indoor mapping. It describes and compares various probabilistic techniques, as they are presently being applied to a vast array of mobile robot mapping problems. The history of robotic mapping is also detailed, along with an extensive list of open research problems.

1,584 citations

Proceedings Article
03 Jul 2018
TL;DR: In this paper, the authors show that the overestimation bias persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic.
Abstract: In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

954 citations

Proceedings Article
06 Aug 2017
TL;DR: Constrained Policy Optimization (CPO) as discussed by the authors is the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.
Abstract: For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al., 2016; Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

768 citations