scispace - formally typeset
Search or ask a question
Author

Wenjie Shi

Bio: Wenjie Shi is an academic researcher from Tsinghua University. The author has contributed to research in topics: Reinforcement learning & Stability (learning theory). The author has an hindex of 6, co-authored 10 publications receiving 78 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: In this article, a hybrid actor-critic architecture is proposed to achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministic policy and action-value function, respectively.
Abstract: This paper investigates trajectory tracking problem for a class of underactuated autonomous underwater vehicles (AUVs) with unknown dynamics and constrained inputs. Different from existing policy gradient methods which employ single actor critic but cannot realize satisfactory tracking control accuracy and stable learning, our proposed algorithm can achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministic policy and action-value function, respectively. Specifically, for the critics, the expected absolute Bellman error-based updating rule is used to choose the worst critic to be updated in each time step. Subsequently, to calculate the loss function with more accurate target value for the chosen critic, Pseudo Q-learning, which uses subgreedy policy to replace the greedy policy in Q-learning, is developed for continuous action spaces, and Multi Pseudo Q-learning (MPQ) is proposed to reduce the overestimation of action-value function and to stabilize the learning. As for the actors, deterministic policy gradient is applied to update the weights, and the final learned policy is defined as the average of all actors to avoid large but bad updates. Moreover, the stability analysis of the learning is given qualitatively. The effectiveness and generality of the proposed MPQ-based deterministic policy gradient (MPQ-DPG) algorithm are verified by the application on AUV with two different reference trajectories. In addition, the results demonstrate high-level tracking control accuracy and stable learning of MPQ-DPG. Besides, the results also validate that increasing the number of the actors and critics will further improve the performance.

48 citations

Posted Content
TL;DR: The proposed MPQ-based deterministic policy gradient algorithm can achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministicpolicy and action-value function, respectively.
Abstract: This paper investigates trajectory tracking problem for a class of underactuated autonomous underwater vehicles (AUVs) with unknown dynamics and constrained inputs. Different from existing policy gradient methods which employ single actor-critic but cannot realize satisfactory tracking control accuracy and stable learning, our proposed algorithm can achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministic policy and action-value function, respectively. Specifically, for the critics, the expected absolute Bellman error based updating rule is used to choose the worst critic to be updated in each time step. Subsequently, to calculate the loss function with more accurate target value for the chosen critic, Pseudo Q-learning, which uses sub-greedy policy to replace the greedy policy in Q-learning, is developed for continuous action spaces, and Multi Pseudo Q-learning (MPQ) is proposed to reduce the overestimation of action-value function and to stabilize the learning. As for the actors, deterministic policy gradient is applied to update the weights, and the final learned policy is defined as the average of all actors to avoid large but bad updates. Moreover, the stability analysis of the learning is given qualitatively. The effectiveness and generality of the proposed MPQ-based Deterministic Policy Gradient (MPQ-DPG) algorithm are verified by the application on AUV with two different reference trajectories. And the results demonstrate high-level tracking control accuracy and stable learning of MPQ-DPG. Besides, the results also validate that increasing the number of the actors and critics will further improve the performance.

23 citations

Proceedings ArticleDOI
01 Aug 2019
TL;DR: This paper presents an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining hard policy gradient with soft Bellman equation to ensure stable learning while eliminating the need of two separate critics for soft value functions.
Abstract: Maximum entropy deep reinforcement learning (RL) methods have been demonstrated on a range of challenging continuous tasks. However, existing methods either suffer from severe instability when training on large off-policy data or cannot scale to tasks with very high state and action dimensionality such as 3D humanoid locomotion. Besides, the optimality of desired Boltzmann policy set for non-optimal soft value function is not persuasive enough. In this paper, we first derive soft policy gradient based on entropy regularized expected reward objective for RL with continuous actions. Then, we present an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining soft policy gradient with soft Bellman equation. To ensure stable learning while eliminating the need of two separate critics for soft value functions, we leverage double sampling approach to making the soft Bellman equation tractable. The experimental results demonstrate that our method outperforms in performance over off-policy prior methods.

21 citations

Proceedings Article
Wenjie Shi1, Shiji Song1, Hui Wu1, Yachu Hsu1, Cheng Wu1, Gao Huang2 
01 Jan 2019
TL;DR: This work proposes a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations.
Abstract: Model-free deep reinforcement learning (RL) algorithms have been widely used for a range of complex control tasks. However, slow convergence and sample inefficiency remain challenging problems in RL, especially when handling continuous and high-dimensional state spaces. To tackle this problem, we propose a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations. Specifically, we first explain how policy iteration can be applied directly with Anderson acceleration. Then we extend RAA to the case of deep RL by introducing a regularization term to control the impact of perturbation induced by function approximation errors. We further propose two strategies, i.e., progressive update and adaptive restart, to enhance the performance. The effectiveness of our method is evaluated on a variety of benchmark tasks, including Atari 2600 and MuJoCo. Experimental results show that our approach substantially improves both the learning speed and final performance of state-of-the-art deep RL algorithms.

19 citations

Journal ArticleDOI
Wenjie Shi1, Gao Huang1, Shiji Song1, Zhuoyuan Wang1, Tingyu Lin, Cheng Wu1 
TL;DR: A self-supervised interpretable framework is proposed, which can discover interpretable features to enable easy understanding of RL agents even for non-experts and provides valuable insight into the internal decision-making process of vision-based RL.
Abstract: Deep reinforcement learning (RL) has recently led to many breakthroughs on a range of complex control tasks. However, the decision-making process is generally not transparent. The lack of interpretability hinders the applicability in safety-critical scenarios. While several methods have attempted to interpret vision-based RL, most come without detailed explanation for the agent's behaviour. In this paper, we propose a self-supervised interpretable framework, which can discover causal features to enable easy interpretation of RL even for non-experts. Specifically, a self-supervised interpretable network is employed to produce fine-grained masks for highlighting task-relevant information, which constitutes most evidence for the agent's decisions. We verify and evaluate our method on several Atari-2600 games and Duckietown, which is a challenging self-driving car simulator environment. The results show that our method renders causal explanations and empirical evidences about how the agent makes decisions and why the agent performs well or badly. Overall, our method provides valuable insight into the decision-making process of RL. In addition, our method does not use any external labelled data, and thus demonstrates the possibility to learn high-quality mask through a self-supervised manner, which may shed light on new paradigms for label-free vision learning such as self-supervised segmentation and detection.

14 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Book ChapterDOI
01 Jan 1985
TL;DR: The first group of results in fixed point theory were derived from Banach's fixed point theorem as discussed by the authors, which is a nice result since it contains only one simple condition on the map F, since it is easy to prove and since it nevertheless allows a variety of applications.
Abstract: Formally we have arrived at the middle of the book. So you may need a pause for recovering, a pause which we want to fill up by some fixed point theorems supplementing those which you already met or which you will meet in later chapters. The first group of results centres around Banach’s fixed point theorem. The latter is certainly a nice result since it contains only one simple condition on the map F, since it is so easy to prove and since it nevertheless allows a variety of applications. Therefore it is not astonishing that many mathematicians have been attracted by the question to which extent the conditions on F and the space Ω can be changed so that one still gets the existence of a unique or of at least one fixed point. The number of results produced this way is still finite, but of a statistical magnitude, suggesting at a first glance that only a random sample can be covered by a chapter or even a book of the present size. Fortunately (or unfortunately?) most of the modifications have not found applications up to now, so that there is no reason to write a cookery book about conditions but to write at least a short outline of some ideas indicating that this field can be as interesting as other chapters. A systematic account of more recent ideas and examples in fixed point theory should however be written by one of the true experts. Strange as it is, such a book does not seem to exist though so many people are puzzling out so many results.

994 citations

Posted Content
TL;DR: This work provides a self-contained assessment of the current state-of-the-art MARL techniques from a game theoretical perspective and expects this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new directions based on recent advances.
Abstract: Following the remarkable success of the AlphaGO series, 2019 was a booming year that witnessed significant advances in multi-agent reinforcement learning (MARL) techniques. MARL corresponds to the learning problem in a multi-agent system in which multiple agents learn simultaneously. It is an interdisciplinary domain with a long history that includes game theory, machine learning, stochastic control, psychology, and optimisation. Although MARL has achieved considerable empirical success in solving real-world games, there is a lack of a self-contained overview in the literature that elaborates the game theoretical foundations of modern MARL methods and summarises the recent advances. In fact, the majority of existing surveys are outdated and do not fully cover the recent developments since 2010. In this work, we provide a monograph on MARL that covers both the fundamentals and the latest developments in the research frontier. The goal of our monograph is to provide a self-contained assessment of the current state-of-the-art MARL techniques from a game theoretical perspective. We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new directions based on recent advances.

103 citations

Journal ArticleDOI
TL;DR: An intelligent proportional-integral based on sliding mode (SM) observer to mitigate the destructive impedance instabilities of nonideal CPLs with time-varying nature in the ultralocal model sense is addressed.
Abstract: The nonlinearities and unmodeled dynamics inevitably degrade the quality and reliability of power conversion, and as a result, pose big challenges on higher-performance voltage stabilization of dc–dc buck converters. The stability of such power electronic equipment is further threatened when feeding the nonideal constant power loads (CPLs) because of the induced negative impedance specifications. In response to these challenges, the advanced regulatory and technological mechanisms associated with the converters require to be developed to efficiently implement these interface systems in the microgrid configuration. This article addresses an intelligent proportional-integral based on sliding mode (SM) observer to mitigate the destructive impedance instabilities of nonideal CPLs with time-varying nature in the ultralocal model sense. In particular, in the current article, an auxiliary deep deterministic policy gradient (DDPG) controller is adaptively developed to decrease the observer estimation error and further ameliorate the dynamic characteristics of dc–dc buck converters. The design of the DDPG is realized in two parts: (i) an actor-network which generates the policy commands, while (ii) a critic-network evaluates the quality of the policy command generated by the actor. The suggested strategy establishes the DDPG-based control to handle for what the iPI-based SM observer is unable to compensate. In this application, the weight coefficients of the actor and critic networks are trained based on the reward feedback of the voltage error, by using the gradient descent scheme. Finally, to investigate the merits and implementation feasibility of the suggested method, some experimental results on a laboratory prototype of the dc–dc buck converter, which feeds a time-varying CPL, are presented.

62 citations

Journal ArticleDOI
01 Jun 2021
TL;DR: It is proved that the novel BINN algorithm proposed in this paper can effectively and reasonably distribute multi-AUVs and reduce the overall sailing distance.
Abstract: The task assignment and path planning of a multi-AUVs system has now attracted considerable attention and become a hotspot in the research. In this paper, a novel algorithm of multi-AUVs task assignment and path planning based on Biologically Inspired Neural Network Map (BINN) is proposed. Firstly, the grid map is built by discretizing the three-dimensional underwater environment into many equal grids. Secondly, the activity values of all AUVs in the BINN maps of each target are calculated. Then, the AUV with the highest activity value in the BINN map of the target is selected as the winning AUV for the target. Finally, the winning AUV performs path planning according to the BINN strategy. Through the simulation experiment, it is proved that the novel BINN algorithm proposed in this paper can effectively and reasonably distribute multi-AUVs and reduce the overall sailing distance.

38 citations