scispace - formally typeset
Search or ask a question

Showing papers on "Function approximation published in 2020"


Journal ArticleDOI
TL;DR: This contribution focuses in mechanical problems and analyze the energetic format of the PDE, where the energy of a mechanical system seems to be the natural loss function for a machine learning method to approach a mechanical problem.

721 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new composite neural network (NN) that can be trained based on multi-fidelity data, which is comprised of three NNs, with the first NN trained using the low fidelity data and coupled to two high fidelity NN, one with activation functions and another one without, in order to discover and exploit nonlinear and linear correlations, respectively, between the low-idelity and the high fidelity data.

311 citations


Proceedings Article
15 Jul 2020
TL;DR: One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues, analogous to the global convergence guarantees of iterative value function based algorithms.
Abstract: Policy gradient (PG) methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) ``tabular'' policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. In the emph{tabular setting}, our main results are: 1) convergence rate to global optimum for direct parameterization and projected gradient ascent 2) an asymptotic convergence to global optimum for softmax policy parameterization and PG; and a convergence rate with additional entropy regularization, and 3) dimension-free convergence to global optimum for softmax policy parameterization and Natural Policy Gradient (NPG) method with exact gradients. In emph{function approximation}, we further analyze NPG with exact as well as inexact gradients under certain smoothness assumptions on the policy parameterization and establish rates of convergence in terms of the quality of the initial state distribution. One insight of this work is in formalizing how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place PG methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

198 citations


Journal ArticleDOI
TL;DR: A reinforcement learning approach with value function approximation and feature learning is proposed for autonomous decision making of intelligent vehicles on highways and uses data-driven feature representation for value and policy approximation so that better learning efficiency can be achieved.
Abstract: Autonomous decision making is a critical and difficult task for intelligent vehicles in dynamic transportation environments. In this paper, a reinforcement learning approach with value function approximation and feature learning is proposed for autonomous decision making of intelligent vehicles on highways. In the proposed approach, the sequential decision making problem for lane changing and overtaking is modeled as a Markov decision process with multiple goals, including safety, speediness, smoothness, etc. In order to learn optimized policies for autonomous decision-making, a multiobjective approximate policy iteration (MO-API) algorithm is presented. The features for value function approximation are learned in a data-driven way, where sparse kernel-based features or manifold-based features can be constructed based on data samples. Compared with previous RL algorithms such as multiobjective Q-learning, the MO-API approach uses data-driven feature representation for value and policy approximation so that better learning efficiency can be achieved. A highway simulation environment using a 14 degree-of-freedom vehicle dynamics model was established to generate training data and test the performance of different decision-making methods for intelligent vehicles on highways. The results illustrate the advantages of the proposed MO-API method under different traffic conditions. Furthermore, we also tested the learned decision policy on a real autonomous vehicle to implement overtaking decision and control under normal traffic on highways. The experimental results also demonstrate the effectiveness of the proposed method.

116 citations


Journal ArticleDOI
TL;DR: The proposed ANN-PSO-IPS is implemented for four variants of TONMS-EFEs, and comparison with exact solutions relieved its robustness, correctness and effectiveness, which is further authenticated through statistical explorations.
Abstract: In this study, a novel neuro-swarming computing solver is developed for numerical treatment of third-order nonlinear multi-singular Emden–Fowler equation (TONMS-EFE) by using function approximation ability of artificial neural networks (ANNs) modeling and global optimization mechanism of particle swarm optimization (PSO) integrated with local search of interior-point scheme (IPS), i.e., ANN-PSO-IPS. The inspiration for the design of ANN-PSO-IPS-based numerical solver comes with an objective of presenting a reliable, accurate and viable structure that combines the strength of ANNs optimized with the integrated soft computing frameworks to deal with such challenging systems based on TONMS-EFE. The proposed ANN-PSO-IPS is implemented for four variants of TONMS-EFEs, and comparison with exact solutions relieved its robustness, correctness and effectiveness, which is further authenticated through statistical explorations.

97 citations


Posted Content
TL;DR: The results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).
Abstract: Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

93 citations


Posted Content
TL;DR: An Optimistic-Dual Proximal Policy-OPDOP algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration, which is the first provably efficient policy optimization algorithm for CMDPs with safe exploration.
Abstract: We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing SRL algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization (OPDOP) algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.

69 citations


Posted Content
TL;DR: This work introduces the the Policy Cover-Policy Gradient algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover) and complements the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
Abstract: Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case $\ell_{\infty}$ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.

67 citations


Posted Content
TL;DR: This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale and enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients.
Abstract: A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies. On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98.3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25-40% accuracy.

66 citations


Journal ArticleDOI
TL;DR: This article proposes a novel DBN model based on adaptive sparse restricted Boltzmann machines (AS-RBM) and partial least square (PLS) regression fine-tuning, abbreviated as ARP-DBN to obtain a more robust and accurate model than the existing ones.
Abstract: Deep belief network (DBN) is an efficient learning model for unknown data representation, especially nonlinear systems. However, it is extremely hard to design a satisfactory DBN with a robust structure because of traditional dense representation. In addition, backpropagation algorithm-based fine-tuning tends to yield poor performance since its ease of being trapped into local optima. In this article, we propose a novel DBN model based on adaptive sparse restricted Boltzmann machines (AS-RBM) and partial least square (PLS) regression fine-tuning, abbreviated as ARP-DBN, to obtain a more robust and accurate model than the existing ones. First, the adaptive learning step size is designed to accelerate an RBM training process, and two regularization terms are introduced into such a process to realize sparse representation. Second, initial weight derived from AS-RBM is further optimized via layer-by-layer PLS modeling starting from the output layer to input one. Third, we present the convergence and stability analysis of the proposed method. Finally, our approach is tested on Mackey–Glass time-series prediction, 2-D function approximation, and unknown system identification. Simulation results demonstrate that it has higher learning accuracy and faster learning speed. It can be used to build a more robust model than the existing ones.

52 citations


Journal ArticleDOI
TL;DR: An image classification algorithm based on the stacked sparse coding depth learning model-optimized kernel function nonnegative sparse representation that can better solve the problems of complex function approximation and poor classifier effect, thus further improving image classification accuracy.
Abstract: Although the existing traditional image classification methods have been widely applied in practical problems, there are some problems in the application process, such as unsatisfactory effects, low classification accuracy, and weak adaptive ability. This method separates image feature extraction and classification into two steps for classification operation. The deep learning model has a powerful learning ability, which integrates the feature extraction and classification process into a whole to complete the image classification test, which can effectively improve the image classification accuracy. However, this method has the following problems in the application process: first, it is impossible to effectively approximate the complex functions in the deep learning model. Second, the deep learning model comes with a low classifier with low accuracy. So, this paper introduces the idea of sparse representation into the architecture of the deep learning network and comprehensively utilizes the sparse representation of well multidimensional data linear decomposition ability and the deep structural advantages of multilayer nonlinear mapping to complete the complex function approximation in the deep learning model. And a sparse representation classification method based on the optimized kernel function is proposed to replace the classifier in the deep learning model, thereby improving the image classification effect. Therefore, this paper proposes an image classification algorithm based on the stacked sparse coding depth learning model-optimized kernel function nonnegative sparse representation. The experimental results show that the proposed method not only has a higher average accuracy than other mainstream methods but also can be good adapted to various image databases. Compared with other deep learning methods, it can better solve the problems of complex function approximation and poor classifier effect, thus further improving image classification accuracy.

Journal ArticleDOI
TL;DR: The dynamic economic dispatch problem for smart grid is solved under the assumption that no knowledge of the mathematical formulation of the actual generation cost functions is available and a new distributed reinforcement learning optimization algorithm is proposed to address the lack of a priori knowledge.
Abstract: In this article, the dynamic economic dispatch (DED) problem for smart grid is solved under the assumption that no knowledge of the mathematical formulation of the actual generation cost functions is available. The objective of the DED problem is to find the optimal power output of each unit at each time so as to minimize the total generation cost. To address the lack of a priori knowledge, a new distributed reinforcement learning optimization algorithm is proposed. The algorithm combines the state-action-value function approximation with a distributed optimization based on multiplier splitting. Theoretical analysis of the proposed algorithm is provided to prove the feasibility of the algorithm, and several case studies are presented to demonstrate its effectiveness.

Proceedings Article
19 Jun 2020
TL;DR: An algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations is given, and the sample complexity is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions.
Abstract: Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to compute a near-optimal policy. Jin et al. [2020] showed that in the tabular setting, the agent only needs to collect polynomial number of samples (in terms of the number states, the number of actions, and the planning horizon) for reward-free RL. However, in practice, the number of states and actions can be large, and thus function approximation schemes are required for generalization. In this work, we give both positive and negative results for reward-free RL with linear function approximation. We give an algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations. The sample complexity of our algorithm is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions. We further give an exponential lower bound for reward-free RL in the setting where only the optimal $Q$-function admits a linear representation. Our results imply several interesting exponential separations on the sample complexity of reward-free RL.

Journal ArticleDOI
TL;DR: In this paper, an approximate reinforcement learning (RL) methodology for bi-level power management of networked micro-grids in electric distribution systems is presented, where a cooperative agent performs function approximation to predict the behavior of entities under incomplete information of MG parametric models; while at the lower level, each MG provides power-flow-constrained optimal response to price signals.
Abstract: This paper presents an approximate Reinforcement Learning (RL) methodology for bi-level power management of networked Microgrids (MG) in electric distribution systems. In practice, the cooperative agent can have limited or no knowledge of the MG asset behavior and detailed models behind the Point of Common Coupling (PCC). This makes the distribution systems unobservable and impedes conventional optimization solutions for the constrained MG power management problem. To tackle this challenge, we have proposed a bi-level RL framework in a price-based environment. At the higher level, a cooperative agent performs function approximation to predict the behavior of entities under incomplete information of MG parametric models; while at the lower level, each MG provides power-flow-constrained optimal response to price signals. The function approximation scheme is then used within an adaptive RL framework to optimize the price signal as the system load and solar generation change over time. Numerical experiments have verified that, compared to previous works in the literature, the proposed privacy-preserving learning model has better adaptability and enhanced computational speed.

Posted Content
TL;DR: A computational framework for examining DNNs in practice is introduced, and a practical existence theorem is established, asserting existence of a DNN architecture and training procedure that offers the same performance as compressed sensing based on compressed sensing.
Abstract: Deep learning (DL) is transforming industry as decision-making processes are being automated by deep neural networks (DNNs) trained on real-world data. Driven partly by rapidly-expanding literature on DNN approximation theory showing they can approximate a rich variety of functions, such tools are increasingly being considered for problems in scientific computing. Yet, unlike traditional algorithms in this field, little is known about DNNs from the principles of numerical analysis, e.g., stability, accuracy, computational efficiency and sample complexity. In this paper we introduce a computational framework for examining DNNs in practice, and use it to study empirical performance with regard to these issues. We study performance of DNNs of different widths & depths on test functions in various dimensions, including smooth and piecewise smooth functions. We also compare DL against best-in-class methods for smooth function approx. based on compressed sensing (CS). Our main conclusion from these experiments is that there is a crucial gap between the approximation theory of DNNs and their practical performance, with trained DNNs performing relatively poorly on functions for which there are strong approximation results (e.g. smooth functions), yet performing well in comparison to best-in-class methods for other functions. To analyze this gap further, we provide some theoretical insights. We establish a practical existence theorem, asserting existence of a DNN architecture and training procedure that offers the same performance as CS. This establishes a key theoretical benchmark, showing the gap can be closed, albeit via a strategy guaranteed to perform as well as, but no better than, current best-in-class schemes. Nevertheless, it demonstrates the promise of practical DNN approx., by highlighting potential for better schemes through careful design of DNN architectures and training strategies.

Posted Content
TL;DR: A novel gradient boosting framework is proposed where shallow neural networks are employed as ``weak learners'' and general loss functions are considered under this unified framework with specific examples presented for classification, regression, and learning to rank.
Abstract: A novel gradient boosting framework is proposed where shallow neural networks are employed as ``weak learners''. General loss functions are considered under this unified framework with specific examples presented for classification, regression, and learning to rank. A fully corrective step is incorporated to remedy the pitfall of greedy function approximation of classic gradient boosting decision tree. The proposed model rendered outperforming results against state-of-the-art boosting methods in all three tasks on multiple datasets. An ablation study is performed to shed light on the effect of each model components and model hyperparameters.

Proceedings Article
12 Jul 2020
TL;DR: This paper proves that neural Q-learning finds the optimal policy with O(1/\sqrt{T})$ convergence rate if the neural function approximator is sufficiently overparameterized, where $T$ is the number of iterations.
Abstract: Q-learning with neural network function approximation (neural Q-learning for short) is among the most prevalent deep reinforcement learning algorithms. Despite its empirical success, the non-asymptotic convergence rate of neural Q-learning remains virtually unknown. In this paper, we present a finite-time analysis of a neural Q-learning algorithm, where the data are generated from a Markov decision process and the action-value function is approximated by a deep ReLU neural network. We prove that neural Q-learning finds the optimal policy with $O(1/\sqrt{T})$ convergence rate if the neural function approximator is sufficiently overparameterized, where $T$ is the number of iterations. To our best knowledge, our result is the first finite-time analysis of neural Q-learning under non-i.i.d. data assumption.

Journal ArticleDOI
TL;DR: The proposed method is able to calculate the Pareto front approximation of optimization problems with fewer objective functions evaluations than other methods, which makes it appropriate for costly objectives.

Journal ArticleDOI
TL;DR: In this article, a generalized Gaussian radial basis function (GGRBF) was proposed, which is strictly positive definite and yields exponential convergence rate when used for function approximation, both analytically and numerically, that, by suitably choosing the second auxiliary parameter, the illconditioning problems that usually occur because of flat radial basis can be avoided.
Abstract: We introduce a new infinitely smooth generalized Gaussian radial basis function (GGRBF) involving two shape parameters: ψ ( r ; ϵ ; ϵ 0 ) = φ ( r ; ϵ ) exp ( φ ( r ; ϵ 0 ) − 1 ) , where φ(r; ϵ) is the Gaussian basis associated with the shape parameter ϵ and ϵ0 is an auxiliary shape parameter. A thorough theoretical analysis is performed proving that the proposed radial basis function is strictly positive definite and, when it is used for function approximation, yields exponential convergence rate. In addition, we show, both analytically and numerically, that, by suitably choosing the second auxiliary parameter, the ill-conditioning problems that usually occur because of flat radial basis can be avoided. The reported numerical experiments highlight the very satisfactory performances of the novel radial basis function.

Posted Content
TL;DR: This paper formally shows that there are indeed nontrivial state representations under which the canonical TD algorithm is stable, even when learning off-policy, and empirically demonstrates that these stable representations can be learned using stochastic gradient descent, opening the door to improved techniques for representation learning with deep networks.
Abstract: Reinforcement learning with function approximation can be unstable and even divergent, especially when combined with off-policy learning and Bellman updates. In deep reinforcement learning, these issues have been dealt with empirically by adapting and regularizing the representation, in particular with auxiliary tasks. This suggests that representation learning may provide a means to guarantee stability. In this paper, we formally show that there are indeed nontrivial state representations under which the canonical TD algorithm is stable, even when learning off-policy. We analyze representation learning schemes that are based on the transition matrix of a policy, such as proto-value functions, along three axes: approximation error, stability, and ease of estimation. In the most general case, we show that a Schur basis provides convergence guarantees, but is difficult to estimate from samples. For a fixed reward function, we find that an orthogonal basis of the corresponding Krylov subspace is an even better choice. We conclude by empirically demonstrating that these stable representations can be learned using stochastic gradient descent, opening the door to improved techniques for representation learning with deep networks.

Journal ArticleDOI
TL;DR: A new multi- agent deep reinforcement learning algorithm framework named multi-agent time delayed deep deterministic policy gradient is proposed, which reduces the overestimation error of neural network approximation and variance of estimation result using dual-centered critic, group target network smoothing and delayed policy updating.

Posted Content
09 Nov 2020
TL;DR: This work proposes the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model, and proves that an optimistic modification of the least-squares value iteration algorithm incurs an regret.
Abstract: Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency, due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation. To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model. In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $\delta_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.

Posted Content
TL;DR: Under the broader scope of policy optimization with nonlinear function approximation, it is proved that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.
Abstract: We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.

Posted Content
TL;DR: In this article, the authors give an algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations, and the sample complexity of their algorithm is polynomial in the feature dimension and the planning horizon, and is independent of the number of states and actions.
Abstract: Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest During the exploration phase, an agent collects samples without using a pre-specified reward function After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to compute a near-optimal policy Jin et al [2020] showed that in the tabular setting, the agent only needs to collect polynomial number of samples (in terms of the number states, the number of actions, and the planning horizon) for reward-free RL However, in practice, the number of states and actions can be large, and thus function approximation schemes are required for generalization In this work, we give both positive and negative results for reward-free RL with linear function approximation We give an algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations The sample complexity of our algorithm is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions We further give an exponential lower bound for reward-free RL in the setting where only the optimal $Q$-function admits a linear representation Our results imply several interesting exponential separations on the sample complexity of reward-free RL

Journal ArticleDOI
TL;DR: A highly regular and simple structured class of sparsely connected convolutional neural networks with rectifier activations that provide universal function approximation in a coarse-to-fine manner with increasing number of layers.
Abstract: We construct a highly regular and simple structured class of sparsely connected convolutional neural networks with rectifier activations that provide universal function approximation in a coarse-to-fine manner with increasing number of layers The networks are localized in the sense that local changes in the function to be approximated only require local changes in the final layer of weights At the core of the construction lies the fact that the characteristic function can be derived from a convolution of characteristic functions at the next coarser resolution via a rectifier passing The latter refinement result holds for all higher order univariate B-splines

Journal ArticleDOI
TL;DR: A new computational framework to solve the partial differential equations (PDEs) governing the flow of the joint probability density functions (PDFs) in continuous-time stochastic nonlinear systems is developed and dualization along with an entropic regularization leads to a cone-preserving fixed point recursion that is proved to be contractive in Thompson metric.
Abstract: We develop a new computational framework to solve the partial differential equations (PDEs) governing the flow of the joint probability density functions (PDFs) in continuous-time stochastic nonlinear systems. The need for computing the transient joint PDFs subject to prior dynamics arises in uncertainty propagation, nonlinear filtering, and stochastic control. Our methodology breaks away from the traditional approach of spatial discretization or function approximation—both of which, in general, suffer from the “curse-of-dimensionality.” In the proposed framework, we discretize time but not the state space. We solve infinite dimensional proximal recursions in the manifold of joint PDFs, which in the small time-step limit, is theoretically equivalent to solving the underlying transport PDEs. The resulting computation has the geometric interpretation of gradient flow of certain free energy functional with respect to the Wasserstein metric arising from the theory of optimal mass transport. We show that dualization along with an entropic regularization, leads to a cone-preserving fixed point recursion that is proved to be contractive in Thompson metric. A block co-ordinate iteration scheme is proposed to solve the resulting nonlinear recursions with guaranteed convergence. This approach enables remarkably fast computation for nonparametric transient joint PDF propagation. Numerical examples and various extensions are provided to illustrate the scope and efficacy of the proposed approach.

Journal ArticleDOI
TL;DR: It is shown that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge.
Abstract: We show that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge. Thus, the blessing of compositionality mitigates the curse of dimensionality. On the other hand, a theorem called good propagation of errors allows to "lift" theorems about shallow networks to those about deep networks with an appropriate choice of norms, smoothness, etc. We illustrate this in three contexts where each channel in the deep network calculates a spherical polynomial, a non-smooth ReLU network, or another zonal function network related closely with the ReLU network.

Posted Content
TL;DR: A provably stable variant of neural ordinary differential equations whose trajectories evolve on an energy functional parametrised by a neural network, leading to robustness against input perturbations and low computational burden for the numerical solver.
Abstract: We introduce a provably stable variant of neural ordinary differential equations (neural ODEs) whose trajectories evolve on an energy functional parametrised by a neural network. Stable neural flows provide an implicit guarantee on asymptotic stability of the depth-flows, leading to robustness against input perturbations and low computational burden for the numerical solver. The learning procedure is cast as an optimal control problem, and an approximate solution is proposed based on adjoint sensivity analysis. We further introduce novel regularizers designed to ease the optimization process and speed up convergence. The proposed model class is evaluated on non-linear classification and function approximation tasks.

Journal ArticleDOI
03 Mar 2020
TL;DR: This letter provides an approximate online adaptive solution to the infinite-horizon optimal control problem for control-affine continuous-time nonlinear systems while formalizing system safety using barrier certificates using the use of a barrier function transform.
Abstract: This letter provides an approximate online adaptive solution to the infinite-horizon optimal control problem for control-affine continuous-time nonlinear systems while formalizing system safety using barrier certificates. The use of a barrier function transform provides safety certificates to formalize system behavior. Specifically, using a barrier function, the system is transformed to aid in developing a controller which maintains the system in a pre-defined constrained region. To aid in online learning of the value function, the state-space is segmented into a number of user-defined segments. Off-policy trajectories are selected in each segment, and sparse Bellman error extrapolation is performed within each respective segment to generate an optimal policy within each segment. A Lyapunov-like stability analysis is included which proves uniformly ultimately bounded regulation in the presence of the barrier function transform and discontinuities. Simulation results are provided for a two-state dynamical system to compare the performance of the developed method to existing methods.

Proceedings Article
12 Jul 2020
TL;DR: In this article, the authors presented the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation, where the emphasis critic is trained via Gradient Emphasis Learning (GEM).
Abstract: We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.