There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality.

Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

I and i

Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

Machine learning

Data Mining - Concepts and Techniques.

We present the first deep learning model to successfully learn control
policies directly from high-dimensional sensory input using reinforcement
learning. The model is a convolutional neural network, trained with a variant
of Q-learning, whose input is raw pixels and whose output is a value function
estimating future rewards. We apply our method to seven Atari 2600 games from
the Arcade Learning Environment, with no adjustment of the architecture or
learning algorithm. We find that it outperforms all previous approaches on six
of the games and surpasses a human expert on three of them.

/pdf/playing-atari-with-deep-reinforcement-learning-2f9cvplh8c.pdf

Playing Atari with Deep Reinforcement Learning

Convergence of Probability Measures. By P. Billingsley. Chichester, Sussex, Wiley, 1968. xii, 253 p. 9 1/4“. 117s.

Convergence of Probability Measures

Markov decision processes (MDPs) with large number of states are of high practical interest. However, conventional algorithms to solve MDP are computationally infeasible in this scenario. Approximate dynamic programming (ADP) methods tackle this issue by computing approximate solutions. A widely applied ADP method is approximate linear program (ALP) which makes use of linear function approximation and offers theoretical performance guarantees. Nevertheless, the ALP is difficult to solve due to the presence of a large number of constraints and in practice, a reduced linear program (RLP) is solved instead. The RLP has a tractable number of constraints sampled from the original constraints of the ALP. Though the RLP is known to perform well in experiments, theoretical guarantees are available only for a specific RLP obtained under idealized assumptions.

In this paper, we generalize the RLP to define a generalized reduced linear program (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints of the ALP. The main contribution of this paper is the novel theoretical framework developed to obtain error bounds for any given GRLP. Central to our framework are two max-norm contraction operators. Our result theoretically justifies linear approximation of constraints. We discuss the implication of our results in the contexts of ADP and reinforcement learning. We also demonstrate via an example in the domain of controlled queues that the experiments conform to the theory.

A generalized reduced linear program for markov decision processes

In this paper, we analyze the behavior of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee. We prove that after a large number of iterations, if the stochastic approximation process enters the domain of attraction of an attracting set, it gets locked into the attracting set with high probability. We demonstrate that the above-mentioned result is an effective instrument for analyzing stochastic approximation schemes in the absence of a stability guarantee, by using it to obtain an alternate criterion for convergence in the presence of a locally attracting set for the mean field and by using it to show that a feedback mechanism, which involves resetting the iterates at regular time intervals, stabilizes the scheme when the mean field possesses a globally attracting set, thereby guaranteeing convergence. The results in this paper build on the works of Borkar, Andrieu et al., and Chen et al. , by allowing for the presence of set-valued drift functions.

/pdf/analysis-of-stochastic-approximation-schemes-with-set-valued-1w5j7ldy5r.pdf

Analysis of Stochastic Approximation Schemes With Set-Valued Maps in the Absence of a Stability Guarantee and Their Stabilization

In this chapter, we review the Finite Difference Stochastic Approximation (FDSA) algorithm, also known as Kiefer-Wolfowitz (K-W) algorithm, and some of its variants for finding a local minimum of an objective function. The K-W scheme is a version of the Robbins-Monro stochastic approximation algorithm and incorporates balanced two-sided estimates of the gradient using two objective function measurements for a scalar parameter. When the parameter is an N-dimensional vector, the number of function measurements using this algorithm scales up to 2N. A one-sided variant of this algorithm in the latter case requires N + 1 function measurements. We present the original K-W scheme, first for the case of a scalar parameter, and subsequently for a vector parameter of arbitrary dimension. Variants including the one-sided version are then presented. We only consider here the case when the objective function is a simple expectation over noisy cost samples and not when it has a long-run average form. The latter form of the cost objective would require multi-timescale stochastic approximation, the general case of which was discussed in Chapter 3. Stochastic algorithms for the long-run average cost objectives will be considered in later chapters.

Kiefer-Wolfowitz Algorithm

During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well studied under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, an agent receives single-stage costs or penalties as well depending on the state transitions. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. The goal then is to find a feasible policy that maximizes reward returns while constraining the cost returns to be below a prescribed threshold during training as well as deployment. We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy Optimization. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym. We demonstrate that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches. Further, our approach shows better reward performance than other constrained model-based approaches in the literature.

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

In this paper, we present two new stochastic approximation algorithms for the problem of quantile estimation. The algorithms uses the characterization of the quantile provided in terms of an optimization problem in [1]. The algorithms take the shape of a stochastic gradient descent which minimizes the optimization problem. Asymptotic convergence of the algorithms to the true quantile is proven using the ODE method. The theoretical results are also supplemented through empirical evidence. The algorithms are shown to provide significant improvement in terms of memory requirement and accuracy.

Shalabh Bhatnagar

Papers

A generalized reduced linear program for markov decision processes

Analysis of Stochastic Approximation Schemes With Set-Valued Maps in the Absence of a Stability Guarantee and Their Stabilization

Kiefer-Wolfowitz Algorithm

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

A Stochastic Approximation Algorithm for Quantile Estimation