Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Random Forests

Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

Gradient-based learning applied to document recognition

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

/pdf/going-deeper-with-convolutions-1yobw2o2ds.pdf

Going deeper with convolutions

Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

Deep Learning

Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

/pdf/reinforcement-learning-an-introduction-rzxgej9p17.pdf

Reinforcement Learning: An Introduction

Ecosystem Informatics is the study of computational methods for advancing the ecosystem sciences and environmental policy This talk will discuss the ways in which machine learning--in combination with novel sensors--can help transform the ecosystem sciences from small-scale hypothesis-driven science to global-scale data-driven science Example challenge problems include optimal sensor placement, modeling errors and biases in data collection, automated recognition of species from acoustic and image data, automated data cleaning, fitting models to data (species distribution models and dynamical system models), and robust optimization of environmental policies The talk will also discuss the recent development of The Evidence Tree Methodology for complex machine learning applications

/pdf/machine-learning-and-ecosystem-informatics-challenges-and-3czakno7et.pdf

Machine Learning and Ecosystem Informatics: Challenges and Opportunities

A common heuristic for solving Partial ly Observable Markov Decision Problems POMDPs is to rst solve the underlying Markov Decision Process MDP and then con struct a POMDP policy by performing a xed depth lookahead search in the POMDP and evaluating the leaf nodes using the MDP value function A problem with this approximation is that it does not account for the need to choose actions in order to gain information about the state of the world particularly when those ob servation actions are needed at some point in the future This paper proposes two heuristics that are better than the MDP approximation in POMDPs where there is a delayed need to observe The rst approximation introduced in is the even odd POMDP in which the world is assumed to be fully observable every other time step The even odd POMDP can be converted into an equivalent MDP the even MDP whose value function captures some of the sensing costs of the original POMDP An online policy consisting in a step lookahead search com bined with the value function of the even MDP gives an approximation to the POMDP s value function that is at least as good as the method based on the value function of the underlying MDP The second POMDP approximation is applicable to a special kind of POMDP which we call the Cost Observable Markov Decision Problem COMDP In a COMDP the actions are partitioned into those that change the state of the world and those that are pure observa tion actions For such problems we describe the chain MDP algorithm which in many cases is able to capture more of the sensing costs than the even odd POMDP approximation We prove that both heuristics compute value functions that are upper bounded by i e bet ter than the value function of the underlying MDP and in the case of the even MDP also lower bounded by the POMDP s optimal value function We show cases where the chain MDP online policy is better equal or worse than the even MDP online policy

/pdf/two-heuristics-for-solving-pomdps-having-a-delayed-need-to-zk3ule00jo.pdf

Two heuristics for solving POMDPs having a delayed need to observe

Support vector machines introduced three important innovations to machine learning research: (a) the application of mathematical programming algorithms to solve optimization problems in machine learning, (b) the control of overfitting by maximizing the margin, and (c) the use of Mercer kernels to convert linear separators into non-linear decision boundaries in implicit spaces Despite their attractiveness in classification and regression, support vector methods have not been applied to the problem of value function approximation in reinforcement learning This paper presents three ways of combining linear programming with kernel methods to find value function approximations for reinforcement learning One formulation is based on the standard approach to SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good actions have an advantage over bad actions All formulations attempt to minimize the norm of the weight vector while fitting the data, which corresponds to maximizing the margin in standard SVM classification Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance However, the third formulation is much more efficient to train and also converges more reliably Unlike policy gradient and temporal difference methods, the kernel methods described here can easily adjust the complexity of the function approximator to fit the complexity of the value function

/pdf/support-vectors-for-reinforcement-learning-1tdgqkt5c3.pdf

Support Vectors for Reinforcement Learning

Consider a binary classification problem in which the learner is given a labeled training set, an unlabeled test set, and is restricted to choosing exactly $k$ test points to output as positive predictions. Problems of this kind---{\it transductive precision@$k$}---arise in information retrieval, digital advertising, and reserve design for endangered species. Previous methods separate the training of the model from its use in scoring the test points. This paper introduces a new approach, Transductive Top K (TTK), that seeks to minimize the hinge loss over all training instances under the constraint that exactly $k$ test instances are predicted as positive. The paper presents two optimization methods for this challenging problem. Experiments and analysis confirm the importance of incorporating the knowledge of $k$ into the learning process. Experimental evaluations of the TTK approach show that the performance of TTK matches or exceeds existing state-of-the-art methods on 7 UCI datasets and 3 reserve design problem instances.

Transductive Optimization of Top k Precision

Fire spread on forested landscapes depends on vegetation conditions across the landscape that affect the fire arrival probability and forest stand value. Landowners can control some forest characteristics that facilitate fire spread, and when a single landowner controls the entire landscape, a rational landowner accounts for spatial interactions when making management decisions. With multiple landowners, management activity by one may impact outcomes for the others. Various liability regulations have been proposed, and some enacted, to make landowners account for these impacts by changing the incentives they face. In this paper, the effects of two different types of liability regulations are examined – strict liability and negligence standards. We incorporate spatial information into a model of land manager decision-making about the timing and spatial location of timber harvest and fuel treatment. The problem is formulated as a dynamic game and solved via multi-agent approximate dynamic programming. We found that, in some cases, liability regulation can increase expected land values for individual land ownerships and for the landscape as a whole. But in other cases, it may create perverse incentives that reduce expected land value. We also showed that regulations may increase risk for individual landowners by increasing the variability of potential outcomes.

/pdf/evaluating-wildland-fire-liability-standards-does-regulation-y6vxftyf71.pdf

Thomas G. Dietterich

Papers

Machine Learning and Ecosystem Informatics: Challenges and Opportunities

Two heuristics for solving POMDPs having a delayed need to observe

Support Vectors for Reinforcement Learning

Transductive Optimization of Top k Precision

Evaluating wildland fire liability standards - does regulation incentivise good management?