Showing papers in &quot;Machine Learning in 2002&quot;

Finite-time Analysis of the Multiarmed Bandit Problem

TL;DR: In this article, a Support Vector Machine (SVM) method based on recursive feature elimination (RFE) was proposed to select a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays.

...read moreread less

Abstract: DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.

...read moreread less

7,939 citations

Journal Article•DOI•

[...]

Peter Auer¹, Nicolò Cesa-Bianchi², Paul Fischer³•Institutions (3)

Graz University of Technology¹, University of Milan², Technical University of Dortmund³

01 May 2002-Machine Learning

TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

...read moreread less

Abstract: Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

...read moreread less

6,361 citations

Journal Article•DOI•

Choosing Multiple Parameters for Support Vector Machines

[...]

Olivier Chapelle, Vladimir Vapnik¹, Olivier Bousquet², Sayan Mukherjee³•Institutions (3)

AT&T¹, École Polytechnique², Massachusetts Institute of Technology³

Near-Optimal Reinforcement Learning in Polynomial Time

TL;DR: The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVMs) is considered by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters.

...read moreread less

Abstract: The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVMs) is considered. This is done by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters. Usual methods for choosing parameters, based on exhaustive search become intractable as soon as the number of parameters exceeds two. Some experimental results assess the feasibility of our approach for a large number of parameters (more than 100) and demonstrate an improvement of generalization performance.

...read moreread less

2,323 citations

Journal Article•DOI•

[...]

Michael Kearns¹, Satinder Singh•Institutions (1)

University of Pennsylvania¹

Training Invariant Support Vector Machines

TL;DR: In this paper, the authors show that the number of actions required to approach the optimal return is lower bounded by the mixing time of the optimal policy (in the undiscounted case) or by the horizon time T in the discounted case.

...read moreread less

Abstract: We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

...read moreread less

802 citations

Journal Article•DOI•

[...]

Dennis DeCoste¹, Bernhard Schölkopf²•Institutions (2)

California Institute of Technology¹, Max Planck Society²

Text Categorization with Support Vector Machines. How to Represent Texts in Input Space

TL;DR: This work reports the recent achievement of the lowest reported test error on the well-known MNIST digit recognition benchmark task, with SVM training times that are also significantly faster than previous SVM methods.

...read moreread less

Abstract: Practical experience has shown that in order to obtain the best possible performance, prior knowledge about invariances of a classification problem at hand ought to be incorporated into the training procedure. We describe and review all known methods for doing so in support vector machines, provide experimental results, and discuss their respective merits. One of the significant new results reported in this work is our recent achievement of the lowest reported test error on the well-known MNIST digit recognition benchmark task, with SVM training times that are also significantly faster than previous SVM methods.

...read moreread less

633 citations

Journal Article•DOI•

[...]

Edda Leopold¹, Jörg Kindermann¹•Institutions (1)

Center for Information Technology¹

Linear Programming Boosting via Column Generation

TL;DR: It is shown that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself.

...read moreread less

Abstract: The choice of the kernel function is crucial to most applications of support vector machines. In this paper, however, we show that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself. We discuss the role of importance-weights (e.g. document frequency and redundancy), which is not yet fully understood in the light of model complexity and calculation cost, and we show that time consuming lemmatization or stemming can be avoided even when classifying a highly inflectional language like German.

...read moreread less

473 citations

Journal Article•DOI•

[...]

Ayhan Demiriz¹, Kristin P. Bennett¹, John Shawe-Taylor²•Institutions (2)

Rensselaer Polytechnic Institute¹, Royal Holloway, University of London²

A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes

TL;DR: It is proved that for classification, minimizing the 1-norm soft margin error function directly optimizes a generalization error bound and is competitive in quality and computational cost to AdaBoost.

...read moreread less

Abstract: We examine linear program (LP) approaches to boosting and demonstrate their efficient solution using LPBoost, a column generation based simplex method. We formulate the problem as if all possible weak hypotheses had already been generated. The labels produced by the weak hypotheses become the new feature space of the problem. The boosting task becomes to construct a learning function in the label space that minimizes misclassification error and maximizes the soft margin. We prove that for classification, minimizing the 1-norm soft margin error function directly optimizes a generalization error bound. The equivalent linear program can be efficiently solved using column generation techniques developed for large-scale optimization problems. The resulting LPBoost algorithm can be used to solve any LP boosting formulation by iteratively optimizing the dual misclassification costs in a restricted LP and dynamically generating weak hypotheses to make new LP columns. We provide algorithms for soft margin classification, confidence-rated, and regression boosting problems. Unlike gradient boosting algorithms, which may converge in the limit only, LPBoost converges in a finite number of iterations to a global solution satisfying mathematically well-defined optimality conditions. The optimal solutions of LPBoost are very sparse in contrast with gradient based methods. Computationally, LPBoost is competitive in quality and computational cost to AdaBoost.

...read moreread less

462 citations

Journal Article•DOI•

[...]

Michael Kearns¹, Yishay Mansour², Andrew Y. Ng³•Institutions (3)

University of Pennsylvania¹, Tel Aviv University², University of California, Berkeley³

Support Vector Machines for Classification in Nonstandard Situations

TL;DR: This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.

...read moreread less

Abstract: A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states The running time is exponential in the horizon time (which depends only on the discount factor γ and the desired degree of approximation to the optimal policy) Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter Our algorithm is based on the idea of sparse sampling We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng Neural information processing systems 13, to appear)

...read moreread less

416 citations

Journal Article•DOI•

[...]

Yi Lin¹, Yoonkyung Lee¹, Grace Wahba¹•Institutions (1)

University of Wisconsin-Madison¹

Variable Resolution Discretization in Optimal Control

TL;DR: This paper explains why the standard support vectors machine is not suitable for the nonstandard situation, and introduces a simple procedure for adapting the support vector machine methodology to the non standard situation.

...read moreread less

Abstract: The majority of classification algorithms are developed for the standard situation in which it is assumed that the examples in the training set come from the same distribution as that of the target population, and that the cost of misclassification into different classes are the same. However, these assumptions are often violated in real world settings. For some classification methods, this can often be taken care of simply with a change of thresholds for others, additional effort is required. In this paper, we explain why the standard support vector machine is not suitable for the nonstandard situation, and introduce a simple procedure for adapting the support vector machine methodology to the nonstandard situation. Theoretical justification for the procedure is provided. Simulation study illustrates that the modified support vector machine significantly improves upon the standard support vector machine in the nonstandard situation. The computational load of the proposed procedure is the same as that of the standard support vector machine. The procedure reduces to the standard support vector machine in the standard situation.

...read moreread less

385 citations

Journal Article•DOI•

[...]

Rémi Munos¹, Andrew W. Moore²•Institutions (2)

École Polytechnique¹, Carnegie Mellon University²

TL;DR: This paper evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

...read moreread less

Abstract: The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the “Car on the hill”. It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

...read moreread less

360 citations

Journal Article•DOI•

Kernel Matching Pursuit

[...]

Pascal Vincent¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

Technical Update: Least-Squares Temporal Difference Learning

TL;DR: This work shows how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine learning problems, while keeping control of the sparsity of the solution.

...read moreread less

Abstract: Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the least-squares sense. We show how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine learning problems, while keeping control of the sparsity of the solution. We present a version of the algorithm that makes an optimal choice of both the next basis and the weights of all the previously chosen bases. Finally, links to boosting algorithms and RBF training procedures, as well as an extensive experimental comparison with SVMs for classification are given, showing comparable results with typically much sparser models.

...read moreread less

Journal Article•DOI•

[...]

Justin A. Boyan

A Simple Decomposition Method for Support Vector Machines

TL;DR: This paper updates Bradtke and Barto's work in three significant ways: first, it presents a simpler derivation of the LSTD algorithm; second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of κ, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression.

...read moreread less

Abstract: TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ e 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ e 0 to arbitrary values of λs at the extreme of λ e 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

...read moreread less

Journal Article•DOI•

[...]

Hsu Chih-Wei¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

Convergence of a Generalized SMO Algorithm for SVM Classifier Design

TL;DR: Through the design of decomposition methods for bound-constrained SVM formulations, it is demonstrated that the working set selection is not a trivial task and a simple selection is proposed which leads to faster convergences for difficult cases.

...read moreread less

Abstract: The decomposition method is currently one of the major methods for solving support vector machines. An important issue of this method is the selection of working sets. In this paper through the design of decomposition methods for bound-constrained SVM formulations we demonstrate that the working set selection is not a trivial task. Then from the experimental analysis we propose a simple selection of the working set which leads to faster convergences for difficult cases. Numerical experiments on different types of problems are conducted to demonstrate the viability of the proposed method.

...read moreread less

Journal Article•DOI•

[...]

S. Sathiya Keerthi¹, Elmer G. Gilbert²•Institutions (2)

National University of Singapore¹, University of Michigan²

Efficient SVM Regression Training with SMO

TL;DR: Convergence of a generalized version of the modified SMO algorithms given by Keerthi et al. for SVM classifier design is proved and the results are extended to modifiedSMO algorithms for solving ν-SVM classifiers problems.

...read moreread less

Abstract: Convergence of a generalized version of the modified SMO algorithms given by Keerthi et al. for SVM classifier design is proved. The convergence results are also extended to modified SMO algorithms for solving ν-SVM classifier problems.

...read moreread less

Journal Article•DOI•

[...]

Gary W. Flake¹, Steve Lawrence¹•Institutions (1)

Princeton University¹

Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities

TL;DR: This work generalizes SMO so that it can handle regression problems, and addresses problems with several modifications that enable caching to be effectively used with SMO.

...read moreread less

Abstract: The sequential minimal optimization algorithm (SMO) has been shown to be an effective method for training support vector machines (SVMs) on classification tasks defined on sparse data sets. SMO differs from most SVM algorithms in that it does not require a quadratic programming solver. In this work, we generalize SMO so that it can handle regression problems. However, one problem with SMO is that its rate of convergence slows down dramatically when data is non-sparse and when there are many support vectors in the solution—as is often the case in regression—because kernel function evaluations tend to dominate the runtime in this case. Moreover, caching kernel function outputs can easily degrade SMO's performance even more because SMO tends to access kernel function outputs in an unstructured manner. We address these problems with several modifications that enable caching to be effectively used with SMO. For regression problems, our modifications improve convergence time by over an order of magnitude.

...read moreread less

Journal Article•DOI•

[...]

Peter Sollich¹•Institutions (1)

King's College London¹

TL;DR: A framework for interpreting Support Vector Machines as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors is described, which allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters and how to obtain predictive class probabilities.

...read moreread less

Abstract: I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This probabilistic interpretation can provide intuitive guidelines for choosing a ‘good’ SVM kernel. Beyond this, it allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters—the misclassification penalty C, and any parameters specifying the ernel—and how to obtain predictive class probabilities rather than the conventional deterministic class label predictions. Hyperparameters can be set by maximizing the evidences I explain how the latter can be defined and properly normalized. Both analytical approximations and numerical methods (Monte Carlo chaining) for estimating the evidence are discussed. I also compare different methods of estimating class probabilities, ranging from simple evaluation at the MAP or at the posterior average to full averaging over the posterior. A simple toy application illustrates the various concepts and techniques.

...read moreread less

Journal Article•DOI•

Bayesian Treed Models

[...]

Hugh A. Chipman¹, Edward I. George², Robert E. McCulloch³•Institutions (3)

University of Waterloo¹, University of Pennsylvania², University of Chicago³

Boosting Methods for Regression

TL;DR: This paper proposes a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression, and illustrates the potential of this approach by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets.

...read moreread less

Abstract: When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (e.g. CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this paper, we propose a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets.

...read moreread less

Journal Article•DOI•

[...]

Nigel Duffy¹, David P. Helmbold¹•Institutions (1)

University of California, Santa Cruz¹

01 May 2002-Machine Learning

TL;DR: This paper examines ensemble methods for regression that leverage or “boost” base regressors by iteratively calling them on modified samples and bound the complexity of the regression functions produced in order to derive PAC-style bounds on their generalization errors.

...read moreread less

Abstract: In this paper we examine ensemble methods for regression that leverage or “boost” base regressors by iteratively calling them on modified samples. The most successful leveraging algorithm for classification is AdaBoost, an algorithm that requires only modest assumptions on the base learning method for its strong theoretical guarantees. We present several gradient descent leveraging algorithms for regression and prove AdaBoost-style bounds on their sample errors using intuitive assumptions on the base learners. We bound the complexity of the regression functions produced in order to derive PAC-style bounds on their generalization errors. Experiments validate our theoretical results.

...read moreread less

Journal Article•DOI•

Bayesian Clustering by Dynamics

[...]

Marco F. Ramoni¹, Paola Sebastiani², Paul R. Cohen²•Institutions (2)

Harvard University¹, University of Massachusetts Amherst²

01 Apr 2002-Machine Learning

TL;DR: This paper introduces a Bayesian method for clustering dynamic processes that models dynamics as Markov chains and then applies an agglomerative clustering procedure to discover the most probable set of clusters capturing different dynamics.

...read moreread less

Abstract: This paper introduces a Bayesian method for clustering dynamic processes. The method models dynamics as Markov chains and then applies an agglomerative clustering procedure to discover the most probable set of clusters capturing different dynamics. To increase efficiency, the method uses an entropy-based heuristic search strategy. A controlled experiment suggests that the method is very accurate when applied to artificial time series in a broad range of conditions and, when applied to clustering sensor data from mobile robots, it produces clusters that are meaningful in the domain of application.

...read moreread less

Journal Article•DOI•

Optimal Ordered Problem Solver

[...]

Jürgen Schmidhuber¹•Institutions (1)

Dalle Molle Institute for Artificial Intelligence Research¹

31 Jul 2002-Machine Learning

TL;DR: An efficient, recursive, backtracking-based way of implementing OOPS on realistic computers with limited storage is introduced, and experiments illustrate how OOPS can greatly profit from metalearning or metasearching, that is, searching for faster search procedures.

...read moreread less

Abstract: We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, efficiently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we extend the principles of optimal nonincremental universal search to build an incremental universal learner that is able to improve itself through experience. The initial bias is embodied by a task-dependent probability distribution on possible program prefixes. Prefixes are self-delimiting and executed in online fashion while being generated. They compute the probabilities of their own possible continuations. Let p^n denote a found prefix solving the first n tasks. It may exploit previously stored solutions p^i, i >n, by calling them as subprograms, or by copying them and editing the copies before applying them. We provide equal resources for two searches that run in parallel until p^{n+1} is discovered and stored. The first search is exhaustive; it systematically tests all possible prefixes on all tasks up to n+1. The second search is much more focused; it only searches for prefixes that start with p^n, and only tests them on task n+1, which is safe, because we already know that such prefixes solve all tasks up to n. Both searches are depth-first and bias-optimal: the branches of the search trees are program prefixes, and backtracking is triggered once the sum of the runtimes of the current prefix on all current tasks exceeds the prefix probability multiplied by the total search time so far. In illustrative experiments, our self-improver becomes the first general system that learns to solve all n disk Towers of Hanoi tasks (solution size 2^n-1) for n up to 30, profiting from previously solved, simpler tasks involving samples of a simple context free language.

...read moreread less

Journal Article•DOI•

Feature Generation Using General Constructor Functions

[...]

Shaul Markovitch¹, Dan Rosenstein¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Oct 2002-Machine Learning

TL;DR: A generalized and flexible framework that is capable of generating features from any given set of constructor functions, and was applied to a variety of classification problems and was able to generate features that were strongly related to the underlying target concepts.

...read moreread less

Abstract: Most classification algorithms receive as input a set of attributes of the classified objects. In many cases, however, the supplied set of attributes is not sufficient for creating an accurate, succinct and comprehensible representation of the target concept. To overcome this problem, researchers have proposed algorithms for automatic construction of features. The majority of these algorithms use a limited predefined set of operators for building new features. In this paper we propose a generalized and flexible framework that is capable of generating features from any given set of constructor functions. These can be domain-independent functions such as arithmetic and logic operators, or domain-dependent operators that rely on partial knowledge on the part of the user. The paper describes an algorithm which receives as input a set of classified objects, a set of attributes, and a specification for a set of constructor functions that contains their domains, ranges and properties. The algorithm produces as output a set of generated features that can be used by standard concept learners to create improved classifiers. The algorithm maintains a set of its best generated features and improves this set iteratively. During each iteration, the algorithm performs a beam search over its defined feature space and constructs new features by applying constructor functions to the members of its current feature set. The search is guided by general heuristic measures that are not confined to a specific feature representation. The algorithm was applied to a variety of classification problems and was able to generate features that were strongly related to the underlying target concepts. These features also significantly improved the accuracy achieved by standard concept learners, for a variety of classification problems.

...read moreread less

Journal Article•DOI•

Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates

[...]

Tom Bylander¹•Institutions (1)

University of Texas at San Antonio¹

A Probabilistic Framework for SVM Regression and Error Bar Estimation

TL;DR: For two-class datasets, a method for estimating the generalization error of a bag using out-of-bag estimates is provided and most of the bias is eliminated and accuracy is increased by incorporating a correction based on the distribution of the out- of-bag votes.

...read moreread less

Abstract: For two-class datasets, we provide a method for estimating the generalization error of a bag using out-of-bag estimates. In bagging, each predictor (single hypothesis) is learned from a bootstrap sample of the training exampless the output of a bag (a set of predictors) on an example is determined by voting. The out-of-bag estimate is based on recording the votes of each predictor on those training examples omitted from its bootstrap sample. Because no additional predictors are generated, the out-of-bag estimate requires considerably less time than 10-fold cross-validation. We address the question of how to use the out-of-bag estimate to estimate generalization error on two-class datasets. Our experiments on several datasets show that the out-of-bag estimate and 10-fold cross-validation have similar performance, but are both biased. We can eliminate most of the bias in the out-of-bag estimate and increase accuracy by incorporating a correction based on the distribution of the out-of-bag votes.

...read moreread less

Journal Article•DOI•

[...]

Junbin Gao¹, Steve R. Gunn¹, Chris Harris¹, M. Brown²•Institutions (2)

University of Southampton¹, IBM²

Continuous-Action Q-Learning

TL;DR: This paper concentrates on the derivation of the evidence and error bar approximation for regression problems and an error bar formula is derived based on the ∈-insensitive loss function.

...read moreread less

Abstract: In this paper, we elaborate on the well-known relationship between Gaussian Processes (GP) and Support Vector Machines (SVM) under some convex assumptions for the loss functions. This paper concentrates on the derivation of the evidence and error bar approximation for regression problems. An error bar formula is derived based on the e-insensitive loss function.

...read moreread less

Journal Article•DOI•

[...]

José del R. Millán, Daniele Posenato, Eric Dedieu

Model Selection for Small Sample Regression

TL;DR: Experimental results in robotics domains show the superiority of the proposed continuous-action Q-learning over the standard discrete-action version in terms of both asymptotic performance and speed of learning.

...read moreread less

Abstract: This paper presents a Q-learning method that works in continuous domains. Other characteristics of our approach are the use of an incremental topology preserving map (ITPM) to partition the input space, and the incorporation of bias to initialize the learning process. A unit of the ITPM represents a limited region of the input space and maps it onto the Q-values of M possible discrete actions. The resulting continuous action is an average of the discrete actions of the “winning unit” weighted by their Q-values. Then, TD(λ) updates the Q-values of the discrete actions according to their contribution. Units are created incrementally and their associated Q-values are initialized by means of domain knowledge. Experimental results in robotics domains show the superiority of the proposed continuous-action Q-learning over the standard discrete-action version in terms of both asymptotic performance and speed of learning. The paper also reports a comparison of discounted-reward against average-reward Q-learning in an infinite horizon robotics task.

...read moreread less

Journal Article•DOI•

[...]

Olivier Chapelle, Vladimir Vapnik¹, Yoshua Bengio²•Institutions (2)

AT&T¹, Université de Montréal²

Structural Modelling with Sparse Kernels

TL;DR: This work presents a new penalization method for performing model selection for regression that is appropriate even for small samples, based on an accurate estimator of the ratio of the expected training error and the expected generalization error, in terms of theexpected eigenvalues of the input covariance matrix.

...read moreread less

Abstract: Model selection is an important ingredient of many machine learning algorithms, in particular when the sample size in small, in order to strike the right trade-off between overfitting and underfitting Previous classical results for linear regression are based on an asymptotic analysis We present a new penalization method for performing model selection for regression that is appropriate even for small samples Our penalization is based on an accurate estimator of the ratio of the expected training error and the expected generalization error, in terms of the expected eigenvalues of the input covariance matrix

...read moreread less

Journal Article•DOI•

[...]

Steve R. Gunn¹, J.S. Kandola¹•Institutions (1)

University of Southampton¹

Large Scale Kernel Regression via Linear Programming

TL;DR: This work describes a transparent, advanced non-linear modelling approach that enables the constructed predictive models to be visualised, allowing model validation and assisting in interpretation, and it is shown to exhibit competitive generalisation performance together with improved interpretability.

...read moreread less

Abstract: A widely acknowledged drawback of many statistical modelling techniques, commonly used in machine learning, is that the resulting model is extremely difficult to interpret. A number of new concepts and algorithms have been introduced by researchers to address this problem. They focus primarily on determining which inputs are relevant in predicting the output. This work describes a transparent, advanced non-linear modelling approach that enables the constructed predictive models to be visualised, allowing model validation and assisting in interpretation. The technique combines the representational advantage of a sparse ANOVA decomposition, with the good generalisation ability of a kernel machine. It achieves this by employing two forms of regularisation: a 1-norm based structural regulariser to enforce transparency, and a 2-norm based regulariser to control smoothness. The resulting model structure can be visualised showing the overall effects of different inputs, their interactions, and the strength of the interactions. The robustness of the technique is illustrated using a range of both artifical and “real world” datasets. The performance is compared to other modelling techniques, and it is shown to exhibit competitive generalisation performance together with improved interpretability.

...read moreread less

Journal Article•DOI•

[...]

Olvi L. Mangasarian¹, David R. Musicant²•Institutions (2)

University of Wisconsin-Madison¹, Carleton College²

Structure in the Space of Value Functions

TL;DR: The proposed approach tolerates a small error, which is adjusted parametrically, while fitting the given data, which leads to improved fitting of noisy data (over ordinary least error solutions) as demonstrated computationally.

...read moreread less

Abstract: The problem of tolerant data fitting by a nonlinear surface, induced by a kernel-based support vector machine is formulated as a linear program with fewer number of variables than that of other linear programming formulations. A generalization of the linear programming chunking algorithm for arbitrary kernels is implemented for solving problems with very large datasets wherein chunking is performed on both data points and problem variables. The proposed approach tolerates a small error, which is adjusted parametrically, while fitting the given data. This leads to improved fitting of noisy data (over ordinary least error solutions) as demonstrated computationally. Comparative numerical results indicate an average time reduction as high as 26.0% over other formulations, with a maximal time reduction of 79.7%. Additionally, linear programs with as many as 16,000 data points and more than a billion nonzero matrix elements are solved.

...read moreread less

Journal Article•DOI•

[...]

David J. Foster¹, Peter Dayan•Institutions (1)

University of Edinburgh¹

Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces

TL;DR: Evidence is presented that fragmentations found using unsupervised, mixture model, learning methods on data derived from optimal value functions for multiple tasks can be of use in a practical reinforcement learning context by facilitating online, actor-critic learning of multiple goals MDPs.

...read moreread less

Abstract: Solving in an efficient manner many different optimal control tasks within the same underlying environment requires decomposing the environment into its computationally elemental fragments. We suggest how to find fragmentations using unsupervised, mixture model, learning methods on data derived from optimal value functions for multiple tasks, and show that these fragmentations are in accord with observable structure in the environments. Further, we present evidence that such fragments can be of use in a practical reinforcement learning context, by facilitating online, actor-critic learning of multiple goals MDPs.

...read moreread less

Journal Article•DOI•

[...]

Gunnar Rätsch, Ayhan Demiriz¹, Kristin P. Bennett¹•Institutions (1)

Rensselaer Polytechnic Institute¹

On the Dual Formulation of Regularized Linear Systems with Convex Risks

TL;DR: There exists an optimal solution to the infinite hypothesis space problem consisting of a finite number of hypothesis, and two algorithms for solving the infinite and finite hypothesis problems are proposed.

...read moreread less

Abstract: We examine methods for constructing regression ensembles based on a linear program (LP). The ensemble regression function consists of linear combinations of base hypotheses generated by some boosting-type base learning algorithm. Unlike the classification case, for regression the set of possible hypotheses producible by the base learning algorithm may be infinite. We explicitly tackle the issue of how to define and solve ensemble regression when the hypothesis space is infinite. Our approach is based on a semi-infinite linear program that has an infinite number of constraints and a finite number of variables. We show that the regression problem is well posed for infinite hypothesis spaces in both the primal and dual spaces. Most importantly, we prove there exists an optimal solution to the infinite hypothesis space problem consisting of a finite number of hypothesis. We propose two algorithms for solving the infinite and finite hypothesis problems. One uses a column generation simplex-type algorithm and the other adopts an exponential barrier approach. Furthermore, we give sufficient conditions for the base learning algorithm and the hypothesis set to be used for infinite regression ensembles. Computational results show that these methods are extremely promising.

...read moreread less

Journal Article•DOI•

[...]

Tong Zhang¹•Institutions (1)

IBM¹