Showing papers in "Journal of Machine Learning Research in 2004"

PDF

Open Access

Journal Article•DOI•

RCV1: A New Benchmark Collection for Text Categorization Research

[...]

David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This work describes the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data.

...read moreread less

Abstract: Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection's properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.

...read moreread less

2,852 citations

Journal Article•DOI•

Non-negative Matrix Factorization with Sparseness Constraints

[...]

Patrik O. Hoyer

01 Dec 2004-Journal of Machine Learning Research

TL;DR: In this paper, the notion of sparseness is incorporated into NMF to improve the found decompositions, and the authors provide complete MATLAB code both for standard NMF and for their extension.

...read moreread less

Abstract: Non-negative matrix factorization (NMF) is a recently developed technique for finding parts-based, linear representations of non-negative data. Although it has successfully been applied in several applications, it does not always result in parts-based representations. In this paper, we show how explicitly incorporating the notion of 'sparseness' improves the found decompositions. Additionally, we provide complete MATLAB code both for standard NMF and for our extension. Our hope is that this will further the application of these methods to solving novel data-analysis problems.

...read moreread less

2,663 citations

Journal Article•DOI•

Learning the Kernel Matrix with Semidefinite Programming

[...]

Gert R. G. Lanckriet¹, Nello Cristianini, Peter L. Bartlett, Laurent El Ghaoui, Michael I. Jordan - Show less +1 more•Institutions (1)

University of California, Berkeley¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This paper shows how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques and leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.

...read moreread less

Abstract: Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space---classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm---using the labeled part of the data one can learn an embedding also for the unlabeled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.

...read moreread less

2,419 citations

Journal Article•

Efficient Feature Selection via Analysis of Relevance and Redundancy

[...]

Lei Yu¹, Huan Liu²•Institutions (2)

Arizona State University¹, Biodesign Institute²

01 Dec 2004-Journal of Machine Learning Research

TL;DR: It is shown that feature relevance alone is insufficient for efficient feature selection of high-dimensional data, and a new framework is introduced that decouples relevance analysis and redundancy analysis.

...read moreread less

Abstract: Feature selection is applied to reduce the number of features in many applications where data has hundreds or thousands of features. Existing feature selection methods mainly focus on finding relevant features. In this paper, we show that feature relevance alone is insufficient for efficient feature selection of high-dimensional data. We define feature redundancy and propose to perform explicit redundancy analysis in feature selection. A new framework is introduced that decouples relevance analysis and redundancy analysis. We develop a correlation-based method for relevance and redundancy analysis, and conduct an empirical study of its efficiency and effectiveness comparing with representative methods.

...read moreread less

1,971 citations

Journal Article•DOI•

Probability Estimates for Multi-class Classification by Pairwise Coupling

[...]

Tingfan Wu, Chih-Jen Lin, Ruby C. Weng

01 Dec 2004-Journal of Machine Learning Research

TL;DR: In this paper, the authors present two approaches for obtaining class probabilities, which can be reduced to linear systems and are easy to implement, and show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998).

...read moreread less

Abstract: Pairwise coupling is a popular multi-class classification method that combines all comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998)

...read moreread less

1,888 citations

Journal Article•

In Defense of One-Vs-All Classification

[...]

Ryan Rifkin, Aldebaro Klautau

01 Dec 2004-Journal of Machine Learning Research

TL;DR: It is argued that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines.

...read moreread less

Abstract: We consider the problem of multiclass classification. Our main thesis is that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines. This thesis is interesting in that it disagrees with a large body of recent published work on multiclass classification. We support our position by means of a critical review of the existing literature, a substantial collection of carefully controlled experimental work, and theoretical arguments.

...read moreread less

1,841 citations

Journal Article•

Fast Binary Feature Selection with Conditional Mutual Information

[...]

François Fleuret

01 Dec 2004-Journal of Machine Learning Research

TL;DR: It is shown that this feature selection method outperforms other classical algorithms, and that a naive Bayesian classifier built with features selected that way achieves error rates similar to those of state-of-the-art methods such as boosting or SVMs.

...read moreread less

Abstract: We propose in this paper a very fast feature selection technique based on conditional mutual information. By picking features which maximize their mutual information with the class to predict conditional to any feature already picked, it ensures the selection of features which are both individually informative and two-by-two weakly dependant. We show that this feature selection method outperforms other classical algorithms, and that a naive Bayesian classifier built with features selected that way achieves error rates similar to those of state-of-the-art methods such as boosting or SVMs. The implementation we propose selects 50 features among 40,000, based on a training set of 500 examples in a tenth of a second on a standard 1Ghz PC.

...read moreread less

1,018 citations

Journal Article•DOI•

Feature Selection for Unsupervised Learning

[...]

Jennifer G. Dy, Carla E. Brodley

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This paper explores the feature selection problem and issues through FSSEM (Feature Subset Selection using Expectation-Maximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood.

...read moreread less

Abstract: In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of feature selection criteria with respect to dimension. We explore the feature selection problem and these issues through FSSEM (Feature Subset Selection using Expectation-Maximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. We present proofs on the dimensionality biases of these feature criteria, and present a cross-projection normalization scheme that can be applied to any criterion to ameliorate these biases. Our experiments show the need for feature selection, the need for addressing these two issues, and the effectiveness of our proposed solutions.

...read moreread less

939 citations

Journal Article•

No Unbiased Estimator of the Variance of K-Fold Cross-Validation

[...]

Yoshua Bengio, Yves Grandvalet

01 Dec 2004-Journal of Machine Learning Research

TL;DR: There exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation, and the main theorem shows that this result is based on the eigen-decomposition of the covariance matrix of errors.

...read moreread less

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare the performance of different algorithms (in particular, their proposed algorithm). In order to be able to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the very commonly used K-fold cross-validation estimator of generalization performance. The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation. The analysis that accompanies this result is based on the eigen-decomposition of the covariance matrix of errors, which has only three different eigenvalues corresponding to three degrees of freedom of the matrix and three components of the total variance. This analysis helps to better understand the nature of the problem and how it can make naive estimators (that don't take into account the error correlations due to the overlap between training and test sets) grossly underestimate variance. This is confirmed by numerical experiments in which the three components of the variance are compared when the difficulty of the learning problem and the number of folds are varied.

...read moreread less

869 citations

Journal Article•

The Entire Regularization Path for the Support Vector Machine

[...]

Trevor Hastie¹, Saharon Rosset², Robert Tibshirani¹, Ji Zhu•Institutions (2)

Stanford University¹, IBM²

01 Dec 2004-Journal of Machine Learning Research

TL;DR: An algorithm is derived that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model.

...read moreread less

Abstract: The support vector machine (SVM) is a widely used tool for classification. Many efficient implementations exist for fitting a two-class SVM model. The user has to supply values for the tuning parameters: the regularization cost parameter, and the kernel parameters. It seems a common practice is to use a default value for the cost parameter, often leading to the least restrictive model. In this paper we argue that the choice of the cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. We illustrate our algorithm on some examples, and use our representation to give further insight into the range of SVM solutions.

...read moreread less

699 citations

Journal Article•DOI•

Image Categorization by Learning and Reasoning with Regions

[...]

Yixin Chen¹, James Z. Wang²•Institutions (2)

University of New Orleans¹, Penn State College of Information Sciences and Technology²

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This paper presents a new learning technique, which extends Multiple-Instance Learning (MIL), and its application to the problem of region-based image categorization, and provides experimental results on an image categorizing problem and a drug activity prediction problem.

...read moreread less

Abstract: Designing computer programs to automatically categorize images using low-level features is a challenging research topic in computer vision. In this paper, we present a new learning technique, which extends Multiple-Instance Learning (MIL), and its application to the problem of region-based image categorization. Images are viewed as bags, each of which contains a number of instances corresponding to regions obtained from image segmentation. The standard MIL problem assumes that a bag is labeled positive if at least one of its instances is positive; otherwise, the bag is negative. In the proposed MIL framework, DD-SVM, a bag label is determined by some number of instances satisfying various properties. DD-SVM first learns a collection of instance prototypes according to a Diverse Density (DD) function. Each instance prototype represents a class of instances that is more likely to appear in bags with the specific label than in the other bags. A nonlinear mapping is then defined using the instance prototypes and maps every bag to a point in a new feature space, named the bag feature space. Finally, standard support vector machines are trained in the bag feature space. We provide experimental results on an image categorization problem and a drug activity prediction problem.

...read moreread less

Journal Article•DOI•

Large-Sample Learning of Bayesian Networks is NP-Hard

[...]

David Maxwell Chickering, David Heckerman, Christopher A. Meek

01 Dec 2004-Journal of Machine Learning Research

TL;DR: In this paper, it was shown that identifying high-scoring structures is NP-hard, even when any combination of one or more of the following holds: the generative distribution is perfect with respect to some DAG containing hidden variables; we are given an independence oracle; we were given an inference oracle, and we were also given an information oracle.

...read moreread less

Abstract: In this paper, we provide new complexity results for algorithms that learn discrete-variable Bayesian networks from data. Our results apply whenever the learning algorithm uses a scoring criterion that favors the simplest structure for which the model is able to represent the generative distribution exactly. Our results therefore hold whenever the learning algorithm uses a consistent scoring criterion and is applied to a sufficiently large dataset. We show that identifying high-scoring structures is NP-hard, even when any combination of one or more of the following hold: the generative distribution is perfect with respect to some DAG containing hidden variables; we are given an independence oracle; we are given an inference oracle; we are given an information oracle; we restrict potential solutions to structures in which each node has at most k parents, for all k>=3.Our proof relies on a new technical result that we establish in the appendices. In particular, we provide a method for constructing the local distributions in a Bayesian network such that the resulting joint distribution is provably perfect with respect to the structure of the network.

...read moreread less

Journal Article•

Probability Product Kernels

[...]

Tony Jebara, Risi Kondor, Andrew Howard

01 Dec 2004-Journal of Machine Learning Research

TL;DR: The advantages of discriminative learning algorithms and kernel machines are combined with generative modeling using a novel kernel between distributions to exploit the properties, metrics and invariances of the generative models the authors infer from each datum.

...read moreread less

Abstract: The advantages of discriminative learning algorithms and kernel machines are combined with generative modeling using a novel kernel between distributions. In the probability product kernel, data points in the input space are mapped to distributions over the sample space and a general inner product is then evaluated as the integral of the product of pairs of distributions. The kernel is straightforward to evaluate for all exponential family models such as multinomials and Gaussians and yields interesting nonlinear kernels. Furthermore, the kernel is computable in closed form for latent distributions such as mixture models, hidden Markov models and linear dynamical systems. For intractable models, such as switching linear dynamical systems, structured mean-field approximations can be brought to bear on the kernel evaluation. For general distributions, even if an analytic expression for the kernel is not feasible, we show a straightforward sampling method to evaluate it. Thus, the kernel permits discriminative learning methods, including support vector machines, to exploit the properties, metrics and invariances of the generative models we infer from each datum. Experiments are shown using multinomial models for text, hidden Markov models for biological data sets and linear dynamical systems for time series data.

...read moreread less

Journal Article•

Exact Bayesian Structure Discovery in Bayesian Networks

[...]

Mikko Koivisto, Kismat Sood

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This work presents an algorithm that computes the exact posterior probability of a subnetwork, e.g., a directed edge, and shows that also in domains with a large number of variables, exact computation is feasible, given suitable a priori restrictions on the structures.

...read moreread less

Abstract: Learning a Bayesian network structure from data is a well-motivated but computationally hard task. We present an algorithm that computes the exact posterior probability of a subnetwork, e.g., a directed edge; a modified version of the algorithm finds one of the most probable network structures. This algorithm runs in time O(n 2n + nk+1C(m)), where n is the number of network variables, k is a constant maximum in-degree, and C(m) is the cost of computing a single local marginal conditional likelihood for m data instances. This is the first algorithm with less than super-exponential complexity with respect to n. Exact computation allows us to tackle complex cases where existing Monte Carlo methods and local search procedures potentially fail. We show that also in domains with a large number of variables, exact computation is feasible, given suitable a priori restrictions on the structures; combining exact and inexact methods is also possible. We demonstrate the applicability of the presented algorithm on four synthetic data sets with 17, 22, 37, and 100 variables.

...read moreread less

Journal Article•DOI•

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

[...]

Shie Mannor¹, John N. Tsitsiklis¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This work considers the Multi-armed bandit problem under the PAC (“probably approximately correct”) model and generalizes the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the Arms are not.

...read moreread less

Abstract: We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O((n/e2)log(1/δ)) trials suffices in order to find an e-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/e2)(n+log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ((n/e2)log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.

...read moreread less

Journal Article•DOI•

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

[...]

Evan Greensmith, Peter L. Bartlett, Jonathan Baxter

01 Dec 2004-Journal of Machine Learning Research

TL;DR: In this article, the authors consider variance reduction methods that were developed for Monte Carlo estimates of integrals and study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective.

...read moreread less

Abstract: Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function.

...read moreread less

Journal Article•

Online Choice of Active Learning Algorithms

[...]

Yoram Baram¹, Ran El-Yaniv¹, Kobi Luz¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: Taking an ensemble containing two of the best known active learning algorithms and a new algorithm, the resulting new active learning master algorithm is empirically shown to consistently perform almost as well as and sometimes outperform the best algorithm in the ensemble on a range of classification problems.

...read moreread less

Abstract: This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in pool-based active learning. We develop an active-learning master algorithm, based on a known competitive algorithm for the multi-armed bandit problem. A major challenge in successfully choosing top performing active learners online is to reliably estimate their progress during the learning session. To this end we propose a simple maximum entropy criterion that provides effective estimates in realistic settings. We study the performance of the proposed master algorithm using an ensemble containing two of the best known active-learning algorithms as well as a new algorithm. The resulting active-learning master algorithm is empirically shown to consistently perform almost as well as and sometimes outperform the best algorithm in the ensemble on a range of classification problems.

...read moreread less

Journal Article•

Subgroup Discovery with CN2-SD

[...]

Nada Lavrač, Branko Kavšek, Peter A. Flach, Ljupčo Todorovski

01 Dec 2004-Journal of Machine Learning Research

TL;DR: A subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorithm, search heuristic, probabilistic classification of instances, and evaluation measures, shows substantial reduction of the number of induced rules, increased rule coverage and rule significance, as well as slight improvements in terms of the area under ROC curve.

...read moreread less

Abstract: This paper investigates how to adapt standard classification rule learning approaches to subgroup discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorithm, search heuristic, probabilistic classification of instances, and evaluation measures. Experimental evaluation of CN2-SD on 23 UCI data sets shows substantial reduction of the number of induced rules, increased rule coverage and rule significance, as well as slight improvements in terms of the area under ROC curve, when compared with the CN2 algorithm. Application of CN2-SD to a large traffic accident data set confirms these findings.

...read moreread less

Journal Article•

Boosting as a Regularized Path to a Maximum Margin Classifier

[...]

Saharon Rosset¹, Ji Zhu, Trevor Hastie²•Institutions (2)

IBM¹, Stanford University²

01 Dec 2004-Journal of Machine Learning Research

TL;DR: It is built on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an l1 constraint on the coefficient vector, and shows that as the constraint is relaxed the solution converges (in the separable case) to an "l1-optimal" separating hyper-plane.

...read moreread less

Abstract: In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an l1 constraint on the coefficient vector. This helps understand the success of boosting with early stopping as regularized fitting of the loss criterion. For the two most commonly used criteria (exponential and binomial log-likelihood), we further show that as the constraint is relaxed---or equivalently as the boosting iterations proceed---the solution converges (in the separable case) to an "l1-optimal" separating hyper-plane. We prove that this l1-optimal separating hyper-plane has the property of maximizing the minimal l1-margin of the training data, as defined in the boosting literature. An interesting fundamental similarity between boosting and kernel support vector machines emerges, as both can be described as methods for regularized optimization in high-dimensional predictor space, using a computational trick to make the calculation practical, and converging to margin-maximizing solutions. While this statement describes SVMs exactly, it applies to boosting only approximately.

...read moreread less

Journal Article•

Statistical Analysis of Some Multi-Category Large Margin Classification Methods

[...]

Tong Zhang¹•Institutions (1)

IBM¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: It is shown that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem, which can be useful for statistical inferencing tasks beyond classification.

...read moreread less

Abstract: The purpose of this paper is to investigate statistical properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions of binary large margin classification. We establish conditions that guarantee the consistency of classifiers obtained in the risk minimization framework with respect to the classification error. Examples are provided for four specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information can be useful for statistical inferencing tasks beyond classification.

...read moreread less

Journal Article•

A Fast Algorithm for Joint Diagonalization with Non-orthogonal Transformations and its Application to Blind Source Separation

[...]

Andreas Ziehe¹, Pavel Laskov¹, Guido Nolte², Klaus-Robert Müller³, Klaus-Robert Müller¹ - Show less +1 more•Institutions (3)

Fraunhofer Institute for Open Communication Systems¹, National Institutes of Health², University of Potsdam³

01 Dec 2004-Journal of Machine Learning Research

TL;DR: A new efficient algorithm is presented for joint diagonalization of several matrices based on the Frobenius-norm formulation of the joint diagonalized problem, and addresses diagonalization with a general, non-orthogonal transformation.

...read moreread less

Abstract: A new efficient algorithm is presented for joint diagonalization of several matrices. The algorithm is based on the Frobenius-norm formulation of the joint diagonalization problem, and addresses diagonalization with a general, non-orthogonal transformation. The iterative scheme of the algorithm is based on a multiplicative update which ensures the invertibility of the diagonalizer. The algorithm's efficiency stems from the special approximation of the cost function resulting in a sparse, block-diagonal Hessian to be used in the computation of the quasi-Newton update step. Extensive numerical simulations illustrate the performance of the algorithm and provide a comparison to other leading diagonalization methods. The results of such comparison demonstrate that the proposed algorithm is a viable alternative to existing state-of-the-art joint diagonalization algorithms. The practical use of our algorithm is shown for blind source separation problems.

...read moreread less

Journal Article•DOI•

Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods

[...]

Giorgio Valentini¹, Thomas G. Dietterich²•Institutions (2)

University of Milan¹, Oregon State University²

01 Dec 2004-Journal of Machine Learning Research

TL;DR: An extended experimental analysis of bias-variance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels, shows that the expected trade-off between bias and variance is sometimes observed, but more complex relationships can be detected.

...read moreread less

Abstract: Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental analysis of bias-variance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels. A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters, offering insights into the way SVMs learn. The results show that the expected trade-off between bias and variance is sometimes observed, but more complex relationships can be detected, especially in Gaussian and polynomial kernels. We show that the bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base learners, and we outline two directions for developing SVM ensembles, exploiting the SVM bias characteristics and the bias-variance dependence on the kernel param

...read moreread less

Journal Article•

Support Vector Machine Soft Margin Classifiers: Error Analysis

[...]

Di-Rong Chen¹, Qiang Wu, Yiming Ying, Ding-Xuan Zhou•Institutions (1)

Beihang University¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: A projection operator is introduced, which leads to better sample error estimates especially for small complexity kernels, and the choice of the regularization parameter plays an important role in the analysis.

...read moreread less

Abstract: The purpose of this paper is to provide a PAC error analysis for the q-norm soft margin classifier, a support vector machine classification algorithm. It consists of two parts: regularization error and sample error. While many techniques are available for treating the sample error, much less is known for the regularization error and the corresponding approximation error for reproducing kernel Hilbert spaces. We are mainly concerned about the regularization error. It is estimated for general distributions by a K-functional in weighted Lq spaces. For weakly separable distributions (i.e., the margin may be zero) satisfactory convergence rates are provided by means of separating functions. A projection operator is introduced, which leads to better sample error estimates especially for small complexity kernels. The misclassification error is bounded by the V-risk associated with a general class of loss functions V. The difficulty of bounding the offset is overcome. Polynomial kernels and Gaussian kernels are used to demonstrate the main results. The choice of the regularization parameter plays an important role in our analysis.

...read moreread less

Journal Article•DOI•

Hierarchical Latent Class Models for Cluster Analysis

[...]

Nevin L. Zhang

01 Dec 2004-Journal of Machine Learning Research

TL;DR: A search-based algorithm for learning hierarchical latent class models from data using a framework where the local dependence problem can be addressed in a principled manner is developed.

...read moreread less

Abstract: Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is often untrue. In this paper we propose hierarchical latent class models as a framework where the local dependence problem can be addressed in a principled manner. We develop a search-based algorithm for learning hierarchical latent class models from data. The algorithm is evaluated using both synthetic and real-world data.

...read moreread less

Journal Article•

Rational Kernels: Theory and Algorithms

[...]

Corinna Cortes, Patrick Haffner¹, Mehryar Mohri¹•Institutions (1)

AT&T¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: A general family of kernels based on weighted transducers or rational relations, rational kernels, that extend kernel methods to the analysis of variable-length sequences or more generally weighted automata and show that rational kernels are easy to design and implement and lead to substantial improvements of the classification accuracy.

...read moreread less

Abstract: Many classification algorithms were originally designed for fixed-size vectors. Recent applications in text and speech processing and computational biology require however the analysis of variable-length sequences and more generally weighted automata. An approach widely used in statistical learning techniques such as Support Vector Machines (SVMs) is that of kernel methods, due to their computational efficiency in high-dimensional feature spaces. We introduce a general family of kernels based on weighted transducers or rational relations, rational kernels , that extend kernel methods to the analysis of variable-length sequences or more generally weighted automata. We show that rational kernels can be computed efficiently using a general algorithm of composition of weighted transducers and a general single-source shortest-distance algorithm. Not all rational kernels are positive definite and symmetric (PDS), or equivalently verify the Mercer condition, a condition that guarantees the convergence of training for discriminant classification algorithms such as SVMs. We present several theoretical results related to PDS rational kernels. We show that under some general conditions these kernels are closed under sum, product, or Kleene-closure and give a general method for constructing a PDS rational kernel from an arbitrary transducer defined on some non-idempotent semirings. We give the proof of several characterization results that can be used to guide the design of PDS rational kernels. We also show that some commonly used string kernels or similarity measures such as the edit-distance, the convolution kernels of Haussler, and some string kernels used in the context of computational biology are specific instances of rational kernels. Our results include the proof that the edit-distance over a non-trivial alphabet is not negative definite, which, to the best of our knowledge, was never stated or proved before. Rational kernels can be combined with SVMs to form efficient and powerful techniques for a variety of classification tasks in text and speech processing, or computational biology. We describe examples of general families of PDS rational kernels that are useful in many of these applications and report the result of our experiments illustrating the use of rational kernels in several difficult large-vocabulary spoken-dialog classification tasks based on deployed spoken-dialog systems. Our results show that rational kernels are easy to design and implement and lead to substantial improvements of the classification accuracy.

...read moreread less

Journal Article•DOI•

Reinforcement Learning with Factored States and Actions

[...]

Brian Sallans, Geoffrey E. Hinton

01 Dec 2004-Journal of Machine Learning Research

TL;DR: A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces and shows that the product of experts approximation can be used to solve large problems.

...read moreread less

Abstract: A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates state-action values as negative free energies in an undirected graphical model called a product of experts. The model parameters can be learned efficiently because values and derivatives can be efficiently computed for a product of experts. Actions can be found even in large factored action spaces by the use of Markov chain Monte Carlo sampling. Simulation results show that the product of experts approximation can be used to solve large problems. In one simulation it is used to find actions in action spaces of size 240.

...read moreread less

Journal Article•

Fast String Kernels using Inexact Matching for Protein Sequences

[...]

Christina S. Leslie¹, Rui Kuang¹•Institutions (1)

Columbia University¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: In protein classification experiments on two benchmark SCOP data sets, it is shown that the new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.

...read moreread less

Abstract: We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels -- restricted gappy kernels, substitution kernels, and wildcard kernels -- are based on feature spaces indexed by k-length subsequences ("k-mers") from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK(|x|+|y|)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant cK = km+1|Σ|m of the (k,m)-mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.

...read moreread less

Journal Article•DOI•

Learning Ensembles from Bites: A Scalable and Accurate Approach

[...]

Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer, W. Philip Kegelmeyer

01 Dec 2004-Journal of Machine Learning Research

TL;DR: Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging.

...read moreread less

Abstract: Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.

...read moreread less

Journal Article•

A Geometric Approach to Multi-Criterion Reinforcement Learning

[...]

Shie Mannor, Nahum Shimkin

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This work captures the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type using a stochastic game model, where the learning agent is facing an adversary whose policy is arbitrary and unknown, and where the reward function is vector-valued.

...read moreread less

Abstract: We consider the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, actions that are observed but cannot be predicted beforehand. We capture this situation using a stochastic game model, where the learning agent is facing an adversary whose policy is arbitrary and unknown, and where the reward function is vector-valued. State recurrence conditions are imposed throughout. In our basic problem formulation, a desired target set is specified in the vector reward space, and the objective of the learning agent is to approach the target set, in the sense that the long-term average reward vector will belong to this set. We devise appropriate learning algorithms, that essentially use multiple reinforcement learning algorithms for the standard scalar reward problem, which are combined using the geometric insight from the theory of approachability for vector-valued stochastic games. We then address the more general and optimization-related problem, where a nested class of possible target sets is prescribed, and the goal of the learning agent is to approach the smallest possible target set (which will generally depend on the unknown system parameters). A particular case which falls into this framework is that of stochastic games with average reward constraints, and further specialization provides a reinforcement learning algorithm for constrained Markov decision processes. Some basic examples are provided to illustrate these results.

...read moreread less

Journal Article•DOI•

The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins

[...]

Cynthia Rudin¹, Cynthia Rudin², Ingrid Daubechies¹, Robert E. Schapire•Institutions (2)

Princeton University¹, New York University²

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This work reduces AdaBoost to a nonlinear iterated map and studies the evolution of its weight vectors to understand AdaBoost's convergence properties completely, and shows that AdaBoost does not always converge to a maximum margin combined classifier, answering an open question.

...read moreread less

Abstract: In order to study the convergence properties of the AdaBoost algorithm, we reduce AdaBoost to a nonlinear iterated map and study the evolution of its weight vectors. This dynamical systems approach allows us to understand AdaBoost's convergence properties completely in certain cases; for these cases we find stable cycles, allowing us to explicitly solve for AdaBoost's output.Using this unusual technique, we are able to show that AdaBoost does not always converge to a maximum margin combined classifier, answering an open question. In addition, we show that "non-optimal" AdaBoost (where the weak learning algorithm does not necessarily choose the best weak classifier at each iteration) may fail to converge to a maximum margin classifier, even if "optimal" AdaBoost produces a maximum margin. Also, we show that if AdaBoost cycles, it cycles among "support vectors", i.e., examples that achieve the same smallest margin.

...read moreread less