Showing papers in "Journal of Machine Learning Research in 2008"

PDF

Open Access

Journal Article•

[...]

Laurens van der Maaten, Geoffrey E. Hinton

01 Jan 2008-Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

...read moreread less

30,124 citations

Journal Article•

LIBLINEAR: A Library for Large Linear Classification

[...]

Rong-En Fan¹, Kai-Wei Chang¹, Cho-Jui Hsieh¹, Xiang-Rui Wang¹, Chih-Jen Lin¹ - Show less +1 more•Institutions (1)

National Taiwan University¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.

...read moreread less

Abstract: LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.

...read moreread less

7,848 citations

Journal Article•DOI•

Mixed Membership Stochastic Blockmodels

[...]

Edoardo M. Airoldi¹, David M. Blei, Stephen E. Fienberg, Eric P. Xing•Institutions (1)

Princeton University¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: In this article, the authors introduce a class of variance allocation models for pairwise measurements, called mixed membership stochastic blockmodels, which combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters (mixed membership), and develop a general variational inference algorithm for fast approximate posterior inference.

...read moreread less

Abstract: Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.

...read moreread less

1,803 citations

Journal Article•

Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

[...]

Andreas Krause, Ajit P. Singh, Carlos Guestrin

01 Jun 2008-Journal of Machine Learning Research

TL;DR: It is proved that the problem of finding the configuration that maximizes mutual information is NP-complete, and a polynomial-time approximation is described that is within (1-1/e) of the optimum by exploiting the submodularity of mutual information.

...read moreread less

Abstract: When monitoring spatial phenomena, which can often be modeled as Gaussian processes (GPs), choosing sensor locations is a fundamental task. There are several common strategies to address this task, for example, geometry or disk models, placing sensors at the points of highest entropy (variance) in the GP model, and A-, D-, or E-optimal design. In this paper, we tackle the combinatorial optimization problem of maximizing the mutual information between the chosen locations and the locations which are not selected. We prove that the problem of finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1-1/e) of the optimum by exploiting the submodularity of mutual information. We also show how submodularity can be used to obtain online bounds, and design branch and bound search procedures. We then extend our algorithm to exploit lazy evaluations and local structure in the GP, yielding significant speedups. We also extend our approach to find placements which are robust against node failures and uncertainties in the model. These extensions are again associated with rigorous theoretical approximation guarantees, exploiting the submodularity of the objective function. We demonstrate the advantages of our approach towards optimizing mutual information in a very extensive empirical study on two real-world data sets.

...read moreread less

1,593 citations

Journal Article•

An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

[...]

Salvador García, Francisco Herrera

01 Jan 2008-Journal of Machine Learning Research

TL;DR: The paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method, but it does not deal with some advanced topics in depth.

...read moreread less

Abstract: In a recently published paper in JMLR, Demˇ sar (2006) recommends a set of non-parametric statistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n n classifiers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures.

...read moreread less

1,312 citations

Journal Article•DOI•

Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

[...]

Onureena Banerjee¹, Laurent El Ghaoui¹, Alexandre d'Aspremont•Institutions (1)

University of California, Berkeley¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This work considers the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse, and presents two new algorithms for solving problems with at least a thousand nodes in the Gaussian case.

...read moreread less

Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added l1-norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive l1-norm penalized regression. Our second algorithm, based on Nesterov's first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data.

...read moreread less

1,189 citations

Journal Article•DOI•

Consistency of the Group Lasso and Multiple Kernel Learning

[...]

Francis Bach

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This paper derives necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, and proposes an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.

...read moreread less

Abstract: We consider the least-square regression problem with regularization by a block l1-norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the l1-norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model mis specification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covar iance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.

...read moreread less

687 citations

Journal Article•

A Tutorial on Conformal Prediction

[...]

Glenn Shafer, Vladimir Vovk¹•Institutions (1)

Royal Holloway, University of London¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples of how the model under which successive examples are sampled independently from the same distribution can be applied to any method for producing ŷ.

...read moreread less

Abstract: Conformal prediction uses past experience to determine precise levels of confidence in new predictions. Given an error probability e, together with a method that makes a prediction ŷ of a label y, it produces a set of labels, typically containing ŷ, that also contains y with probability 1 e. Conformal prediction can be applied to any method for producing ŷ: a nearest-neighbor method, a support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 e of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005).

...read moreread less

648 citations

Journal Article•

Consistency of Random Forests and Other Averaging Classifiers

[...]

Gérard Biau, Luc Devroye, Gábor Lugosi

01 Jun 2008-Journal of Machine Learning Research

TL;DR: A number of theorems are given that establish the universal consistency of averaging rules, and it is shown that some popular classifiers, including one suggested by Breiman, are not universally consistent.

...read moreread less

Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent.

...read moreread less

521 citations

Journal Article•

Finite-Time Bounds for Fitted Value Iteration

[...]

Rémi Munos, Csaba Szepesvári

01 Jun 2008-Journal of Machine Learning Research

TL;DR: A theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available.

...read moreread less

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted Lp-norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is "aligned" with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings.

...read moreread less

441 citations

Journal Article•DOI•

Optimization Techniques for Semi-Supervised Support Vector Machines

[...]

Olivier Chapelle¹, Vikas Sindhwani, S. Sathiya Keerthi•Institutions (1)

Max Planck Society¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: The performance and behavior of various S3VMs algorithms is studied together, under a common experimental setting, to review key ideas in this literature on semi-supervised support Vector Machines.

...read moreread less

Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S3VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3VMs algorithms is studied together, under a common experimental setting.

...read moreread less

Journal Article•DOI•

Classification with a Reject Option using a Hinge Loss

[...]

Peter L. Bartlett, Marten Wegkamp

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This work considers the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation and proposes a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs).

...read moreread less

Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate lossthe φ-riskalso minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y=1|X) is unlikely to be close to certain critical values.

...read moreread less

Journal Article•

Approximations for Binary Gaussian Process Classification

[...]

Hannes Nickisch¹, Carl Edward Rasmussen¹•Institutions (1)

Max Planck Society¹

16 Sep 2008-Journal of Machine Learning Research

TL;DR: A comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification and the relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results.

...read moreread less

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

...read moreread less

Journal Article•DOI•

Trust Region Newton Method for Logistic Regression

[...]

Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This paper applies a trust region Newton method to maximize the log-likelihood of the logistic regression model, and extends the proposed method to large-scale L2-loss linear support vector machines (SVM).

...read moreread less

Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM).

...read moreread less

Journal Article•DOI•

Optimal Solutions for Sparse Principal Component Analysis

[...]

Alexandre d'Aspremont¹, Francis Bach², Laurent El Ghaoui³•Institutions (3)

Princeton University¹, École Normale Supérieure², University of California, Berkeley³

01 Jun 2008-Journal of Machine Learning Research

TL;DR: In this paper, a new semidefinite relaxation is proposed to solve the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination.

...read moreread less

Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3), per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases.

...read moreread less

Journal Article•

Accelerated Neural Evolution through Cooperatively Coevolved Synapses

[...]

Faustino Gomez¹, Jürgen Schmidhuber¹, Jürgen Schmidhuber², Risto Miikkulainen•Institutions (2)

University of Lugano¹, Technische Universität München²

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This paper compares a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large state spaces and hidden state.

...read moreread less

Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks.

...read moreread less

Journal Article•

Robust Submodular Observation Selection

[...]

Andreas Krause¹, H. Brendan McMahan, Carlos Guestrin, Anupam Gupta•Institutions (1)

Carnegie Mellon University¹

01 Jan 2008-Journal of Machine Learning Research

TL;DR: This paper presents the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property, and proves that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms.

...read moreread less

Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. For example, in environmental monitoring, we want to select locations to measure in order to most effectively predict spatial phenomena. Often, we want to select observations which are robust against a number of possible objective functions. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We show how our algorithm can be extended to handle complex cost functions (incorporating non-unit observation cost or communication and path costs). We also show how the algorithm can be used to near-optimally trade off expected-case (e.g., the Mean Square Prediction Error in Gaussian Process regression) and worst-case (e.g., maximum predictive variance) performance. We show that many important machine learning problems fit our robust submodular observation selection formalism, and provide extensive empirical evaluation on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. c ©2008 Andreas Krause, H. Brendan McMahan, Carlos Guestrin and Anupam Gupta. KRAUSE, MCMAHAN, GUESTRIN AND GUPTA

...read moreread less

Journal Article•

Learning from Multiple Sources

[...]

Koby Crammer, Michael Kearns, Jennifer Wortman

01 Jun 2008-Journal of Machine Learning Research

TL;DR: A general theory of which samples should be used to learn models for each source of "nearby" data is provided, applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family.

...read moreread less

Abstract: We consider the problem of learning accurate models from multiple sources of "nearby" data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations.

...read moreread less

Journal Article•

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

[...]

Kai-Wei Chang¹, Cho-Jui Hsieh¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: A novel coordinate descent algorithm for training linear SVM with the L2-loss function that is more efficient and stable than state of the art methods such as Pegasos and TRON.

...read moreread less

Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON.

...read moreread less

Journal Article•DOI•

Consistency of Trace Norm Minimization

[...]

Francis Bach

01 Jun 2008-Journal of Machine Learning Research

TL;DR: In this article, the rank consistency of trace norm minimization with the square loss was investigated and the necessary and sufficient conditions for rank consistency were provided for the non-adaptive version and the adaptive version.

...read moreread less

Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled.

...read moreread less

Journal Article•DOI•

LIBLINEAR: A Library for Large Linear Classification

[...]

FanRong-En, ChangKai-Wei, HsiehCho-Jui, WangXiang-Rui, LinChih-Jen - Show less +1 more

01 Jun 2008-Journal of Machine Learning Research

TL;DR: LIBLINEAR as discussed by the authors is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines, and provides easy-to-use command-line tools and library support.

...read moreread less

Journal Article•

Using Markov Blankets for Causal Structure Learning

[...]

Jean-Philippe Pellet¹, André Elisseeff¹•Institutions (1)

IBM¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: After running a series of comparative experiments on five artificial networks, it is argued that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy.

...read moreread less

Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children's parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw, with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy.

...read moreread less

Journal Article•

Complete Identification Methods for the Causal Hierarchy

[...]

Ilya Shpitser¹, Judea Pearl²•Institutions (2)

Johns Hopkins University¹, University of California, Los Angeles²

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This work completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provides algorithms that accomplish this computation.

...read moreread less

Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple "parallel worlds" and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy.

...read moreread less

Journal Article•DOI•

Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

[...]

Michael Collins¹, Amir Globerson¹, Terry Koo¹, Xavier Carreras¹, Peter L. Bartlett² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of California, Berkeley²

01 Jun 2008-Journal of Machine Learning Research

TL;DR: Examining exponentiated gradient algorithms for training log-linear and maximum-margin models describes how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing.

...read moreread less

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/e) EG updates are required to reach a given accuracy e in the dual; in contrast, for log-linear models only O(log(1/e)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

...read moreread less

Journal Article•DOI•

Universal Multi-Task Kernels

[...]

Andrea Caponnetto, Charles A. Micchelli¹, Massimiliano Pontil², Yiming Ying³•Institutions (3)

State University of New York System¹, University College London², University of Bristol³

01 Jun 2008-Journal of Machine Learning Research

TL;DR: The primary goal here is to derive conditions which ensure that the kernel K is universal, which means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel.

...read moreread less

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y, an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y. Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results.

...read moreread less

Journal Article•

Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

[...]

Manfred K. Warmuth, Dima Kuzmin

01 Jan 2008-Journal of Machine Learning Research

TL;DR: The methodology in the expert setting of online learning is developed by giving an algorithm for learning as well as the best subset of experts of a certain size and then lifted to the matrix setting where the subsets of experts correspond to subspaces.

...read moreread less

Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances.

...read moreread less

Journal Article•

Active Learning of Causal Networks with Intervention Experiments and Optimal Designs

[...]

Yangbo He, Zhi Geng

01 Jan 2008-Journal of Machine Learning Research

TL;DR: It is shown theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph.

...read moreread less

Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph.

...read moreread less

Journal Article•

Evidence Contrary to the Statistical View of Boosting

[...]

David Mease¹, Abraham J. Wyner²•Institutions (2)

San Jose State University¹, University of Pennsylvania²

01 Jun 2008-Journal of Machine Learning Research

TL;DR: In this article, the authors present empirical evidence that raises questions about the statistical perspective on boosting algorithms and reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view.

...read moreread less

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences.

...read moreread less

Journal Article•DOI•

Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

[...]

KrauseAndreas, SinghAjit, GuestrinCarlos

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This paper presents a meta-modelling architecture suitable for discrete-time decision-making that automates the very labor-intensive and therefore time-heavy and therefore expensive process of manually selecting sensor locations for spatial phenomena monitoring.

...read moreread less

Journal Article•DOI•

Regularization on Graphs with Function-adapted Diffusion Processes

[...]

Arthur Szlam¹, Mauro Maggioni², Ronald R. Coifman³•Institutions (3)

University of California, Los Angeles¹, Duke University², Yale University³

01 Jun 2008-Journal of Machine Learning Research

TL;DR: A method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods is presented.

...read moreread less

Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods.

...read moreread less