Showing papers in "arXiv: Machine Learning in 2009"

PDF

Open Access

Posted Content•

Online Learning for Matrix Factorization and Sparse Coding

[...]

Julien Mairal¹, Francis Bach¹, Jean Ponce¹, Guillermo Sapiro•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Aug 2009-arXiv: Machine Learning

TL;DR: A new online optimization algorithm is proposed, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems.

...read moreread less

Abstract: Sparse coding--that is, modelling data vectors as sparse linear combinations of basis elements--is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set, adapting it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large datasets.

...read moreread less

2,256 citations

Posted Content•

The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

[...]

Han Liu¹, John Lafferty¹, Larry Wasserman¹•Institutions (1)

Carnegie Mellon University¹

03 Mar 2009-arXiv: Machine Learning

TL;DR: In this paper, the non-paranormal Gaussian copula is used for high-dimensional inference in sparse undirected graphs for real-valued data in high dimensional problems.

...read moreread less

Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula--or "nonparanormal"--for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method's theoretical properties, and show that it works well in many examples.

...read moreread less

541 citations

Posted Content•

Structured Variable Selection with Sparsity-Inducing Norms

[...]

Rodolphe Jenatton¹, Jean-Yves Audibert¹, Francis Bach¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

22 Apr 2009-arXiv: Machine Learning

TL;DR: In this paper, the authors consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms, defined as sums of Euclidean norms on certain subsets of variables.

...read moreread less

Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual $\ell_1$-norm and the group $\ell_1$-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

...read moreread less

503 citations

Posted Content•

Learning Bayesian Networks with the bnlearn R Package

[...]

Marco Scutari

26 Aug 2009-arXiv: Machine Learning

TL;DR: Bbnlearn is an R package which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables.

...read moreread less

Abstract: bnlearn is an R package which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package.

...read moreread less

374 citations

Posted Content•

Under-determined reverberant audio source separation using a full-rank spatial covariance model

[...]

Ngoc Q. K. Duong¹, Emmanuel Vincent¹, Rémi Gribonval¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Dec 2009-arXiv: Machine Learning

TL;DR: This paper addresses the modeling of reverberant recording environments in the context of under-determined convolutive blind source separation by model the contribution of each source to all mixture channels in the time-frequency domain as a zero-mean Gaussian random variable whose covariance encodes the spatial characteristics of the source.

...read moreread less

Abstract: This article addresses the modeling of reverberant recording environments in the context of under-determined convolutive blind source separation. We model the contribution of each source to all mixture channels in the time-frequency domain as a zero-mean Gaussian random variable whose covariance encodes the spatial characteristics of the source. We then consider four specific covariance models, including a full-rank unconstrained model. We derive a family of iterative expectationmaximization (EM) algorithms to estimate the parameters of each model and propose suitable procedures to initialize the parameters and to align the order of the estimated sources across all frequency bins based on their estimated directions of arrival (DOA). Experimental results over reverberant synthetic mixtures and live recordings of speech data show the effectiveness of the proposed approach.

...read moreread less

354 citations

Posted Content•

Structured Sparse Principal Component Analysis

[...]

Rodolphe Jenatton¹, Guillaume Obozinski¹, Francis Bach¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

08 Sep 2009-arXiv: Machine Learning

TL;DR: In this article, the sparsity patterns of all dictionary elements are structured and constrained to belong to a prespecified set of shapes, which is based on a structured regularization recently introduced by [1].

...read moreread less

Abstract: We present an extension of sparse PCA, or sparse dictionary learning, where the sparsity patterns of all dictionary elements are structured and constrained to belong to a prespecified set of shapes. This \emph{structured sparse PCA} is based on a structured regularization recently introduced by [1]. While classical sparse priors only deal with \textit{cardinality}, the regularization we use encodes higher-order information about the data. We propose an efficient and simple optimization procedure to solve this problem. Experiments with two practical tasks, face recognition and the study of the dynamics of a protein complex, demonstrate the benefits of the proposed structured approach over unstructured approaches.

...read moreread less

306 citations

Posted Content•

Distance Dependent Chinese Restaurant Processes

[...]

David M. Blei, Peter I. Frazier

06 Oct 2009-arXiv: Machine Learning

TL;DR: The distance dependent Chinese restaurant process (CRP) as mentioned in this paper is a flexible class of distributions over partitions that allows for non-exchangeability and can be used to model many kinds of dependencies between data in infinite clustering models.

...read moreread less

Abstract: We develop the distance dependent Chinese restaurant process (CRP), a flexible class of distributions over partitions that allows for non-exchangeability. This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies across time or space. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both observed and mixture settings. We study its performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data. We also show its alternative formulation of the traditional CRP leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation.

...read moreread less

293 citations

Posted Content•

Taking Advantage of Sparsity in Multi-Task Learning

[...]

Karim Lounici, Massimiliano Pontil¹, Alexandre B. Tsybakov², Sara van de Geer³•Institutions (3)

University College London¹, University of Paris², ETH Zurich³

09 Mar 2009-arXiv: Machine Learning

TL;DR: The Group Lasso is considered as a candidate estimation method and it is shown that this estimator enjoys nice sparsity oracle inequalities and variable selection properties and can be extended to more general noise distributions, of which it only requires the variance to be finite.

...read moreread less

Abstract: We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multi-task learning Argyriou et al. [2008], we assume that the regression vectors share the same sparsity pattern. This means that the set of relevant predictor variables is the same across the different equations. This assumption leads us to consider the Group Lasso as a candidate estimation method. We show that this estimator enjoys nice sparsity oracle inequalities and variable selection properties. The results hold under a certain restricted eigenvalue condition and a coherence condition on the design matrix, which naturally extend recent work in Bickel et al. [2007], Lounici [2008]. In particular, in the multi-task learning scenario, in which the number of tasks can grow, we are able to remove completely the effect of the number of predictor variables in the bounds. Finally, we show how our results can be extended to more general noise distributions, of which we only require the variance to be finite.

...read moreread less

241 citations

Posted Content•

Sparse Canonical Correlation Analysis

[...]

David R. Hardoon¹, John Shawe-Taylor²•Institutions (2)

Institute for Infocomm Research Singapore¹, University College London²

19 Aug 2009-arXiv: Machine Learning

TL;DR: In this article, a sparse convex framework for solving Canonical Correlation Analysis (CCA) was proposed, which minimizes the number of features used in both the primal and dual projections while maximising the correlation between the two views.

...read moreread less

Abstract: We present a novel method for solving Canonical Correlation Analysis (CCA) in a sparse convex framework using a least squares approach. The presented method focuses on the scenario when one is interested in (or limited to) a primal representation for the first view while having a dual representation for the second view. Sparse CCA (SCCA) minimises the number of features used in both the primal and dual projections while maximising the correlation between the two views. The method is demonstrated on two paired corpuses of English-French and English-Spanish for mate-retrieval. We are able to observe, in the mate-retreival, that when the number of the original features is large SCCA outperforms Kernel CCA (KCCA), learning the common semantic space from a sparse set of features.

...read moreread less

213 citations

Posted Content•

Bayesian Agglomerative Clustering with Coalescents

[...]

Yee Whye Teh, Hal Daumé, Daniel M. Roy

04 Jul 2009-arXiv: Machine Learning

TL;DR: In this paper, a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent is introduced, which operates in a bottom-up agglomerative fashion.

...read moreread less

Abstract: We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.

...read moreread less

135 citations

Posted Content•

Composite Binary Losses

[...]

Mark D. Reid¹, Robert C. Williamson¹•Institutions (1)

Australian National University¹

17 Dec 2009-arXiv: Machine Learning

TL;DR: This paper characterised when margin losses can be proper composite losses, explicitly showed how to determine a symmetric loss in full from half of one of its partial losses, and gave a complete characterisation of the relationship between proper losses and ''classification calibrated'' losses.

...read moreread less

Abstract: We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and ``classification calibrated'' losses. We also consider the question of the ``best'' surrogate binary loss. We introduce a precise notion of ``best'' and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of ``surrogate tuning''. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise.

...read moreread less

Journal Article•DOI•

Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping

[...]

Seyoung Kim, Eric P. Xing

08 Sep 2009-arXiv: Machine Learning

TL;DR: A tree-guided group lasso is proposed for estimating structured sparsity under multi-response regression by employing a novel penalty function constructed from the tree, and a systematic weighting scheme for the overlapping groups in the tree-penalty is described.

...read moreread less

Abstract: We consider the problem of estimating a sparse multi-response regression function, with an application to expression quantitative trait locus (eQTL) mapping, where the goal is to discover genetic variations that influence gene-expression levels. In particular, we investigate a shrinkage technique capable of capturing a given hierarchical structure over the responses, such as a hierarchical clustering tree with leaf nodes for responses and internal nodes for clusters of related responses at multiple granularity, and we seek to leverage this structure to recover covariates relevant to each hierarchically-defined cluster of responses. We propose a tree-guided group lasso, or tree lasso, for estimating such structured sparsity under multi-response regression by employing a novel penalty function constructed from the tree. We describe a systematic weighting scheme for the overlapping groups in the tree-penalty such that each regression coefficient is penalized in a balanced manner despite the inhomogeneous multiplicity of group memberships of the regression coefficients due to overlaps among groups. For efficient optimization, we employ a smoothing proximal gradient method that was originally developed for a general class of structured-sparsity-inducing penalties. Using simulated and yeast data sets, we demonstrate that our method shows a superior performance in terms of both prediction errors and recovery of true sparsity patterns, compared to other methods for learning a multivariate-response regression.

...read moreread less

Posted Content•

A more robust boosting algorithm

[...]

Yoav Freund

13 May 2009-arXiv: Machine Learning

TL;DR: This work presents a new boosting algorithm, motivated by the large margins theory for boosting, that is significantly more robust against label noise than existing boosting algorithm.

...read moreread less

Abstract: We present a new boosting algorithm, motivated by the large margins theory for boosting. We give experimental evidence that the new algorithm is significantly more robust against label noise than existing boosting algorithm.

...read moreread less

Posted Content•

Empirical Bernstein Bounds and Sample Variance Penalization

[...]

Andreas Maurer¹, Massimiliano Pontil¹•Institutions (1)

University College London¹

21 Jul 2009-arXiv: Machine Learning

TL;DR: In this paper, the authors consider sample variance penalization, a learning method which takes into account the empirical variance of the loss function, and give conditions under which the method is effective.

...read moreread less

Abstract: We give improved constants for data dependent and variance sensitive confidence bounds, called empirical Bernstein bounds, and extend these inequalities to hold uniformly over classes of functionswhose growth function is polynomial in the sample size n. The bounds lead us to consider sample variance penalization, a novel learning method which takes into account the empirical variance of the loss function. We give conditions under which sample variance penalization is effective. In particular, we present a bound on the excess risk incurred by the method. Using this, we argue that there are situations in which the excess risk of our method is of order 1/n, while the excess risk of empirical risk minimization is of order 1/sqrt/{n}. We show some experimental results, which confirm the theory. Finally, we discuss the potential application of our results to sample compression schemes.

...read moreread less

Posted Content•

Visualizing Topics with Multi-Word Expressions

[...]

David M. Blei, John Lafferty

06 Jul 2009-arXiv: Machine Learning

TL;DR: A new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models, based on a language model of arbitrary length expressions, which outperforms the more standard use of $\chi^2$ and likelihood ratio tests.

...read moreread less

Abstract: We describe a new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models. Our method finds significant $n$-grams related to a topic, which are then used to help understand and interpret the underlying distribution. Compared with the usual visualization, which simply lists the most probable topical terms, the multi-word expressions provide a better intuitive impression for what a topic is "about." Our approach is based on a language model of arbitrary length expressions, for which we develop a new methodology based on nested permutation tests to find significant phrases. We show that this method outperforms the more standard use of $\chi^2$ and likelihood ratio tests. We illustrate the topic presentations on corpora of scientific abstracts and news articles.

...read moreread less

Posted Content•

Learning the Structure of Deep Sparse Graphical Models

[...]

Ryan P. Adams, Hanna Wallach, Zoubin Ghahramani

31 Dec 2009-arXiv: Machine Learning

TL;DR: The cascading Indian buffet process (CIBP) is introduced, which provides a prior on the structure of a layered, directed belief network that is unbounded in both depth and width, yet allows tractable inference.

...read moreread less

Abstract: Deep belief networks are a powerful way to model complex probability distributions. However, learning the structure of a belief network, particularly one with hidden units, is difficult. The Indian buffet process has been used as a nonparametric Bayesian prior on the directed structure of a belief network with a single infinitely wide hidden layer. In this paper, we introduce the cascading Indian buffet process (CIBP), which provides a nonparametric prior on the structure of a layered, directed belief network that is unbounded in both depth and width, yet allows tractable inference. We use the CIBP prior with the nonlinear Gaussian belief network so each unit can additionally vary its behavior between discrete and continuous representations. We provide Markov chain Monte Carlo algorithms for inference in these belief networks and explore the structures learned on several image data sets.

...read moreread less

Posted Content•

The Benefit of Group Sparsity

[...]

Junzhou Huang, Tong Zhang¹•Institutions (1)

Rutgers University¹

20 Jan 2009-arXiv: Machine Learning

TL;DR: In this article, the authors developed a theory for group Lasso using a concept called strong group sparsity and showed that group lasso is superior to standard lasso for strongly group-sparse signals.

...read moreread less

Abstract: This paper develops a theory for group Lasso using a concept called strong group sparsity. Our result shows that group Lasso is superior to standard Lasso for strongly group-sparse signals. This provides a convincing theoretical justification for using group sparse regularization when the underlying group structure is consistent with the data. Moreover, the theory predicts some limitations of the group Lasso formulation that are confirmed by simulation studies.

...read moreread less

Posted Content•

Discrete Temporal Models of Social Networks

[...]

Steve Hanneke, Wenjie Fu, Eric P. Xing

09 Aug 2009-arXiv: Machine Learning

TL;DR: The authors propose a family of statistical models for social network evolution over time, which represent an extension of Exponential Random Graph Models (ERGMs) and give examples of their use for hypothesis testing and classification.

...read moreread less

Abstract: We propose a family of statistical models for social network evolution over time, which represents an extension of Exponential Random Graph Models (ERGMs). Many of the methods for ERGMs are readily adapted for these models, including maximum likelihood estimation algorithms. We discuss models of this type and their properties, and give examples, as well as a demonstration of their use for hypothesis testing and classification. We believe our temporal ERG models represent a useful new framework for modeling time-evolving social networks, and rewiring networks from other domains such as gene regulation circuitry, and communication networks.

...read moreread less

Journal Article•DOI•

Kullback-Leibler aggregation and misspecified generalized linear models

[...]

Philippe Rigollet

16 Nov 2009-arXiv: Machine Learning

TL;DR: In a regression setup with deterministic design, the pure aggregation problem is studied and it is shown that this problem can be solved by constrained and/or penalized likelihood maximization and the bounds are proved to be optimal in a minimax sense.

...read moreread less

Abstract: In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.

...read moreread less

Posted Content•

An Iterative Algorithm for Fitting Nonconvex Penalized Generalized Linear Models with Grouped Predictors

[...]

Yiyuan She¹•Institutions (1)

Florida State University¹

29 Nov 2009-arXiv: Machine Learning

TL;DR: In this article, a nonconvex penalized generalized linear model with grouped predictors is proposed and a simple-to-implement algorithm is proposed for computation, and a rigorous theoretical result guarantees its convergence and provides tight preliminary scaling.

...read moreread less

Abstract: High-dimensional data pose challenges in statistical learning and modeling. Sometimes the predictors can be naturally grouped where pursuing the between-group sparsity is desired. Collinearity may occur in real-world high-dimensional applications where the popular $l_1$ technique suffers from both selection inconsistency and prediction inaccuracy. Moreover, the problems of interest often go beyond Gaussian models. To meet these challenges, nonconvex penalized generalized linear models with grouped predictors are investigated and a simple-to-implement algorithm is proposed for computation. A rigorous theoretical result guarantees its convergence and provides tight preliminary scaling. This framework allows for grouped predictors and nonconvex penalties, including the discrete $l_0$ and the `$l_0+l_2$' type penalties. Penalty design and parameter tuning for nonconvex penalties are examined. Applications of super-resolution spectrum estimation in signal processing and cancer classification with joint gene selection in bioinformatics show the performance improvement by nonconvex penalized estimation.

...read moreread less

Journal Article•DOI•

Context tree selection and linguistic rhythm retrieval from written texts

[...]

Antonio Galves, Charlotte Galves, Jesús E. García, Nancy L. Garcia, Florencia Leonardi - Show less +1 more

20 Feb 2009-arXiv: Machine Learning

TL;DR: This study introduces a new criterion to select in a consistent way the probabilistic context tree generating a sample, compatible with the long standing conjecture that European Portuguese and Brazilian Portuguese belong to different rhythmic classes.

...read moreread less

Abstract: The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.

...read moreread less

Posted Content•

Maximum Entropy Discrimination Markov Networks

[...]

Jun Zhu, Eric P. Xing

18 Jan 2009-arXiv: Machine Learning

TL;DR: The MaxEnDNet as discussed by the authors model combines the max-margin structured learning and Bayesian-style estimation and combines and extends their merits, which is the first successful attempt to combine Bayesian style learning with structured maximum margin learning.

...read moreread less

Abstract: In this paper, we present a novel and general framework called {\it Maximum Entropy Discrimination Markov Networks} (MaxEnDNet), which integrates the max-margin structured learning and Bayesian-style estimation and combines and extends their merits. Major innovations of this model include: 1) It generalizes the extant Markov network prediction rule based on a point estimator of weights to a Bayesian-style estimator that integrates over a learned distribution of the weights. 2) It extends the conventional max-entropy discrimination learning of classification rule to a new structural max-entropy discrimination paradigm of learning the distribution of Markov networks. 3) It subsumes the well-known and powerful Maximum Margin Markov network (M$^3$N) as a special case, and leads to a model similar to an $L_1$-regularized M$^3$N that is simultaneously primal and dual sparse, or other types of Markov network by plugging in different prior distributions of the weights. 4) It offers a simple inference algorithm that combines existing variational inference and convex-optimization based M$^3$N solvers as subroutines. 5) It offers a PAC-Bayesian style generalization bound. This work represents the first successful attempt to combine Bayesian-style learning (based on generative models) with structured maximum margin learning (based on a discriminative model), and outperforms a wide array of competing methods for structured input/output learning on both synthetic and real data sets.

...read moreread less

Posted Content•

How the initialization affects the stability of the k-means algorithm

[...]

Sébastien Bubeck, Marina Meila¹, Ulrike von Luxburg²•Institutions (2)

University of Washington¹, Max Planck Society²

31 Jul 2009-arXiv: Machine Learning

TL;DR: This paper investigates the role of the initialization for the stability of the қ-means clustering algorithm and analyzes when different initializations lead to the same local optimum, and when they lead to different local optima.

...read moreread less

Abstract: We investigate the role of the initialization for the stability of the k-means clustering algorithm. As opposed to other papers, we consider the actual k-means algorithm and do not ignore its property of getting stuck in local optima. We are interested in the actual clustering, not only in the costs of the solution. We analyze when different initializations lead to the same local optimum, and when they lead to different local optima. This enables us to prove that it is reasonable to select the number of clusters based on stability scores.

...read moreread less

Posted Content•

Information, Divergence and Risk for Binary Experiments

[...]

Mark D. Reid¹, Robert C. Williamson¹•Institutions (1)

Australian National University¹

05 Jan 2009-arXiv: Machine Learning

TL;DR: The authors unify f-divergences, Bregret bounds, proper scoring rules, matching losses, cost curves, ROC-curves and information, and derive a new derivation of Support Vector Machines in terms of divergences and relate Maximum Mean Discrepancy to Fisher Linear Discriminants.

...read moreread less

Abstract: We unify f-divergences, Bregman divergences, surrogate loss bounds (regret bounds), proper scoring rules, matching losses, cost curves, ROC-curves and information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their primitives which all are related to cost-sensitive binary classification. As well as clarifying relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate loss bounds and generalised Pinsker inequalities relating f-divergences to variational divergence. The new viewpoint illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants. It also suggests new techniques for estimating f-divergences.

...read moreread less

Posted Content•

Hilbert space embeddings and metrics on probability measures

[...]

Bharath K. Sriperumbudur¹, Arthur Gretton¹, Kenji Fukumizu¹, Bernhard Schölkopf¹, Gert R. G. Lanckriet² - Show less +1 more•Institutions (2)

Max Planck Society¹, University of California, San Diego²

30 Jul 2009-arXiv: Machine Learning

TL;DR: In this paper, the authors consider the question of determining the conditions on the kernel $k$ for which the kernel is a metric, and show that such kernels are characteristic if and only if the support of its Fourier transform is the entire Hilbert space.

...read moreread less

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as $\gamma_k$, indexed by the kernel function $k$ that defines the inner product in the RKHS. We present three theoretical properties of $\gamma_k$. First, we consider the question of determining the conditions on the kernel $k$ for which $\gamma_k$ is a metric: such $k$ are denoted {\em characteristic kernels}. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g. on compact domains), and are difficult to check, our conditions are straightforward and intuitive: bounded continuous strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on $\bb{R}^d$, then it is characteristic if and only if the support of its Fourier transform is the entire $\bb{R}^d$. Second, we show that there exist distinct distributions that are arbitrarily close in $\gamma_k$. Third, to understand the nature of the topology induced by $\gamma_k$, we relate $\gamma_k$ to other popular metrics on probability measures, and present conditions on the kernel $k$ under which $\gamma_k$ metrizes the weak topology.

...read moreread less

Posted Content•

Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

[...]

Ery Arias-Castro¹•Institutions (1)

University of California, Los Angeles¹

12 Sep 2009-arXiv: Machine Learning

TL;DR: A generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes, and densities is considered and a lower bound on the spectral gap is provided to consistently choose the correct number of clusters in the spectral method.

...read moreread less

Abstract: In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method.

...read moreread less

Posted Content•

Which graphical models are difficult to learn

[...]

José Bento, Andrea Montanari

30 Oct 2009-arXiv: Machine Learning

TL;DR: In this article, the authors consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples and show that low-complexity algorithms systematically fail when the Markov Random Field develops long-range correlations.

...read moreread less

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it).

...read moreread less

Posted Content•

Condition Number Analysis of Kernel-based Density Ratio Estimation

[...]

Takafumi Kanamori¹, Taiji Suzuki, Masashi Sugiyama•Institutions (1)

Nagoya University¹

15 Dec 2009-arXiv: Machine Learning

TL;DR: This paper considers a kernelized variant of the least-squares method and investigates its theoretical properties from the viewpoint of the condition number using smoothed analysis techniques--the condition number of the Hessian matrix determines the convergence rate of optimization and the numerical stability.

...read moreread less

Abstract: The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation, and least-squares density ratio fitting. In this paper, we consider a kernelized variant of the least-squares method and investigate its theoretical properties from the viewpoint of the condition number using smoothed analysis techniques--the condition number of the Hessian matrix determines the convergence rate of optimization and the numerical stability. We show that the kernel least-squares method has a smaller condition number than a version of kernel mean matching and other M-estimators, implying that the kernel least-squares method has preferable numerical properties. We further give an alternative formulation of the kernel least-squares estimator which is shown to possess an even smaller condition number. We show that numerical studies meet our theoretical analysis.

...read moreread less

Journal Article•DOI•

On landmark selection and sampling in high-dimensional data analysis

[...]

Mohamed-Ali Belabbas¹, Patrick J. Wolfe¹•Institutions (1)

Harvard University¹

24 Jun 2009-arXiv: Machine Learning

TL;DR: In this paper, the authors provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets, and provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process.

...read moreread less

Abstract: In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nystrom extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.

...read moreread less

Posted Content•

Sparsistent Estimation of Time-Varying Discrete Markov Random Fields

[...]

Mladen Kolar, Eric P. Xing

14 Jul 2009-arXiv: Machine Learning

TL;DR: Conditions under which the proposed method consistently recovers the structure of a time-varying network are established, providing sound theoretical guarantees for the proposed estimation procedure.

...read moreread less

Abstract: Network models have been popular for modeling and representing complex relationships and dependencies between observed variables. When data comes from a dynamic stochastic process, a single static network model cannot adequately capture transient dependencies, such as, gene regulatory dependencies throughout a developmental cycle of an organism. Kolar et al (2010b) proposed a method based on kernel-smoothing l1-penalized logistic regression for estimating time-varying networks from nodal observations collected from a time-series of observational data. In this paper, we establish conditions under which the proposed method consistently recovers the structure of a time-varying network. This work complements previous empirical findings by providing sound theoretical guarantees for the proposed estimation procedure. For completeness, we include numerical simulations in the paper.

...read moreread less