Showing papers in "Journal of Machine Learning Research in 2011"

PDF

Open Access

Journal Article•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel¹, Peter Prettenhofer², Ron Weiss³, Vincent Dubourg, Jake Vanderplas⁴, Alexandre Passos⁵, David Cournapeau, Matthieu Brucher⁶, Matthieu Perrot, Edouard Duchesnay - Show less +12 more•Institutions (6)

Kobe University¹, Bauhaus University, Weimar², Google³, University of Washington⁴, University of Massachusetts Amherst⁵, Total S.A.⁶

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.

...read moreread less

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

...read moreread less

47,974 citations

Journal Article•

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

[...]

John C. Duchi¹, Elad Hazan², Yoram Singer³•Institutions (3)

University of California, Berkeley¹, Princeton University², Google³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

...read moreread less

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

...read moreread less

6,984 citations

Journal Article•

Natural Language Processing (Almost) from Scratch

[...]

Ronan Collobert, Jason Weston¹, Léon Bottou, Michael Karlen, Koray Kavukcuoglu², Pavel P. Kuksa³ - Show less +2 more•Institutions (3)

Google¹, New York University², Rutgers University³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling is proposed.

...read moreread less

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

...read moreread less

6,734 citations

Journal Article•DOI•

Scikit-learn: Machine Learning in Python

[...]

PedregosaFabian, VaroquauxGaël, GramfortAlexandre, MichelVincent, ThirionBertrand, GriselOlivier, BlondelMathieu, PrettenhoferPeter, WeissRon, DubourgVincent, VanderplasJake, PassosAlexandre, CournapeauDavid, BrucherMatthieu, PerrotMatthieu, DuchesnayÉdouard - Show less +12 more

01 Nov 2011-Journal of Machine Learning Research

TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.

...read moreread less

1,787 citations

Journal Article•

Multiple Kernel Learning Algorithms

[...]

Mehmet Gönen¹, Ethem Alpaydin¹•Institutions (1)

Boğaziçi University¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Overall, using multiple kernels instead of a single one is useful and it is believed that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.

...read moreread less

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.

...read moreread less

1,762 citations

Journal Article•DOI•

Weisfeiler-Lehman Graph Kernels

[...]

Nino Shervashidze¹, Pascal Schweitzer¹, Erik Jan van Leeuwen², Kurt Mehlhorn¹, Karsten M. Borgwardt¹ - Show less +1 more•Institutions (2)

Max Planck Society¹, University of Bergen²

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A family of efficient kernels for large graphs with discrete node labels based on the Weisfeiler-Lehman test of isomorphism on graphs that outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime.

...read moreread less

Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis.

...read moreread less

1,552 citations

Journal Article•DOI•

Differentially Private Empirical Risk Minimization

[...]

Kamalika Chaudhuri¹, Claire Monteleoni, Anand D. Sarwate•Institutions (1)

University of California, San Diego¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work proposes a new method, objective perturbation, for privacy-preserving machine learning algorithm design, and shows that both theoretically and empirically, this method is superior to the previous state-of-the-art, output perturbations, in managing the inherent tradeoff between privacy and learning performance.

...read moreread less

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the e-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.

...read moreread less

1,057 citations

Journal Article•

A Simpler Approach to Matrix Completion

[...]

Benjamin Recht¹•Institutions (1)

University of Wisconsin-Madison¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix by minimizing the nuclear norm of the hidden matrix subject to agreement with the provided entries.

...read moreread less

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Candes and Recht (2009), Candes and Tao (2009), and Keshavan et al. (2009). The reconstruction is accomplished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory.

...read moreread less

867 citations

Journal Article•

MULAN: A Java Library for Multi-Label Learning

[...]

Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas

01 Feb 2011-Journal of Machine Learning Research

TL;DR: MULAN is a Java library for learning from multi-label data that offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms forlearning from hierarchically structured labels.

...read moreread less

Abstract: MULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures.

...read moreread less

709 citations

Journal Article•

Structured Variable Selection with Sparsity-Inducing Norms

[...]

Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work considers the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms defined as sums of Euclidean norms on certain subsets of variables, and explores the relationship between groups defining the norm and the resulting nonzero patterns.

...read moreread less

Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual l1-norm and the group l1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

...read moreread less

480 citations

Journal Article•DOI•

The Indian Buffet Process: An Introduction and Review

[...]

Thomas L. Griffiths, Zoubin Ghahramani

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A detailed derivation of this distribution is given, and its use as a prior in an infinite latent feature model in probabilistic models such as bipartite graphs in which the size of at least one class of nodes is unknown is unknown.

...read moreread less

Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes.

...read moreread less

Journal Article•DOI•

l p -Norm Multiple Kernel Learning

[...]

Marius Kloft¹, Ulf Brefeld², Sören Sonnenburg³, Alexander Zien•Institutions (3)

University of California, Berkeley¹, Yahoo!², Technical University of Berlin³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Empirical applications of lp-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art, and two efficient interleaved optimization strategies for arbitrary norms are developed.

...read moreread less

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this l1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is lp-norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and l∞-norm MKL in various scenarios. Importantly, empirical applications of lp-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/.

...read moreread less

Journal Article•DOI•

Convergence Rates of Efficient Global Optimization Algorithms

[...]

Adam D. Bull¹•Institutions (1)

University of Cambridge¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: In this article, a Gaussian process prior is used to determine the associated space of functions, its reproducing-kernel Hilbert space (RKHS), and the expected improvement is known to converge on the minimum of any function in its RKHS.

...read moreread less

Abstract: In the efficient global optimization problem, we minimize an unknown function f, using as few observations f(x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f. We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior.

...read moreread less

Journal Article•

Learning from Partial Labels

[...]

Timothee Cour, Ben Sapp¹, Ben Taskar¹•Institutions (1)

University of Pennsylvania¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work proposes a convex learning formulation based on minimization of a loss function appropriate for the partial label setting, and analyzes the conditions under which this loss function is asymptotically consistent, as well as its generalization and transductive performance.

...read moreread less

Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partially-labeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost.

...read moreread less

Journal Article•

Online variational inference for the hierarchical Dirichlet process

[...]

Chong Wang¹, John Paisley¹, David M. Blei¹•Institutions (1)

Princeton University¹

01 Jan 2011-Journal of Machine Learning Research

TL;DR: This work proposes an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data, and lets us analyze much larger data sets.

...read moreread less

Abstract: The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric model that can be used to model mixed-membership data with a potentially infinite number of components. It has been applied widely in probabilistic topic modeling, where the data are documents and the components are distributions of terms that reflect recurring patterns (or “topics”) in the collection. Given a document collection, posterior inference is used to determine the number of topics needed and to characterize their distributions. One limitation of HDP analysis is that existing posterior inference algorithms require multiple passes through all the data—these algorithms are intractable for very large scale applications. We propose an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data. Our algorithm is significantly faster than traditional inference algorithms for the HDP, and lets us analyze much larger data sets. We illustrate the approach on two large collections of text, showing improved performance over online LDA, the finite counterpart to the HDP topic model.

...read moreread less

Journal Article•

Proximal Methods for Hierarchical Sparse Coding

[...]

Rodolphe Jenatton¹, Julien Mairal¹, Guillaume Obozinski¹, Francis Bach¹•Institutions (1)

École Normale Supérieure¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: The procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the l1-norm.

...read moreread less

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the l1-norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models.

...read moreread less

Journal Article•

Laplacian Support Vector Machines Trained in the Primal

[...]

Stefano Melacci, Mikhail Belkin

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This paper presents two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation, and presents an extensive experimental evaluation on real world data showing the benefits of the proposed approach.

...read moreread less

Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3) to O(kn2), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach.

...read moreread less

Journal Article•DOI•

Universality, Characteristic Kernels and RKHS Embedding of Measures

[...]

Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet¹•Institutions (1)

University of California, San Diego¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pD and integrally strictlypd kernels.

...read moreread less

Abstract: Over the last few years, two different notions of positive definite (pd) kernels---universal and characteristic---have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on ℜd, all these notions are shown to be equivalent.

...read moreread less

Journal Article•

Learning with Structured Sparsity

[...]

Junzhou Huang, Tong Zhang¹, Dimitris N. Metaxas¹•Institutions (1)

Rutgers University¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: In this article, a general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing.

...read moreread less

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications.

...read moreread less

Journal Article•DOI•

Distance Dependent Chinese Restaurant Processes

[...]

David M. Blei, Peter I. Frazier

01 Feb 2011-Journal of Machine Learning Research

TL;DR: The distance dependent Chinese restaurant process (DCP) as discussed by the authors is a flexible class of distributions over partitions that allows for dependencies between the elements, which can be used to model many kinds of dependencies between data in infinite clustering models including dependencies arising from time, space and network connectivity.

...read moreread less

Abstract: We develop the distance dependent Chinese restaurant process, a flexible class of distributions over partitions that allows for dependencies between the elements This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies arising from time, space, and network connectivity We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings We study its empirical performance with three text corpora We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data and network data We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation

...read moreread less

Journal Article•

X -Armed Bandits

[...]

Sébastien Bubeck, Rémi Munos, Gilles Stoltz¹, Csaba Szepesvári•Institutions (1)

École Normale Supérieure¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision maker is constructed.

...read moreread less

Abstract: We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by √n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches.

...read moreread less

Journal Article•DOI•

Computationally Efficient Convolved Multiple Output Gaussian Processes

[...]

Mauricio A. Álvarez¹, Neil D. Lawrence²•Institutions (2)

Technological University of Pereira¹, University of Sheffield²

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This paper presents different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism, exploit the conditional independencies present naturally in the model and shows experimental results with synthetic and real data.

...read moreread less

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data.

...read moreread less

Journal Article•DOI•

Efficient Structure Learning of Bayesian Networks using Constraints

[...]

Cassio P. de Campos¹, Qiang Ji²•Institutions (2)

Association for Computing Machinery¹, Rensselaer Polytechnic Institute²

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality and the benefits of using the properties with state-of-the-art methods and with the new algorithm, able to handle larger data sets than before.

...read moreread less

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before.

...read moreread less

Journal Article•DOI•

Learning Latent Tree Graphical Models

[...]

Myung Jin Choi¹, Vincent Y. F. Tan², Animashree Anandkumar³, Alan S. Willsky¹•Institutions (3)

Massachusetts Institute of Technology¹, University of Wisconsin-Madison², University of California, Irvine³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: In this article, the problem of learning a latent tree graphical model where samples are available only from a subset of variables has been studied and two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes, have been proposed.

...read moreread less

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P index and of the words in the 20 newsgroups data set.

...read moreread less

Journal Article•DOI•

Natural Language Processing (Almost) from Scratch

[...]

CollobertRonan, WestonJason, BottouLéon, KarlenMichael, KavukcuogluKoray, KuksaPavel - Show less +2 more

01 Nov 2011-Journal of Machine Learning Research

...read moreread less

Journal Article•DOI•

DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model

[...]

Shohei Shimizu¹, Takanori Inazumi¹, Yasuhiro Sogawa¹, Aapo Hyvärinen², Yoshinobu Kawahara¹, Takashi Washio¹, Patrik O. Hoyer², Kenneth A. Bollen³ - Show less +4 more•Institutions (3)

Osaka University¹, University of Helsinki², University of North Carolina at Chapel Hill³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: In this article, a non-Gaussianity-based method is proposed to estimate the causal ordering and connection strength of a linear acyclic model, which is guaranteed to converge to the right solution within a fixed number of steps if the data strictly follows the model.

...read moreread less

Abstract: Structural equation models and Bayesian networks have been widely used to analyze causal relations between continuous variables. In such frameworks, linear acyclic models are typically used to model the data-generating process of variables. Recently, it was shown that use of non-Gaussianity identifies the full structure of a linear acyclic model, that is, a causal ordering of variables and their connection strengths, without using any prior knowledge on the network structure, which is not the case with conventional methods. However, existing estimation methods are based on iterative search algorithms and may not converge to a correct solution in a finite number of steps. In this paper, we propose a new direct method to estimate a causal ordering and connection strengths based on non-Gaussianity. In contrast to the previous methods, our algorithm requires no algorithmic parameters and is guaranteed to converge to the right solution within a small fixed number of steps if the data strictly follows the model, that is, if all the model assumptions are met and the sample size is infinite.

...read moreread less

Journal Article•

Locally Defined Principal Curves and Surfaces

[...]

Umut Ozertem, Deniz Erdogmus

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems are presented.

...read moreread less

Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems.

...read moreread less

Journal Article•DOI•

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

[...]

DuchiJohn, HazanElad, SingerYoram

01 Jul 2011-Journal of Machine Learning Research

TL;DR: A new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning are presented.

...read moreread less

Journal Article•DOI•

Information, Divergence and Risk for Binary Experiments

[...]

Mark D. Reid¹, Robert C. Williamson¹•Institutions (1)

Australian National University¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants.

...read moreread less

Abstract: We unify f-divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f-divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants.

...read moreread less

Journal Article•DOI•

Information Rates of Nonparametric Gaussian Process Methods

[...]

Aad van der Vaart, Harry van Zanten

01 Feb 2011-Journal of Machine Learning Research

TL;DR: The results show that for good performance, the regularity of the GP prior should match the regularities of the unknown response function, and is expressible in a certain concentration function.

...read moreread less

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Matern and squared exponential kernels. For these priors the risk, and hence the information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function.

...read moreread less

Collapse