scispace - formally typeset
Search or ask a question

Showing papers by "Michael I. Jordan published in 2006"


Journal ArticleDOI
TL;DR: This work considers problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups, and considers a hierarchical model, specifically one in which the base measure for the childDirichlet processes is itself distributed according to a Dirichlet process.
Abstract: We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes ...

3,755 citations


Journal ArticleDOI
TL;DR: A variational inference algorithm forDP mixtures is presented and experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem are presented.
Abstract: Dirichlet process (DP) mixture models are the cornerstone of non- parametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled the application of non- parametric Bayesian methods to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to ex- plore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, varia- tional methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias 2000; Ghahramani and Beal 2001; Blei et al. 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem.

1,471 citations


Journal ArticleDOI
TL;DR: A general quantitative relationship between the risk as assessed using the 0–1 loss and the riskAs assessed using any nonnegative surrogate loss function is provided, and it is shown that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function.
Abstract: Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0–1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0–1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function—that it satisfies a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the...

1,352 citations


Journal ArticleDOI
TL;DR: New cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem are derived.
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram.

313 citations


Proceedings Article
04 Dec 2006
TL;DR: A PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection is developed, based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.
Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.

214 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: A statistical approach to software debugging in the presence of multiple bugs is described and an iterative collective voting scheme for the program runs and predicates is proposed, taking inspiration from bi-clustering algorithms.
Abstract: We describe a statistical approach to software debugging in the presence of multiple bugs. Due to sparse sampling issues and complex interaction between program predicates, many generic off-the-shelf algorithms fail to select useful bug predictors. Taking inspiration from bi-clustering algorithms, we propose an iterative collective voting scheme for the program runs and predicates. We demonstrate successful debugging results on several real world programs and a large debugging benchmark suite.

169 citations


Journal ArticleDOI
TL;DR: This work has developed an adaptive nonparametric method for constructing smooth estimates of G0 that is inspired by an existing characterization of its maximum-likelihood estimator and yields a flexible empirical Bayes treatment of Dirichlet process mixtures.
Abstract: The Dirichlet process prior allows flexible nonparametric mixture modeling. The number of mixture components is not specified in advance and can grow as new data arrive. However, analyses based on the Dirichlet process prior are sensitive to the choice of the parameters, including an infinite-dimensional distributional parameter G 0. Most previous applications have either fixed G 0 as a member of a parametric family or treated G 0 in a Bayesian fashion, using parametric prior specifications. In contrast, we have developed an adaptive nonparametric method for constructing smooth estimates of G 0. We combine this method with a technique for estimating ?, the other Dirichlet process parameter, that is inspired by an existing characterization of its maximum-likelihood estimator. Together, these estimation procedures yield a flexible empirical Bayes treatment of Dirichlet process mixtures. Such a treatment is useful in situations where smooth point estimates of G 0 are of intrinsic interest, or where the structure of G 0 cannot be conveniently modeled with the usual parametric prior families. Analysis of simulated and real-world datasets illustrates the robustness of this approach.

134 citations


Journal ArticleDOI
TL;DR: A simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models, based on the dual extragradient algorithm.
Abstract: We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex flow, depending on the structure of the problem. We show that this approach provides a memory-efficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.

123 citations


Journal ArticleDOI
TL;DR: A novel method, applicable to discrete-valued Markov random fields on arbitrary graphs, for approximately solving this marginalization problem, and finds that the performance of this log-determinant relaxation is comparable or superior to the widely used sum-product algorithm over a range of experimental conditions.
Abstract: Graphical models are well suited to capture the complex and non-Gaussian statistical dependencies that arise in many real-world signals A fundamental problem common to any signal processing application of a graphical model is that of computing approximate marginal probabilities over subsets of nodes This paper proposes a novel method, applicable to discrete-valued Markov random fields (MRFs) on arbitrary graphs, for approximately solving this marginalization problem The foundation of our method is a reformulation of the marginalization problem as the solution of a low-dimensional convex optimization problem over the marginal polytope Exactly solving this problem for general graphs is intractable; for binary Markov random fields, we describe how to relax it by using a Gaussian bound on the discrete entropy and a semidefinite outer bound on the marginal polytope This combination leads to a log-determinant maximization problem that can be solved efficiently by interior point methods, thereby providing approximations to the exact marginals We show how a slightly weakened log-determinant relaxation can be solved even more efficiently by a dual reformulation When applied to denoising problems in a coupled mixture-of-Gaussian model defined on a binary MRF with cycles, we find that the performance of this log-determinant relaxation is comparable or superior to the widely used sum-product algorithm over a range of experimental conditions

120 citations


Proceedings ArticleDOI
04 Jun 2006
TL;DR: This work addresses the limitations of the proposed bipartite matching model of Taskar et al. (2005) by enriching the model form, and gives estimation and inference algorithms for these enhancements.
Abstract: Recently, discriminative word alignment methods have achieved state-of-the-art accuracies by extending the range of information sources that can be easily incorporated into aligners. The chief advantage of a discriminative framework is the ability to score alignments based on arbitrary features of the matching word tokens, including orthographic form, predictions of other models, lexical context and so on. However, the proposed bipartite matching model of Taskar et al. (2005), despite being tractable and effective, has two important limitations. First, it is limited by the restriction that words have fertility of at most one. More importantly, first order correlations between consecutive words cannot be directly captured by the model. In this work, we address these limitations by enriching the model form. We give estimation and inference algorithms for these enhancements. Our best model achieves a relative AER reduction of 25% over the basic matching formulation, outperforming intersected IBM Model 4 without using any overly compute-intensive features. By including predictions of other models as features, we achieve AER of 3.8 on the standard Hansards dataset.

97 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: This paper captures cross-population structure using a nonparametric Bayesian prior known as the hierarchical Dirichlet process (HDP), conjoining this prior with a recently developed Bayesian methodology for haplotype phasing known as DP-Haplotyper and develops an efficient sampling algorithm based on a two-level nested Pólya urn scheme.
Abstract: Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far---including methods based on coalescence, finite and infinite mixtures, and maximal parsimony---ignore the underlying population structure in the genotype data. As noted by Pritchard (2001), different populations can share certain portion of their genetic ancestors, as well as have their own genetic components through migration and diversification. In this paper, we address the problem of multi-population haplotype inference. We capture cross-population structure using a nonparametric Bayesian prior known as the hierarchical Dirichlet process (HDP) (Teh et al., 2006), conjoining this prior with a recently developed Bayesian methodology for haplotype phasing known as DP-Haplotyper (Xing et al., 2004). We also develop an efficient sampling algorithm for the HDP based on a two-level nested Polya urn scheme. We show that our model outperforms extant algorithms on both simulated and real biological data.

Journal ArticleDOI
TL;DR: To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification, and a novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated.
Abstract: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

01 Jan 2006
TL;DR: This work considers the problem of network anomaly detection given the data collected and processed over large distributed systems and proposes an analytical method based on stochastic matrix perturbation theory for balancing the tradeoff between the accuracy of the approximate network anomalies detection, and the amount of data communication over the network.
Abstract: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Abstract We consider the problem of network anomaly detection given the data collected and processed over large distributed systems. Our algorithmic framework can be seen as an approximate , distributed version of the well-known Principal Component Analysis (PCA) method, which is concerned with continuously tracking the behavior of the data projected onto the residual subspace of the principal components within error bound guarantees. Our approach consists of a protocol for local processing at individual monitoring devices, and global decision-making and monitoring feedback at a coordinator. A key ingredient of our framework is an analytical method based on stochastic matrix perturbation theory for balancing the tradeoff between the accuracy of our approximate network anomaly detection, and the amount of data communication over the network.

Proceedings Article
16 Jun 2006
TL;DR: The practices of operators in a large-scale Internet service Amazon.com are studied and a new set of tools for operators are proposed that lets the operators explore the health of system components and dependencies between them and automatically suggests solutions to recurring problems.
Abstract: Despite significant efforts in the field of Autonomic Computing, system operators will still play a critical role in administering Internet services for many years to come. However, very little is know about how system operators work, what tools they use and how we can make them more efficient. In this paper we study the practices of operators in a large-scale Internet service Amazon.com and propose a new set of tools for operators. The first tool lets the operators explore the health of system components and dependencies between them; the other monitors the actions of operators and automatically suggests solutions to recurring problems.

Proceedings ArticleDOI
25 Jun 2006
TL;DR: A simple statistical model of molecular function evolution to predict protein function is presented and results from applying this model to three protein families are presented, and prediction results on the extant proteins are compared to other available protein function prediction methods.
Abstract: We present a simple statistical model of molecular function evolution to predict protein function. The model description encodes general knowledge of how molecular function evolves within a phylogenetic tree based on the proteins' sequence. Inputs are a phylogeny for a set of evolutionarily related protein sequences and any available function characterizations for those proteins. Posterior probabilities for each protein are used to predict the molecular function of that protein. We present results from applying our model to three protein families, and compare our prediction results on the extant proteins to other available protein function prediction methods. For the deaminase family, our method achieves 93.9% where related methods BLAST achieves 72.7%, GOtcha achieves 87.9%, and Orthostrapper achieves 72.7% in prediction accuracy.

Proceedings Article
13 Jul 2006
TL;DR: In this paper, the multi-class support vector machine (MSVM) is viewed as a MAP estimation procedure under an appropriate probabilistic interpretation of the classifier, and this interpretation can be extended to a hierarchical Bayesian architecture and to a fully-Bayesian inference procedure for multichannel classification based on data augmentation.
Abstract: We show that the multi-class support vector machine (MSVM) proposed by Lee et al. (2004) can be viewed as a MAP estimation procedure under an appropriate probabilistic interpretation of the classifier. We also show that this interpretation can be extended to a hierarchical Bayesian architecture and to a fully-Bayesian inference procedure for multi-class classification based on data augmentation. We present empirical results that show that the advantages of the Bayesian formalism are obtained without a loss in classification accuracy.

Proceedings ArticleDOI
09 Jul 2006
TL;DR: A negative answer to the question whether optimal local decision functions for the Bayesian formulation of sequential decentralized detection can be found within the class of stationary rules is provided by exploiting an asymptotic approximation to the optimal cost of stationary quantization rules, and the asymmetry of the Kullback-Leibler divergences.
Abstract: We consider the problem of sequential decentralized detection, a problem that entails the choice of a stopping rule (specifying the sample size), a global decision function (a choice between two competing hypotheses), and a set of quantization rules (the local decisions on the basis of which the global decision is made). The main result of this paper is to resolve an open problem posed by Veeravalli et al. (1993) concerning whether optimal local decision functions for the Bayesian formulation of sequential decentralized detection can be found within the class of stationary rules. We provide a negative answer to this question by exploiting an asymptotic approximation to the optimal cost of stationary quantization rules, and the asymmetry of the Kullback-Leibler divergences. In addition, we show that asymptotically optimal quantizers, when restricted to the space of blockwise stationary quantizers, are likelihood-based threshold rules


Journal ArticleDOI
TL;DR: The support vector machine (SVM) has played an important role in bringing certain themes to the fore in computationally oriented statistics as discussed by the authors, however, it is important to place the SVM in context as but one part of a class of closely related algorithms for nonlinear classification.
Abstract: The support vector machine (SVM) has played an important role in bringing certain themes to the fore in computationally oriented statistics. However, it is important to place the SVM in context as but one mem ber of a class of closely related algorithms for nonlin ear classification. As we discuss, several of the "open problems" identified by the authors have in fact been the subject of a significant literature, a literature that may have been missed because it has been aimed not only at the SVM but at a broader family of algorithms. Keeping the broader class of algorithms in mind also helps to make clear that the SVM involves certain specific algorithmic choices, some of which have fa vorable consequences and others of which have unfa vorable consequences?both in theory and in practice. The broader context helps to clarify the ties of the SVM to the surrounding statistical literature. We have at least two broader contexts in mind for the

Journal Article
TL;DR: Several of the "open problems" identified by the authors have in fact been the subject of a significant literature, a literature that may have been missed because it has been aimed not only at the SVM but at a broader family of algorithms.
Abstract: The support vector machine (SVM) has played an important role in bringing certain themes to the fore in computationally oriented statistics. However, it is important to place the SVM in context as but one mem ber of a class of closely related algorithms for nonlin ear classification. As we discuss, several of the "open problems" identified by the authors have in fact been the subject of a significant literature, a literature that may have been missed because it has been aimed not only at the SVM but at a broader family of algorithms. Keeping the broader class of algorithms in mind also helps to make clear that the SVM involves certain specific algorithmic choices, some of which have fa vorable consequences and others of which have unfa vorable consequences?both in theory and in practice. The broader context helps to clarify the ties of the SVM to the surrounding statistical literature. We have at least two broader contexts in mind for the

Journal ArticleDOI
TL;DR: In this article, an asymptotic approximation to the optimal cost of stationary quantization rules is proposed, and the authors exploit this approximation to show that stationary quantizers are not optimal in a broad class of settings.
Abstract: We consider the design of systems for sequential decentralized detection, a problem that entails several interdependent choices: the choice of a stopping rule (specifying the sample size), a global decision function (a choice between two competing hypotheses), and a set of quantization rules (the local decisions on the basis of which the global decision is made). This paper addresses an open problem of whether in the Bayesian formulation of sequential decentralized detection, optimal local decision functions can be found within the class of stationary rules. We develop an asymptotic approximation to the optimal cost of stationary quantization rules and exploit this approximation to show that stationary quantizers are not optimal in a broad class of settings. We also consider the class of blockwise stationary quantizers, and show that asymptotically optimal quantizers are likelihood-based threshold rules.

Posted Content
22 Aug 2006
TL;DR: An asymptotic approximation to the optimal cost of stationary quantization rules is obtained and it is shown how this approximation yields a negative answer to the stationarity question.
Abstract: We consider the problem of sequential decentralized detect ion, a problem that entails several interdependent choices: the choice of a stopping rule (specifyin g the sample size), a global decision function (a choice between two competing hypotheses), and a set of qua ntization rules (the local decisions on the basis of which the global decision is made). In this paper we esolve an open problem concerning whether optimal local decision functions for the Bayesian f ormulation of sequential decentralized detection can be found within the class of stationary rules. We dev elop an asymptotic approximation to the optimal cost of stationary quantization rules and show how t his approximation yields a negative answer to the stationarity question. We also consider the class of b lockwise stationary quantizers and show that asymptotically optimal quantizers are likelihood-based t hreshold rules. 1

03 Jan 2006
TL;DR: In this article, the authors present the final conclusions of the research on decision-making under uncertainty conducted by the investigators at the University of California at Berkeley, Stanford University, and the UC Davis, under the aegis of the MURI on Decision-Making under Uncertainty.
Abstract: : This report presents the final conclusions of the research on decision-making under uncertainty conducted by the investigators at the University of California at Berkeley, Stanford University, and the University of California at Davis, under the aegis of the MURI on Decision-Making under Uncertainty.

01 Jan 2006
TL;DR: The Mann-Whitney test rejects if 1 nm ∑ i,j I(Xi ≤ Yj) is large as discussed by the authors, i.e., the t-test is defined as defined in the previous lecture.
Abstract: The t-test is as defined in the previous lecture. It has slope 1/σ. The Mann-Whitney test rejects if 1 nm ∑ i,j I(Xi ≤ Yj) is large. Note. The Mann-Whitney statistic has a relationship with the area under the ROC curve (AUC) for classification algorithms with a tunable parameter. The ROC plot has one axis for proportion of false positives and one axis for proportion of true positives; as we move to the right, the classifier becomes more sensitive to true positives, but misclassifies more negative points as positive. Nearly-flat ROC curves are bad and strongly humped ROC curves are good (see Figure 1). The empirical ROC curve shows the classifier’s performance on the training data and has the form of discrete stair-steps. If we consider the positive examples to be one population Y1, . . . , Ym and the negative examples to be another population X1, . . . , Xn, and consider