scispace - formally typeset
Search or ask a question

Showing papers by "Michael I. Jordan published in 2009"


Proceedings ArticleDOI
11 Oct 2009
TL;DR: In this article, a general methodology to mine this rich source of information to automatically detect system runtime problems was proposed, combining source code analysis with information retrieval to create composite features and then analyze these features using machine learning to detect operational problems.
Abstract: Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers We propose a general methodology to mine this rich source of information to automatically detect system runtime problems We first parse console logs by combining source code analysis with information retrieval to create composite features We then analyze these features using machine learning to detect operational problems We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals

777 citations


Proceedings ArticleDOI
28 Jun 2009
TL;DR: This work develops a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data, and develops two concrete instances of this framework, one based on local k-means clustering (KASP) and onebased on random projection trees (RASP).
Abstract: Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

507 citations


Proceedings Article
07 Dec 2009
TL;DR: This work pursues a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time the authors learn which entities have each feature, and combines these inferred features with known covariates in order to perform link prediction.
Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets.

448 citations


Proceedings ArticleDOI
02 Aug 2009
TL;DR: A generative model is presented that simultaneously segments the text into utterances and maps each utterance to a meaning representation grounded in the world state and generalizes across three domains of increasing difficulty.
Abstract: A central problem in grounded language acquisition is learning the correspondences between a rich world state and a stream of text which references that world state. To deal with the high degree of ambiguity present in this setting, we present a generative model that simultaneously segments the text into utterances and maps each utterance to a meaning representation grounded in the world state. We show that our model generalizes across three domains of increasing difficulty---Robocup sportscasting, weather forecasts (a new domain), and NFL recaps.

348 citations


Proceedings ArticleDOI
29 Mar 2009
TL;DR: A system that uses machine learning to accurately predict the performance metrics of database queries whose execution times range from milliseconds to hours, and was able to correctly identify both the short and long-running queries to inform workload management and capacity planning.
Abstract: One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics --- their runtimes and resource usage --- can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries. When these long-running queries can be identified before they start, they can be rejected or scheduled when they will not cause extreme resource contention for the other queries in the system. Second, deciding whether a system can complete a given workload in a given time period (or a bigger system is necessary) depends on knowing the resource requirements of the queries in that workload. We have developed a system that uses machine learning to accurately predict the performance metrics of database queries whose execution times range from milliseconds to hours. For training and testing our system, we used both real customer queries and queries generated from an extended set of TPC-DS templates. The extensions mimic queries that caused customer problems. We used these queries to compare how accurately different techniques predict metrics such as elapsed time, records used, disk I/Os, and message bytes. The most promising technique was not only the most accurate, but also predicted these metrics simultaneously and using only information available prior to query execution. We validated the accuracy of this machine learning technique on a number of HP Neoview configurations. We were able to predict individual query elapsed time within 20% of its actual time for 85% of the test queries. Most importantly, we were able to correctly identify both the short and long-running (up to two hour) queries to inform workload management and capacity planning.

329 citations


Journal ArticleDOI
TL;DR: In this paper, a new methodology for sufficient dimension reduction (SDR) is presented, which derives directly from the formulation of SDR in terms of the conditional independence of the covariate $X$ from the response $Y$ given the projection of $X $ on the central subspace.
Abstract: We present a new methodology for sufficient dimension reduction (SDR). Our methodology derives directly from the formulation of SDR in terms of the conditional independence of the covariate $X$ from the response $Y$, given the projection of $X$ on the central subspace [cf. J. Amer. Statist. Assoc. 86 (1991) 316--342 and Regression Graphics (1998) Wiley]. We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an $M$-estimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice.

270 citations


Journal ArticleDOI
TL;DR: In this paper, the conditional independence of the covariate X from the response Y, given the projection of X on the central subspace is characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces.
Abstract: We present a new methodology for sufficient dimension reduction (SDR). Our methodology derives directly from the formulation of SDR in terms of the conditional independence of the covariate X from the response Y , given the projection of X on the central subspace [cf. J. Amer. Statist. Assoc. 86 (1991) 316–342 and Regression Graphics (1998) Wiley]. We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an M-estimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice.

236 citations


Proceedings Article
15 Jun 2009
TL;DR: Preliminary results running aWeb 2.0 benchmark application driven by real workload traces on Amazon's EC2 cloud show that the imported modeling, control, and analysis techniques can effectively control the number of servers, even in the face of performance anomalies.
Abstract: Horizontally-scalable Internet services on clusters of commodity computers appear to be a great fit for automatic control: there is a target output (service-level agreement), observed output (actual latency), and gain controller (adjusting the number of servers). Yet few datacenters are automated this way in practice, due in part to well-founded skepticism about whether the simple models often used in the research literature can capture complex real-life workload/performance relationships and keep up with changing conditions that might invalidate the models. We argue that these shortcomings can be fixed by importing modeling, control, and analysis techniques from statistics and machine learning. In particular, we apply rich statistical models of the application's performance, simulation-based methods for finding an optimal control policy, and change-point methods to find abrupt changes in performance. Preliminary results running aWeb 2.0 benchmark application driven by real workload traces on Amazon's EC2 cloud show that our method can effectively control the number of servers, even in the face of performance anomalies.

233 citations


Journal ArticleDOI
TL;DR: Quantitative guidelines for researchers wishing to make a limited number of SNPs available publicly without compromising subjects' privacy are provided.
Abstract: Recent studies have demonstrated that statistical methods can be used to detect the presence of a single individual within a study group based on summary data reported from genome-wide association studies (GWAS). We present an analytical and empirical study of the statistical power of such methods. We thereby aim to provide quantitative guidelines for researchers wishing to make a limited number of SNPs available publicly without compromising subjects' privacy.

231 citations


Proceedings ArticleDOI
06 Dec 2009
TL;DR: A novel application of using data mining and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting and shows that it can not only achieve highly accurate and fast problem detection, but also help operators better understand execution patterns in their system.
Abstract: We describe a novel application of using data mining and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting. Different from existing solutions, we use a two stage detection system. The first stage uses frequent pattern mining and distribution estimation techniques to capture the dominant patterns (both frequent sequences and time duration). The second stage use principal component analysis based anomaly detection technique to identify actual problems. Using real system data from a 203-node Hadoop [1] cluster, we show that we can not only achieve highly accurate and fast problem detection, but also help operators better understand execution patterns in their system.

178 citations


Proceedings Article
07 Dec 2009
TL;DR: This work develops an efficient Markov chain Monte Carlo inference method that is based on the Indian buffet process representation of the predictive distribution of the beta process, and uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities.
Abstract: We propose a Bayesian nonparametric approach to the problem of modeling related time series. Using a beta process prior, our approach is based on the discovery of a set of latent dynamical behaviors that are shared among multiple time series. The size of the set and the sharing pattern are both inferred from data. We develop an efficient Markov chain Monte Carlo inference method that is based on the Indian buffet process representation of the predictive distribution of the beta process. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth/death proposals. We validate our sampling algorithm using several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.

Proceedings ArticleDOI
14 Jun 2009
TL;DR: A Bayesian decision-theoretic framework is presented, which allows us to both integrate diverse measurements and choose new measurements to make, and a variational inference algorithm is used, which exploits exponential family duality.
Abstract: Given a model family and a set of unlabeled examples, one could either label specific examples or state general constraints---both provide information about the desired model. In general, what is the most cost-effective way to learn? To address this question, we introduce measurements, a general class of mechanisms for providing information about a target model. We present a Bayesian decision-theoretic framework, which allows us to both integrate diverse measurements and choose new measurements to make. We use a variational inference algorithm, which exploits exponential family duality. The merits of our approach are demonstrated on two sequence labeling tasks.

Journal ArticleDOI
TL;DR: In this article, the covariates are not available directly but are transformed by a dimensionality-reducing quantizer, and conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated.
Abstract: The goal of binary classification is to estimate a discriminant function y from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer Q. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as Ali-Silvey or /-divergence functionals. Whereas this correspondence was established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951) 93-102. Univ. California Press, Berkeley] for the 0-1 loss, we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.

Posted Content
15 May 2009
TL;DR: In this article, a Bayesian nonparametric approach to speaker diarization is proposed, which builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al.
Abstract: We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566--1581]. Although the basic HDP-HMM tends to over-segment the audio data---creating redundant states and rapidly switching among them---we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

Proceedings ArticleDOI
19 Jun 2009
TL;DR: This paper proposes to train the performance model of a Web 2.0 application using an exploration policy that quickly collects data from different performance regimes of the application and shows that by using the exploration policy, it can be trained in less than an hour and immediately used in a resource allocation controller.
Abstract: Horizontally scalable Internet services present an opportunity to use automatic resource allocation strategies for system management in the datacenter. In most of the previous work, a controller employs a performance model of the system to make decisions about the optimal allocation of resources. However, these models are usually trained offline or on a small-scale deployment and will not accurately capture the performance of the controlled application. To achieve accurate control of the web application, the models need to be trained directly on the production system and adapted to changes in workload and performance of the application. In this paper we propose to train the performance model using an exploration policy that quickly collects data from different performance regimes of the application. The goal of our approach for managing the exploration process is to strike a balance between not violating the performance SLAs and the need to collect sufficient data to train an accurate performance model, which requires pushing the system close to its capacity. We show that by using our exploration policy, we can train a performance model of a Web 2.0 application in less than an hour and then immediately use the model in a resource allocation controller.

01 Jan 2009
TL;DR: This work first parse console logs by combining source code analysis with information retrieval to create composite features, and then analyzes these features using machine learning to detect operational problems to automatically detect system runtime problems.
Abstract: Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software’s internals.

Journal ArticleDOI
TL;DR: A likelihood-based method using an interleaved hidden Markov model (HMM) that can jointly estimate the aforementioned three parameters fundamental to recombination and shows that modeling overlapping gene conversions is crucial for improving the joint estimation of the gene conversion rate and the mean conversion tract length.
Abstract: Motivation: Two known types of meiotic recombination are crossovers and gene conversions. Although they leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. In particular, for a given population SNP dataset, the joint estimation of the crossover rate, the gene conversion rate and the mean conversion tract length is widely viewed as a very difficult problem. Results: In this article, we devise a likelihood-based method using an interleaved hidden Markov model (HMM) that can jointly estimate the aforementioned three parameters fundamental to recombination. Our method significantly improves upon a recently proposed method based on a factorial HMM. We show that modeling overlapping gene conversions is crucial for improving the joint estimation of the gene conversion rate and the mean conversion tract length. We test the performance of our method on simulated data. We then apply our method to analyze real biological data from the telomere of the X chromosome of Drosophila melanogaster, and show that the ratio of the gene conversion rate to the crossover rate for the region may not be nearly as high as previously claimed. Availability: A software implementation of the algorithms discussed in this article is available at http://www.cs.berkeley.edu/~yss/software.html. Contact: yss@eecs.berkeley.edu

Journal ArticleDOI
TL;DR: In this article, a Bayesian nonparametric approach to speaker diarization is proposed, which builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al.
Abstract: We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566--1581]. Although the basic HDP-HMM tends to over-segment the audio data---creating redundant states and rapidly switching among them---we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

Book ChapterDOI
14 Jan 2009
TL;DR: The techniques are illustrated on the blind one-microphone speech separation problem, by casting the problem as one of segmentation of the spectrogram.
Abstract: Spectral clustering refers to a class of recent techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this chapter, we introduce the main concepts and algorithms together with recent advances in learning the similarity matrix from data. The techniques are illustrated on the blind one-microphone speech separation problem, by casting the problem as one of segmentation of the spectrogram.

Book ChapterDOI
27 Aug 2009
TL;DR: This paper uncovers a general relationship between regularized discriminant analysis and ridge regression and yields variations on conventional LDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator.
Abstract: Fisher linear discriminant analysis (LDA) and its kernel extension--kernel discriminant analysis (KDA)--are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squared error procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of LDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional LDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Experimental results on a collection of benchmark data sets demonstrate the effectiveness of our approach.

Proceedings Article
18 Jun 2009
TL;DR: It is shown that the hardness of computing the objective function and gradient of the mean field objective qualitatively depends on a simple graph property and a new algorithm based on the construction of an auxiliary exponential family can be used to make inference possible in this case.
Abstract: In intractable, undirected graphical models, an intuitive way of creating structured mean field approximations is to select an acyclic tractable subgraph We show that the hardness of computing the objective function and gradient of the mean field objective qualitatively depends on a simple graph property If the tractable subgraph has this property---we call such subgraphs v-acyclic---a very fast block coordinate ascent algorithm is possible If not, optimization is harder, but we show a new algorithm based on the construction of an auxiliary exponential family that can be used to make inference possible in this case as well We discuss the advantages and disadvantages of each regime and compare the algorithms empirically

Proceedings Article
07 Dec 2009
TL;DR: This paper presents a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer.
Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning.

Proceedings Article
15 Apr 2009
TL;DR: A new majorization loss function is proposed that is closely related to the multinomial log-likelihood function and its limit at zero temperature corresponds to a multicategory hinge loss function.
Abstract: Margin-based classification methods are typically devised based on a majorizationminimization procedure, which approximately solves an otherwise intractable minimization problem defined with the 0-l loss. The extension of such methods from the binary classification setting to the more general multicategory setting turns out to be nontrivial. In this paper, our focus is to devise margin-based classification methods that can be seamlessly applied to both settings, with the binary setting simply as a special case. In particular, we propose a new majorization loss function that we call the coherence function, and then devise a new multicategory margin-based boosting algorithm based on the coherence function. Analogous to deterministic annealing, the coherence function is characterized by a temperature factor. It is closely related to the multinomial log-likelihood function and its limit at zero temperature corresponds to a multicategory hinge loss function.

Journal ArticleDOI
01 Jul 2009
TL;DR: An automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins.
Abstract: It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.

Book
01 Jan 2009
TL;DR: This thesis shows how to extend the discriminative learning framework to exploit different types of structure: on one hand, the structure on outputs, such as the combinatorial structure in word alignment; on the other hand, a latent variable structure on inputs,such as in text document classification.
Abstract: Some of the best performing classifiers in modern machine learning have been designed using discriminative learning, as exemplified by Support Vector Machines. The ability of discriminative learning to use flexible features via the kernel trick has enlarged the possible set of applications for machine learning. With the expanded range of possible applications though, it has become apparent that real world data exhibits more structure than has been assumed by classical methods. In this thesis, we show how to extend the discriminative learning framework to exploit different types of structure: on one hand, the structure on outputs, such as the combinatorial structure in word alignment; on the other hand, a latent variable structure on inputs, such as in text document classification. In the context of structured output classification, we present a scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm of Nesterov. We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment. We then show how one can obtain state-of-the-art results for the word alignment task by formulating it as a quadratic assignment problem within our discriminative learning framework. In the context of latent variable models, we present DiscLDA, a discriminative variant of the Latent Dirichlet Allocation (LDA) model which has been popular to model collections of text documents or images. In DiscLDA, we introduce a class-dependent linear transformation on the topic mixture proportions of LDA and estimate it discriminatively by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. Our experiments on the 20 Newsgroups document classification task show how our model can identify shared topics across classes as well as discriminative class-dependent topics.


Journal ArticleDOI
TL;DR: This paper presents a nonparametric Bayesian approach to identifying an unknown number of persistent, smooth dynamical modes by utilizing a hierarchical Dirichlet process prior and employs automatic relevance determination to infer a sparse set of dynamic dependencies.


Proceedings ArticleDOI
04 Jan 2009
TL;DR: I will give some examples of how this blend of ideas leads to useful models in some applied problem domains, including natural language parsing, computational vision, statistical genetics and protein structural modeling.
Abstract: Computer Science has historically been strong on data structures and weak on inference from data, whereas Statistics has historically been weak on data structures and strong on inference from data. One way to draw on the strengths of both disciplines is to develop "inferential methods for data structures"; i.e., methods that update probability distributions on recursively-defined objects such as trees, graphs, grammars and function calls. This is the world of "nonparametric Bayes," where prior and posterior distributions are allowed to be general stochastic processes. Both statistical and computational considerations lead one to certain classes of stochastic processes, and these tend to have interesting connections to combinatorics. I will give some examples of how this blend of ideas leads to useful models in some applied problem domains, including natural language parsing, computational vision, statistical genetics and protein structural modeling.

Proceedings Article
15 Apr 2009
TL;DR: In this article, a probabilistic interpretation of principal component analysis (PCO) is presented, which yields a maximum likelihood procedure for estimating the PCO parameters and an iterative expectation-maximization algorithm for obtaining maximum likelihood estimates.
Abstract: Principal coordinate analysis (PCO), a dual of principal component analysis (PCA), is a classical method for exploratory data analysis. In this paper we provide a probabilistic interpretation of PCO. We show that this interpretation yields a maximum likelihood procedure for estimating the PCO parameters and we also present an iterative expectation-maximization algorithm for obtaining maximum likelihood estimates. Finally, we show that our framework yields a probabilistic formulation of kernel PCA.