Showing papers by "Michael I. Jordan published in 2004"

PDF

Open Access

Journal Article•DOI•

Learning the Kernel Matrix with Semidefinite Programming

[...]

Gert R. G. Lanckriet¹, Nello Cristianini, Peter L. Bartlett, Laurent El Ghaoui, Michael I. Jordan - Show less +1 more•Institutions (1)

University of California, Berkeley¹

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This paper shows how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques and leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.

...read moreread less

Abstract: Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space---classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm---using the labeled part of the data one can learn an embedding also for the unlabeled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.

...read moreread less

2,419 citations

Journal Article•DOI•

Kalman filtering with intermittent observations

[...]

Bruno Sinopoli¹, Luca Schenato¹, Massimo Franceschetti¹, Kameshwar Poolla¹, Michael I. Jordan¹, S. Shankar Sastry¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

13 Sep 2004-IEEE Transactions on Automatic Control

TL;DR: This work addresses the problem of performing Kalman filtering with intermittent observations by showing the existence of a critical value for the arrival rate of the observations, beyond which a transition to an unbounded state error covariance occurs.

...read moreread less

Abstract: Motivated by navigation and tracking applications within sensor networks, we consider the problem of performing Kalman filtering with intermittent observations. When data travel along unreliable communication channels in a large, wireless, multihop sensor network, the effect of communication delays and loss of information in the control loop cannot be neglected. We address this problem starting from the discrete Kalman filtering formulation, and modeling the arrival of the observation as a random process. We study the statistical convergence properties of the estimation error covariance, showing the existence of a critical value for the arrival rate of the observations, beyond which a transition to an unbounded state error covariance occurs. We also give upper and lower bounds on this expected state error covariance.

...read moreread less

2,343 citations

Proceedings Article•DOI•

Multiple kernel learning, conic duality, and the SMO algorithm

[...]

Francis Bach¹, Gert R. G. Lanckriet¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

04 Jul 2004

TL;DR: Experimental results are presented that show that the proposed novel dual formulation of the QCQP as a second-order cone programming problem is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.

...read moreread less

Abstract: While classical kernel-based classifiers are based on a single kernel, in practice it is often desirable to base classifiers on combinations of multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for the support vector machine (SVM), and showed that the optimization of the coefficients of such a combination reduces to a convex optimization problem known as a quadratically-constrained quadratic program (QCQP). Unfortunately, current convex optimization toolboxes can solve this problem only for a small number of kernels and a small number of data points; moreover, the sequential minimal optimization (SMO) techniques that are essential in large-scale implementations of the SVM cannot be applied because the cost function is non-differentiable. We propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. We present experimental results that show that our SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.

...read moreread less

1,625 citations

Report•DOI•

Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces

[...]

Kenji Fukumizu, Francis Bach¹, Michael I. Jordan²•Institutions (2)

University of Chicago¹, University of California, Berkeley²

01 Dec 2004

TL;DR: A novel method of dimensionality reduction for supervised learning problems that requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y, and establishes a general nonparametric characterization of conditional independence using covariance operators on reproducing kernel Hilbert spaces.

...read moreread less

Abstract: We propose a novel method of dimensionality reduction for supervised learning problems. Given a regression or classification problem in which we wish to predict a response variable Y from an explanatory variable X, we treat the problem of dimensionality reduction as that of finding a low-dimensional "effective subspace" for X which retains the statistical relationship between X and Y. We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem we establish a general nonparametric characterization of conditional independence using covariance operators on reproducing kernel Hilbert spaces. This characterization allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods for dimensionality reduction in supervised learning, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y. We present experiments that compare the performance of the method with conventional methods.

...read moreread less

760 citations

Journal Article•DOI•

A statistical framework for genomic data fusion

[...]

Gert R. G. Lanckriet, Tijl De Bie¹, Nello Cristianini², Michael I. Jordan³, William Stafford Noble⁴ - Show less +1 more•Institutions (4)

Katholieke Universiteit Leuven¹, University of California, Davis², University of California, Berkeley³, University of Washington⁴

01 Nov 2004-Bioinformatics

TL;DR: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins.

...read moreread less

Abstract: Motivation: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. Results: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein--protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins---membrane proteins and ribosomal proteins---performs significantly better than the same algorithm trained on any single type of data. Availability: Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm

...read moreread less

731 citations

Posted Content•

A direct formulation for sparse PCA using semidefinite programming

[...]

Alexandre d'Aspremont¹, Laurent El Ghaoui, Michael I. Jordan, Gert R. G. Lanckriet²•Institutions (2)

Princeton University¹, University of California, San Diego²

16 Jun 2004-arXiv: Computational Engineering, Finance, and Science

TL;DR: A modification of the classical variational representation of the largest eigenvalue of a symmetric matrix is used, where cardinality is constrained, and a semidefinite programming-based relaxation is derived for the sparse PCA problem.

...read moreread less

Abstract: We examine the problem of approximating, in the Frobenius-norm sense, a positive, semidefinite symmetric matrix by a rank-one matrix, with an upper bound on the cardinality of its eigenvector. The problem arises in the decomposition of a covariance matrix into sparse factors, and has wide applications ranging from biology to finance. We use a modification of the classical variational representation of the largest eigenvalue of a symmetric matrix, where cardinality is constrained, and derive a semidefinite programming based relaxation for our problem. We also discuss Nesterov's smooth minimization technique applied to the SDP arising in the direct sparse PCA method.

...read moreread less

572 citations

Journal Article•DOI•

Chemogenomic profiling: Identifying the functional interactions of small molecules in yeast

[...]

Guri Giaever¹, Patrick Flaherty, Jochen Kumm, Michael F. Proctor, Corey Nislow, Daniel F. Jaramillo, Angela M. Chu, Michael I. Jordan, Adam P. Arkin, Ronald W. Davis - Show less +6 more•Institutions (1)

Stanford University¹

20 Jan 2004-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: The efficacy of a genome-wide protocol in yeast is demonstrated that allows the identification of those gene products that functionally interact with small molecules and result in the inhibition of cellular proliferation and a chemical core structure shared among three therapeutically distinct compounds that inhibit the ERG24 heterozygous deletion strain is identified.

...read moreread less

Abstract: We demonstrate the efficacy of a genome-wide protocol in yeast that allows the identification of those gene products that functionally interact with small molecules and result in the inhibition of cellular proliferation. Here we present results from screening 10 diverse compounds in 80 genome-wide experiments against the complete collection of heterozygous yeast deletion strains. These compounds include anticancer and antifungal agents, statins, alverine citrate, and dyclonine. In several cases, we identified previously known interactions; furthermore, in each case, our analysis revealed novel cellular interactions, even when the relationship between a compound and its cellular target had been well established. In addition, we identified a chemical core structure shared among three therapeutically distinct compounds that inhibit the ERG24 heterozygous deletion strain, demonstrating that cells may respond similarly to compounds of related structure. The ability to identify on-and-off target effects in vivo is fundamental to understanding the cellular response to small-molecule perturbants.

...read moreread less

518 citations

Proceedings Article•

Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

[...]

Yee Whye Teh¹, Michael I. Jordan¹, Matthew J. Beal², David M. Blei¹•Institutions (2)

University of California, Berkeley¹, University of Toronto²

01 Dec 2004

TL;DR: The hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data, is proposed and experimental results are reported showing the effective and superior performance of the HDP over previous models.

...read moreread less

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

...read moreread less

474 citations

Proceedings Article•DOI•

Failure diagnosis using decision trees

[...]

Mike Y. Chen¹, A.X. Zheng¹, J. Lloyd¹, Michael I. Jordan¹, Eric Brewer¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

17 May 2004

TL;DR: A decision tree learning approach to diagnosing failures in large Internet sites is presented, and it is found that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives.

...read moreread less

Abstract: We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components. We evaluate this approach using actual failures from eBay, and find that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives. We discuss some results in applying simplified decision trees on eBay's production site for several months. In addition, we give a cost-benefit analysis of manual vs. automated diagnosis systems. Our contributions include the statistical learning approach, the adaptation of decision trees to the context of failure diagnosis, and the deployment and evaluation of our tools on a high-volume production service.

...read moreread less

387 citations

Proceedings Article•

A Direct Formulation for Sparse PCA Using Semidefinite Programming

[...]

Alexandre d'Aspremont¹, Laurent El Ghaoui, Michael I. Jordan¹, Gert R. G. Lanckriet¹•Institutions (1)

University of California, Berkeley¹

01 Dec 2004

TL;DR: A modification of the classical variational representation of the largest eigenvalue of a symmetric matrix, where cardinality is constrained, is used and derived to derive a semidefinite programming based relaxation for the problem.

...read moreread less

323 citations

Proceedings Article•DOI•

Variational methods for the Dirichlet process

[...]

David M. Blei¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

04 Jul 2004

TL;DR: A mean-field variational approach to approximate inference for the Dirichlet process, where the approximate posterior is based on the truncated stick-breaking construction (Ishwaran & James, 2001).

...read moreread less

Abstract: Variational inference methods, including mean field methods and loopy belief propagation, have been widely used for approximate probabilistic inference in graphical models. While often less accurate than MCMC, variational methods provide a fast deterministic approximation to marginal and conditional probabilities. Such approximations can be particularly useful in high dimensional problems where sampling methods are too slow to be effective. A limitation of current methods, however, is that they are restricted to parametric probabilistic models. MCMC does not have such a limitation; indeed, MCMC samplers have been developed for the Dirichlet process (DP), a nonparametric distribution on distributions (Ferguson, 1973) that is the cornerstone of Bayesian nonparametric statistics (Escobar & West, 1995; Neal, 2000). In this paper, we develop a mean-field variational approach to approximate inference for the Dirichlet process, where the approximate posterior is based on the truncated stick-breaking construction (Ishwaran & James, 2001). We compare our approach to DP samplers for Gaussian DP mixture models.

...read moreread less

Proceedings Article•

Semi-supervised Learning via Gaussian Processes

[...]

Neil D. Lawrence¹, Michael I. Jordan²•Institutions (2)

University of Sheffield¹, University of California, Berkeley²

01 Dec 2004

TL;DR: A probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data using a "null category noise model" (NCNM) inspired by ordered categorical noise models.

...read moreread less

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a "null category noise model" (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits.

...read moreread less

Journal Article•DOI•

Learning graphical models for stationary time series

[...]

Francis Bach¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

01 Aug 2004-IEEE Transactions on Signal Processing

TL;DR: This paper describes an algorithm for efficient forecasting for stationary Gaussian time series whose spectral densities factorize in a graphical model and shows how to make use of Mercer kernels in this setting, allowing the ideas to be extended to nonlinear models.

...read moreread less

Abstract: Probabilistic graphical models can be extended to time series by considering probabilistic dependencies between entire time series For stationary Gaussian time series, the graphical model semantics can be expressed naturally in the frequency domain, leading to interesting families of structured time series models that are complementary to families defined in the time domain In this paper, we present an algorithm to learn the structure from data for directed graphical models for stationary Gaussian time series We describe an algorithm for efficient forecasting for stationary Gaussian time series whose spectral densities factorize in a graphical model We also explore the relationships between graphical model structure and sparsity, comparing and contrasting the notions of sparsity in the time domain and the frequency domain Finally, we show how to make use of Mercer kernels in this setting, allowing our ideas to be extended to nonlinear models

...read moreread less

Proceedings Article•

Computing regularization paths for learning multiple kernels

[...]

Francis Bach¹, Romain Thibaux¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

01 Dec 2004

TL;DR: Working in the setting of kernel linear regression and kernel logistic regression, it is shown empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1- norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case.

...read moreread less

Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case.

...read moreread less

Proceedings Article•

Blind One-microphone Speech Separation: A Spectral Learning Approach

[...]

Francis Bach¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

01 Dec 2004

TL;DR: This work forms the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets, and develops an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures.

...read moreread less

Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures.

...read moreread less

Probabilistic models of text and images

[...]

David M. Blei¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2004

TL;DR: A suite of probabilistic models of information collections for which the above problems can be cast as statistical queries are described, and directed graphical models are used as a flexible, modular framework for describing appropriate modeling assumptions about the data.

...read moreread less

Abstract: Managing large and growing collections of information is a central goal of modern computer science. Data repositories of texts, images, sounds, and genetic information have become widely accessible, thus necessitating good methods of retrieval, organization, and exploration. In this thesis, we describe a suite of probabilistic models of information collections for which the above problems can be cast as statistical queries. We use directed graphical models as a flexible, modular framework for describing appropriate modeling assumptions about the data. Fast approximate posterior inference algorithms based on variational methods free us from having to specify tractable models, and further allow us to take the Bayesian perspective, even in the face of large datasets. With this framework in hand, we describe latent Dirichlet allocation (LDA), a graphical model particularly suited to analyzing text collections. LDA posits a finite index of hidden topics which describe the underlying documents. New documents are situated into the collection via approximate posterior inference of their associated index terms. Extensions to LDA can index a set of images, or multimedia collections of interrelated text and images. Finally, we describe nonparametric Bayesian methods for relaxing the assumption of a fixed number of topics, and develop models based on the natural assumption that the size of the index can grow with the collection. This idea is extended to trees, and to models which represent the hidden structure and content of a topic hierarchy that underlies a collection.

...read moreread less

Journal Article•DOI•

Logos: a modular bayesian model for de novo motif detection.

[...]

Eric P. Xing¹, Wei Wu², Michael I. Jordan¹, Richard M. Karp¹•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

01 Mar 2004-Journal of Bioinformatics and Computational Biology

TL;DR: LOGOS is presented, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complexBiopolymer sequence analysis.

...read moreread less

Abstract: The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection, it is necessary to model the complex dependencies within and among motifs and to incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependency within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila genomes with regard to sensitivity, specificity, flexibility and extensibility.

...read moreread less

Journal Article•DOI•

Multiple-sequence functional annotation and the generalized hidden Markov phylogeny

[...]

Jon McAuliffe¹, Lior Pachter¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

12 Aug 2004-Bioinformatics

TL;DR: A formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods is developed and a generalized hidden Markov phylogeny (GHMP) is shown how GHMPs can be used to predict complete shared gene structures in multiple primate sequences.

...read moreread less

Abstract: Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. Results: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human--mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. Availability: A Web server is available at http://bonaire.lbl.gov/shadower

...read moreread less

Proceedings Article•DOI•

Bayesian haplo-type inference via the dirichlet process

[...]

Eric P. Xing¹, Roded Sharan¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

04 Jul 2004

TL;DR: A Bayesian approach to the problem of inferring haplotypes from genotypes of single nucleotide polymorphisms based on a nonparametric prior known as the Dirichlet process is presented, which incorporates a likelihood that captures statistical errors in the haplotype/genotype relationship.

...read moreread less

Abstract: The problem of inferring haplotypes from genotypes of single nucleotide polymorphisms (SNPs) is essential for the understanding of genetic variation within and among populations, with important applications to the genetic analysis of disease propensities and other complex traits. The problem can be formulated as a mixture model, where the mixture components correspond to the pool of haplotypes in the population. The size of this pool is unknown; indeed, knowing the size of the pool would correspond to knowing something significant about the genome and its history. Thus methods for fitting the genotype mixture must crucially address the problem of estimating a mixture with an unknown number of mixture components. In this paper we present a Bayesian approach to this problem based on a nonparametric prior known as the Dirichlet process. The model also incorporates a likelihood that captures statistical errors in the haplotype/genotype relationship. We apply our approach to the analysis of both simulated and real genotype data, and compare to extant methods.

...read moreread less

Book Chapter•DOI•

Extensions of the informative vector machine

[...]

Neil D. Lawrence¹, John Platt², Michael I. Jordan³•Institutions (3)

University of Sheffield¹, Microsoft², University of California, Berkeley³

07 Sep 2004

TL;DR: This paper proposes a novel noise model that allows the IVM to be applied to a mixture of labeled and unlabeled data, and uses IVM on a block-diagonal covariance matrix, for “learning to learn” from related tasks.

...read moreread less

Abstract: The informative vector machine (IVM) is a practical method for Gaussian process regression and classification. The IVM produces a sparse approximation to a Gaussian process by combining assumed density filtering with a heuristic for choosing points based on minimizing posterior entropy. This paper extends IVM in several ways. First, we propose a novel noise model that allows the IVM to be applied to a mixture of labeled and unlabeled data. Second, we use IVM on a block-diagonal covariance matrix, for “learning to learn” from related tasks. Third, we modify the IVM to incorporate prior knowledge from known invariances. All of these extensions are tested on artificial and real data.

...read moreread less

Sparse Gaussian Process Classification With Multiple Classes

[...]

Matthias Seeger, Michael I. Jordan

01 Jan 2004

TL;DR: This work shows how to generalize the binary classification informative vector machine (IVM) to multiple classes and is a principled approximation to Bayesian inference which yields valid uncertainty estimates and allows for hyperparameter adaption via marginal likelihood maximization.

...read moreread less

Abstract: Sparse approximations to Bayesian inference for nonparametric Gaussian Process models scale linearly in the number of training points, allowing for the application of these powerful kernel-based models to large datasets. We show how to generalize the binary classification informative vector machine (IVM) (Lawrence et.al., 2002) to multiple classes. In contrast to earlier efficient approaches to kernel-based non-binary classification, our method is a principled approximation to Bayesian inference which yields valid uncertainty estimates and allows for hyperparameter adaption via marginal likelihood maximization. While most earlier proposals suggest fitting independent binary discriminants to heuristically chosen partitions of the data and combining these in a heuristic manner, our method operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the number of classes and the number of training points.

...read moreread less

Journal Article•DOI•

Robust sparse hyperplane classifiers: application to uncertain molecular profiling data.

[...]

Chiranjib Bhattacharyya¹, L. R. Grate, Michael I. Jordan, L. El Ghaoui, I. S. Mian - Show less +1 more•Institutions (1)

University of California, Berkeley¹

01 Dec 2004-Journal of Computational Biology

TL;DR: The task of learning a robust sparse hyperplane from such data is formulated as a second order cone program (SOCP).

...read moreread less

Abstract: Molecular profiling studies can generate abundance measurements for thousands of transcripts, proteins, metabolites, or other species in, for example, normal and tumor tissue samples. Treating such measurements as features and the samples as labeled data points, sparse hyperplanes provide a statistical methodology for classifying data points into one of two categories (classification and prediction) and defining a small subset of discriminatory features (relevant feature identification). However, this and other extant classification methods address only implicitly the issue of observed data being a combination of underlying signals and noise. Recently, robust optimization has emerged as a powerful framework for handling uncertain data explicitly. Here, ideas from this field are exploited to develop robust sparse hyperplanes, i.e., classification and relevant feature identification algorithms that are resilient to variation in the data. Specifically, each data point is associated with an explicit data uncertainty model in the form of an ellipsoid parameterized by a center and covariance matrix. The task of learning a robust sparse hyperplane from such data is formulated as a second order cone program (SOCP). Gaussian and distribution-free data uncertainty models are shown to yield SOCPs that are equivalent to the SCOP based on ellipsoidal uncertainty. The real-world utility of robust sparse hyperplanes is demonstrated via retrospective analysis of breast cancer related transcript profiles. Data-dependent heuristics are used to compute the parameters of each ellipsoidal data uncertainty model. The generalization performance of a specific implementation, designated "robust LIKNON," is better than its nominal counterpart. Finally, the strengths and limitations of robust sparse hyperplanes are discussed.

...read moreread less

Journal Article•DOI•

A Direct Formulation for Sparse Pca Using Semidefinite Programming

[...]

Alexandre d'Aspremont¹, Laurent El Ghaoui², Michael I. Jordan², Gert R. G. Lanckriet²•Institutions (2)

Princeton University¹, University of California, Berkeley²

01 Jun 2004-Social Science Research Network

TL;DR: In this article, the problem of approximating, in the Frobenius-norm sense, a positive, semidefinite symmetric matrix by a rank-one matrix, with an upper bound on the cardinality of its eigenvector was examined.

...read moreread less

Proceedings Article•DOI•

Graph partition strategies for generalized mean field inference

[...]

Eric P. Xing¹, Michael I. Jordan¹, Stuart Russell¹•Institutions (1)

University of California, Berkeley¹

07 Jul 2004

TL;DR: In this article, a combination of graph partitioning algorithms with a generalized mean field (GMF) inference algorithm is presented, which optimizes over disjoint clustering of variables and performs inference using those clusters.

...read moreread less

Abstract: An autonomous variational inference algorithm for arbitrary graphical models requires the ability to optimize variational approximations over the space of model parameters as well as over the choice of tractable families used for the variational approximation. In this paper, we present a novel combination of graph partitioning algorithms with a generalized mean field (GMF) inference algorithm. This combination optimizes over disjoint clustering of variables and performs inference using those clusters. We provide a formal analysis of the relationship between the graph cut and the GMF approximation, and explore several graph partition strategies empirically. Our empirical results provide rather clear support for a weighted version of MinCut as a useful clustering algorithm for GMF inference, which is consistent with the implications from the formal analysis.

...read moreread less

Book•

Dictionary of Gods And Goddesses

[...]

Michael I. Jordan

01 Aug 2004

TL;DR: The Dictionary of Gods and Goddesses, Second Edition as mentioned in this paper provides access to more than 2,500 of these religious figures, from ancient Sumerian gods through the modern Haitian deities.

...read moreread less

Abstract: An essential collection of more than 2,500 deities of the world. For more than 60,000 years, people have worshiped deities of the sun, sky, and sea, as well as creator gods, relying on the guidance of faith in the midst of the mysterious world around them. Dictionary of Gods and Goddesses. Second Edition provides access to more than 2,500 of these religious figures, from ancient Sumerian gods through the modern Haitian deities. Providing a plethora of information from cultures as diverse as the Aztecs, Celts, and Japanese, this dictionary discusses lesser-known divinities as well as the contemporary gods of the major monotheistic religions - Allah, God, and Yahweh, among others. New features, including cross-references and a comprehensive index, make this revised edition more accessible than ever. Dictionary of Gods and Goddesses, Second Edition is an indispensable resource perfect for general readers interested in mythology and religion, as well as scholars in religious studies, anthropology, history, and archaeology.

...read moreread less

Fast Kernel Learning using Sequential Minimal Optimization

[...]

Francis Bach, Gert R. G. Lanckriet, Michael I. Jordan

01 Jan 2004

...read moreread less

Proceedings Article•DOI•

Decentralized detection and classification using kernel methods

[...]

XuanLong Nguyen¹, Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

04 Jul 2004

TL;DR: This work proposes a novel algorithm using the framework of empirical risk minimization and marginalized kernels, and analyzes its computational and statistical properties both theoretically and empirically.

...read moreread less

Abstract: We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels, and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm, and demonstrate its performance on both simulated and real data sets.

...read moreread less

Probabilistic graphical models and algorithms for genomic analysis

[...]

Poe Xing¹, Richard M. Karp¹, Michael I. Jordan¹, Stuart Russell¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2004

TL;DR: This thesis discusses two probabilistic modeling problems arising in metazoan genomic analysis: identifying motifs and cis-regulatory modules (CRMs) from transcriptional regulatory sequences, and inferring haplotypes from genotypes of single nucleotide polymorphisms.

...read moreread less

Abstract: In this thesis, I discuss two probabilistic modeling problems arising in metazoan genomic analysis: identifying motifs and cis-regulatory modules (CRMs) from transcriptional regulatory sequences, and inferring haplotypes from genotypes of single nucleotide polymorphisms. Motif and CRM identification is important for understanding the gene regulatory network underlying metazoan development and functioning. I discuss a modular Bayesian model that captures rich structural characteristics of the transcriptional regulatory sequences and supports a variety of motif detection tasks. Haplotype inference is essential for the understanding of genetic variation within and among populations, with important applications to the genetic analysis of disease propensities. I discuss a Bayesian model based on a prior distribution constructed from a Dirichlet process—a nonparametric prior which provides control over the size of the unknown pool of population haplotypes, and on a likelihood function that allows statistical errors in the haplotype/genotype relationship. Our models use the “probabilistic graphical model” formalism, a formalism that exploits the conjoined capabilities of graph theory and probability theory to build complex models out of simpler pieces. I discuss the mathematical underpinnings for the models, how they formally incorporate biological prior knowledge about the data, and I present a generalized mean field theory and a generic algorithm for approximate inference on such models.

...read moreread less

Journal Article•

A Framework for Genomic Data Fusion and its Application to Membrane Protein Prediction

[...]

Gert R. G. Lanckriet, Tijl De Bie, Nello Cristianini, Michael I. Jordan, W Stafford Noble - Show less +1 more

01 Jan 2004-Bioinformatics

TL;DR: A computational framework for integrating and drawing inferences from a collection of genome-wide measurements, which demonstrates the utility of predicting membrane proteins from heterogeneous data, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions.

...read moreread less

Abstract: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each data set is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion—recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in an optimal way. These methods formulate the problem of optimal kernel combination as a convex optimization problem that can be solved with semi-definite programming techniques. In this paper, we demonstrate the utility of these techniques by investigating the problem of predicting membrane proteins from heterogeneous data, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions. A statistical learning algorithm trained from all of these data performs significantly better than the same algorithm trained on any single type of data and better than existing algorithms for membrane protein classification.

...read moreread less

Fast Sparse Gaussian Process Classication With Multiple Classes

[...]

Matthias Seeger, Michael I. Jordan

01 Jan 2004

TL;DR: This work shows how to generalize the Informative Vector Machine (IVM) to a multi-way classication setting and yields valid uncertainty estimates and allows for hyperparameter adaption via optimization.

...read moreread less

Abstract: The favourable scaling behaviour of sparse approximations to Bayesian inference for Gaussian Process models makes them attractive for large-scale applications. We show how to generalize the Informative Vector Machine (IVM) [3] to a multi-way classication setting. While being a kernel-based approach, our method yields valid uncertainty estimates and allows for hyperparameter adaption via optimization. Our scheme operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the number of classes and the number of training points.

...read moreread less