scispace - formally typeset
Search or ask a question

Showing papers by "Michael I. Jordan published in 2008"


Book
16 Dec 2008
TL;DR: The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.
Abstract: The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statistical, computational and mathematical fields, including bioinformatics, communication theory, statistical physics, combinatorial optimization, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances — including the key problems of computing marginals and modes of probability distributions — are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate duality between the cumulant function and the entropy for exponential families, we develop general variational representations of the problems of computing likelihoods, marginal probabilities and most probable configurations. We describe how a wide variety of algorithms — among them sum-product, cluster variational methods, expectation-propagation, mean field methods, max-product and linear programming relaxation, as well as conic programming relaxations — can all be understood in terms of exact or approximate forms of these variational representations. The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.

4,335 citations


Proceedings Article
08 Dec 2008
TL;DR: This paper presents DiscLDA, a discriminative variation on Latent Dirichlet Allocation in which a class-dependent linear transformation is introduced on the topic mixture proportions, and obtains a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification.
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.

420 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: A sampling algorithm is developed that employs a truncated approximation of the DP to jointly resample the full state sequence, greatly improving mixing rates and demonstrating the advantages of the sticky extension, and the utility of the HDP-HMM in real-world applications.
Abstract: The hierarchical Dirichlet process hidden Markov model (HDP-HMM) is a flexible, nonparametric model which allows state spaces of unknown size to be learned from data We demonstrate some limitations of the original HDP-HMM formulation (Teh et al, 2006), and propose a sticky extension which allows more robust learning of smoothly varying dynamics Using DP mixtures, this formulation also allows learning of more complex, multimodal emission distributions We further develop a sampling algorithm that employs a truncated approximation of the DP to jointly resample the full state sequence, greatly improving mixing rates Via extensive experiments with synthetic data and the NIST speaker diarization database, we demonstrate the advantages of our sticky extension, and the utility of the HDP-HMM in real-world applications

313 citations


Journal ArticleDOI
TL;DR: The sparsity-overlap function (B) reveals that, if the design is uncorrelated on the active rows, ‘1/‘2 regularization for multivariate regression never harms performance relative to an ordinary Lasso approach and can yield substantial improvements in sample complexity when the regression vectors are suitably orthogonal.
Abstract: In multivariate regression, a $K$-dimensional response vector is regressed upon a common set of $p$ covariates, with a matrix $B^*\in\mathbb{R}^{p\times K}$ of regression coefficients. We study the behavior of the multivariate group Lasso, in which block regularization based on the $\ell_1/\ell_2$ norm is used for support union recovery, or recovery of the set of $s$ rows for which $B^*$ is nonzero. Under high-dimensional scaling, we show that the multivariate group Lasso exhibits a threshold for the recovery of the exact row pattern with high probability over the random design and noise that is specified by the sample complexity parameter $\theta(n,p,s):=n/[2\psi(B^*)\log(p-s)]$. Here $n$ is the sample size, and $\psi(B^*)$ is a sparsity-overlap function measuring a combination of the sparsities and overlaps of the $K$-regression coefficient vectors that constitute the model. We prove that the multivariate group Lasso succeeds for problem sequences $(n,p,s)$ such that $\theta(n,p,s)$ exceeds a critical level $\theta_u$, and fails for sequences such that $\theta(n,p,s)$ lies below a critical level $\theta_{\ell}$. For the special case of the standard Gaussian ensemble, we show that $\theta_{\ell}=\theta_u$ so that the characterization is sharp. The sparsity-overlap function $\psi(B^*)$ reveals that, if the design is uncorrelated on the active rows, $\ell_1/\ell_2$ regularization for multivariate regression never harms performance relative to an ordinary Lasso approach and can yield substantial improvements in sample complexity (up to a factor of $K$) when the coefficient vectors are suitably orthogonal. For more general designs, it is possible for the ordinary Lasso to outperform the multivariate group Lasso. We complement our analysis with simulations that demonstrate the sharpness of our theoretical results, even for relatively small problems.

287 citations


Journal ArticleDOI
TL;DR: The results show that currently available data for mammals allows predictions with both breadth and accuracy, and many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.
Abstract: Background: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.

263 citations


Proceedings Article
08 Dec 2008
TL;DR: This work develops a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences in an unknown number of persistent, smooth dynamical modes.
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.

221 citations


Proceedings Article
08 Dec 2008
TL;DR: This work develops a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases, and uses Gaussian processes to discover spatially contiguous segments which respect image boundaries.
Abstract: We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman-Yor (PY) process This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes

202 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: This paper presents a unified framework for studying parameter estimators, which allows them to compare their relative (statistical) efficiencies, and suggests that modeling more of the data tends to reduce variance, but at the cost of being more sensitive to model misspecification.
Abstract: Statistical and computational concerns have motivated parameter estimators based on various forms of likelihood, e.g., joint, conditional, and pseudolikelihood. In this paper, we present a unified framework for studying these estimators, which allows us to compare their relative (statistical) efficiencies. Our asymptotic analysis suggests that modeling more of the data tends to reduce variance, but at the cost of being more sensitive to model misspecification. We present experiments validating our analysis.

141 citations


Proceedings ArticleDOI
15 Dec 2008
TL;DR: This paper proposes several NMF inspired algorithms to solve different data mining problems, including multi-way normalized cut spectral clustering, graph matching of both undirected and directed graphs, and maximal clique finding on both graphs and bipartite graphs.
Abstract: Nonnegative matrix factorization (NMF) is a versatile model for data clustering. In this paper, we propose several NMF inspired algorithms to solve different data mining problems. They include (1) multi-way normalized cut spectral clustering, (2) graph matching of both undirected and directed graphs, and (3) maximal clique finding on both graphs and bipartite graphs. Key features of these algorithms are (a) they are extremely simple to implement; and (b) they are provably convergent. We conduct experiments to demonstrate the effectiveness of these new algorithms. We also derive a new spectral bound for the size of maximal edge bicliques as a byproduct of our approach.

129 citations


Proceedings Article
11 Dec 2008
TL;DR: This paper combines log parsing and text mining with source code analysis to extract structure from the console logs and extracts features from the structured information in order to detect anomalous patterns in the logs using Principal Component Analysis (PCA).
Abstract: The console logs generated by an application contain messages that the application developers believed would be useful in debugging or monitoring the application. Despite the ubiquity and large size of these logs, they are rarely exploited in a systematic way for monitoring and debugging because they are not readily machine-parsable. In this paper, we propose a novel method for mining this rich source of information. First, we combine log parsing and text mining with source code analysis to extract structure from the console logs. Second, we extract features from the structured information in order to detect anomalous patterns in the logs using Principal Component Analysis (PCA). Finally, we use a decision tree to distill the results of PCA-based anomaly detection to a format readily understandable by domain experts (e.g. system operators) who need not be familiar with the anomaly detection algorithms. As a case study, we distill over one million lines of console logs from the Hadoop file system to a simple decision tree that a domain expert can readily understand; the process requires no operator intervention and we detect a large portion of runtime anomalies that are commonly overlooked.

120 citations


Journal ArticleDOI
TL;DR: This work focuses on methods for calibrating and combining independent predictions to obtain a set of probabilistic predictions that are consistent with the topology of the ontology, and finds that many apparently reasonable reconciliation methods yield reconciled probabilities with significantly lower precision than the original, unreconciled estimates.
Abstract: In predicting hierarchical protein function annotations, such as terms in the Gene Ontology (GO), the simplest approach makes predictions for each term independently. However, this approach has the unfortunate consequence that the predictor may assign to a single protein a set of terms that are inconsistent with one another; for example, the predictor may assign a specific GO term to a given protein ('purine nucleotide binding') but not assign the parent term ('nucleotide binding'). Such predictions are difficult to interpret. In this work, we focus on methods for calibrating and combining independent predictions to obtain a set of probabilistic predictions that are consistent with the topology of the ontology. We call this procedure 'reconciliation'. We begin with a baseline method for predicting GO terms from a collection of data types using an ensemble of discriminative classifiers. We apply the method to a previously described benchmark data set, and we demonstrate that the resulting predictions are frequently inconsistent with the topology of the GO. We then consider 11 distinct reconciliation methods: three heuristic methods; four variants of a Bayesian network; an extension of logistic regression to the structured case; and three novel projection methods - isotonic regression and two variants of a Kullback-Leibler projection method. We evaluate each method in three different modes - per term, per protein and joint - corresponding to three types of prediction tasks. Although the principal goal of reconciliation is interpretability, it is important to assess whether interpretability comes at a cost in terms of precision and recall. Indeed, we find that many apparently reasonable reconciliation methods yield reconciled probabilities with significantly lower precision than the original, unreconciled estimates. On the other hand, we find that isotonic regression usually performs better than the underlying, unreconciled method, and almost never performs worse; isotonic regression appears to be able to use the constraints from the GO network to its advantage. An exception to this rule is the high precision regime for joint evaluation, where Kullback-Leibler projection yields the best performance.

01 Jan 2008
TL;DR: The work in this article was supported by ATR Auditory and Visual Perception Research Laboratories, by a grant from Siemens Corporation, by an NSF Presidential Young Investigator, and by grant N00014-90-J-1942 awarded by the Office of Naval Research.
Abstract: *I want to thank Elliot Saltzman, Steven Keele, and Herbert Heuer for helpful comments on the manuscript. Preparation of this paper was supported in part by a grant from ATR Auditory and Visual Perception Research Laboratories, by a grant from Siemens Corporation, by a grant from the Human Frontier Science Program, by a grant from the McDonnell-Pew Foundation, and by grant N00014-90-J-1942 awarded by the Office of Naval Research. Michael Jordan is an NSF Presidential Young Investigator.

Proceedings Article
08 Dec 2008
TL;DR: It is shown that the error under perturbation of spectral clustering is closely related to the perturbations of the eigenvectors of the Laplacian matrix, and this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed to meet a specification of permitted loss in clustering performance.
Abstract: Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance.

Posted Content
05 Aug 2008
TL;DR: In this article, the multivariate group Lasso is used to recover the exact row pattern with high probability over the random design and noise that is specified by the sample complexity parameter, where the sparsity-overlap function is defined as a combination of the sparsities and overlaps of the regression coefficient vectors.
Abstract: In multivariate regression, a $K$-dimensional response vector is regressed upon a common set of $p$ covariates, with a matrix $B^*\in\mathbb{R}^{p\times K}$ of regression coefficients. We study the behavior of the multivariate group Lasso, in which block regularization based on the $\ell_1/\ell_2$ norm is used for support union recovery, or recovery of the set of $s$ rows for which $B^*$ is nonzero. Under high-dimensional scaling, we show that the multivariate group Lasso exhibits a threshold for the recovery of the exact row pattern with high probability over the random design and noise that is specified by the sample complexity parameter $\theta(n,p,s):=n/[2\psi(B^*)\log(p-s)]$. Here $n$ is the sample size, and $\psi(B^*)$ is a sparsity-overlap function measuring a combination of the sparsities and overlaps of the $K$-regression coefficient vectors that constitute the model. We prove that the multivariate group Lasso succeeds for problem sequences $(n,p,s)$ such that $\theta(n,p,s)$ exceeds a critical level $\theta_u$, and fails for sequences such that $\theta(n,p,s)$ lies below a critical level $\theta_{\ell}$. For the special case of the standard Gaussian ensemble, we show that $\theta_{\ell}=\theta_u$ so that the characterization is sharp. The sparsity-overlap function $\psi(B^*)$ reveals that, if the design is uncorrelated on the active rows, $\ell_1/\ell_2$ regularization for multivariate regression never harms performance relative to an ordinary Lasso approach and can yield substantial improvements in sample complexity (up to a factor of $K$) when the coefficient vectors are suitably orthogonal. For more general designs, it is possible for the ordinary Lasso to outperform the multivariate group Lasso. We complement our analysis with simulations that demonstrate the sharpness of our theoretical results, even for relatively small problems.

Journal ArticleDOI
TL;DR: This work presents an augmented form of Markov models that can be used to predict historical recombination events and can model background linkage disequilibrium (LD) more accurately and study some of the computational issues that arise in using such Markovian models on realistic data sets.
Abstract: Inference of ancestral information in recently admixed populations, in which every individual is composed of a mixed ancestry (e.g., African Americans in the United States), is a challenging problem. Several previous model-based approaches to admixture have been based on hidden Markov models (HMMs) and Markov hidden Markov models (MHMMs). We present an augmented form of these models that can be used to predict historical recombination events and can model background linkage disequilibrium (LD) more accurately. We also study some of the computational issues that arise in using such Markovian models on realistic data sets. In particular, we present an effective initialization procedure that, when combined with expectation-maximization (EM) algorithms for parameter estimation, yields high accuracy at significantly decreased computational cost relative to the Markov chain Monte Carlo (MCMC) algorithms that have generally been used in earlier studies. We present experiments exploring these modeling and algorithmic issues in two scenarios-the inference of locus-specific ancestries in a population that is assumed to originate from two unknown ancestral populations, and the inference of allele frequencies in one ancestral population given those in another.

Proceedings Article
09 Jul 2008
TL;DR: In this paper, a non-exchangeable prior for a class of nonparametric latent feature models that is nearly as efficient computationally as its exchangeable counterpart is presented. But the model is applicable to the general setting in which the dependencies between objects can be expressed using a tree, where edge lengths indicate the strength of relationships.
Abstract: Nonparametric Bayesian models are often based on the assumption that the objects being modeled are exchangeable. While appropriate in some applications (e.g., bag-of-words models for documents), exchangeability is sometimes assumed simply for computational reasons; non-exchangeable models might be a better choice for applications based on subject matter. Drawing on ideas from graphical models and phylogenetics, we describe a non-exchangeable prior for a class of nonparametric latent feature models that is nearly as efficient computationally as its exchangeable counterpart. Our model is applicable to the general setting in which the dependencies between objects can be expressed using a tree, where edge lengths indicate the strength of relationships. We demonstrate an application to modeling probabilistic choice.

01 Jan 2008
TL;DR: Taking the “zero-temperature limit” recovers a variational representation for MAP computation as a linear program (LP) over the marginal polytope, which clarifies the essential ingredients of known variational methods, and also suggests novel relaxations.
Abstract: Underlying a variety of techniques for approximate inference—among them mean field, sum-product, and cluster variational methods—is a classical variational principle from statistical physics, which involves a “free energy” optimization problem over the set of all distributions. Working within the framework of exponential families, we describe an alternative view, in which the optimization takes place over the (typically) much lowerdimensional space of mean parameters. The associated constraint set consists of all mean parameters that are globally realizable; for discrete random variables, we refer to this set as a marginal polytope. As opposed to the classical formulation, the representation given here clarifies that there are two distinct components to variational inference algorithms: (a) an approximation to the entropy function; and (b) an approximation to the marginal polytope. This viewpoint clarifies the essential ingredients of known variational methods, and also suggests novel relaxations. Taking the “zero-temperature limit” recovers a variational representation for MAP computation as a linear program (LP) over the marginal polytope. For trees, the max-product updates are a dual method for solving this LP, which provides a variational viewpoint that unifies the sum-product and max-product algorithms.

Journal ArticleDOI
TL;DR: In this paper, a margin-based perspective on multiway spectral clustering is presented, which illuminates both the relaxation and rounding aspects of clustering, providing a unified analysis of existing algorithms and guiding the design of new algorithms.
Abstract: Spectral clustering is a broad class of clustering procedures in which an intractable combinatorial optimization formulation of clustering is “relaxed” into a tractable eigenvector problem, and in which the relaxed solution is subsequently “rounded” into an approximate discrete solution to the original problem. In this paper we present a novel margin-based perspective on multiway spectral clustering. We show that the margin-based perspective illuminates both the relaxation and rounding aspects of spectral clustering, providing a unified analysis of existing algorithms and guiding the design of new algorithms. We also present connections between spectral clustering and several other topics in statistics, specifically minimum-variance clustering, Procrustes analysis and Gaussian intrinsic autoregression.

01 Jan 2008
TL;DR: This paper identifies a new approach based on the synergy between virtual machines and statistical machine learning, and observes that constrained energy conservation can improve hardware reliability, and gives initial results on a cluster that reduces energy costs, reduces integrated circuit failures, and disk failures.
Abstract: Although there is prior work on energy conservation in datacenters, we identify a new approach based on the synergy between virtual machines and statistical machine learning, and we observe that constrained energy conservation can improve hardware reliability. We give initial results on a cluster that reduces energy costs by a factor of 5, reduces integrated circuit failures by a factor of 1.6, and disk failures by a factor of 5. We propose research milestones to generalize our results and compare them with recent related work. 1. Problem: Energy Efficiency vs. Performance Power is ranked as the #5 top concern of IT executives [Sca06], with availability and performance being #10 and #11 respectively.1 This is not surprising given that powering and cooling a datacenter now rivals the cost of the hardware: each $1 spent on servers in 2005 required an additional $0.48 to power and cool it, expected to rise to $0.71 by 2010 [Sca06]. Also, new problems arise as power becomes the constraining resource in datacenters: 1) Space limited by power budget: The University of Buffalo’s $2.5M Dell datacenter caused a brownout when switched on, because the operators had failed to arrange for extra power circuits to handle the greater load [Cla05]. Their response was to underutilize the datacenter to avoid another brownout. Similarly, a datacenter at Ask.com is only 2/3 full due to limited power availability [Gil06]. 2) Power emergencies: Large power consumers must handle on-demand “agile conservation” such as Pacific Gas & Electric ordered to prevent rolling blackouts during peak power demand [Bra06]. 3) Thermal emergencies: Cooling systems can be overtaxed by “hot spots” in a datacenter, leading to downtime or ruined equipment. At the same time, workload peaks can exceed average utilization by 5x, and datacenters are provisioned for these peaks because compliance with service-level agreements (SLAs) trumps average efficiency [Sca06]. Hence, if we could dynamically turn off underutilized equipment and guarantee no impact on meeting SLAs, we could save, say, 4/5’s of the power used by datacenters. Our position is that statistical machine learning (SML) will be the key enabling technology for making policy decisions about turning equipment on and off, and virtual machine technology (VM) will be the enabling mechanism. 1 The top concerns are security, system management tools, virtualization, product road map, power consumption, ease of deployment, interoperability, scalability, features and functionality, availability, performance, and product portfolio breadth. In this paper we motivate this position, present initial proof-of-concept results, and propose research milestones. We also address the concern that power cycling equipment reduces reliability, explaining how careful power cycling decisions could improve component reliability. 2. The Promise of SML for Making Policy The promise of statistical machine learning rests in part on recent theoretical progress and in part on the fact that techniques languishing since the 1960's have become practical due to computers becoming 100,000 times faster. Moreover, cheaper computers make resources for monitoring and analysis affordable. This is fortunate, because we find four reasons for enthusiasm about SML. First, SML techniques work well in dynamic environments where transients (e.g., resource reallocation times) impact performance in complex ways. For example, Tesauro et al. [TJDB06] found that Reinforcement Learning (RL) performed better than queuing theory for making dynamic server allocation decisions, in part because RL learned to “ride out” the short-term effects of transition behavior. Accomplishing the same result using queuing theory would have required avoiding sampling during transitions between steady states, yet performance during these transitions is critical. Queuing theory also cannot inform tradeoffs between power and performance. Second, techniques such as RL can handle arbitrary nonlinear cost/reward functions. For example, to shape behavior of datacenter operators, Pacific Gas & Electric plans to pick (on short notice) a handful of days each year during which the cost of electricity will be many times higher than the average, and then slightly drop the cost the remaining days [Bra06]. Adaptive operators could achieve major savings if they could exploit these complex reward functions without sacrificing performance. Third, SML techniques adapt as hardware/software configurations change. For example, eBay and Amazon push 100 software changes in a typical month [BFJ+06]. Whereas SML can automatically adapt to some of these changes, as demonstrated by the use of ensembles of SML models to capture changing system behavior [ZCM+05], queuing models must be redone by experts when the system configuration changes. The problem of managing a datacenter is too large for constant manual intervention; hence, we need techniques that can adapt to such changes, as SML can. Lastly, because conserving energy is an optimization and not a guarantee, useful progress can be made even though SML models, algorithms and heuristics are imperfect, as long as performance is preserved. Indeed, we demonstrate this by simulation in section 5.

Journal ArticleDOI
TL;DR: A formal hypothesis is developed, in the form of a kinetic model, for the mechanism of action of this GPCR signal transduction system that predicts a synergistic region in the calcium peak height dose response that results when cells are simultaneously stimulated by C5a and UDP.
Abstract: Macrophage cells that are stimulated by two different ligands that bind to G-protein-coupled receptors (GPCRs) usually respond as if the stimulus effects are additive, but for a minority of ligand combinations the response is synergistic. The G-protein-coupled receptor system integrates signaling cues from the environment to actuate cell morphology, gene expression, ion homeostasis, and other physiological states. We analyze the effects of the two signaling molecules complement factors 5a (C5a) and uridine diphosphate (UDP) on the intracellular second messenger calcium to elucidate the principles that govern the processing of multiple signals by GPCRs. We have developed a formal hypothesis, in the form of a kinetic model, for the mechanism of action of this GPCR signal transduction system using data obtained from RAW264.7 macrophage cells. Bayesian statistical methods are employed to represent uncertainty in both data and model parameters and formally tie the model to experimental data. When the model is also used as a tool in the design of experiments, it predicts a synergistic region in the calcium peak height dose response that results when cells are simultaneously stimulated by C5a and UDP. An analysis of the model reveals a potential mechanism for crosstalk between the Gαi-coupled C5a receptor and the Gαq-coupled UDP receptor signaling systems that results in synergistic calcium release.


Proceedings ArticleDOI
01 Aug 2008
TL;DR: This sparsity-overlap function reveals that, if the design is uncorrelated on the active rows, block lscr1/lscr2 regularization for multivariate regression never harms performance relative to an ordinary Lasso approach, and can yield substantial improvements in sample complexity when the regression vectors are suitably orthogonal.
Abstract: In the problem of multivariate regression, a K-dimensional response vector is regressed upon a common set of p covariates, with a matrix B* isin RopfptimesK of regression coefficients. We study the behavior of the group Lasso using lscr1/lscr2 regularization for the union support problem, meaning that the set of s rows for which B* is non-zero is recovered exactly. Studying this problem under high-dimensional scaling, we show that group Lasso recovers the exact row pattern with high probability over the random design and noise for scalings of (n, p, s) such that the sample complexity parameter given by thetas(n, p, s) := n/[2psi(B*) log(p - s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the number of non-zero rows, and psi(B*) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that, if the design is uncorrelated on the active rows, block lscr1/lscr2 regularization for multivariate regression never harms performance relative to an ordinary Lasso approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal. For more general designs, it is possible for the ordinary Lasso to outperform the group Lasso.

Proceedings Article
08 Dec 2008
TL;DR: This sparsity-overlap function reveals that block l1/l2 regulation for multivariate regression never harms performance relative to a naive l1-approach, and can yield substantial improvements in sample complexity when the regression vectors are suitably orthogonal relative to the design.
Abstract: We study the behavior of block l1/l2 regulation for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θl1/l2(n, p, s) : = n/ [2Ѱ(B*) log(p - s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and Ѱ(B*) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block l1/l2 regulation for multivariate regression never harms performance relative to a naive l1-approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems.

01 Jan 2008
TL;DR: This thesis presents general techiques for inference in various nonparametric Bayesian models, furthers the understanding of the stochastic processes at the core of these models, and develops new models of data based on findings.
Abstract: This thesis presents general techiques for inference in various nonparametric Bayesian models, furthers our understanding of the stochastic processes at the core of these models, and develops new models of data based on these findings. In particular, we develop new Monte Carlo algorithms for Dirichlet process mixtures based on a general framework. We extend the vocabulary of processes used for nonparametric Bayesian models by proving many properties of beta and gamma processes. In particular, we show how to perform probabilistic inference in hierarchies of beta and gamma processes, and how this naturally leads to improvements to the well known naive Bayes algorithm. We demonstrate the robustness and speed of the resulting methods by applying it to a classification task with 1 million training samples and 40,000 classes.

Journal ArticleDOI
TL;DR: An asymptotic approximation to the optimal cost of stationary quantization rules is developed and exploited to show that stationary quantizers are not optimal in a broad class of settings.
Abstract: We consider the design of systems for sequential decentralized detection, a problem that entails several interdependent choices: the choice of a stopping rule (specifying the sample size), a global decision function (a choice between two competing hypotheses), and a set of quantization rules (the local decisions on the basis of which the global decision is made). This correspondence addresses an open problem of whether in the Bayesian formulation of sequential decentralized detection, optimal local decision functions can be found within the class of stationary rules. We develop an asymptotic approximation to the optimal cost of stationary quantization rules and exploit this approximation to show that stationary quantizers are not optimal in a broad class of settings. We also consider the class of blockwise-stationary quantizers, and show that asymptotically optimal quantizers are likelihood-based threshold rules.

Journal ArticleDOI
TL;DR: In this paper, a non-asymptotic variational characterization of divergences is proposed, which allows the problem of estimating divergence to be tackled via convex empirical risk optimization, and is simple to implement, requiring only the solution of standard convex programs.
Abstract: We develop and analyze $M$-estimation methods for divergence functionals and the likelihood ratios of two probability distributions. Our method is based on a non-asymptotic variational characterization of $f$-divergences, which allows the problem of estimating divergences to be tackled via convex empirical risk optimization. The resulting estimators are simple to implement, requiring only the solution of standard convex programs. We present an analysis of consistency and convergence for these estimators. Given conditions only on the ratios of densities, we show that our estimators can achieve optimal minimax rates for the likelihood ratio and the divergence functionals in certain regimes. We derive an efficient optimization algorithm for computing our estimates, and illustrate their convergence behavior and practical viability by simulations.

Proceedings Article
08 Dec 2008
TL;DR: This work proposes a new ancestry resampling procedure for inference in evolutionary trees and evaluates its method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and shows substantial improvement over the current state of the art.
Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art.


Journal ArticleDOI
TL;DR: A new genealogy-based approach, CAMP (coalescent-based association mapping), is suggested that takes into account the trade-off between the complexity of the genealogy and the power lost due to the additional multiple hypotheses.
Abstract: The central questions asked in whole-genome association studies are how to locate associated regions in the genome and how to estimate the significance of these findings. Researchers usually do this by testing each SNP separately for association and then applying a suitable correction for multiple-hypothesis testing. However, SNPs are correlated by the unobserved genealogy of the population, and a more powerful statistical methodology would attempt to take this genealogy into account. Leveraging the genealogy in association studies is challenging, however, because the inference of the genealogy from the genotypes is a computationally intensive task, in particular when recombination is modeled, as in ancestral recombination graphs. Furthermore, if large numbers of genealogies are imputed from the genotypes, the power of the study might decrease if these imputed genealogies create an additional multiple-hypothesis testing burden. Indeed, we show in this paper that several existing methods that aim to address this problem suffer either from low power or from a very high false-positive rate; their performance is generally not better than the standard approach of separate testing of SNPs. We suggest a new genealogy-based approach, CAMP (coalescent-based association mapping), that takes into account the trade-off between the complexity of the genealogy and the power lost due to the additional multiple hypotheses. Our experiments show that CAMP yields a significant increase in power relative to that of previous methods and that it can more accurately locate the associated region.

Proceedings Article
08 Dec 2008
TL;DR: A Bayesian interpretation of sparsity in the kernel setting is pursued by making use of a mixture of a point-mass distribution and prior that is referred to as "Silverman's g-prior", which provides a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior.
Abstract: Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as "Silverman's g-prior." We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion.