scispace - formally typeset
Search or ask a question

Showing papers by "Michael I. Jordan published in 2012"


Posted Content
TL;DR: This work presents an alternative algorithm based on stochastic optimization that allows for direct optimization of the variational lower bound and demonstrates the approach on two non-conjugate models: logistic regression and an approximation to the HDP.
Abstract: Mean-field variational inference is a method for approximate Bayesian posterior inference. It approximates a full posterior distribution with a factorized set of distributions by maximizing a lower bound on the marginal likelihood. This requires the ability to integrate a sum of terms in the log joint likelihood using this factorized distribution. Often not all integrals are in closed form, which is typically handled by using a lower bound. We present an alternative algorithm based on stochastic optimization that allows for direct optimization of the variational lower bound. This method uses control variates to reduce the variance of the stochastic search gradient, in which existing lower bounds can play an important role. We demonstrate the approach on two non-conjugate models: logistic regression and an approximation to the HDP.

355 citations


Proceedings Article
26 Jun 2012
TL;DR: In this article, the authors revisited the k-means clustering algorithm from a Bayesian nonparametric viewpoint and showed that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-mean-like clustering objective that includes a penalty for the number of clusters.
Abstract: Bayesian models offer great flexibility for clustering applications--Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-means and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph.

326 citations


Posted Content
TL;DR: The generalized mean field (GMF) algorithm as discussed by the authors is a generalization of mean field theory for approximate inference in complex exponential family models, which involves limiting the optimization over the class of cluster-factorizable distributions.
Abstract: The mean field methods, which entail approximating intractable probability distributions variationally with distributions from a tractable family, enjoy high efficiency, guaranteed convergence, and provide lower bounds on the true likelihood. But due to requirement for model-specific derivation of the optimization equations and unclear inference quality in various models, it is not widely used as a generic approximate inference algorithm. In this paper, we discuss a generalized mean field theory on variational approximation to a broad class of intractable distributions using a rich set of tractable distributions via constrained optimization over distribution spaces. We present a class of generalized mean field (GMF) algorithms for approximate inference in complex exponential family models, which entails limiting the optimization over the class of cluster-factorizable distributions. GMF is a generic method requiring no model-specific derivations. It factors a complex model into a set of disjoint variable clusters, and uses a set of canonical fix-point equations to iteratively update the cluster distributions, and converge to locally optimal cluster marginals that preserve the original dependency structure within each cluster, hence, fully decomposed the overall inference problem. We empirically analyzed the effect of different tractable family (clusters of different granularity) on inference quality, and compared GMF with BP on several canonical models. Possible extension to higher-order MF approximation is also discussed.

196 citations


Proceedings Article
26 Jun 2012
TL;DR: In this article, an alternative algorithm based on stochastic optimization that allows for direct optimization of the variational lower bound is presented. But this method requires the ability to integrate a sum of terms in the log joint likelihood using this factorized distribution, which is typically handled by using a lower bound.
Abstract: Mean-field variational inference is a method for approximate Bayesian posterior inference. It approximates a full posterior distribution with a factorized set of distributions by maximizing a lower bound on the marginal likelihood. This requires the ability to integrate a sum of terms in the log joint likelihood using this factorized distribution. Often not all integrals are in closed form, which is typically handled by using a lower bound. We present an alternative algorithm based on stochastic optimization that allows for direct optimization of the variational lower bound. This method uses control variates to reduce the variance of the stochastic search gradient, in which existing lower bounds can play an important role. We demonstrate the approach on two non-conjugate models: logistic regression and an approximation to the HDP.

179 citations


Posted Content
TL;DR: The Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality, is presented.
Abstract: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. As an alternative, we present the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality. BLB is well suited to modern parallel and distributed computing architectures and retains the generic applicability, statistical efficiency, and favorable theoretical properties of the bootstrap. We provide the results of an extensive empirical and theoretical investigation of BLB's behavior, including a study of its statistical correctness, its large-scale implementation and performance, selection of hyperparameters, and performance on real data.

128 citations


Posted Content
TL;DR: A nonparametric link prediction algorithm for a sequence of graph snapshots over time that predicts links based on the features of its endpoints as well as those of the local neighborhood around the endpoints, and proves the consistency of the estimator, and gives a fast implementation based on locality-sensitive hashing.
Abstract: We propose a non-parametric link prediction algorithm for a sequence of graph snapshots over time. The model predicts links based on the features of its endpoints, as well as those of the local neighborhood around the endpoints. This allows for different types of neighborhoods in a graph, each with its own dynamics (e.g, growing or shrinking communities). We prove the consistency of our estimator, and give a fast implementation based on locality-sensitive hashing. Experiments with simulated as well as five real-world dynamic graphs show that we outperform the state of the art, especially when sharp fluctuations or non-linearities are present.

120 citations


Journal ArticleDOI
TL;DR: In this article, the authors derive a stick-breaking representation for the Dirichlet process from the characterization of the beta process as a completely random measure, which they use to derive a three-parameter generalization of the Beta process.
Abstract: The beta-Bernoulli process provides a Bayesian nonparametric prior for models involving collections of binary-valued features. A draw from the beta process yields an infinite collection of probabilities in the unit interval, and a draw from the Bernoulli process turns these into binary-valued features. Recent work has provided stick-breaking representations for the beta process analogous to the well-known stick-breaking representation for the Dirichlet process. We derive one such stick-breaking representation directly from the characterization of the beta process as a completely random measure. This approach motivates a three-parameter generalization of the beta process, and we study the power laws that can be obtained from this generalized beta process. We present a posterior inference algorithm for the beta-Bernoulli process that exploits the stick-breaking representation, and we present experimental results for a discrete factor-analysis model.

94 citations


Journal ArticleDOI
TL;DR: This paper develops an extension of classical SMC based on partially ordered sets and shows how to apply this framework—which is referred to as PosetSMC—to phylogenetic analysis, and provides a theoretical treatment and empirical results that demonstrate that Poset SMC is a very promising alternative to MCMC.
Abstract: Bayesian inference provides an appealing general framework for phylogenetic analysis, able to incorporate a wide variety of modeling assumptions and to provide a coherent treatment of uncertainty. Existing computational approaches to bayesian inference based on Markov chain Monte Carlo (MCMC) have not, however, kept pace with the scale of the data analysis problems in phylogenetics, and this has hindered the adoption of bayesian methods. In this paper, we present an alternative to MCMC based on Sequential Monte Carlo (SMC). We develop an extension of classical SMC based on partially ordered sets and show how to apply this framework--which we refer to as PosetSMC--to phylogenetic analysis. We provide a theoretical treatment of PosetSMC and also present experimental evaluation of PosetSMC on both synthetic and real data. The empirical results demonstrate that PosetSMC is a very promising alternative to MCMC, providing up to two orders of magnitude faster convergence. We discuss other factors favorable to the adoption of PosetSMC in phylogenetics, including its ability to estimate marginal likelihoods, its ready implementability on parallel and distributed computing platforms, and the possibility of combining with MCMC in hybrid MCMC-SMC schemes. Software for PosetSMC is available at http://www.stat.ubc.ca/ bouchard/PosetSMC.

93 citations


Posted Content
TL;DR: In this article, the authors study statistical risk minimization problems under a privacy model in which the data is kept confidential even from the learner, and establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures.
Abstract: We study statistical risk minimization problems under a privacy model in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, as measured by convergence rate, of any statistical estimator or learning procedure.

76 citations


Proceedings Article
03 Dec 2012
TL;DR: This paper derives novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models.
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the k-means and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis.

73 citations


Proceedings Article
03 Dec 2012
TL;DR: In this article, the authors study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner, and establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures.
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator.

Journal ArticleDOI
TL;DR: It is shown that as long as the source of randomness is suitably ergodic---it converges quickly enough to a stationary distribution---the method enjoys strong convergence guarantees, both in expectation and with high probability.
Abstract: We generalize stochastic subgradient descent methods to situations in which we do not receive independent samples from the distribution over which we optimize, instead receiving samples coupled over time. We show that as long as the source of randomness is suitably ergodic---it converges quickly enough to a stationary distribution---the method enjoys strong convergence guarantees, both in expectation and with high probability. This result has implications for stochastic optimization in high-dimensional spaces, peer-to-peer distributed optimization schemes, decision problems with dependent data, and stochastic optimization problems over combinatorial spaces.

Journal ArticleDOI
TL;DR: This paper defines such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG), a variant of generalized hyperbolic distributions, and shows that these algorithms bear an interesting resemblance to iteratively reweighted l2 or l1 methods.
Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted l2 or l1 methods. Finally, we present two extensions for grouped variable selection and logistic regression.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: An active learning algorithm is proposed that incrementally measures only those similarities that are most likely to remove uncertainty in an intermediate clustering solution, and shows a significant improvement in performance compared to the alternatives.
Abstract: Spectral clustering is a widely used method for organizing data that only relies on pairwise similarity measurements. This makes its application to non-vectorial data straight-forward in principle, as long as all pairwise similarities are available. However, in recent years, numerous examples have emerged in which the cost of assessing similarities is substantial or prohibitive. We propose an active learning algorithm for spectral clustering that incrementally measures only those similarities that are most likely to remove uncertainty in an intermediate clustering solution. In many applications, similarities are not only costly to compute, but also noisy. We extend our algorithm to maintain running estimates of the true similarities, as well as estimates of their accuracy. Using this information, the algorithm updates only those estimates which are relatively inaccurate and whose update would most likely remove clustering uncertainty. We compare our methods on several datasets, including a realistic example where similarities are expensive and noisy. The results show a significant improvement in performance compared to the alternatives.

Proceedings Article
26 Jun 2012
TL;DR: The Bag of Little Bootstraps (BLB) as mentioned in this paper is a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality.
Abstract: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. As an alternative, we present the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality. BLB is well suited to modern parallel and distributed computing architectures and retains the generic applicability, statistical efficiency, and favorable theoretical properties of the bootstrap. We provide the results of an extensive empirical and theoretical investigation of BLB's behavior, including a study of its statistical correctness, its large-scale implementation and performance, selection of hyperparameters, and performance on real data.

Proceedings Article
03 Dec 2012
TL;DR: This work presents a novel method in the family of particle MCMC methods that it refers to as particle Gibbs with ancestor sampling (PG-AS), and develops a truncation strategy of these models that is applicable in principle to any backward-simulation-based method, but which is particularly well suited to the PG-AS framework.
Abstract: We present a novel method in the family of particle MCMC methods that we refer to as particle Gibbs with ancestor sampling (PG-AS). Similarly to the existing PG with backward simulation (PG-BS) procedure, we use backward sampling to (considerably) improve the mixing of the PG kernel. Instead of using separate forward and backward sweeps as in PG-BS, however, we achieve the same effect in a single forward sweep. We apply the PG-AS framework to the challenging class of non-Markovian state-space models. We develop a truncation strategy of these models that is applicable in principle to any backward-simulation-based method, but which is particularly well suited to the PG-AS framework. In particular, as we show in a simulation study, PG-AS can yield an order-of-magnitude improved accuracy relative to PG-BS due to its robustness to the truncation error. Several application examples are discussed, including Rao-Blackwellized particle smoothing and inference in degenerate state-space models.

Proceedings Article
26 Jun 2012
TL;DR: In this article, the authors propose a nonparametric link prediction algorithm for a sequence of graph snapshots over time, which predicts links based on the features of its endpoints, as well as those of the local neighborhood around the endpoints.
Abstract: We propose a nonparametric link prediction algorithm for a sequence of graph snapshots over time. The model predicts links based on the features of its endpoints, as well as those of the local neighborhood around the endpoints. This allows for different types of neighborhoods in a graph, each with its own dynamics (e.g, growing or shrinking communities). We prove the consistency of our estimator, and give a fast implementation based on locality-sensitive hashing. Experiments with simulated as well as five real-world dynamic graphs show that we outperform the state of the art, especially when sharp fluctuations or nonlinearities are present.

Proceedings Article
21 Mar 2012
TL;DR: It is shown that the stick-breaking construction of the beta process due to Paisley et al. (2010) can be obtained from the characterization of the Beta process as a Poisson process, and this underlying representation is used to derive error bounds on truncated beta processes that are tighter than those in the literature.
Abstract: We show that the stick-breaking construction of the beta process due to Paisley et al. (2010) can be obtained from the characterization of the beta process as a Poisson process. Specifically, we show that the mean measure of the underlying Poisson process is equal to that of the beta process. We use this underlying representation to derive error bounds on truncated beta processes that are tighter than those in the literature. We also develop a new MCMC inference algorithm for beta processes, based in part on our new Poisson process construction.

Posted Content
TL;DR: Two new active learning algorithms are presented to combine humans and algorithms together in a crowd-sourced database, based on the theory of non-parametric bootstrap, which makes their results applicable to a broad class of machine learning models.
Abstract: Crowd-sourcing has become a popular means of acquiring labeled data for a wide variety of tasks where humans are more accurate than computers, e.g., labeling images, matching objects, or analyzing sentiment. However, relying solely on the crowd is often impractical even for data sets with thousands of items, due to time and cost constraints of acquiring human input (which cost pennies and minutes per label). In this paper, we propose algorithms for integrating machine learning into crowd-sourced databases, with the goal of allowing crowd-sourcing applications to scale, i.e., to handle larger datasets at lower costs. The key observation is that, in many of the above tasks, humans and machine learning algorithms can be complementary, as humans are often more accurate but slow and expensive, while algorithms are usually less accurate, but faster and cheaper. Based on this observation, we present two new active learning algorithms to combine humans and algorithms together in a crowd-sourced database. Our algorithms are based on the theory of non-parametric bootstrap, which makes our results applicable to a broad class of machine learning models. Our results, on three real-life datasets collected with Amazon's Mechanical Turk, and on 15 well-known UCI data sets, show that our methods on average ask humans to label one to two orders of magnitude fewer items to achieve the same accuracy as a baseline that labels random images, and two to eight times fewer questions than previous active learning schemes.

Posted Content
TL;DR: In this article, a particle Gibbs with ancestor sampling (PG-AS) method was proposed to improve the mixing of the particle MCMC kernel in a single forward sweep instead of using separate forward and backward sweeps.
Abstract: We present a novel method in the family of particle MCMC methods that we refer to as particle Gibbs with ancestor sampling (PG-AS). Similarly to the existing PG with backward simulation (PG-BS) procedure, we use backward sampling to (considerably) improve the mixing of the PG kernel. Instead of using separate forward and backward sweeps as in PG-BS, however, we achieve the same effect in a single forward sweep. We apply the PG-AS framework to the challenging class of non-Markovian state-space models. We develop a truncation strategy of these models that is applicable in principle to any backward-simulation-based method, but which is particularly well suited to the PG-AS framework. In particular, as we show in a simulation study, PG-AS can yield an order-of-magnitude improved accuracy relative to PG-BS due to its robustness to the truncation error. Several application examples are discussed, including Rao-Blackwellized particle smoothing and inference in degenerate state-space models.

Journal ArticleDOI
TL;DR: This paper develops analogous representations for the feature modeling problem of Bayesian nonparametric clustering, which include the beta process and the Indian buffet process as well as new representations that provide insight into the connections between these processes.
Abstract: One of the focal points of the modern literature on Bayesian nonparametrics has been the problem of clustering, or partitioning, where each data point is modeled as being associated with one and only one of some collection of groups called clusters or partition blocks. Underlying these Bayesian nonparametric models are a set of interrelated stochastic processes, most notably the Dirichlet process and the Chinese restaurant process. In this paper we provide a formal development of an analogous problem, called feature modeling, for associating data points with arbitrary nonnegative integer numbers of groups, now called features or topics. We review the existing combinatorial stochastic process representations for the clustering problem and develop analogous representations for the feature modeling problem. These representations include the beta process and the Indian buffet process as well as new representations that provide insight into the connections between these processes. We thereby bring the same level of completeness to the treatment of Bayesian nonparametric feature modeling that has previously been achieved for Bayesian nonparametric clustering.

Proceedings Article
03 Dec 2012
TL;DR: This work considers derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates and shows that if pairs of function values are available, algorithms that use gradient estimates based on random perturbations suffer a factor of at most √d in convergence rate over traditional Stochastic gradient methods.
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that use gradient estimates based on random perturbations suffer a factor of at most √d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problem-dependent quantities: they cannot be improved by more than constant factors.

Posted Content
TL;DR: This paper presents a generalization of independent component analysis (ICA), where instead of looking for a linear transform that makes the data components independent, it is shown that the optimal transform is found by minimizing a contrast function based on mutual information.
Abstract: We present a generalization of independent component analysis (ICA), where instead of looking for a linear transform that makes the data components independent, we look for a transform that makes the data components well fit by a tree-structured graphical model. Treating the problem as a semiparametric statistical problem, we show that the optimal transform is found by minimizing a contrast function based on mutual information, a function that directly extends the contrast function used for classical ICA. We provide two approximations of this contrast function, one using kernel density estimation, and another using kernel generalized variance. This tree-dependent component analysis framework leads naturally to an efficient general multivariate density estimation technique where only bivariate density estimation needs to be performed.

Posted Content
TL;DR: This paper presents a novel combination of graph partitioning algorithms with a generalized mean field (GMF) inference algorithm that optimizes over disjoint clustering of variables and performs inference using those clusters.
Abstract: An autonomous variational inference algorithm for arbitrary graphical models requires the ability to optimize variational approximations over the space of model parameters as well as over the choice of tractable families used for the variational approximation. In this paper, we present a novel combination of graph partitioning algorithms with a generalized mean field (GMF) inference algorithm. This combination optimizes over disjoint clustering of variables and performs inference using those clusters. We provide a formal analysis of the relationship between the graph cut and the GMF approximation, and explore several graph partition strategies empirically. Our empirical results provide rather clear support for a weighted version of MinCut as a useful clustering algorithm for GMF inference, which is consistent with the implications from the formal analysis.

20 Nov 2012
TL;DR: The Million Cancer Genome Warehouse as mentioned in this paper is an example of an information commons and a computing system that will bring about precision medicine, coupling established clinical pathological indexes with state-of-the-art molecular profiling to create diagnostic, prognostic and therapeutic strategies precisely tailored to each patient's individual requirements.
Abstract: : This white paper discusses the motivation and issues surrounding the development of a repository and associated computational infrastructure to house and process a million genomes to help battle cancer, which we call the Million Cancer Genome Warehouse It is proposed as an example of an information commons and a computing system that will bring about precision medicine, coupling established clinical pathological indexes with state-of-the-art molecular profiling to create diagnostic, prognostic, and therapeutic strategies precisely tailored to each patient's individual requirements The goal of the white paper is to stimulate discussion so as to help reach consensus about the need to construct a Million Cancer Genome Warehouse and what its nature should be To try to anticipate concerns, including thorough cost estimates, it covers topics as varied as high-level health policy issues to low-level details about statistical analysis, data formats and structures, software design, and hardware construction and cost

Journal ArticleDOI
TL;DR: In this paper, a matrix extension of the scalar concentration theory developed by Sourav Chatterjee using Stein's method of exchangeable pairs is presented. But it is not a generalization of the classical inequalities due to Hoeffding, Bernstein, Khintchine and Rosenthal.
Abstract: This paper derives exponential concentration inequalities and polynomial moment inequalities for the spectral norm of a random matrix. The analysis requires a matrix extension of the scalar concentration theory developed by Sourav Chatterjee using Stein's method of exchangeable pairs. When applied to a sum of independent random matrices, this approach yields matrix generalizations of the classical inequalities due to Hoeffding, Bernstein, Khintchine and Rosenthal. The same technique delivers bounds for sums of dependent random matrices and more general matrix-valued functions of dependent random variables.

Journal ArticleDOI
TL;DR: This work presents a new approach to supervised ranking based on aggregation of partial preferences, and develops $U$-statistic-based empirical risk minimization procedures that yield consistency results that parallel those available for classification.
Abstract: We consider the predictive problem of supervised ranking, where the task is to rank sets of candidate items returned in response to queries Although there exist statistical procedures that come with guarantees of consistency in this setting, these procedures require that individuals provide a complete ranking of all items, which is rarely feasible in practice Instead, individuals routinely provide partial preference information, such as pairwise comparisons of items, and more practical approaches to ranking have aimed at modeling this partial preference data directly As we show, however, such an approach raises serious theoretical challenges Indeed, we demonstrate that many commonly used surrogate losses for pairwise comparison data do not yield consistency; surprisingly, we show inconsistency even in low-noise settings With these negative results as motivation, we present a new approach to supervised ranking based on aggregation of partial preferences, and we develop $U$-statistic-based empirical risk minimization procedures We present an asymptotic analysis of these new procedures, showing that they yield consistency results that parallel those available for classification We complement our theoretical results with an experiment studying the new procedures in a large-scale web-ranking task

Posted Content
TL;DR: In this article, a probabilistic model of events in continuous time is presented, in which each event triggers a Poisson process of successor events, and the ensemble of observed events is thereby modeled as a superposition of Poisson processes.
Abstract: We present a probabilistic model of events in continuous time in which each event triggers a Poisson process of successor events. The ensemble of observed events is thereby modeled as a superposition of Poisson processes. Efficient inference is feasible under this model with an EM algorithm. Moreover, the EM algorithm can be implemented as a distributed algorithm, permitting the model to be applied to very large datasets. We apply these techniques to the modeling of Twitter messages and the revision history of Wikipedia.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: A new procedure, the "bag of little bootstraps," is presented, which circumvents this problem, inheriting the favorable theoretical properties of the bootstrap but also having a much more favorable computational profile.
Abstract: I present some recent work on statistical inference for Big Data. Divide-and-conquer is a natural computational paradigm for approaching Big Data problems, particularly given recent developments in distributed and parallel computing, but some interesting challenges arise when applying divide-and-conquer algorithms to statistical inference problems. One interesting issue is that of obtaining confidence intervals in massive datasets.The bootstrap principle suggests resampling data to obtain fluctuations in the values of estimators, and thereby confidence intervals, but this is infeasible with massive data. Subsampling the data yields fluctuations on the wrong scale, which have to be corrected to provide calibrated statistical inferences. I present a new procedure, the "bag of little bootstraps," which circumvents this problem, inheriting the favorable theoretical properties of the bootstrap but also having a much more favorable computational profile. Another issue that I discuss is the problem of large-scale matrix completion. Here divide-and-conquer is a natural heuristic that works well in practice, but new theoretical problems arise when attempting to characterize the statistical performance of divide-and-conquer algorithms. Here the theoretical support is provided by concentration theorems for random matrices, and I present a new approach to this problem based on Stein's method1.

Journal ArticleDOI
TL;DR: In this article, a particle filter is used to generate a sample state trajectory in a Markov chain Monte Carlo sampler, which has been shown to be efficient even when we use very few particles in the PF.