scispace - formally typeset
Search or ask a question

Showing papers by "Edoardo M. Airoldi published in 2017"


Journal ArticleDOI
TL;DR: Estimating the contribution of transcript levels to orthogonal sources of variability found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability, suggesting extensive post-transcriptional regulation.
Abstract: Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.

159 citations


Proceedings ArticleDOI
13 Aug 2017
TL;DR: A new experimental design is leverage for testing whether SUTVA holds, without making any assumptions on how treatment effects may spill over between the treatment and the control group, and the proposed methodology can be applied to settings in which a network is not necessarily observed but, if available, can be used in the analysis.
Abstract: Randomized experiments, or A/B tests, are the standard approach for evaluating the causal effects of new product features, i.e., treatments. The validity of these tests rests on the "stable unit treatment value assumption" (SUTVA), which implies that the treatment only affects the behavior of treated users, and does not affect the behavior of their connections. Violations of SUTVA, common in features that exhibit network effects, result in inaccurate estimates of the causal effect of treatment. In this paper, we leverage a new experimental design for testing whether SUTVA holds, without making any assumptions on how treatment effects may spill over between the treatment and the control group. To achieve this, we simultaneously run both a completely randomized and a cluster-based randomized experiment, and then we compare the difference of the resulting estimates. We present a statistical test for measuring the significance of this difference and offer theoretical bounds on the Type I error rate. We provide practical guidelines for implementing our methodology on large-scale experimentation platforms. Importantly, the proposed methodology can be applied to settings in which a network is not necessarily observed but, if available, can be used in the analysis. Finally, we deploy this design to LinkedIn's experimentation platform and apply it to two online experiments, highlighting the presence of network effects and bias in standard A/B testing approaches in a real-world setting.

77 citations


Journal ArticleDOI
TL;DR: In this article, the authors introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined, and provide theoretical analysis of the asymptotic behavior of both standard and implicit gradient descent-based estimators.
Abstract: Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.

58 citations


Posted Content
TL;DR: In this paper, the authors develop elements of optimal estimation theory for causal effects leveraging an observed network, by assuming that the potential outcomes of an individual depend only on the individual's treatment and on the treatment of the neighbors.
Abstract: Randomized experiments in which the treatment of a unit can affect the outcomes of other units are becoming increasingly common in healthcare, economics, and in the social and information sciences From a causal inference perspective, the typical assumption of no interference becomes untenable in such experiments In many problems, however, the patterns of interference may be informed by the observation of network connections among the units of analysis Here, we develop elements of optimal estimation theory for causal effects leveraging an observed network, by assuming that the potential outcomes of an individual depend only on the individual's treatment and on the treatment of the neighbors We propose a collection of exclusion restrictions on the potential outcomes, and show how subsets of these restrictions lead to various parameterizations Considering the class of linear unbiased estimators of the average direct treatment effect, we derive conditions on the design that lead to the existence of unbiased estimators, and offer analytical insights on the weights that lead to minimum integrated variance estimators We illustrate the improved performance of these estimators when compared to more standard biased and unbiased estimators, using simulations

54 citations


Posted Content
TL;DR: In this article, the authors show that even parametric structural assumptions that allow the existence of unbiased estimators cannot guarantee a decreasing variance even in the large sample limit of randomized experiments with interference.
Abstract: Randomized experiments on a network often involve interference between connected units; i.e., a situation in which an individual's treatment can affect the response of another individual. Current approaches to deal with interference, in theory and in practice, often make restrictive assumptions on its structure---for instance, assuming that interference is local---even when using otherwise nonparametric inference strategies. This reliance on explicit restrictions on the interference mechanism suggests a shared intuition that inference is impossible without any assumptions on the interference structure. In this paper, we begin by formalizing this intuition in the context of a classical nonparametric approach to inference, referred to as design-based inference of causal effects. Next, we show how, always in the context of design-based inference, even parametric structural assumptions that allow the existence of unbiased estimators, cannot guarantee a decreasing variance even in the large sample limit. This lack of concentration in large samples is often observed empirically, in randomized experiments in which interference of some form is expected to be present. This result has direct consequences for the design and analysis of large experiments---for instance, in online social platforms---where the belief is that large sample sizes automatically guarantee small variance. More broadly, our results suggest that although strategies for causal inference in the presence of interference borrow their formalism and main concepts from the traditional causal inference literature, much of the intuition from the no-interference case do not easily transfer to the interference setting.

26 citations


Journal ArticleDOI
TL;DR: HIquant as discussed by the authors developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards.
Abstract: Many proteoforms - arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes - have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass-spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries. We characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code. HIquant server is implemented at: this https URL

22 citations


Journal ArticleDOI
TL;DR: In this paper, a method for adaptive nonlinear sequential modeling of time series data is proposed, where data are modeled as a nonlinear function of past values corrupted by noise, and the underlying nonlinear functions are assumed to be approximately expandable in a spline basis.
Abstract: We propose a method for adaptive nonlinear sequential modeling of time series data. Data are modeled as a nonlinear function of past values corrupted by noise, and the underlying nonlinear function is assumed to be approximately expandable in a spline basis. We cast the modeling of data as finding a good fit representation in the linear span of multidimensional spline basis, and use a variant of $l_1$ -penalty regularization in order to reduce the dimensionality of representation. Using adaptive filtering techniques, we design our online algorithm to automatically tune the underlying parameters based on the minimization of the regularized sequential prediction error. We demonstrate the generality and flexibility of the proposed approach on both synthetic and real-world datasets. Moreover, we analytically investigate the performance of our algorithm by obtaining both bounds on prediction errors and consistency in variable selection.

16 citations


Journal ArticleDOI
TL;DR: A novel parameterization of distributions on hypergraphs based on the geometry of points in , which leads to new Metropolis-Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space and generally offers greater control on the distribution of graph features than currently possible.
Abstract: We introduce a novel parameterization of distributions on hypergraphs based on the geometry of points in Rd. The idea is to induce distributions on hypergraphs by placing priors on point configurations via spatial processes. This specification is then used to infer conditional independence models, or Markov structure, for multivariate distributions. This approach results in a broader class of conditional independence models beyond standard graphical models. Factorizations that cannot be retrieved via a graph are possible. Inference of nondecomposable graphical models is possible without requiring decomposability, or the need of Gaussian assumptions. This approach leads to new Metropolis-Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space, generally offers greater control on the distribution of graph features than currently possible, and naturally extends to hypergraphs. We provide a comparative performance evaluation against state-of-the-art approaches, and...

16 citations


Posted Content
TL;DR: In this paper, the authors introduce an experimental design strategy for testing whether the assumption of no interference among users holds, under which the outcome of one user does not depend on the treatment assigned to other users, is rarely tenable on such platforms.
Abstract: Experimentation platforms are essential to modern large technology companies, as they are used to carry out many randomized experiments daily. The classic assumption of no interference among users, under which the outcome of one user does not depend on the treatment assigned to other users, is rarely tenable on such platforms. Here, we introduce an experimental design strategy for testing whether this assumption holds. Our approach is in the spirit of the Durbin-Wu-Hausman test for endogeneity in econometrics, where multiple estimators return the same estimate if and only if the null hypothesis holds. The design that we introduce makes no assumptions on the interference model between units, nor on the network among the units, and has a sharp bound on the variance and an implied analytical bound on the type I error rate. We discuss how to apply the proposed design strategy to large experimentation platforms, and we illustrate it in the context of an experiment on the LinkedIn platform.

8 citations


Posted ContentDOI
26 Jul 2017-bioRxiv
TL;DR: This work characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code.
Abstract: Many proteoforms - arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes - have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass-spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries. We characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code. HIquant server is implemented at: https://web.northeastern.edu/slavov/2014_HIquant/.

7 citations


Proceedings ArticleDOI
01 Dec 2017
TL;DR: A simulation framework is implemented to show the effectiveness of two latent homophily proxies, latent coordinates and community membership, in improving peer influence effect estimates on game downloads in a Japanese social network website.
Abstract: The importance of peer influence on consumer actions plays a vital role in marketing efforts. However, peer influence effects are often confounded with latent homophily, which are unobserved commonalities that drive friendship. Understanding causality has become one of the pressing issues of current research. We present an approach to explicitly account for various causal influences. We implement a simulation framework to show the effectiveness of two latent homophily proxies, latent coordinates and community membership, in improving peer influence effect estimates on game downloads in a Japanese social network website. We demonstrate that latent homophily proxies have no significant improvement in peer influence effect bias in the available website data.

Proceedings ArticleDOI
01 Nov 2017
TL;DR: This work proposes a methodology to sequentially and adaptively model nonlinear multivariate time series data and designs an online algorithm to automatically tune the underlying parameters based on the minimization of the regularized sequential prediction error.
Abstract: Given massive data that may be time dependent and multi-dimensional, how to efficiently explore the underlying functional relationships across different dimensions and time lags? In this work, we propose a methodology to sequentially and adaptively model nonlinear multivariate time series data. Data at each time step and dimension is modeled as a nonlinear function of past values corrupted by noise, and the underlying nonlinear function is assumed to be approximately expandable in a spline basis. We cast the modeling of data as finding a good fit representation in the linear span of multi-dimensional spline basis, and use a variant of h-penalty regularization in order to reduce the dimensionality of representation. Using adaptive filtering techniques, we design our online algorithm to automatically tune the underlying parameters based on the minimization of the regularized sequential prediction error. We demonstrate the generality and flexibility of the proposed approach on both synthetic and real-world datasets. Moreover, we analytically investigate the performance of our algorithm by obtaining bounds of the prediction errors.