Open AccessPosted Content
Biwhitening Reveals the Rank of a Count Matrix.
Reads0
Chats0
TLDR
In this paper, a simple procedure termed ''textit{biwhitening'' is proposed to estimate the rank of the underlying data matrix (i.e., the Poisson parameter matrix) without any prior knowledge on its structure.Abstract:
Estimating the rank of a corrupted data matrix is an important task in data science, most notably for choosing the number of components in principal component analysis. Significant progress on this task has been made using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, such as Poisson or binomial, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, focusing on a Poisson random matrix with independent entries, we propose a simple procedure termed \textit{biwhitening} that makes it possible to estimate the rank of the underlying data matrix (i.e., the Poisson parameter matrix) without any prior knowledge on its structure. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson distribution, we extend our biwhitening approach to other discrete distributions, such as the generalized Poisson, binomial, multinomial, and negative binomial. We conduct numerical experiments that corroborate our theoretical findings, and demonstrate our approach on real single-cell RNA sequencing (scRNA-seq) data, where we show that our results agree with a slightly overdispersed generalized Poisson model.read more
Citations
More filters
Journal ArticleDOI
HePPCAT: Probabilistic PCA for Data With Heteroscedastic Noise
TL;DR: In this article, a probabilistic PCA approach is proposed to account for heterogeneity in the statistical model by using an alternating maximization algorithm to jointly estimate both the underlying factors and the unknown noise variances.
References
More filters
Journal ArticleDOI
The scree test for the number of factors
TL;DR: The Scree Test for the Number Of Factors this paper was first proposed in 1966 and has been used extensively in the field of behavioral analysis since then, e.g., in this paper.
Book
Topics in Matrix Analysis
TL;DR: The field of values as discussed by the authors is a generalization of the field of value of matrices and functions, and it includes singular value inequalities, matrix equations and Kronecker products, and Hadamard products.
Journal ArticleDOI
A rationale and test for the number of factors in factor analysis.
TL;DR: It is suggested that if Guttman's latent-root-one lower bound estimate for the rank of a correlation matrix is accepted as a psychometric upper bound, then the rank for a sample matrix should be estimated by subtracting out the component in the latent roots which can be attributed to sampling error.
Journal ArticleDOI
Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets
Evan Z. Macosko,Evan Z. Macosko,Anindita Basu,Anindita Basu,Rahul Satija,Rahul Satija,James Nemesh,James Nemesh,Karthik Shekhar,Melissa Goldman,Melissa Goldman,Itay Tirosh,Allison R. Bialas,Nolan Kamitaki,Nolan Kamitaki,Emily M. Martersteck,John J. Trombetta,David A. Weitz,Joshua R. Sanes,Alex K. Shalek,Alex K. Shalek,Alex K. Shalek,Aviv Regev,Aviv Regev,Aviv Regev,Steven A. McCarroll,Steven A. McCarroll +26 more
TL;DR: Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together.
Journal ArticleDOI
Massively parallel digital transcriptional profiling of single cells
Grace X.Y. Zheng,Jessica M. Terry,Phillip Belgrader,Paul Ryvkin,Zachary Bent,Ryan Wilson,Solongo B. Ziraldo,Tobias Daniel Wheeler,Geoffrey P. McDermott,Junjie Zhu,Mark T. Gregory,Joe Shuga,Luz Montesclaros,Jason G. Underwood,Donald A. Masquelier,Stefanie Y. Nishimura,Michael Schnall-Levin,Paul Wyatt,Christopher Hindson,Rajiv Bharadwaj,Alexander Wong,Kevin D. Ness,Lan Beppu,H. Joachim Deeg,Christopher McFarland,Keith R. Loeb,Keith R. Loeb,William J. Valente,William J. Valente,Nolan G. Ericson,Emily A. Stevens,Jerald P. Radich,Tarjei S. Mikkelsen,Benjamin J. Hindson,Jason H. Bielas +34 more
TL;DR: A droplet-based system that enables 3′ mRNA counting of tens of thousands of single cells per sample is described and sequence variation in the transcriptome data is used to determine host and donor chimerism at single-cell resolution from bone marrow mononuclear cells isolated from transplant patients.