scispace - formally typeset
Search or ask a question

Showing papers by "David L. Donoho published in 2022"


10 Oct 2022
TL;DR: In this article , the authors study the spiked covariance model, with population covariance a low-rank perturbation of the identity, and derive optimal performance levels and eigenvalue shrinkage formulas for the spiked Wigner setting, of independent and fundamental interest.
Abstract: Recent studies of high-dimensional covariance estimation often assume the proportional growth asymptotic, where the sample size n and dimension p are comparable, with n, p → ∞ and γ n ≡ p/n → γ > 0. Yet, many datasets—perhaps most—have very different numbers of rows and columns. Consider instead disproportional growth , where n, p → ∞ and γ n → 0 or γ n → ∞ . With far fewer dimensions than observations, the disproportional limit γ n → 0 may seem similar to classical fixed- p asymptotics. In fact, either disproportional limit induces novel phenomena distinct from the proportional and fixed- p limits. We study the spiked covariance model, with population covariance a low-rank perturbation of the identity. For each of 15 different loss functions, and each disproportional limit, we exhibit in closed form new optimal shrinkage and thresholding rules; optimality takes the particularly strong form of unique asymptotic admissibility. Readers who initially view the disproportionate limit γ n → 0 as similar to classical fixed- p asymptotics may expect, given the dominance in that setting of the sample covariance estimator, that there is no alternative here. On the contrary, although the sample covariance is consistent as γ n → 0, our optimal procedures demand extensive eigenvalue shrinkage and offer substantial performance benefits. The sample covariance is similarly improvable in the disproportionate limit γ n → ∞ . Practitioners may worry how to choose between proportional and disproportional growth frameworks in practice. Conveniently, under the spiked covariance model there is no conflict between the two and no choice is needed; one unified set of closed forms (used with the aspect ratio γ n of the practitioner’s data) offers full asymptotic optimality in both regimes. heart in a low-rank by symmetric noise. The eigenvalue distributions of the spiked covariance under disproportionate growth (appropriately scaled) and the spiked Wigner converge to a common limit—the semicircle law. Exploiting this connection, we derive optimal performance levels and eigenvalue shrinkage formulas for the spiked Wigner setting, of independent and fundamental interest. These formulas visibly correspond to our formulas for optimal shrinkage in covariance estimation.

2 citations


Journal ArticleDOI
03 Apr 2022
TL;DR: In a recent article as mentioned in this paper , the authors saluted the authors for their gift to the world of this new dataset and expressed a great deal of enthusiasm for the data, which seems such a departure from the pattern of many articles in statistics today.
Abstract: I salute the authors for their gift to the world of this new dataset! They have clearly invested plenty of time, effort, and IQ points in the study of the statistics literature as a bibliometric laboratory, and our field will grow and develop because of this dataset, as well as methodology the authors developed and/or fine-tuned with those data. Strikingly, the article also conveys a great deal of enthusiasm for the data! This seems such a departure from the pattern of many articles in statistics today. The enthusiastic spirit reminds me of some classic work by great figures in the history of statistics, who often were fascinated by new kinds of data which were just becoming available in their day, and who were inspired by the new data to invent fundamental new statistical tools and mathematical machinery. Francis Galton was interested in the relationships between father’s height and son’s height, himself compiling an extensive bivariate dataset of such heights, leading to the invention of the bivariate normal distribution and the correlation coefficient. Time and time again, new types of data came first, new types of models and methodology later. Indeed, this seems almost inevitable. As new technologies come onstream, new kinds of measurements become available, and new settings for data analysis and statistical inference emerge. This is plain to see in recent decades, where computational biology produced gene expression data, DNA sequence data, SNP data, and RNA-Seq data, each new data type leading to interesting methodological challenges and scientific progress. For me, each effort by a statistics researcher to understand a newly available type of data enlarges our field; it should be a primary part of the career of statisticians to cultivate an interest in cultivating new types of datasets, so that new methodology can be discovered and developed.