scispace - formally typeset
Search or ask a question

Showing papers by "David L. Donoho published in 2020"


Journal ArticleDOI
TL;DR: A now-standard training methodology: driving the cross-entropy loss to zero, continuing long after the classification error is already zero, is considered, helping to understand an important component of the modern deep learning training paradigm.
Abstract: Modern practice for training classification deepnets involves a terminal phase of training (TPT), which begins at the epoch where training error first vanishes. During TPT, the training error stays effectively zero, while training loss is pushed toward zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call neural collapse (NC), involving four deeply interconnected phenomena. (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class means. (NC2) The class means collapse to the vertices of a simplex equiangular tight frame (ETF). (NC3) Up to rescaling, the last-layer classifiers collapse to the class means or in other words, to the simplex ETF (i.e., to a self-dual configuration). (NC4) For a given activation, the classifier’s decision collapses to simply choosing whichever class has the closest train class mean (i.e., the nearest class center [NCC] decision rule). The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability.

194 citations


Journal ArticleDOI
TL;DR: In image processing, speech and video processing, machine vision, natural language processing, and classic two-player games, the state-of-the-art has been rapidly pushed forward over the last decade, as a series of machine-learning performance records were achieved for publicly organized challenge problems.
Abstract: Scientists today have completely different ideas of what machines can learn to do than we had only 10 y ago. In image processing, speech and video processing, machine vision, natural language processing, and classic two-player games, in particular, the state-of-the-art has been rapidly pushed forward over the last decade, as a series of machine-learning performance records were achieved for publicly organized challenge problems. In many of these challenges, the records now meet or exceed human performance level. A contest in 2010 proved that the Go-playing computer software of the day could not beat a strong human Go player. Today, in 2020, no one believes that human Go players—including human world champion Lee Sedol—can beat AlphaGo, a system constructed over the last decade. These new performance records, and the way they were achieved, obliterate the expectations of 10 y ago. At that time, human-level performance seemed a long way off and, for many, it seemed that no technologies then available would be able to deliver such performance. Systems like AlphaGo benefited in this last decade from a completely unanticipated simultaneous expansion on several fronts. On the one hand, we saw the unprecedented availability of on-demand scalable computing power in the form of cloud computing, and on the other hand, a massive industrial investment in assembling human engineering teams from a globalized talent pool by some of the largest global technology players. These resources were steadily deployed over that decade to allow rapid expansions in challenge problem performance. The 2010s produced a true technology explosion, a one-time–only transition: The sudden public availability of massive image and text data. Billions of people posted trillions of images and documents on social media, as the phrase “Big Data” entered media awareness. Image processing and natural language processing were forever changed by this new data resource … [↵][1]1To whom correspondence may be addressed. Email: donoho{at}stanford.edu. [1]: #xref-corresp-1-1

31 citations


Posted Content
TL;DR: A formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise is derived; although it nominally involves unobservables, it is shown how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable.
Abstract: We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable. The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattell's ever-popular but vague Scree Plot heuristic from 1966. ScreeNOT has a surprising oracle property: it typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance - i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure. Our results depend on the assumption that the singular values of the noise have a limiting empirical distribution of compact support; this model, which is standard in random matrix theory, is satisfied by many models exhibiting either cross-row correlation structure or cross-column correlation structure, and also by many situations where there is inter-element correlation structure. Simulations demonstrate the effectiveness of the method even at moderate matrix sizes. The paper is supplemented by ready-to-use software packages implementing the proposed algorithm.

7 citations


Posted Content
TL;DR: In this article, the authors adapt Higher Criticism (HC) to the comparison of two frequency tables which may exhibit moderate differences between the tables in some unknown, relatively small subset out of a large number of categories.
Abstract: We adapt Higher Criticism (HC) to the comparison of two frequency tables which may -- or may not -- exhibit moderate differences between the tables in some unknown, relatively small subset out of a large number of categories. Our analysis of the power of the proposed HC test quantifies the rarity and size of assumed differences and applies moderate deviations-analysis to determine the asymptotic powerfulness/powerlessness of our proposed HC procedure. Our analysis considers the null hypothesis of no difference in underlying generative model against a rare/weak perturbation alternative, in which the frequencies of $N^{1-\beta}$ out of the $N$ categories are perturbed by $r(\log N)/2n$ in the Hellinger distance; here $n$ is the size of each sample. Our proposed Higher Criticism (HC) test ewtext{for} this setting uses P-values obtained from $N$ exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the sparsity parameter $\beta$ and the perturbation intensity parameter $r$. Specifically, we derive a region in the $(\beta,r)$-plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between cases in which the counts in both tables are low, versus cases in which counts are high, corresponding to the cases of sparse and dense frequency tables. The phase transition curve of HC in the high-counts regime matches formally the curve delivered by HC in a two-sample normal means model.

4 citations


Posted Content
TL;DR: The asymptotic performance of the HC-based test is characterized in terms of the sparsity parameter $\beta$ and the perturbation intensity parameter $r$, and a region in the $(\beta,r)$-plane where the test asymPTotically has maximal power, while having asymptonically no power outside this region is derived.
Abstract: Given two samples from possibly different discrete distributions over a common set of size $N$, consider the problem of testing whether these distributions are identical, vs. the following rare/weak perturbation alternative: the frequencies of $N^{1-\beta}$ elements are perturbed by $r(\log N)/2n$ in the Hellinger distance, where $n$ is the size of each sample. We adapt the Higher Criticism (HC) test to this setting using P-values obtained from $N$ exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the sparsity parameter $\beta$ and the perturbation intensity parameter $r$. Specifically, we derive a region in the $(\beta,r)$-plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between the cases of dense ($N\gg n$) and sparse ($N\ll n$) contingency tables. In the dense case, the phase transition curve matches that of an analogous two-sample normal means model.

4 citations