scispace - formally typeset
Search or ask a question

Showing papers by "David L. Donoho published in 2016"


Journal ArticleDOI
TL;DR: It is shown here that that this phenomenon can be characterized rigorously using techniques that were developed by the authors for analyzing the Lasso estimator under high-dimensional asymptotics, and clarified that the ‘extra Gaussian noise’ encountered in this problem is fundamentally similar to phenomena already studied for regularized least squares in the setting of n.
Abstract: In a recent article, El Karoui et al. (Proc Natl Acad Sci 110(36):14557–14562, 2013) study the distribution of robust regression estimators in the regime in which the number of parameters p is of the same order as the number of samples n. Using numerical simulations and ‘highly plausible’ heuristic arguments, they unveil a striking new phenomenon. Namely, the regression coefficients contain an extra Gaussian noise component that is not explained by classical concepts such as the Fisher information matrix. We show here that that this phenomenon can be characterized rigorously using techniques that were developed by the authors for analyzing the Lasso estimator under high-dimensional asymptotics. We introduce an approximate message passing (AMP) algorithm to compute M-estimators and deploy state evolution to evaluate the operating characteristics of AMP and so also M-estimates. Our analysis clarifies that the ‘extra Gaussian noise’ encountered in this problem is fundamentally similar to phenomena already studied for regularized least squares in the setting $$n

221 citations


Proceedings ArticleDOI
01 Jan 2016
TL;DR: ClusterJob (CJ) is presented, an efficient computing environment that researchers have used to conduct and share million-CPU-hour experiments in a painless and reproducible way and a taxonomy of some of the desiderata which such paradigms should offer.
Abstract: The increasing availability of access to large-scale computing clusters — for example via the cloud — is changing the way scientific research can be conducted, enabling experiments of a scale and scope that would have been inconceivable several years ago. An ambitious data scientist today can carry out projects involving several million CPU hours. In the near future, we anticipate a typical Ph.D. in computational science may be expected or even required to offer findings based on at least 1 million CPU hours of computations. The massive scale of these soon-to-be-upon-us computational experiments demands that we change how we organize our experimental practices. Traditionally, and still the dominant paradigm today, the end-to-end process of experiment design and execution involves a significant amount of manual intervention and situational tweaking, cutting and pasting, and the use of disparate disconnected tools, much of which is undocumented and easily lost. This makes it difficult to detect and understand possible failure points in the computational workflow, making it virtually impossible to correct, let alone simply rerun the experiment. This is an amazing state of affairs, considering the ubiquity of error in scientific computation and in research generally. Following such unstructured and undocumented research practices limits the ability of the researcher to exploit cluster and cloud-based paradigms, as each increase in scale under the dominant paradigm is likely to lead to ever more errors and misunderstandings. A better paradigm will integrate the design of large experiments seamlessly with job management, output harvesting, data analysis, reporting, and publication of code and data. In particular such a paradigm would submerge the details of all the processing, harvesting, and management while exposing transparently the description of the discovery process itself, including details such as the parameter space exploration. Reproducing any job would be a push-button affair, and creating a new experiment from a previous one might involve only changes of a line or two of code followed again by push-button execution and reporting. Even though such experiments would be operating at a much greater scale than today, under such a paradigm they would be easier to conduct, obtain a lower error rate, and offer a much greater opportunity for ‘outsiders’ to understand the results. In this article, we discuss the challenges of massive computational experimentation and present a taxonomy of some of the desiderata which such paradigms should offer. We then present ClusterJob (CJ), an efficient computing environment that we and other researchers have used to conduct and share million-CPU-hour experiments in a painless and reproducible way.

30 citations