scispace - formally typeset
Search or ask a question
Author

Shuxiao Chen

Bio: Shuxiao Chen is an academic researcher from University of Pennsylvania. The author has contributed to research in topics: Empirical risk minimization & Minimax. The author has an hindex of 6, co-authored 16 publications receiving 126 citations.

Papers
More filters
Posted Content
TL;DR: It is shown that data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant, and it is proved that it leads to variance reduction.
Abstract: Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).

111 citations

Journal ArticleDOI
TL;DR: This article highlights the fact that the standard “detect-and-forget” OLS approach can lead to invalid inference and shows how recently developed tools in selective inference can be used to properly account for outlier detection and removal.
Abstract: Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the dat...

39 citations

Posted Content
25 Jul 2019
TL;DR: A theoretical framework to start to shed light on how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).
Abstract: Many complex deep learning models have found success by exploiting symmetries in data. Convolutional neural networks (CNNs), for example, are ubiquitous in image classification due to their use of translation symmetry, as image identity is roughly invariant to translations. In addition, many other forms of symmetry such as rotation, scale, and color shift are commonly used via data augmentation: the transformed images are added to the training set. However, a clear framework for understanding data augmentation is not available. One may even say that it is somewhat mysterious: how can we increase performance by simply adding transforms of our data to the model? Can that be information theoretically possible? In this paper, we develop a theoretical framework to start to shed light on some of these problems. We explain data augmentation as averaging over the orbits of the group that keeps the data distribution invariant, and show that it leads to variance reduction. We study finite-sample and asymptotic empirical risk minimization (using results from stochastic convex optimization, Rademacher complexity, and asymptotic statistical theory). We work out as examples the variance reduction in exponential families, linear regression, and certain two-layer neural networks under shift invariance (using discrete Fourier analysis). We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).

36 citations

Posted Content
TL;DR: The present paper studies community detection in a stylized yet informative inhomogeneous multilayer network model, and provides an efficient algorithm that is simultaneously asymptotic minimax optimal for both estimation tasks under mild conditions.
Abstract: In network applications, it has become increasingly common to obtain datasets in the form of multiple networks observed on the same set of subjects, where each network is obtained in a related but different experiment condition or application scenario. Such datasets can be modeled by multilayer networks where each layer is a separate network itself while different layers are associated and share some common information. The present paper studies community detection in a stylized yet informative inhomogeneous multilayer network model. In our model, layers are generated by different stochastic block models, the community structures of which are (random) perturbations of a common global structure while the connecting probabilities in different layers are not related. Focusing on the symmetric two block case, we establish minimax rates for both \emph{global estimation} of the common structure and \emph{individualized estimation} of layer-wise community structures. Both minimax rates have sharp exponents. In addition, we provide an efficient algorithm that is simultaneously asymptotic minimax optimal for both estimation tasks under mild conditions. The optimal rates depend on the \emph{parity} of the number of most informative layers, a phenomenon that is caused by inhomogeneity across layers.

17 citations

Proceedings Article
01 Jan 2020
TL;DR: A novel approach from the perspective of label-awareness to reduce the performance gap for the neural tangent kernels and shows that the models trained with the proposed kernels better simulate NNs in terms of generalization ability and local elasticity.
Abstract: As a popular approach to modeling the dynamics of training overparametrized neural networks (NNs), the neural tangent kernels (NTK) are known to fall behind real-world NNs in generalization ability. This performance gap is in part due to the \textit{label agnostic} nature of the NTK, which renders the resulting kernel not as \textit{locally elastic} as NNs~\citep{he2019local}. In this paper, we introduce a novel approach from the perspective of \emph{label-awareness} to reduce this gap for the NTK. Specifically, we propose two label-aware kernels that are each a superimposition of a label-agnostic part and a hierarchy of label-aware parts with increasing complexity of label dependence, using the Hoeffding decomposition. Through both theoretical and empirical evidence, we show that the models trained with the proposed kernels better simulate NNs in terms of generalization ability and local elasticity.

16 citations


Cited by
More filters
Journal ArticleDOI
01 May 1981
TL;DR: This chapter discusses Detecting Influential Observations and Outliers, a method for assessing Collinearity, and its applications in medicine and science.
Abstract: 1. Introduction and Overview. 2. Detecting Influential Observations and Outliers. 3. Detecting and Assessing Collinearity. 4. Applications and Remedies. 5. Research Issues and Directions for Extensions. Bibliography. Author Index. Subject Index.

4,948 citations

01 Jan 2016
TL;DR: In this paper, plots transformations and regression is used as an introduction to graphical methods of diagnostic regression analysis, but end up in malicious downloads, instead of reading a good book with a cup of coffee in the afternoon, instead they are facing with some infectious bugs inside their laptop.
Abstract: Thank you very much for reading plots transformations and regression an introduction to graphical methods of diagnostic regression analysis. Maybe you have knowledge that, people have look numerous times for their favorite readings like this plots transformations and regression an introduction to graphical methods of diagnostic regression analysis, but end up in malicious downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they are facing with some infectious bugs inside their laptop.

138 citations

Posted Content
TL;DR: The resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet, which is the best such result the authors know of for a classifier that is not a trained neural network.
Abstract: Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are trained. An exact algorithm to compute CNTK (Arora et al., 2019) yielded the finding that classification accuracy of CNTK on CIFAR-10 is within 6-7% of that of that of the corresponding CNN architecture (best figure being around 78%) which is interesting performance for a fixed kernel. Here we show how to significantly enhance the performance of these kernels using two ideas. (1) Modifying the kernel using a new operation called Local Average Pooling (LAP) which preserves efficient computability of the kernel and inherits the spirit of standard data augmentation using pixel shifts. Earlier papers were unable to incorporate naive data augmentation because of the quadratic training cost of kernel regression. This idea is inspired by Global Average Pooling (GAP), which we show for CNN-GP and CNTK is equivalent to full translation data augmentation. (2) Representing the input image using a pre-processing technique proposed by Coates et al. (2011), which uses a single convolutional layer composed of random image patches. On CIFAR-10, the resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet (Krizhevsky et al., 2012). Note that this is the best such result we know of for a classifier that is not a trained neural network. Similar improvements are obtained for Fashion-MNIST.

105 citations

Posted Content
TL;DR: The contribution on generalization of weight decay and dropout is not only superfluous when sufficient implicit regularization is provided, but also such techniques can dramatically deteriorate the performance if the hyperparameters are not carefully tuned for the architecture and data set.
Abstract: Contrary to most machine learning models, modern deep artificial neural networks typically include multiple components that contribute to regularization. Despite the fact that some (explicit) regularization techniques, such as weight decay and dropout, require costly fine-tuning of sensitive hyperparameters, the interplay between them and other elements that provide implicit regularization is not well understood yet. Shedding light upon these interactions is key to efficiently using computational resources and may contribute to solving the puzzle of generalization in deep learning. Here, we first provide formal definitions of explicit and implicit regularization that help understand essential differences between techniques. Second, we contrast data augmentation with weight decay and dropout. Our results show that visual object categorization models trained with data augmentation alone achieve the same performance or higher than models trained also with weight decay and dropout, as is common practice. We conclude that the contribution on generalization of weight decay and dropout is not only superfluous when sufficient implicit regularization is provided, but also such techniques can dramatically deteriorate the performance if the hyperparameters are not carefully tuned for the architecture and data set. In contrast, data augmentation systematically provides large generalization gains and does not require hyperparameter re-tuning. In view of our results, we suggest to optimize neural networks without weight decay and dropout to save computational resources, hence carbon emissions, and focus more on data augmentation and other inductive biases to improve performance and robustness.

101 citations

Journal ArticleDOI
TL;DR: Experimental results show that EDL-COVID offers promising results for COVID-19 case detection with an accuracy of 95%, better than CO VID-Net of 93.3% and a proposed weighted averaging ensembling method that is aware of different sensitivities of deep learning models on different classes types.
Abstract: Effective screening of COVID-19 cases has been becoming extremely important to mitigate and stop the quick spread of the disease during the current period of COVID-19 pandemic worldwide. In this article, we consider radiology examination of using chest X-ray images, which is among the effective screening approaches for COVID-19 case detection. Given deep learning is an effective tool and framework for image analysis, there have been lots of studies for COVID-19 case detection by training deep learning models with X-ray images. Although some of them report good prediction results, their proposed deep learning models might suffer from overfitting, high variance, and generalization errors caused by noise and a limited number of datasets. Considering ensemble learning can overcome the shortcomings of deep learning by making predictions with multiple models instead of a single model, we propose EDL-COVID , an ensemble deep learning model employing deep learning and ensemble learning. The EDL-COVID model is generated by combining multiple snapshot models of COVID-Net, which has pioneered in an open-sourced COVID-19 case detection method with deep neural network processed chest X-ray images, by employing a proposed weighted averaging ensembling method that is aware of different sensitivities of deep learning models on different classes types. Experimental results show that EDL-COVID offers promising results for COVID-19 case detection with an accuracy of 95%, better than COVID-Net of 93.3%.

94 citations