Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Home
/
Papers
/
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Proceedings Article•

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe¹, Christian Szegedy¹•Institutions (1)

06 Jul 2015-Vol. 1, pp 448-456

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

read less

Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk Neuroblastoma

[...]

Valerio Maggio¹, Marco Chierici¹, Giuseppe Jurman¹, Cesare Furlanello¹•Institutions (1)

fondazione bruno kessler¹

07 Dec 2018-PLOS ONE

TL;DR: It is shown that a CDRP-N embedding is indeed parametrically associated to increasing severity and the embedding can be used to better stratify patients’ survival and is presented the first application to survival prediction in High-Risk (HR) Neuroblastoma from transcriptomics data.

...read moreread less

Abstract: We introduce the CDRP (Concatenated Diagnostic-Relapse Prognostic) architecture for multi-task deep learning that incorporates a clinical algorithm, e.g., a risk stratification schema to improve prognostic profiling. We present the first application to survival prediction in High-Risk (HR) Neuroblastoma from transcriptomics data, a task that studies from the MAQC consortium have shown to remain the hardest among multiple diagnostic and prognostic endpoints predictable from the same dataset. To obtain a more accurate risk stratification needed for appropriate treatment strategies, CDRP combines a first component (CDRP-A) synthesizing a diagnostic task and a second component (CDRP-N) dedicated to one or more prognostic tasks. The approach leverages the advent of semi-supervised deep learning structures that can flexibly integrate multimodal data or internally create multiple processing paths. CDRP-A is an autoencoder trained on gene expression on the HR/non-HR risk stratification by the Children's Oncology Group, obtaining a 64-node representation in the bottleneck layer. CDRP-N is a multi-task classifier for two prognostic endpoints, i.e., Event-Free Survival (EFS) and Overall Survival (OS). CDRP-A provides the HR embedding input to the CDRP-N shared layer, from which two branches depart to model EFS and OS, respectively. To control for selection bias, CDRP is trained and evaluated using a Data Analysis Protocol (DAP) developed within the MAQC initiative. CDRP was applied on Illumina RNA-Seq of 498 Neuroblastoma patients (HR: 176) from the SEQC study (12,464 Entrez genes) and on Affymetrix Human Exon Array expression profiles (17,450 genes) of 247 primary diagnostic Neuroblastoma of the TARGET NBL cohort. On the SEQC HR patients, CDRP achieves Matthews Correlation Coefficient (MCC) 0.38 for EFS and MCC = 0.19 for OS in external validation, improving over published SEQC models. We show that a CDRP-N embedding is indeed parametrically associated to increasing severity and the embedding can be used to better stratify patients' survival.

...read moreread less

16 citations

Journal Article•DOI•

Monocular Depth Estimation With Augmented Ordinal Depth Relationships

[...]

Yuanzhouhan Cao¹, Tianqi Zhao², Ke Xian³, Chunhua Shen¹, Zhiguo Cao³, Shugong Xu⁴ - Show less +2 more•Institutions (4)

University of Adelaide¹, Tsinghua University², Huazhong University of Science and Technology³, Shanghai University⁴

01 Aug 2020-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: It is shown that relative depth can be an informative cue for metric depth estimation and can be easily obtained from vast stereo videos, and a new “relative depth in stereo” (RDIS) dataset densely labeled with relative depths is introduced.

...read moreread less

Abstract: Most existing algorithms for depth estimation from single monocular images need large quantities of metric ground-truth depths for supervised learning. We show that relative depth can be an informative cue for metric depth estimation and can be easily obtained from vast stereo videos. Acquiring metric depths from stereo videos are sometimes impracticable due to the absence of camera parameters. In this paper, we propose to improve the performance of metric depth estimation with relative depths collected from stereo movie videos using existing stereo matching algorithm. We introduce a new “relative depth in stereo” (RDIS) dataset densely labeled with relative depths. We first pretrain a ResNet model on our RDIS dataset. Then, we finetune the model on RGB-D datasets with metric ground-truth depths. During our finetuning, we formulate depth estimation as a classification task. This re-formulation scheme enables us to obtain the confidence of a depth prediction in the form of probability distribution. With this confidence, we propose an information gain loss to make use of the predictions that are close to ground-truth. We evaluate our approach on both indoor and outdoor benchmark RGB-D datasets and achieve the state-of-the-art performance.

...read moreread less

16 citations

Cites methods from "Batch Normalization: Accelerating D..."

...Batch normalizations (BNs) [39] and ReLUs are applied before weight layers....
[...]

Posted Content•

Pipelined Backpropagation at Scale: Training Large Models without Batches

[...]

Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, Urs Köster - Show less +1 more

25 Mar 2020-arXiv: Learning

TL;DR: This work evaluates the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm that has significant hardware advantages and introduces two methods, Spike Compensation and Linear Weight Prediction, that effectively mitigate the downsides caused by the asynchronicity of Pipeline Backpropaganda and outperform existing techniques in this setting.

...read moreread less

Abstract: New hardware can substantially increase the speed and efficiency of deep neural network training. To guide the development of future hardware architectures, it is pertinent to explore the hardware and machine learning properties of alternative training algorithms. In this work we evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm that has significant hardware advantages. We introduce two methods, Spike Compensation and Linear Weight Prediction, that effectively mitigate the downsides caused by the asynchronicity of Pipelined Backpropagation and outperform existing techniques in our setting. We show that appropriate normalization and small batch sizes can also aid training. With our methods, fine-grained Pipelined Backpropagation using a batch size of one can match the accuracy of SGD for multiple networks trained on CIFAR-10 and ImageNet. Simple scaling rules allow the use of existing hyperparameters for traditional training without additional tuning.

...read moreread less

16 citations

Cites background or methods from "Batch Normalization: Accelerating D..."

...• Batch normalization (Ioffe & Szegedy, 2015) requires batch parallelism....
[...]
...To enable training at a batch size of one we replace batch normalization (Ioffe & Szegedy, 2015) with group normalization (Wu & He, 2018)9....
[...]

Journal Article•DOI•

A High-Stability Diagnosis Model Based on a Multiscale Feature Fusion Convolutional Neural Network

[...]

Pengxin Wang¹, Liuyang Song¹, Xudong Guo¹, Huaqing Wang¹, Lingli Cui² - Show less +1 more•Institutions (2)

Beijing University of Chemical Technology¹, Beijing University of Technology²

09 Aug 2021-IEEE Transactions on Instrumentation and Measurement

TL;DR: In this paper, a multiscale feature fusion convolutional neural network (MFF-CNN) is proposed for the diagnosis of rotating machines based on deep learning models, which extracts, modulates, and fuses the input samples' multi-scale features to focus more on the health state difference rather than the noise disturbance and workload difference.

...read moreread less

Abstract: Recently, the diagnosis of rotating machines based on deep learning models has achieved great success. Many of these intelligent diagnosis models are assumed that training and test data are subject to independent identical distributions (IIDs). Unfortunately, such an assumption is generally invalid in practical applications due to noise disturbances and changes in workload. To address the above problem, this article presents a high-stability diagnosis model named the multiscale feature fusion convolutional neural network (MFF-CNN). MFF-CNN does not rely on tedious data preprocessing and target domain information. It is composed of multiscale dilated convolution, self-adaptive weighting, and the new form of maxout (NFM) activation. It extracts, modulates, and fuses the input samples’ multiscale features so that the model focuses more on the health state difference rather than the noise disturbance and workload difference. Two diagnostic cases, including noisy cases and variable load cases, are used to verify the effectiveness of the present model. The results show that the present model has a strong health state identification capability and anti-interference capability for variable loads and noise disturbances.

...read moreread less

16 citations

Journal Article•DOI•

A Deep Learning-Based Approach to Enable Action Recognition for Construction Equipment

[...]

Jinyue Zhang¹, Lijun Zi, Yuexian Hou², Mingen Wang², Wenting Jiang, Da Deng² - Show less +2 more•Institutions (2)

College of Management and Economics¹, Tianjin University²

05 Nov 2020-Advances in Civil Engineering

TL;DR: A new deep learning-based CEAR approach that combines a convolutional neural network with long short-term memory (LSTM, an artificial recurrent neural network) to recognize human actions and the movement of construction equipment in virtual construction scenes is proposed.

...read moreread less

Abstract: In order to support smart construction, digital twin has been a well-recognized concept for virtually representing the physical facility. It is equally important to recognize human actions and the movement of construction equipment in virtual construction scenes. Compared to the extensive research on human action recognition (HAR) that can be applied to identify construction workers, research in the field of construction equipment action recognition (CEAR) is very limited, mainly due to the lack of available datasets with videos showing the actions of construction equipment. The contributions of this research are as follows: (1) the development of a comprehensive video dataset of 2,064 clips with five action types for excavators and dump trucks; (2) a new deep learning-based CEAR approach (known as a simplified temporal convolutional network or STCN) that combines a convolutional neural network (CNN) with long short-term memory (LSTM, an artificial recurrent neural network), where CNN is used to extract image features and LSTM is used to extract temporal features from video frame sequences; and (3) the comparison between this proposed new approach and a similar CEAR method and two of the best-performing HAR approaches, namely, three-dimensional (3D) convolutional networks (ConvNets) and two-stream ConvNets, to evaluate the performance of STCN and investigate the possibility of directly transferring HAR approaches to the field of CEAR.

...read moreread less

16 citations

Cites methods from "Batch Normalization: Accelerating D..."

...In order to ensure the stability of data distribution in each layer to improve the training efficiency, batch normalization is used in all convolutional layers [57]....
[...]

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gradient-based learning applied to document recognition

[...]

Yann LeCun¹, Léon Bottou², Léon Bottou³, Yoshua Bengio³, Yoshua Bengio⁴, Yoshua Bengio⁵, Patrick Haffner³ - Show less +3 more•Institutions (5)

Bell Labs¹, École Normale Supérieure², AT&T³, École Polytechnique de Montréal⁴, Alcatel-Lucent⁵

01 Jan 1998

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

...read moreread less

42,067 citations

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Journal Article•

Dropout: a simple way to prevent neural networks from overfitting

[...]

Nitish Srivastava¹, Geoffrey E. Hinton¹, Alex Krizhevsky¹, Ilya Sutskever¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

33,597 citations

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

Proceedings Article•

Rectified Linear Units Improve Restricted Boltzmann Machines

[...]

Vinod Nair¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

21 Jun 2010

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.

...read moreread less

Abstract: Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these "Stepped Sigmoid Units" are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors.

...read moreread less

14,799 citations