scispace - formally typeset
Search or ask a question
Journal ArticleDOI

diffGrad: An Optimization Method for Convolutional Neural Networks

29 Oct 2020-IEEE Transactions on Neural Networks (IEEE Trans Neural Netw Learn Syst)-Vol. 31, Iss: 11, pp 4500-4511
TL;DR: A novel optimizer is proposed based on the difference between the present and the immediate past gradient, diffGrad, which shows that diffGrad outperforms other optimizers and performs uniformly well for training CNN using different activation functions.
Abstract: Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .
Citations
More filters
Posted Content
TL;DR: An extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers is performed, identifying a significantly reduced subset of specific algorithms and parameter choices that generally provided competitive results in the authors' experiments.
Abstract: Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

85 citations


Additional excerpts

  • ..., 2019c) DiffGrad (Dubey et al., 2020) SADAM (Tong et al....

    [...]

Journal ArticleDOI
TL;DR: In this article, an extensive review of artificial neural networks (ANNs) based optimization algorithm techniques with some of the famous optimization techniques, e.g., genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), and backtracking search algorithm (BSA), is presented.
Abstract: In the last few years, intensive research has been done to enhance artificial intelligence (AI) using optimization techniques. In this paper, we present an extensive review of artificial neural networks (ANNs) based optimization algorithm techniques with some of the famous optimization techniques, e.g., genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), and backtracking search algorithm (BSA) and some modern developed techniques, e.g., the lightning search algorithm (LSA) and whale optimization algorithm (WOA), and many more. The entire set of such techniques is classified as algorithms based on a population where the initial population is randomly created. Input parameters are initialized within the specified range, and they can provide optimal solutions. This paper emphasizes enhancing the neural network via optimization algorithms by manipulating its tuned parameters or training parameters to obtain the best structure network pattern to dissolve the problems in the best way. This paper includes some results for improving the ANN performance by PSO, GA, ABC, and BSA optimization techniques, respectively, to search for optimal parameters, e.g., the number of neurons in the hidden layers and learning rate. The obtained neural net is used for solving energy management problems in the virtual power plant system.

70 citations

Journal ArticleDOI
TL;DR: An end-to-end spectral–spatial squeeze-and-excitation (SE) residual bag-of-feature (S3EResBoF) learning framework for HSI classification that takes as input raw 3-D image cubes without engineering and builds a codebook representation of transform feature by motivating the feature maps facilitating classification by suppressing useless feature maps based on patterns present in the feature Maps.
Abstract: Of late, convolutional neural networks (CNNs) find great attention in hyperspectral image (HSI) classification since deep CNNs exhibit commendable performance for computer vision-related areas. CNNs have already proved to be very effective feature extractors, especially for the classification of large data sets composed of 2-D images. However, due to the existence of noisy or correlated spectral bands in the spectral domain and nonuniform pixels in the spatial neighborhood, HSI classification results are often degraded and unacceptable. However, the elementary CNN models often find intrinsic representation of pattern directly when employed to explore the HSI in the spectral–spatial domain. In this article, we design an end-to-end spectral–spatial squeeze-and-excitation (SE) residual bag-of-feature ( S3EResBoF ) learning framework for HSI classification that takes as input raw 3-D image cubes without engineering and builds a codebook representation of transform feature by motivating the feature maps facilitating classification by suppressing useless feature maps based on patterns present in the feature maps. To boost the classification performance and learn the joint spatial–spectral features, every residual block is connected to every other 3-D convolutional layer through an identity mapping followed by an SE block, thereby facilitating the rich gradients through backpropagation. Additionally, we introduce batch normalization on every convolutional layer (ConvBN) to regularize the convergence of the network and scale invariant BoF quantization for the measure of classification. The experiments conducted using three well-known HSI data sets and compared with the state-of-the-art classification methods reveal that S3EResBoF provides competitive performance in terms of both classification and computation time.

69 citations


Cites background from "diffGrad: An Optimization Method fo..."

  • ...The optimal learning rate [52] is chosen as 0....

    [...]

Posted Content
TL;DR: In this paper, a survey of state-of-the-art DL frameworks for hyperspectral imaging (HSI) classification is presented, including spectral-features, spatial-features and together spatial-spectral features.
Abstract: Hyperspectral Imaging (HSI) has been extensively utilized in many real-life applications because it benefits from the detailed spectral information contained in each pixel. Notably, the complex characteristics i.e., the nonlinear relation among the captured spectral information and the corresponding object of HSI data make accurate classification challenging for traditional methods. In the last few years, deep learning (DL) has been substantiated as a powerful feature extractor that effectively addresses the nonlinear problems that appeared in a number of computer vision tasks. This prompts the deployment of DL for HSI classification (HSIC) which revealed good performance. This survey enlists a systematic overview of DL for HSIC and compared state-of-the-art strategies of the said topic. Primarily, we will encapsulate the main challenges of traditional machine learning for HSIC and then we will acquaint the superiority of DL to address these problems. This survey breakdown the state-of-the-art DL frameworks into spectral-features, spatial-features, and together spatial-spectral features to systematically analyze the achievements (future directions as well) of these frameworks for HSIC. Moreover, we will consider the fact that DL requires a large number of labeled training examples whereas acquiring such a number for HSIC is challenging in terms of time and cost. Therefore, this survey discusses some strategies to improve the generalization performance of DL strategies which can provide some future guidelines.

68 citations

Journal ArticleDOI
TL;DR: In this article , a survey of state-of-the-art DL frameworks for hyperspectral imaging classification (HSIC) is presented. And the authors discuss some strategies to improve the generalization performance of DL strategies and provide some future guidelines.
Abstract: Hyperspectral imaging (HSI) has been extensively utilized in many real-life applications because it benefits from the detailed spectral information contained in each pixel. Notably, the complex characteristics, i.e., the nonlinear relation among the captured spectral information and the corresponding object of HSI data, make accurate classification challenging for traditional methods. In the last few years, deep learning (DL) has been substantiated as a powerful feature extractor that effectively addresses the nonlinear problems that appeared in a number of computer vision tasks. This prompts the deployment of DL for HSI classification (HSIC) which revealed good performance. This survey enlists a systematic overview of DL for HSIC and compared state-of-the-art strategies of the said topic. Primarily, we will encapsulate the main challenges of TML for HSIC and then we will acquaint the superiority of DL to address these problems. This article breaks down the state-of-the-art DL frameworks into spectral-features, spatial-features, and together spatial–spectral features to systematically analyze the achievements (future research directions as well) of these frameworks for HSIC. Moreover, we will consider the fact that DL requires a large number of labeled training examples whereas acquiring such a number for HSIC is challenging in terms of time and cost. Therefore, this survey discusses some strategies to improve the generalization performance of DL strategies which can provide some future guidelines.

63 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations


"diffGrad: An Optimization Method fo..." refers background or methods in this paper

  • ...Adam computes adaptive learning rates for each parameter [38] by utilizing both first and second moments....

    [...]

  • ...999, and learning rate α ∈ [10−2, 10−4] is a good starting choice for many models [38]....

    [...]

  • ...The convergence property of Adam [38] is shown using the online learning framework proposed in [44]....

    [...]

  • ...Adam [38] is another widely used gradient descent optimization technique that computes the learning rates at each step based on two vectors known as the 1 and 2 order moments (i....

    [...]

  • ...In this experiment, for both Adam [38] as well as the proposed diffGrad optimization methods, the following are the hyper-parameter settings: the decay rate for 1 moment (β1) is 0....

    [...]

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations


"diffGrad: An Optimization Method fo..." refers background or methods in this paper

  • ...The popular CNN architectures for image categorization problems are AlexNet [2], VGGNet [4], GoogleNet [19], and ResNet [20]....

    [...]

  • ...Different CNN architectures have been proposed for image related problems such as AlexNet [2], VggNet [4], GoogLeNet [19], and ResNet [20] for image classification, R-CNN [21], Fast R-CNN [22], Faster R-CNN [23], and YOLO [24] for object detection, Mask R-CNN [25] and PANet [26] for instance segmentation, RCCNet [27] for colon cancer nuclei classification, etc....

    [...]

  • ...Due to the availability of GPU-based highend computational facilities and the huge amount of data, deep learning based approaches generally outperform the traditional hand-designed approaches to solve research problems in Computer Vision [2], [3], [4], [5], Image Processing [6], [7], Signal Processing [8], [9], Robotics [10], Natural Language Processing [11], [12], and many other diverse areas of Artificial Intelligence....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations