scispace - formally typeset
Search or ask a question
Posted Content

Neural Architecture Search with Reinforcement Learning

Barret Zoph1, Quoc V. Le1
05 Nov 2016-arXiv: Learning-
TL;DR: This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.
Abstract: Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.
Citations
More filters
29 Oct 2018
TL;DR: This work aims to give the user more flexibility over library, model and hyperparameter optimization algorithm selection, and is meant to be accessible for novice users and specifically targets the problem of choosing a model.
Abstract: Sherpa is a free open-source hyperparameter optimization library for machine learning models. It is designed for problems with computationally expensive iterative function evaluations, such as the hyperparameter tuning of deep neural networks. With Sherpa, scientists can quickly optimize hyperparameters using a variety of powerful and interchangeable algorithms. Additionally, the framework makes it easy to implement custom algorithms. Sherpa can be run on either a single machine or a cluster via a grid scheduler with minimal configuration. Finally, an interactive dashboard enables users to view the progress of models as they are trained, cancel trials, and explore which hyperparameter combinations are working best. Sherpa empowers machine learning researchers by automating the tedious aspects of model tuning and providing an extensible framework for developing automated hyperparameter-tuning strategies. Its source code and documentation are available at https://github.com/LarsHH/sherpa and https://parameter-sherpa.readthedocs.io/, respectively. A demo can be found at https://youtu.be/L95sasMLgP4. 1 Existing Hyperparameter Optimization Libraries Hyperparameter optimization algorithms for machine learning models have previously been implemented in software packages such as Spearmint [15], HyperOpt [2], Auto-Weka 2.0 [9], and Google Vizier [5] among others. Spearmint is a Python library based on Bayesian optimization using a Gaussian process. Hyperparameter exploration values are specified using the markup language YAML and run on a grid via SGE and MongoDB. Overall, it combines Bayesian optimization with the ability for distributed training. HyperOpt is a hyperparameter optimization framework that uses MongoDB to allow parallel computation. The user manually starts workers which receive tasks from the HyperOpt instance. It offers the use of Random Search and Bayesian optimization based on a Tree of Parzen Estimators. Auto-WEKA 2.0 implements the SMAC [6] algorithm for automatic model selection and hyperparameter optimization within the WEKA machine learning framework. It provides a graphical user interface and supports parallel runs on a single machine. It is meant to be accessible for novice users and specifically targets the problem of choosing a model. Auto-WEKA is related to Auto-Sklearn [4] and Auto-Net [11] which specifically focus on tuning Scikit-Learn models and fully-connected 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. Table 1: Comparison to Existing Libraries Spearmint Auto-WEKA HyperOpt Google Vizier Sherpa Early Stopping No No No Yes Yes Dashboard/GUI Yes Yes No Yes Yes Distributed Yes No Yes Yes Yes Open Source Yes Yes Yes No Yes # of Algorithms 2 1 2 3 5 neural networks in Lasagne, respectively. Auto-WEKA, Auto-Sklearn, and Auto-Net focus on an end-to-end automatic approach. This makes it easy for novice users, but restricts the user to the respective machine learning library and the models it implements. In contrast our work aims to give the user more flexibility over library, model and hyper-parameter optimization algorithm selection. Google Vizier is a service provided by Google for its cloud machine learning platform. It incorporates recent innovation in Bayesian optimization such as transfer learning and provides visualizations via a dashboard. Google Vizier provides many key features of a current hyperparameter optimization tool to Google Cloud users and Google engineers, but is not available in an open source version. A similar situation occurs with other cloud based platforms like Microsoft Azure Hyperparameter Tuning 1 and Amazon SageMaker’s Hyperparameter Optimization 2. 2 Need for a new library The field of machine learning has experienced massive growth over recent years. Access to open source machine learning libraries such as Scikit-Learn [14], Keras [3], Tensorflow [1], PyTorch [13], and Caffe [8] allowed research in machine learning to be widely reproduced by the community making it easy for practitioners to apply state of the art methods to real world problems. The field of hyperparameter optimization for machine learning has also seen many innovations recently such as Hyperband [10], Population Based Training [7], Neural Architecture Search [17], and innovation in Bayesian optimization such as [16]. While the basic implementation of some of these algorithms can be trivial, evaluating trials in a distributed fashion and keeping track of results becomes cumbersome which makes it difficult for users to apply these algorithms to real problems. In short, Sherpa aims to curate implementations of these algorithms while providing infrastructure to run these in a distributed way. The aim is for the platform to be scalable from usage on a laptop to a computation grid.

27 citations


Cites methods from "Neural Architecture Search with Rei..."

  • ...The field of hyperparameter optimization for machine learning has also seen many innovations recently such as Hyperband [10], Population Based Training [7], Neural Architecture Search [17], and innovation in Bayesian optimization such as [16]....

    [...]

Proceedings ArticleDOI
09 Jul 2020
TL;DR: A one-shot neural architecture search method referred to as MergeNAS is proposed by merging different types of operations e.g. convolutions into one operation, which not only reduces the search cost, but also alleviates over-fitting by reducing the redundant parameters.
Abstract: Differentiable architecture search (DARTS) has been a promising one-shot architecture search approach for its mathematical formulation and competitive results. However, besides its caused high memory utilization and a large computation requirement, many research works have shown that DARTS also often suffers notable over-fitting and thus does not work robustly for some new tasks. In this paper, we propose a one-shot neural architecture search method referred to as MergeNAS by merging different types of operations e.g. convolutions into one operation. This merge-based approach not only reduces the search cost (about half a GPU day), but also alleviates over-fitting by reducing the redundant parameters. Extensive experiments on different search space and various datasets have been conducted to verify our approach, showing that MergeNAS can converge to a stable architecture and achieve better performance with fewer parameters and search cost. For test accuracy and its stability, MergeNAS outperforms all NAS baseline methods implemented on NASBench-201, including DARTS, ENAS, RS, BOHB, GDAS and hand-crafted ResNet.

27 citations

Posted Content
TL;DR: This paper integrates the tasks of architecture design and model compression into one unified framework, which enables the joint architecture search with quantization (compression) policies for neural networks.
Abstract: Designing neural architectures is a fundamental step in deep learning applications. As a partner technique, model compression on neural networks has been widely investigated to gear the needs that the deep learning algorithms could be run with the limited computation resources on mobile devices. Currently, both the tasks of architecture design and model compression require expertise tricks and tedious trials. In this paper, we integrate these two tasks into one unified framework, which enables the joint architecture search with quantization (compression) policies for neural networks. This method is named JASQ. Here our goal is to automatically find a compact neural network model with high performance that is suitable for mobile devices. Technically, a multi-objective evolutionary search algorithm is introduced to search the models under the balance between model size and performance accuracy. In experiments, we find that our approach outperforms the methods that search only for architectures or only for quantization policies. 1) Specifically, given existing networks, our approach can provide them with learning-based quantization policies, and outperforms their 2 bits, 4 bits, 8 bits, and 16 bits counterparts. It can yield higher accuracies than the float models, for example, over 1.02% higher accuracy on MobileNet-v1. 2) What is more, under the balance between model size and performance accuracy, two models are obtained with joint search of architectures and quantization policies: a high-accuracy model and a small model, JASQNet and JASQNet-Small that achieves 2.97% error rate with 0.9 MB on CIFAR-10.

27 citations


Cites methods from "Neural Architecture Search with Rei..."

  • ...NASNet [35]....

    [...]

  • ...As shown in Table 2, JASQNet (float) and JASQNetSmall (float) are not better than NASNet [35] or AmoebaNet [27]....

    [...]

  • ...In contrast to NAS that is considered at the topological level, model compression aims to refine the neural nodes of a given network with sparse connections or weighting-parameter quantization....

    [...]

  • ...For neural architecture search space SA , we follow the NASNet search space [35]....

    [...]

  • ...But the technique of NAS alone is far from realworld AI applications....

    [...]

Posted Content
TL;DR: The genetic DCNN designer, an autonomous learning algorithm can generate a DCNN architecture automatically based on the data available for a specific image classification problem, whose performance is comparable to, if not better than, that of stateof- the-art DCNN models.
Abstract: Recent years have witnessed the breakthrough success of deep convolutional neural networks (DCNNs) in image classification and other vision applications. Although freeing users from the troublesome handcrafted feature extraction by providing a uniform feature extraction-classification framework, DCNNs still require a handcrafted design of their architectures. In this paper, we propose the genetic DCNN designer, an autonomous learning algorithm can generate a DCNN architecture automatically based on the data available for a specific image classification problem. We first partition a DCNN into multiple stacked meta convolutional blocks and fully connected blocks, each containing the operations of convolution, pooling, fully connection, batch normalization, activation and drop out, and thus convert the architecture into an integer vector. Then, we use refined evolutionary operations, including selection, mutation and crossover to evolve a population of DCNN architectures. Our results on the MNIST, Fashion-MNIST, EMNISTDigit, EMNIST-Letter, CIFAR10 and CIFAR100 datasets suggest that the proposed genetic DCNN designer is able to produce automatically DCNN architectures, whose performance is comparable to, if not better than, that of stateof- the-art DCNN models

27 citations


Cites background from "Neural Architecture Search with Rei..."

  • ...Zoph and Le [36] employed a recurrent neural network trained by reinforcement learning to maximize the expected accuracy of the generated DCNN architecture on a validation set of images....

    [...]

Journal ArticleDOI
TL;DR: Experimental evaluations on a wide range of DNNs show that the proposed NoC architecture enables 20%–80% reduction in communication latency with respect to state-of-the-art interconnect solutions.
Abstract: In-memory computing reduces latency and energy consumption of Deep Neural Networks (DNNs) by reducing the number of off-chip memory accesses. However, crossbar-based in-memory computing may significantly increase the volume of on-chip communication since the weights and activations are on-chip. State-of-the-art interconnect methodologies for in-memory computing deploy a bus-based network or mesh-based Network-on-Chip (NoC). Our experiments show that up to 90% of the total inference latency of a DNN hardware is spent on on-chip communication when the bus-based network is used. To reduce the communication latency, we propose a methodology to generate an NoC architecture along with a scheduling technique customized for different DNNs. We prove mathematically that the generated NoC architecture and corresponding schedules achieve the minimum possible communication latency for a given DNN. Furthermore, we generalize the proposed solution for edge computing and cloud computing. Experimental evaluations on a wide range of DNNs show that the proposed NoC architecture enables 20%–80% reduction in communication latency with respect to state-of-the-art interconnect solutions.

27 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations


"Neural Architecture Search with Rei..." refers methods in this paper

  • ...Along with this success is a paradigm shift from feature designing to architecture designing, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016a)....

    [...]

Journal ArticleDOI
01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

42,067 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Neural Architecture Search with Rei..." refers methods in this paper

  • ...Along with this success is a paradigm shift from feature designing to architecture designing, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016a)....

    [...]