scispace - formally typeset
Search or ask a question
Posted Content

Neural Architecture Search with Reinforcement Learning

Barret Zoph1, Quoc V. Le1
05 Nov 2016-arXiv: Learning-
TL;DR: This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.
Abstract: Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.
Citations
More filters
Proceedings ArticleDOI
05 Nov 2018
TL;DR: Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.
Abstract: Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS [19] and DPP-Net [10]. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

29 citations


Cites background from "Neural Architecture Search with Rei..."

  • ... neural architectures is usually a manual and time-consuming process that heavily relies on experience and expertise. Recently, neural architecture search (NAS) has been proposed to address this issue[3,27]. Models designed by NAS have achieved impressive performance that is close to or even outperforms the current state-of-the-art designed by domain experts in several challenging tasks[4,14], demonstra...

    [...]

  • ..., ResNet[31] and DenseNet[32], proposed skip-connection and dense-connection, respectively, to create “branches” of the data flow in a neural network. Possibly inspired by these structures, Zoph et al.[3] proposed to design the search space including skip connections; this search space has been quickly adopted by other works[4,8,10,12]. Another recent trend is to design a search space that covers only...

    [...]

  • ...mance is taken as the reward. Related literatures. In general, various RL-based approaches for NAS differ in (a) how the action space is designed, and (b) how the action policy is updated. Zoph et al.[3] first applied policy gradient to update the policy, and in their later work[4] changed to use proximal policy optimization; Baker et al.[6] used Q-learning to update the action policy. There are also ...

    [...]

  • ...mparisons of Neural Architecture Search Approaches. Single-Objective Neural Architecture Search Approach Search Space Algorithm Acceleration Techniques Search Cost (GPU Days) Additional Objectives NAS[3] Macro RL - 22400 - NasNet[4] Micro RL - 1800 - Hierarchical[5] Micro EA/RS - 300 - MetaQNN[6] Macro RL - 100 - GeNet[7] Macro EA - 17 - Large-Scale[8] Macro EA Weight-Sharing 2500 - Amoeba[9] Micro E...

    [...]

  • ... search algorithms in the following sections. 2.2 Reinforcement-Learning-Based Approaches Reinforcement-learning-based approaches have been the mainstream methods for NAS, especially after Zoph et al.[3] demonstrated the impressive experimental results that outperform the state-of-the-art models designed by domain experts. NAS formulated as reinforcement learning (RL) There are three fundamental elem...

    [...]

Journal ArticleDOI
TL;DR: It is found that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems, and the inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems.
Abstract: Genetic programming has found recent success as a tool for learning sets of features for regression and classification. Multidimensional genetic programming is a useful variant of genetic programming for this task because it represents candidate solutions as sets of programs. These sets of programs expose additional information that can be exploited for building block identification. In this work, we discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. We investigate methods for biasing the components of programs that are promoted in order to guide search towards useful and complementary feature spaces. We study two main approaches: 1) the introduction of new objectives and 2) the use of specialized semantic variation operators. We find that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems. The inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems in comparison to several state-of-the-art machine learning approaches and other genetic programming frameworks. Finally, we look at the collinearity and complexity of the data representations produced by different methods, in order to assess whether relevant, concise, and independent factors of variation can be produced in application.

28 citations

Proceedings ArticleDOI
19 Jul 2020
TL;DR: In this article, an efficient particle swarm optimisation method named EPSOCNN is proposed to evolve CNN architectures inspired by the idea of transfer learning, which successfully reduces the computation cost by minimising the search space to a single block and utilising a small subset of the training set to evaluate CNNs during the evolutionary process.
Abstract: Deep Convolutional Neural Networks (CNNs) have been widely used in image classification tasks, but the process of designing CNN architectures is very complex, so Neural Architecture Search (NAS), automatically searching for optimal CNN architectures, has attracted more and more research interests. However, the computational cost of NAS is often too high to be applied to real-life applications. In this paper, an efficient particle swarm optimisation method named EPSOCNN is proposed to evolve CNN architectures inspired by the idea of transfer learning. EPSOCNN successfully reduces the computation cost by minimising the search space to a single block and utilising a small subset of the training set to evaluate CNNs during the evolutionary process. Meanwhile, EPSOCNN also keeps very competitive classification accuracy by stacking the evolved block multiple times to fit the whole training dataset. The proposed EPSOCNN algorithm is evaluated on CIFAR-10 dataset and compared with 13 peer competitors including deep CNNs crafted by hand, learned by reinforcement learning methods and evolved by evolutionary computation approaches. It shows very promising results with regard to the classification accuracy, the number of parameters and the computational cost. Besides, the evolved transferable block from CIFAR-10 is transferred and evaluated on two other datasets — CIFAR-100 and SVHN. It shows promising results on both of the datasets, which demonstrate the transferability of the evolved block. All of the experiments have been performed multiple times and Student’s t-test is used to compare the proposed method with peer competitors from the statistical point of view.

28 citations

Proceedings Article
03 May 2021
TL;DR: The authors proposed TE-NAS, which ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK) and the number of linear regions in the input space, and showed that these two measurements imply the trainability and expressivity of a neural network and strongly correlate with the network's actual test accuracy.
Abstract: Neural Architecture Search (NAS) has been explosively studied to automate the discovery of top-performer neural networks Current works require heavy training of supernet or intensive architecture evaluations, thus suffering from heavy resource consumption and often incurring search bias due to truncated training or approximations Can we select the best neural architectures without involving any training and eliminate a drastic portion of the search cost? We provide an affirmative answer, by proposing a novel framework called \textit{training-free neural architecture search} (TE-NAS) TE-NAS ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK), and the number of linear regions in the input space Both are motivated by recent theory advances in deep networks, and can be computed without any training We show that: (1) these two measurements imply the trainability and expressivity of a neural network; and (2) they strongly correlate with the network's actual test accuracy Further on, we design a pruning-based NAS mechanism to achieve a more flexible and superior trade-off between the trainability and expressivity during the search In NAS-Bench-201 and DARTS search spaces, TE-NAS completes high-quality search but only costs 05 and 4 GPU hours with one 1080Ti on CIFAR-10 and ImageNet, respectively We hope our work to inspire more attempts in bridging between the theoretic findings of deep networks and practical impacts in real NAS applications

28 citations

Journal ArticleDOI
TL;DR: This paper attempts to automatically optimize CNN architecture’s hyperparameters for speech recognition task based on particle swarm optimization (PSO) which is a population based stochastic optimization technique.
Abstract: Convolutional Neural Network (CNN) is one of the successful deep learning algorithms that have shown its effectiveness in a variety of vision tasks. The performance of this network depends directly on its hyperparameters. Although, designing CNN architectures require expert knowledge of their intrinsic structure or a lot of trial and error. To overcome these issues, there is a need to automatically design the optimal architecture of CNNs without any human intervention. So, we try to eliminate the constraints on the number of convolutional layers and pooling layers and their type etc. from traditional architecture. Biologically inspired approaches have not been extensively exploited for this task. This paper attempts to automatically optimize CNN architecture’s hyperparameters for speech recognition task based on particle swarm optimization (PSO) which is a population based stochastic optimization technique. The proposed method is evaluated by designing CNN architecture for speech recognition task on Hindi dataset. The experimental results show that the proposed method significantly designs the competitive CNN architecture which performs similar as other state-of-the-art methods.

28 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations


"Neural Architecture Search with Rei..." refers methods in this paper

  • ...Along with this success is a paradigm shift from feature designing to architecture designing, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016a)....

    [...]

Journal ArticleDOI
01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

42,067 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Neural Architecture Search with Rei..." refers methods in this paper

  • ...Along with this success is a paradigm shift from feature designing to architecture designing, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016a)....

    [...]