Neural Architecture Search with Reinforcement Learning

Home
/
Papers
/
Neural Architecture Search with Reinforcement Learning

Posted Content•

Neural Architecture Search with Reinforcement Learning

Barret Zoph¹, Quoc V. Le¹•Institutions (1)

05 Nov 2016-arXiv: Learning-

TL;DR: This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.

read less

Abstract: Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Searching toward pareto-optimal device-aware neural architectures

[...]

An-Chieh Cheng¹, Jin-Dong Dong¹, Chi-Hung Hsu¹, Shu-Huan Chang¹, Min Sun¹, Shih-Chieh Chang¹, Jia-Yu Pan², Yu-Ting Chen², Wei Wei², Da-Cheng Juan² - Show less +6 more•Institutions (2)

National Tsing Hua University¹, Google²

05 Nov 2018

TL;DR: Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

...read moreread less

Abstract: Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS [19] and DPP-Net [10]. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

...read moreread less

29 citations

Cites background from "Neural Architecture Search with Rei..."

... neural architectures is usually a manual and time-consuming process that heavily relies on experience and expertise. Recently, neural architecture search (NAS) has been proposed to address this issue[3,27]. Models designed by NAS have achieved impressive performance that is close to or even outperforms the current state-of-the-art designed by domain experts in several challenging tasks[4,14], demonstra...
[...]
..., ResNet[31] and DenseNet[32], proposed skip-connection and dense-connection, respectively, to create “branches” of the data ﬂow in a neural network. Possibly inspired by these structures, Zoph et al.[3] proposed to design the search space including skip connections; this search space has been quickly adopted by other works[4,8,10,12]. Another recent trend is to design a search space that covers only...
[...]
...mance is taken as the reward. Related literatures. In general, various RL-based approaches for NAS differ in (a) how the action space is designed, and (b) how the action policy is updated. Zoph et al.[3] ﬁrst applied policy gradient to update the policy, and in their later work[4] changed to use proximal policy optimization; Baker et al.[6] used Q-learning to update the action policy. There are also ...
[...]
...mparisons of Neural Architecture Search Approaches. Single-Objective Neural Architecture Search Approach Search Space Algorithm Acceleration Techniques Search Cost (GPU Days) Additional Objectives NAS[3] Macro RL - 22400 - NasNet[4] Micro RL - 1800 - Hierarchical[5] Micro EA/RS - 300 - MetaQNN[6] Macro RL - 100 - GeNet[7] Macro EA - 17 - Large-Scale[8] Macro EA Weight-Sharing 2500 - Amoeba[9] Micro E...
[...]
... search algorithms in the following sections. 2.2 Reinforcement-Learning-Based Approaches Reinforcement-learning-based approaches have been the mainstream methods for NAS, especially after Zoph et al.[3] demonstrated the impressive experimental results that outperform the state-of-the-art models designed by domain experts. NAS formulated as reinforcement learning (RL) There are three fundamental elem...
[...]

Journal Article•DOI•

Learning feature spaces for regression with genetic programming.

[...]

William La Cava¹, Jason H. Moore¹•Institutions (1)

University of Pennsylvania¹

11 Mar 2020-Genetic Programming and Evolvable Machines

TL;DR: It is found that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems, and the inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems.

...read moreread less

Abstract: Genetic programming has found recent success as a tool for learning sets of features for regression and classification. Multidimensional genetic programming is a useful variant of genetic programming for this task because it represents candidate solutions as sets of programs. These sets of programs expose additional information that can be exploited for building block identification. In this work, we discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. We investigate methods for biasing the components of programs that are promoted in order to guide search towards useful and complementary feature spaces. We study two main approaches: 1) the introduction of new objectives and 2) the use of specialized semantic variation operators. We find that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems. The inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems in comparison to several state-of-the-art machine learning approaches and other genetic programming frameworks. Finally, we look at the collinearity and complexity of the data representations produced by different methods, in order to assess whether relevant, concise, and independent factors of variation can be produced in application.

...read moreread less

28 citations

Proceedings Article•DOI•

Particle Swarm optimisation for Evolving Deep Neural Networks for Image Classification by Evolving and Stacking Transferable Blocks

[...]

Bin Wang¹, Bing Xue¹, Mengjie Zhang¹•Institutions (1)

Victoria University of Wellington¹

19 Jul 2020

TL;DR: In this article, an efficient particle swarm optimisation method named EPSOCNN is proposed to evolve CNN architectures inspired by the idea of transfer learning, which successfully reduces the computation cost by minimising the search space to a single block and utilising a small subset of the training set to evaluate CNNs during the evolutionary process.

...read moreread less

Abstract: Deep Convolutional Neural Networks (CNNs) have been widely used in image classification tasks, but the process of designing CNN architectures is very complex, so Neural Architecture Search (NAS), automatically searching for optimal CNN architectures, has attracted more and more research interests. However, the computational cost of NAS is often too high to be applied to real-life applications. In this paper, an efficient particle swarm optimisation method named EPSOCNN is proposed to evolve CNN architectures inspired by the idea of transfer learning. EPSOCNN successfully reduces the computation cost by minimising the search space to a single block and utilising a small subset of the training set to evaluate CNNs during the evolutionary process. Meanwhile, EPSOCNN also keeps very competitive classification accuracy by stacking the evolved block multiple times to fit the whole training dataset. The proposed EPSOCNN algorithm is evaluated on CIFAR-10 dataset and compared with 13 peer competitors including deep CNNs crafted by hand, learned by reinforcement learning methods and evolved by evolutionary computation approaches. It shows very promising results with regard to the classification accuracy, the number of parameters and the computational cost. Besides, the evolved transferable block from CIFAR-10 is transferred and evaluated on two other datasets — CIFAR-100 and SVHN. It shows promising results on both of the datasets, which demonstrate the transferability of the evolved block. All of the experiments have been performed multiple times and Student’s t-test is used to compare the proposed method with peer competitors from the statistical point of view.

...read moreread less

28 citations

Proceedings Article•

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective

[...]

Wuyang Chen¹, Xinyu Gong¹, Zhangyang Wang¹•Institutions (1)

University of Texas at Austin¹

03 May 2021

TL;DR: The authors proposed TE-NAS, which ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK) and the number of linear regions in the input space, and showed that these two measurements imply the trainability and expressivity of a neural network and strongly correlate with the network's actual test accuracy.

...read moreread less

Abstract: Neural Architecture Search (NAS) has been explosively studied to automate the discovery of top-performer neural networks Current works require heavy training of supernet or intensive architecture evaluations, thus suffering from heavy resource consumption and often incurring search bias due to truncated training or approximations Can we select the best neural architectures without involving any training and eliminate a drastic portion of the search cost? We provide an affirmative answer, by proposing a novel framework called \textit{training-free neural architecture search} (TE-NAS) TE-NAS ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK), and the number of linear regions in the input space Both are motivated by recent theory advances in deep networks, and can be computed without any training We show that: (1) these two measurements imply the trainability and expressivity of a neural network; and (2) they strongly correlate with the network's actual test accuracy Further on, we design a pruning-based NAS mechanism to achieve a more flexible and superior trade-off between the trainability and expressivity during the search In NAS-Bench-201 and DARTS search spaces, TE-NAS completes high-quality search but only costs 05 and 4 GPU hours with one 1080Ti on CIFAR-10 and ImageNet, respectively We hope our work to inspire more attempts in bridging between the theoretic findings of deep networks and practical impacts in real NAS applications

...read moreread less

28 citations

Journal Article•DOI•

PSO-based optimized CNN for Hindi ASR

[...]

Vishal Passricha¹, Rajesh Kumar Aggarwal¹•Institutions (1)

National Institute of Technology, Kurukshetra¹

01 Dec 2019-International Journal of Speech Technology

TL;DR: This paper attempts to automatically optimize CNN architecture’s hyperparameters for speech recognition task based on particle swarm optimization (PSO) which is a population based stochastic optimization technique.

...read moreread less

Abstract: Convolutional Neural Network (CNN) is one of the successful deep learning algorithms that have shown its effectiveness in a variety of vision tasks. The performance of this network depends directly on its hyperparameters. Although, designing CNN architectures require expert knowledge of their intrinsic structure or a lot of trial and error. To overcome these issues, there is a need to automatically design the optimal architecture of CNNs without any human intervention. So, we try to eliminate the constraints on the number of convolutional layers and pooling layers and their type etc. from traditional architecture. Biologically inspired approaches have not been extensively exploited for this task. This paper attempts to automatically optimize CNN architecture’s hyperparameters for speech recognition task based on particle swarm optimization (PSO) which is a population based stochastic optimization technique. The proposed method is evaluated by designing CNN architecture for speech recognition task on Hindi dataset. The experimental results show that the proposed method significantly designs the competitive CNN architecture which performs similar as other state-of-the-art methods.

...read moreread less

28 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
…
100
101
102
103
104
105
106
…
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Proceedings Article•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

111,197 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

55,235 citations

"Neural Architecture Search with Rei..." refers methods in this paper

...Along with this success is a paradigm shift from feature designing to architecture designing, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016a)....
[...]

Journal Article•DOI•

Gradient-based learning applied to document recognition

[...]

Yann LeCun¹, Léon Bottou², Léon Bottou³, Yoshua Bengio⁴, Yoshua Bengio⁵, Yoshua Bengio², Patrick Haffner² - Show less +3 more•Institutions (5)

Bell Labs¹, AT&T², École Normale Supérieure³, Alcatel-Lucent⁴, École Polytechnique de Montréal⁵

01 Jan 1998

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

...read moreread less

42,067 citations

Proceedings Article•DOI•

Histograms of oriented gradients for human detection

[...]

Navneet Dalal¹, Bill Triggs¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

20 Jun 2005

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

...read moreread less

31,952 citations

"Neural Architecture Search with Rei..." refers methods in this paper

...Along with this success is a paradigm shift from feature designing to architecture designing, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016a)....
[...]