Very Deep Convolutional Networks for Large-Scale Image Recognition
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
Cites background or result from "Very Deep Convolutional Networks fo..."
...Table 8 compares full MobileNet to the original GoogleNet  and VGG16 ....
...The general trend has been to make deeper and more complicated networks in order to achieve higher accuracy [27, 31, 29, 8]....
...MobileNet is nearly as accurate as VGG16 while being 32 times smaller and 27 times less compute intensive....
Cites background or methods from "Very Deep Convolutional Networks fo..."
...If the output is downsampled by a factor of f , shift the input x pixels to the right and y pixels down, once for every (x, y) s.t. 0 ≤ x, y f ....
...Sliding window detection by Sermanet et al. , semantic segmentation by Pinheiro and Collobert , and image restoration by Eigen et al.  do fully convolutional inference....
...Next, we add skips between layers to fuse coarse, semantic and local, appearance information....
...Each layer of data in a convnet is a three-dimensional array of size h × w × d, where h and w are spatial dimensions, and d is the feature or channel dimension....
...Convnets are not only improving for whole-image classification [22, 34, 35], but also making progress on local tasks with structured output....