ImageNet Large Scale Visual Recognition Challenge

doi:10.1007/S11263-015-0816-Y

Home
/
Papers
/
ImageNet Large Scale Visual Recognition Challenge

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision (Springer Netherlands)-Vol. 115, Iss: 3, pp 211-252

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

read less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation

[...]

Di Lin, Jifeng Dai¹, Jiaya Jia, Kaiming He¹, Jian Sun¹ - Show less +1 more•Institutions (1)

Microsoft¹

01 Jun 2016

TL;DR: Zhang et al. as discussed by the authors proposed to use scribbles to annotate images, and developed an algorithm to train convolutional networks for semantic segmentation supervised by scribbles.

...read moreread less

Abstract: Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks is a tedious and inefficient procedure. We note that for the topic of interactive image segmentation, scribbles are very widely used in academic research and commercial software, and are recognized as one of the most userfriendly ways of interacting. In this paper, we propose to use scribbles to annotate images, and develop an algorithm to train convolutional networks for semantic segmentation supervised by scribbles. Our algorithm is based on a graphical model that jointly propagates information from scribbles to unmarked pixels and learns network parameters. We present competitive object semantic segmentation results on the PASCAL VOC dataset by using scribbles as annotations. Scribbles are also favored for annotating stuff (e.g., water, sky, grass) that has no well-defined shape, and our method shows excellent results on the PASCALCONTEXT dataset thanks to extra inexpensive scribble annotations. Our scribble annotations on PASCAL VOC are available at http://research.microsoft.com/en-us/um/ people/jifdai/downloads/scribble_sup.

...read moreread less

748 citations

Book Chapter•DOI•

The Visual Object Tracking VOT2016 Challenge Results

[...]

Matej Kristan¹, Ales Leonardis², Jiří Matas³, Michael Felsberg⁴, Roman Pflugfelder⁵, Luka Cehovin¹, Tomas Vojir³, Gustav Häger⁴, Alan Lukežič¹, Gustavo Fernandez⁵, Abhinav Gupta⁶, Alfredo Petrosino⁷, Alireza Memarmoghadam⁸, Alvaro Garcia-Martin⁹, Andres Solis Montero¹⁰, Andrea Vedaldi¹¹, Andreas Robinson⁴, Andy J. Ma¹², Anton Varfolomieiev¹³, A. Aydin Alatan¹⁴, Aykut Erdem¹⁵, Bernard Ghanem¹⁶, Bin Liu, Bohyung Han¹⁷, Brais Martinez¹⁸, Chang-Ming Chang¹⁹, Changsheng Xu²⁰, Chong Sun²¹, Daijin Kim¹⁷, Dapeng Chen²², Dawei Du²⁰, Deepak Mishra²³, Dit-Yan Yeung²⁴, Erhan Gundogdu²⁵, Erkut Erdem¹⁵, Fahad Shahbaz Khan⁴, Fatih Porikli²⁶, Fatih Porikli²⁷, Fei Zhao²⁰, Filiz Bunyak²⁸, Francesco Battistone⁷, Gao Zhu²⁶, Giorgio Roffo²⁹, Gorthi R. K. Sai Subrahmanyam²³, Guilherme Sousa Bastos³⁰, Guna Seetharaman³¹, Henry Medeiros³², Hongdong Li²⁶, Honggang Qi²⁰, Horst Bischof³³, Horst Possegger³³, Huchuan Lu²¹, Hyemin Lee¹⁷, Hyeonseob Nam³⁴, Hyung Jin Chang³⁵, Isabela Drummond³⁰, Jack Valmadre¹¹, Jae-chan Jeong³⁶, Jaeil Cho³⁶, Jae-Yeong Lee³⁶, Jianke Zhu³⁷, Jiayi Feng²⁰, Jin Gao²⁰, Jin-Young Choi, Jingjing Xiao², Ji-Wan Kim³⁶, Jiyeoup Jeong, João F. Henriques¹¹, Jochen Lang¹⁰, Jongwon Choi, José M. Martínez⁹, Junliang Xing²⁰, Junyu Gao²⁰, Kannappan Palaniappan²⁸, Karel Lebeda³⁸, Ke Gao²⁸, Krystian Mikolajczyk³⁵, Lei Qin²⁰, Lijun Wang²¹, Longyin Wen¹⁹, Luca Bertinetto¹¹, Madan Kumar Rapuru²³, Mahdieh Poostchi²⁸, Mario Edoardo Maresca⁷, Martin Danelljan⁴, Matthias Mueller¹⁶, Mengdan Zhang²⁰, Michael Arens, Michel Valstar¹⁸, Ming Tang²⁰, Mooyeol Baek¹⁷, Muhammad Haris Khan¹⁸, Naiyan Wang²⁴, Nana Fan³⁹, Noor M. Al-Shakarji²⁸, Ondrej Miksik¹¹, Osman Akin¹⁵, Payman Moallem⁸, Pedro Senna³⁰, Philip H. S. Torr¹¹, Pong C. Yuen¹², Qingming Huang³⁹, Qingming Huang²⁰, Rafael Martin-Nieto⁹, Rengarajan Pelapur²⁸, Richard Bowden³⁸, Robert Laganiere¹⁰, Rustam Stolkin², Ryan Walsh³², Sebastian B. Krah, Shengkun Li¹⁹, Shengping Zhang³⁹, Shizeng Yao²⁸, Simon Hadfield³⁸, Simone Melzi²⁹, Siwei Lyu¹⁹, Siyi Li²⁴, Stefan Becker, Stuart Golodetz¹¹, Sumithra Kakanuru²³, Sunglok Choi³⁶, Tao Hu²⁰, Thomas Mauthner³³, Tianzhu Zhang²⁰, Tony P. Pridmore¹⁸, Vincenzo Santopietro⁷, Weiming Hu²⁰, Wenbo Li⁴⁰, Wolfgang Hübner, Xiangyuan Lan¹², Xiaomeng Wang¹⁸, Xin Li³⁹, Yang Li³⁷, Yiannis Demiris³⁵, Yifan Wang²¹, Yuankai Qi³⁹, Zejian Yuan²², Zexiong Cai¹², Zhan Xu³⁷, Zhenyu He³⁹, Zhizhen Chi²¹ - Show less +137 more•Institutions (40)

University of Ljubljana¹, University of Birmingham², Czech Technical University in Prague³, Linköping University⁴, Austrian Institute of Technology⁵, Carnegie Mellon University⁶, Parthenope University of Naples⁷, University of Isfahan⁸, Autonomous University of Madrid⁹, University of Ottawa¹⁰, University of Oxford¹¹, Hong Kong Baptist University¹², Kyiv Polytechnic Institute¹³, Middle East Technical University¹⁴, Hacettepe University¹⁵, King Abdullah University of Science and Technology¹⁶, Pohang University of Science and Technology¹⁷, University of Nottingham¹⁸, University at Albany, SUNY¹⁹, Chinese Academy of Sciences²⁰, Dalian University of Technology²¹, Xi'an Jiaotong University²², Indian Institute of Space Science and Technology²³, Hong Kong University of Science and Technology²⁴, ASELSAN²⁵, Australian National University²⁶, Commonwealth Scientific and Industrial Research Organisation²⁷, University of Missouri²⁸, University of Verona²⁹, Universidade Federal de Itajubá³⁰, United States Naval Research Laboratory³¹, Marquette University³², Graz University of Technology³³, Naver Corporation³⁴, Imperial College London³⁵, Electronics and Telecommunications Research Institute³⁶, Zhejiang University³⁷, University of Surrey³⁸, Harbin Institute of Technology³⁹, Lehigh University⁴⁰

08 Oct 2016

TL;DR: The Visual Object Tracking challenge VOT2016 goes beyond its predecessors by introducing a new semi-automatic ground truth bounding box annotation methodology and extending the evaluation system with the no-reset experiment.

...read moreread less

Abstract: The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http://votchallenge.net).

...read moreread less

744 citations

Proceedings Article•DOI•

Environmental sound classification with convolutional neural networks

[...]

Karol J. Piczak¹•Institutions (1)

Warsaw University of Technology¹

12 Nov 2015

TL;DR: The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.

...read moreread less

Abstract: This paper evaluates the potential of convolutional neural networks in classifying short audio clips of environmental sounds. A deep model consisting of 2 convolutional layers with max-pooling and 2 fully connected layers is trained on a low level representation of audio data (segmented spectrograms) with deltas. The accuracy of the network is evaluated on 3 public datasets of environmental and urban recordings. The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.

...read moreread less

742 citations

Proceedings Article•DOI•

Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images

[...]

Shuran Song¹, Jianxiong Xiao¹•Institutions (1)

Princeton University¹

27 Jun 2016

TL;DR: This work proposes the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D.

...read moreread less

Abstract: We focus on the task of amodal 3D object detection in RGB-D images, which aims to produce a 3D bounding box of an object in metric form at its full extent. We introduce Deep Sliding Shapes, a 3D ConvNet formulation that takes a 3D volumetric scene from a RGB-D image as input and outputs 3D object bounding boxes. In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D. In particular, we handle objects of various sizes by training an amodal RPN at two different scales and an ORN to regress 3D bounding boxes. Experiments show that our algorithm outperforms the state-of-the-art by 13.8 in mAP and is 200× faster than the original Sliding Shapes.

...read moreread less

740 citations

Posted Content•

Deep Learning for Identifying Metastatic Breast Cancer

[...]

Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, Andrew H. Beck - Show less +1 more

18 Jun 2016-arXiv: Quantitative Methods

TL;DR: The power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses is demonstrated, by combining the deep learning system's predictions with the human pathologist's diagnoses.

...read moreread less

Abstract: The International Symposium on Biomedical Imaging (ISBI) held a grand challenge to evaluate computational systems for the automated detection of metastatic breast cancer in whole slide images of sentinel lymph node biopsies. Our team won both competitions in the grand challenge, obtaining an area under the receiver operating curve (AUC) of 0.925 for the task of whole slide image classification and a score of 0.7051 for the tumor localization task. A pathologist independently reviewed the same images, obtaining a whole slide image classification AUC of 0.966 and a tumor localization score of 0.733. Combining our deep learning system's predictions with the human pathologist's diagnoses increased the pathologist's AUC to 0.995, representing an approximately 85 percent reduction in human error rate. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.

...read moreread less

739 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
…
76
77
78
79
80
81
82
…
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

55,235 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

49,914 citations

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations