scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009-pp 248-255
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This work proposes a new enhancement to Convolutional LSTM networks that supports accommodation of multiple convolutional kernels and layers, and proposes an attention-based mechanism that is specifically designed for the multi-kernel extension.
Abstract: Action recognition greatly benefits motion understanding in video analysis. Recurrent networks such as long short-term memory (LSTM) networks are a popular choice for motion-aware sequence learning tasks. Recently, a convolutional extension of LSTM was proposed, in which input-to-hidden and hidden-to-hidden transitions are modeled through convolution with a single kernel. This implies an unavoidable trade-off between effectiveness and efficiency. Herein, we propose a new enhancement to convolutional LSTM networks that supports accommodation of multiple convolutional kernels and layers. This resembles a Network-in-LSTM approach, which improves upon the aforementioned concern. In addition, we propose an attention-based mechanism that is specifically designed for our multi-kernel extension. We evaluated our proposed extensions in a supervised classification setting on the UCF-101 and Sports-1M datasets, with the findings showing that our enhancements improve accuracy. We also undertook qualitative analysis to reveal the characteristics of our system and the convolutional LSTM baseline.

23 citations


Cites background from "ImageNet: A large-scale hierarchica..."

  • ...A notable insight of their work is the fact that convolutional networks based on optical flow features can be fine tuned from DCNs taught on RGB inputs, such as that used for image classification on the ImageNet dataset [11]....

    [...]

Posted Content
TL;DR: LISA is designed, named after Light-guided Instance Shadow-object Association, an end-to-end framework to automatically predict the shadow and object instances, together with the shadow-object associations and light direction, and demonstrates its applicability on light direction estimation and photo editing.
Abstract: Instance shadow detection is a brand new problem, aiming to find shadow instances paired with object instances. To approach it, we first prepare a new dataset called SOBA, named after Shadow-OBject Association, with 3,623 pairs of shadow and object instances in 1,000 photos, each with individual labeled masks. Second, we design LISA, named after Light-guided Instance Shadow-object Association, an end-to-end framework to automatically predict the shadow and object instances, together with the shadow-object associations and light direction. Then, we pair up the predicted shadow and object instances, and match them with the predicted shadow-object associations to generate the final results. In our evaluations, we formulate a new metric named the shadow-object average precision to measure the performance of our results. Further, we conducted various experiments and demonstrate our method's applicability on light direction estimation and photo editing.

23 citations


Cites methods from "ImageNet: A large-scale hierarchica..."

  • ...Specifically, we adopt the weights of ResNeXt-101-FPN [27, 48] trained on ImageNet [7] to initialize the parameters of the backbone network, and train our framework on two GeForce GTX 1080 Ti GPUs (four images per GPU) for 40k training iterations....

    [...]

Journal ArticleDOI
TL;DR: In this paper, three main approaches for cell image classification most often used: numerical feature extraction, end-to-end classification with neural networks (NNs), and transport-based morphometry (TBM).
Abstract: Cell image classification methods are currently being used in numerous applications in cell biology and medicine. Applications include understanding the effects of genes and drugs in screening experiments, understanding the role and subcellular localization of different proteins, as well as diagnosis and prognosis of cancer from images acquired using cytological and histological techniques. The article also reviews three main approaches for cell image classification most often used: numerical feature extraction, end-to-end classification with neural networks (NNs), and transport-based morphometry (TBM). In addition, we provide comparisons on four different cell imaging datasets to highlight the relative strength of each method. The results computed using four publicly available datasets show that numerical features tend to carry the best discriminative information for most of the classification tasks. Results also show that NN-based methods produce state-of-the-art results in the dataset that contains a relatively large number of training samples. Data augmentation or the choice of a more recently reported architecture does not necessarily improve the classification performance of NNs in the datasets with limited number of training samples. If understanding and visualization are desired aspects, TBM methods can offer the ability to invert classification functions, and thus can aid in the interpretation of results. These and other comparison outcomes are discussed with the aim of clarifying the advantages and disadvantages of each method. © 2020 International Society for Advancement of Cytometry.

23 citations

Journal ArticleDOI
TL;DR: It is argued that multiscaled map representation, object simultaneous localized and mapping system, and deep neural network-based simultaneous localization and mapping pipeline design could be effective solutions to image semantics-fused visual simultaneous localizationand mapping.
Abstract: As one of the typical application-oriented solutions to robot autonomous navigation, visual simultaneous localization and mapping is essentially restricted to simplex environmental understanding ba...

23 citations

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This work focuses on improving modern image classification techniques by considering topological features as well, and shows that incorporating this information allows the models to improve the accuracy, precision and recall on test data, thus providing evidence that topological signatures can be leveraged for enhancing some of the state-of-the art applications in computer vision.
Abstract: Image classification has been a topic of interest for many years. With the advent of Deep Learning, impressive progress has been made on the task, resulting in quite accurate classification. Our work focuses on improving modern image classification techniques by considering topological features as well. We show that incorporating this information allows our models to improve the accuracy, precision and recall on test data, thus providing evidence that topological signatures can be leveraged for enhancing some of the state-of-the art applications in computer vision.

23 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations


"ImageNet: A large-scale hierarchica..." refers methods in this paper

  • ...SIFT [15] descriptors are used in this experiment....

    [...]

Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations


"ImageNet: A large-scale hierarchica..." refers background or methods in this paper

  • ...ImageNet uses the hierarchical structure of WordNet [9]....

    [...]

  • ...The main asset of WordNet [9] lies in its semantic structure, i....

    [...]

01 Oct 2008
TL;DR: The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life, and exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background.
Abstract: Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life. The database exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background. In addition to describing the details of the database, we provide specific experimental paradigms for which the database is suitable. This is done in an effort to make research performed with the database as consistent and comparable as possible. We provide baseline results, including results of a state of the art face recognition system combined with a face alignment system. To facilitate experimentation on the database, we provide several parallel databases, including an aligned version.

5,742 citations


"ImageNet: A large-scale hierarchica..." refers methods in this paper

  • ...Special purpose datasets, such as FERET faces [19], Labeled faces in the Wild [13] and the Mammal Benchmark by Fink and Ullman [11] are not included....

    [...]

01 Jan 1978
TL;DR: On those remote pages it is written that animals are divided into those that belong to the Emperor, and those that are trained, suckling pigs and stray dogs.
Abstract: On those remote pages itis written that animals are divided into (a) those that belong tothe Emperor, (b)embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel’s hair brush, (1) others, (m) those that have just broken a flower vase, (n) those that resemble f ies from

4,302 citations


"ImageNet: A large-scale hierarchica..." refers background in this paper

  • ...Rosch and Lloyd [ 20 ] have demonstrated that humans tend to label visual objects at an easily accessible semantic level termed as “basic level” (e.g....

    [...]

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A recognition scheme that scales efficiently to a large number of objects and allows a larger and more discriminatory vocabulary to be used efficiently is presented, which it is shown experimentally leads to a dramatic improvement in retrieval quality.
Abstract: A recognition scheme that scales efficiently to a large number of objects is presented. The efficiency and quality is exhibited in a live demonstration that recognizes CD-covers from a database of 40000 images of popular music CD’s. The scheme builds upon popular techniques of indexing descriptors extracted from local regions, and is robust to background clutter and occlusion. The local region descriptors are hierarchically quantized in a vocabulary tree. The vocabulary tree allows a larger and more discriminatory vocabulary to be used efficiently, which we show experimentally leads to a dramatic improvement in retrieval quality. The most significant property of the scheme is that the tree directly defines the quantization. The quantization and the indexing are therefore fully integrated, essentially being one and the same. The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images.

4,024 citations


Additional excerpts

  • ...[16, 17, 28, 18])....

    [...]