Showing papers by "Ross Girshick published in 2016"

PDF

Open Access

Proceedings Article•DOI•

You Only Look Once: Unified, Real-Time Object Detection

[...]

Joseph Redmon¹, Santosh K. Divvala², Ross Girshick³, Ali Farhadi²•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Facebook³

27 Jun 2016

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

27,256 citations

Posted Content•

Feature Pyramid Networks for Object Detection

[...]

Tsung-Yi Lin¹, Piotr Dollár², Ross Girshick², Kaiming He², Bharath Hariharan², Serge Belongie¹ - Show less +2 more•Institutions (2)

Cornell University¹, Facebook²

09 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Feature pyramid networks (FPNets) as mentioned in this paper exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.

...read moreread less

Abstract: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

...read moreread less

5,438 citations

Posted Content•

Aggregated Residual Transformations for Deep Neural Networks

[...]

Saining Xie¹, Ross Girshick², Piotr Dollár², Zhuowen Tu¹, Kaiming He² - Show less +1 more•Institutions (2)

University of California, San Diego¹, Facebook²

16 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.

...read moreread less

Abstract: We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.

...read moreread less

2,760 citations

Journal Article•DOI•

Region-Based Convolutional Networks for Accurate Object Detection and Segmentation

[...]

Ross Girshick¹, Jeff Donahue², Trevor Darrell², Jitendra Malik²•Institutions (2)

Microsoft¹, University of California, Berkeley²

01 Jan 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50 percent relative to the previous best result on VOC 2012-achieving a mAP of 62.4 percent.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50 percent relative to the previous best result on VOC 2012—achieving a mAP of 62.4 percent. Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call the resulting model an R-CNN or Region-based Convolutional Network . Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

2,058 citations

Proceedings Article•

Unsupervised deep embedding for clustering analysis

[...]

Junyuan Xie¹, Ross Girshick², Ali Farhadi¹•Institutions (2)

University of Washington¹, Facebook²

19 Jun 2016

TL;DR: Deep Embedded Clustering (DEC) as discussed by the authors learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective.

...read moreread less

Abstract: Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

...read moreread less

1,776 citations

Proceedings Article•DOI•

Training Region-Based Object Detectors with Online Hard Example Mining

[...]

Abhinav Shrivastava¹, Abhinav Gupta¹, Ross Girshick²•Institutions (2)

Carnegie Mellon University¹, Facebook²

01 Jun 2016

TL;DR: In this article, the authors proposed an online hard example mining (OHEM) algorithm for training region-based ConvNet detectors and achieved state-of-the-art results.

...read moreread less

Abstract: The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune. We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors. Our motivation is the same as it has always been – detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use. But more importantly, it yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness increases as datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.

...read moreread less

1,756 citations

Posted Content•

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

[...]

Justin Johnson¹, Bharath Hariharan², Laurens van der Maaten², Li Fei-Fei¹, C. Lawrence Zitnick², Ross Girshick² - Show less +2 more•Institutions (2)

Stanford University¹, Facebook²

20 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents a diagnostic dataset that tests a range of visual reasoning abilities and uses this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

...read moreread less

Abstract: When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

...read moreread less

1,236 citations

Proceedings Article•DOI•

Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

[...]

Sean Bell, C. Lawrence Zitnick¹, Kavita Bala, Ross Girshick¹•Institutions (1)

Microsoft¹

01 Jun 2016

TL;DR: The Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest, provides strong evidence that context and multi-scale representations improve small object detection.

...read moreread less

Abstract: It is well known that contextual and multi-scale representations are important for accurate visual recognition. In this paper we present the Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest. Contextual information outside the region of interest is integrated using spatial recurrent neural networks. Inside, we use skip pooling to extract information at multiple scales and levels of abstraction. Through extensive experiments we evaluate the design space and provide readers with an overview of what tricks of the trade are important. ION improves state-of-the-art on PASCAL VOC 2012 object detection from 73.9% to 77.9% mAP. On the new and more challenging MS COCO dataset, we improve state-of-the-art from 19.7% to 33.1% mAP. In the 2015 MS COCO Detection Challenge, our ION model won "Best Student Entry" and finished 3rd place overall. As intuition suggests, our detection results provide strong evidence that context and multi-scale representations improve small object detection.

...read moreread less

1,209 citations

Book Chapter•DOI•

Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks

[...]

Junyuan Xie¹, Ross Girshick¹, Ali Farhadi¹•Institutions (1)

University of Washington¹

08 Oct 2016

TL;DR: Deep3D as discussed by the authors uses deep neural networks to automatically convert 2D videos and images to a stereoscopic 3D format, which is trained end-to-end directly on stereo pairs extracted from existing 3D movies.

...read moreread less

Abstract: As 3D movie viewing becomes mainstream and the Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly. Producing 3D videos, however, remains challenging. In this paper we propose to use deep neural networks to automatically convert 2D videos and images to a stereoscopic 3D format. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages and need ground truth depth map as supervision, our approach is trained end-to-end directly on stereo pairs extracted from existing 3D movies. This novel training scheme makes it possible to exploit orders of magnitude more data and significantly increases performance. Indeed, Deep3D outperforms baselines in both quantitative and human subject evaluations.

...read moreread less

435 citations

Posted Content•

Low-shot Visual Recognition by Shrinking and Hallucinating Features

[...]

Bharath Hariharan¹, Ross Girshick¹•Institutions (1)

Facebook¹

09 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, representation regularization and hallucination techniques were proposed to improve the performance of low-shot visual learning, improving the one-shot accuracy by 2.3x on the ImageNet dataset.

...read moreread less

Abstract: Low-shot visual learning---the ability to recognize novel object categories from very few examples---is a hallmark of human visual intelligence. Existing machine learning approaches fail to generalize in the same way. To make progress on this foundational problem, we present a low-shot learning benchmark on complex images that mimics challenges faced by recognition systems in the wild. We then propose a) representation regularization techniques, and b) techniques to hallucinate additional training examples for data-starved classes. Together, our methods improve the effectiveness of convolutional networks in low-shot learning, improving the one-shot accuracy on novel classes by 2.3x on the challenging ImageNet dataset.

...read moreread less

332 citations

Posted Content•

Training Region-based Object Detectors with Online Hard Example Mining

[...]

Abhinav Shrivastava¹, Abhinav Gupta¹, Ross Girshick²•Institutions (2)

Carnegie Mellon University¹, Facebook²

12 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors proposed an online hard example mining (OHEM) algorithm for training region-based ConvNet detectors and achieved state-of-the-art results.

...read moreread less

Abstract: The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune. We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors. Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use. But more importantly, it yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness increases as datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.

...read moreread less

Posted Content•

Visual Storytelling

[...]

Ting-Hao, Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell - Show less +12 more

13 Apr 2016-arXiv: Computation and Language

TL;DR: Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.

...read moreread less

Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling The first release of this dataset, SIND v1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression

...read moreread less

Proceedings Article•

Reducing Overfitting in Deep Networks by Decorrelating Representations

[...]

Michael Cogswell¹, Faruk Ahmed², Ross Girshick³, Larry Zitnick⁴, Dhruv Batra¹ - Show less +1 more•Institutions (4)

Virginia Tech¹, Université de Montréal², Facebook³, Microsoft⁴

01 Jan 2016

TL;DR: DeCov as mentioned in this paper encourages diverse or non-redundant representations in deep neural networks by minimizing the cross-covariance of hidden activations, which leads to significantly reduced overfitting and better generalization.

...read moreread less

Abstract: One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.

...read moreread less

Proceedings Article•DOI•

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

[...]

Ishan Misra¹, C. Lawrence Zitnick, Margaret Mitchell², Ross Girshick•Institutions (2)

Carnegie Mellon University¹, Microsoft²

01 Jun 2016

TL;DR: In this article, the authors use human reporting bias to learn visually correct image classifiers and demonstrate that the noise in these annotations exhibits structure and can be modeled, which is highly interpretable for reporting "what's in the image" versus "What's worth saying".

...read moreread less

Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image, however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M.We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

...read moreread less

Proceedings Article•DOI•

Visual Storytelling

[...]

Ting-Hao Kenneth Huang¹, Francis Ferraro², Nasrin Mostafazadeh³, Ishan Misra¹, Aishwarya Agrawal⁴, Jacob Devlin¹, Ross Girshick⁵, Xiaodong He⁶, Pushmeet Kohli⁶, Dhruv Batra⁴, C. Lawrence Zitnick⁶, Devi Parikh⁴, Lucy Vanderwende⁶, Michel Galley⁶, Margaret Mitchell⁶ - Show less +11 more•Institutions (6)

Carnegie Mellon University¹, Johns Hopkins University², University of Rochester³, Virginia Tech⁴, Facebook⁵, Microsoft⁶

13 Jun 2016

...read moreread less

Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.

...read moreread less

Patent•

Object detection and classification in images

[...]

Jian Sun¹, Ross Girshick¹, Shaoqing Ren¹, Kaiming He¹•Institutions (1)

Microsoft¹

20 Jan 2016

TL;DR: In this article, a computing device can receive an input image and generate a convolutional feature map, which can then be processed through a Region Proposal Network (RPN) to generate proposals for candidate objects in the image.

...read moreread less

Abstract: Systems, methods, and computer-readable media for providing fast and accurate object detection and classification in images are described herein In some examples, a computing device can receive an input image The computing device can process the image, and generate a convolutional feature map In some configurations, the convolutional feature map can be processed through a Region Proposal Network (RPN) to generate proposals for candidate objects in the image In various examples, the computing device can process the convolutional feature map with the proposals through a Fast Region-Based Convolutional Neural Network (FRCN) proposal classifier to determine a class of each object in the image and a confidence score associated therewith The computing device can then provide a requestor with an output including the object classification and/or confidence score

...read moreread less

Posted Content•

Learning Features by Watching Objects Move

[...]

Deepak Pathak¹, Ross Girshick², Piotr Dollár², Trevor Darrell¹, Bharath Hariharan² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Facebook²

19 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Inspired by the human visual system, low-level motion-based grouping cues can be used to learn an effective visual representation that significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

...read moreread less

Abstract: This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

...read moreread less

Journal Article•DOI•

The three R's of computer vision

[...]

Jitendra Malik¹, Pablo Arbeláez², Joao Carreira¹, Katerina Fragkiadaki¹, Ross Girshick³, Georgia Gkioxari¹, Saurabh Gupta¹, Bharath Hariharan³, Abhishek Kar¹, Shubham Tulsiani¹ - Show less +6 more•Institutions (3)

University of California, Berkeley¹, University of Los Andes², Facebook³

01 Mar 2016-Pattern Recognition Letters

TL;DR: This work argues for the importance of the interaction between recognition, reconstruction and re-organization, and proposes that as a unifying framework for computer vision, with pipelined versions of two systems, one for RGB-D images, and another for RGB images, which produce rich 3D scene interpretations in this framework.

...read moreread less

Posted Content•

Low-shot visual object recognition.

[...]

Bharath Hariharan, Ross Girshick

09 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel protocol to evaluate low-shot learning on complex images where the learner is permitted to first build a feature representation is presented, leading to a 2x reduction in the amount of training data required at equal accuracy rates on the challenging ImageNet dataset.

...read moreread less

Abstract: Low-shot visual learning - the ability to recognize novel object categories from very few examples - is a hallmark of human visual intelligence. Existing machine learning approaches fail to generalize in the same way. To make progress on this foundational problem, we present a novel protocol to evaluate low-shot learning on complex images where the learner is permitted to first build a feature representation. Then, we propose and evaluate representation regularization techniques that improve the effectiveness of convolutional networks at the task of low-shot learning, leading to a 2x reduction in the amount of training data required at equal accuracy rates on the challenging ImageNet dataset.

...read moreread less

Posted Content•

Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks

[...]

Junyuan Xie¹, Ross Girshick¹, Ali Farhadi¹•Institutions (1)

University of Washington¹

13 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes to use deep neural networks to automatically convert 2D videos and images to a stereoscopic 3D format and shows that Deep3D outperforms baselines in both quantitative and human subject evaluations.

...read moreread less

Abstract: As 3D movie viewing becomes mainstream and Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly. Producing 3D videos, however, remains challenging. In this paper we propose to use deep neural networks for automatically converting 2D videos and images to stereoscopic 3D format. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages and need ground truth depth map as supervision, our approach is trained end-to-end directly on stereo pairs extracted from 3D movies. This novel training scheme makes it possible to exploit orders of magnitude more data and significantly increases performance. Indeed, Deep3D outperforms baselines in both quantitative and human subject evaluations.

...read moreread less