Top 22 papers published by Ross Girshick from Facebook in 2015

Posted Content•

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

[...]

Shaoqing Ren¹, Kaiming He², Ross Girshick³, Jian Sun²•Institutions (3)

University of Science and Technology of China¹, Microsoft², Facebook³

04 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

...read moreread less

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

...read moreread less

23,183 citations

Proceedings Article•DOI•

Fast R-CNN

[...]

Ross Girshick¹•Institutions (1)

Microsoft¹

07 Dec 2015

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.

...read moreread less

Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

...read moreread less

14,824 citations

Posted Content•

Fast R-CNN

[...]

Ross Girshick¹•Institutions (1)

Microsoft¹

30 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.

...read moreread less

Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at this https URL.

...read moreread less

14,747 citations

Proceedings Article•

Faster R-CNN: towards real-time object detection with region proposal networks

[...]

Shaoqing Ren¹, Kaiming He¹, Ross Girshick¹, Jian Sun¹•Institutions (1)

Microsoft¹

07 Dec 2015

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

...read moreread less

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model [19], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. Code is available at https://github.com/ShaoqingRen/faster_rcnn.

...read moreread less

13,674 citations

Proceedings Article•DOI•

Hypercolumns for object segmentation and fine-grained localization

[...]

Bharath Hariharan¹, Pablo Arbeláez², Ross Girshick³, Jitendra Malik¹•Institutions (3)

University of California, Berkeley¹, University of Los Andes², Microsoft³

07 Jun 2015

TL;DR: In this paper, the authors define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel, and use hypercolumns as pixel descriptors.

...read moreread less

Abstract: Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation [22], where we improve state-of-the-art from 49.7 mean APr [22] to 60.0, keypoint localization, where we get a 3.3 point boost over [20], and part labeling, where we show a 6.6 point gain over a strong baseline.

...read moreread less

1,511 citations

Posted Content•

You Only Look Once: Unified, Real-Time Object Detection

[...]

Joseph Redmon¹, Santosh K. Divvala², Ross Girshick³, Ali Farhadi²•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Facebook³

08 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: YOLO as discussed by the authors predicts bounding boxes and class probabilities directly from full images in one evaluation, which can be optimized end-to-end directly on detection performance, and achieves state-of-the-art performance.

...read moreread less

Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

...read moreread less

390 citations

Proceedings Article•DOI•

Deformable part models are convolutional neural networks

[...]

Ross Girshick¹, Forrest Iandola², Trevor Darrell², Jitendra Malik²•Institutions (2)

Microsoft¹, University of California, Berkeley²

07 Jun 2015

TL;DR: This paper shows that a DPM can be formulated as a CNN, thus providing a synthesis of the two ideas and calls the resulting model a DeepPyramid DPM, which is found to significantly outperform DPMs based on histograms of oriented gradients features (HOG) and slightly outperforms a comparable version of the recently introduced R-CNN detection system, while running significantly faster.

...read moreread less

Abstract: Deformable part models (DPMs) and convolutional neural networks (CNNs) are two widely used tools for visual recognition. They are typically viewed as distinct approaches: DPMs are graphical models (Markov random fields), while CNNs are “black-box” non-linear classifiers. In this paper, we show that a DPM can be formulated as a CNN, thus providing a synthesis of the two ideas. Our construction involves unrolling the DPM inference algorithm and mapping each step to an equivalent CNN layer. From this perspective, it is natural to replace the standard image features used in DPMs with a learned feature extractor. We call the resulting model a DeepPyramid DPM and experimentally validate it on PASCAL VOC object detection. We find that DeepPyramid DPMs significantly outperform DPMs based on histograms of oriented gradients features (HOG) and slightly outperforms a comparable version of the recently introduced R-CNN detection system, while running significantly faster.

...read moreread less

389 citations

Posted Content•

Contextual Action Recognition with R*CNN

[...]

Georgia Gkioxari¹, Ross Girshick², Jitendra Malik¹•Institutions (2)

University of California, Berkeley¹, Microsoft²

05 May 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: R*CNN as mentioned in this paper exploits the simple observation that actions are accompanied by contextual cues to build a strong action recognition system and adapts RCNN to use more than one region for classification while still maintaining the ability to localize the action.

...read moreread less

Abstract: There are multiple cues in an image which reveal what action a person is performing. For example, a jogger has a pose that is characteristic for jogging, but the scene (e.g. road, trail) and the presence of other joggers can be an additional source of information. In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition system. We adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action. We call our system R*CNN. The action-specific models and the feature maps are trained jointly, allowing for action specific representations to emerge. R*CNN achieves 90.2% mean AP on the PASAL VOC Action dataset, outperforming all other approaches in the field by a significant margin. Last, we show that R*CNN is not limited to action recognition. In particular, R*CNN can also be used to tackle fine-grained tasks such as attribute classification. We validate this claim by reporting state-of-the-art performance on the Berkeley Attributes of People dataset.

...read moreread less

341 citations

Proceedings Article•DOI•

Contextual Action Recognition with R*CNN

[...]

Georgia Gkioxari¹, Ross Girshick², Jitendra Malik¹•Institutions (2)

University of California, Berkeley¹, Microsoft²

07 Dec 2015

TL;DR: This work exploits the simple observation that actions are accompanied by contextual cues to build a strong action recognition system and adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action.

...read moreread less

Abstract: There are multiple cues in an image which reveal what action a person is performing. For example, a jogger has a pose that is characteristic for jogging, but the scene (e.g. road, trail) and the presence of other joggers can be an additional source of information. In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition system. We adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action. We call our system R*CNN. The action-specific models and the feature maps are trained jointly, allowing for action specific representations to emerge. R*CNN achieves 90.2% mean AP on the PASAL VOC Action dataset, outperforming all other approaches in the field by a significant margin. Last, we show that R*CNN is not limited to action recognition. In particular, R*CNN can also be used to tackle fine-grained tasks such as attribute classification. We validate this claim by reporting state-of-the-art performance on the Berkeley Attributes of People dataset.

...read moreread less

306 citations

Proceedings Article•DOI•

Aligning 3D models to RGB-D images of cluttered scenes

[...]

Saurabh Gupta¹, Pablo Arbeláez², Ross Girshick³, Jitendra Malik¹•Institutions (3)

University of California, Berkeley¹, University of Los Andes², Microsoft³

07 Jun 2015

TL;DR: This work first detecting and segmenting object instances in the scene and then using a convolutional neural network to predict the pose of the object, which is trained using pixel surface normals in images containing renderings of synthetic objects.

...read moreread less

Abstract: The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We approach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neural network (CNN) to predict the pose of the object. This CNN is trained using pixel surface normals in images containing renderings of synthetic objects. When tested on real data, our method outperforms alternative algorithms trained on real data. We then use this coarse pose estimate along with the inferred pixel support to align a small number of prototypical models to the data, and place into the scene the model that fits best. We observe a 48% relative improvement in performance at the task of 3D detection over the current state-of-the-art [34], while being an order of magnitude faster.

...read moreread less

271 citations

Journal Article•DOI•

Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation

[...]

Saurabh Gupta¹, Pablo Arbeláez¹, Ross Girshick¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

01 Apr 2015-International Journal of Computer Vision

TL;DR: This paper addresses the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data, and proposes an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset.

...read moreread less

Abstract: In this paper, we address the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset (Silberman et al., ECCV, 2012). We propose algorithms for object boundary detection and hierarchical segmentation that generalize the $$gPb-ucm$$gPb-ucm approach of Arbelaez et al. (TPAMI, 2011) by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We train RGB-D object detectors by analyzing and computing histogram of oriented gradients on the depth image and using them with deformable part models (Felzenszwalb et al., TPAMI, 2010). We observe that this simple strategy for training object detectors significantly outperforms more complicated models in the literature. We then turn to the problem of semantic segmentation for which we propose an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset. We design generic and class-specific features to encode the appearance and geometry of objects. We also show that additional features computed from RGB-D object detectors and scene classifiers further improves semantic segmentation accuracy. In all of these tasks, we report significant improvements over the state-of-the-art.

...read moreread less

Posted Content•

Exploring Nearest Neighbor Approaches for Image Captioning

[...]

Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Lawrence Zitnick - Show less +1 more

17 May 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A variety of nearest neighbor baseline approaches for image captioning find a set of nearest neighbour images in the training set from which a caption may be borrowed for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images.

...read moreread less

Abstract: We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.

...read moreread less

Proceedings Article•DOI•

Actions and Attributes from Wholes and Parts

[...]

Georgia Gkioxari¹, Ross Girshick², Jitendra Malik¹•Institutions (2)

University of California, Berkeley¹, Microsoft²

07 Dec 2015

TL;DR: A part-based approach by leveraging convolutional network features inspired by recent advances in computer vision, which shows that adding parts leads to top-performing results for both tasks and observes that for deeper networks parts are less significant.

...read moreread less

Abstract: We investigate the importance of parts for the tasks of action and attribute classification. We develop a part-based approach by leveraging convolutional network features inspired by recent advances in computer vision. Our part detectors are a deep version of poselets and capture parts of the human body under a distinct set of poses. For the tasks of action and attribute classification, we train holistic convolutional neural networks and show that adding parts leads to top-performing results for both tasks. We observe that for deeper networks parts are less significant. In addition, we demonstrate the effectiveness of our approach when we replace an oracle person detector, as is the default in the current evaluation protocol for both tasks, with a state-of-the-art person detection system.

...read moreread less

Posted Content•

Object Detection Networks on Convolutional Feature Maps

[...]

Shaoqing Ren¹, Kaiming He², Ross Girshick³, Xiangyu Zhang⁴, Jian Sun² - Show less +1 more•Institutions (4)

University of Science and Technology of China¹, Microsoft², Facebook³, Xi'an Jiaotong University⁴

23 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown by experiments that despite the effective ResNets and Faster R-CNN systems, the design of NoCs is an essential element for the 1st-place winning entries in ImageNet and MS COCO challenges 2015.

...read moreread less

Abstract: Most object detectors contain two important components: a feature extractor and an object classifier. The feature extractor has rapidly evolved with significant research efforts leading to better deep convolutional architectures. The object classifier, however, has not received much attention and many recent systems (like SPPnet and Fast/Faster R-CNN) use simple multi-layer perceptrons. This paper demonstrates that carefully designing deep networks for object classification is just as important. We experiment with region-wise classifier networks that use shared, region-independent convolutional features. We call them "Networks on Convolutional feature maps" (NoCs). We discover that aside from deep feature maps, a deep and convolutional per-region classifier is of particular importance for object detection, whereas latest superior image classification models (such as ResNets and GoogLeNets) do not directly lead to good detection accuracy without using such a per-region classifier. We show by experiments that despite the effective ResNets and Faster R-CNN systems, the design of NoCs is an essential element for the 1st-place winning entries in ImageNet and MS COCO challenges 2015.

...read moreread less

Posted Content•

Unsupervised Deep Embedding for Clustering Analysis

[...]

Junyuan Xie¹, Ross Girshick², Ali Farhadi¹•Institutions (2)

University of Washington¹, Facebook²

19 Nov 2015-arXiv: Learning

TL;DR: Deep Embedded Clustering (DEC) as mentioned in this paper learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective.

...read moreread less

Abstract: Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

...read moreread less

Posted Content•

Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

[...]

Sean Bell, C. Lawrence Zitnick¹, Kavita Bala, Ross Girshick¹•Institutions (1)

Microsoft¹

14 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Inside-Outside Network (ION) as mentioned in this paper uses skip pooling to extract information at multiple scales and levels of abstraction inside and outside the region of interest for small object detection.

...read moreread less

Abstract: It is well known that contextual and multi-scale representations are important for accurate visual recognition. In this paper we present the Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest. Contextual information outside the region of interest is integrated using spatial recurrent neural networks. Inside, we use skip pooling to extract information at multiple scales and levels of abstraction. Through extensive experiments we evaluate the design space and provide readers with an overview of what tricks of the trade are important. ION improves state-of-the-art on PASCAL VOC 2012 object detection from 73.9% to 76.4% mAP. On the new and more challenging MS COCO dataset, we improve state-of-art-the from 19.7% to 33.1% mAP. In the 2015 MS COCO Detection Challenge, our ION model won the Best Student Entry and finished 3rd place overall. As intuition suggests, our detection results provide strong evidence that context and multi-scale representations improve small object detection.

...read moreread less

Posted Content•

Inferring 3D Object Pose in RGB-D Images

[...]

Saurabh Gupta, Pablo Arbeláez, Ross Girshick, Jitendra Malik

16 Feb 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: The goal of this work is to replace objects in an RGB-D scene with corresponding 3D models from a library by first detecting and segmenting object instances in the scene using the approach from Gupta et al.

...read moreread less

Abstract: The goal of this work is to replace objects in an RGB-D scene with corresponding 3D models from a library. We approach this problem by first detecting and segmenting object instances in the scene using the approach from Gupta et al. [13]. We use a convolutional neural network (CNN) to predict the pose of the object. This CNN is trained using pixel normals in images containing rendered synthetic objects. When tested on real data, it outperforms alternative algorithms trained on real data. We then use this coarse pose estimate along with the inferred pixel support to align a small number of prototypical models to the data, and place the model that fits the best into the scene. We observe a 48% relative improvement in performance at the task of 3D detection over the current state-of-the-art [33], while being an order of magnitude faster at the same time.

...read moreread less

Posted Content•

Reducing Overfitting in Deep Networks by Decorrelating Representations

[...]

Michael Cogswell¹, Faruk Ahmed², Ross Girshick³, Larry Zitnick⁴, Dhruv Batra¹ - Show less +1 more•Institutions (4)

Virginia Tech¹, Université de Montréal², Facebook³, Microsoft⁴

19 Nov 2015-arXiv: Learning

TL;DR: A new regularizer called DeCov is proposed which leads to significantly reduced overfitting, improved generalization performance, and better generalization in Deep Neural Networks.

...read moreread less

Abstract: One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.

...read moreread less

Journal Article•DOI•

Generalized Sparselet Models for Real-Time Multiclass Object Recognition

[...]

Hyun Oh Song¹, Ross Girshick¹, Stefan Zickler², Christopher Geyer, Pedro F. Felzenszwalb³, Trevor Darrell¹ - Show less +2 more•Institutions (3)

University of California, Berkeley¹, iRobot², Brown University³

01 May 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A framework that simultaneously utilizes shared representation, reconstruction sparsity, and parallelism to enable real-time multiclass object detection with deformable part models at 5Hz on a laptop computer with almost no decrease in task performance is described.

...read moreread less

Abstract: The problem of real-time multiclass object recognition is of great practical importance in object recognition. In this paper, we describe a framework that simultaneously utilizes shared representation, reconstruction sparsity, and parallelism to enable real-time multiclass object detection with deformable part models at 5Hz on a laptop computer with almost no decrease in task performance. Our framework is trained in the standard structured output prediction formulation and is generically applicable for speeding up object recognition systems where the computational bottleneck is in multiclass, multi-convolutional inference. We experimentally demonstrate the efficiency and task performance of our method on PASCAL VOC, subset of ImageNet, Caltech101 and Caltech256 dataset.

...read moreread less

Posted Content•

Learning Visual Classifiers using Human-centric Annotations.

[...]

Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, Ross Girshick

22 Dec 2015

TL;DR: This paper proposes an algorithm to decouple the human reporting bias from the correct visually grounded labels for learning image classifiers, and provides results that are highly interpretable for reporting “what’s in the image” versus “ what”s worth saying.

...read moreread less

Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

...read moreread less

Journal Article•DOI•

I1.4: Invited Paper: Indoor Scene Understanding from RGB-D Images

[...]

Saurabh Gupta¹, Ross Girshick², Pablo Arbeláez³, Jitendra Malik¹•Institutions (3)

University of California, Berkeley¹, Microsoft², University of Los Andes³

01 Jun 2015

TL;DR: This work aims to be able to align objects in an RGB-D image with 3D models from a library by detecting and segmenting objects and estimating coarse pose using a convolutional neural network, followed by inserting the rendered model in the scene.

...read moreread less

Abstract: Our goal is to be able to align objects in an RGB-D image with 3D models from a library. Our pipeline for this task involves detecting and segmenting objects and estimating coarse pose using a convolutional neural network, followed by inserting the rendered model in the scene.

...read moreread less

Posted Content•

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

[...]

Ishan Misra¹, C. Lawrence Zitnick, Margaret Mitchell², Ross Girshick•Institutions (2)

Carnegie Mellon University¹, Microsoft²

22 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors use human reporting bias to learn visually correct image classifiers and demonstrate that the noise in these annotations exhibits structure and can be modeled, which is highly interpretable for reporting "what's in the image" versus "What's worth saying".

...read moreread less

Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

...read moreread less

Showing papers by "Ross Girshick published in 2015"