Top 25 papers published by Ross Girshick from Facebook in 2014

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Posted Content•

Caffe: Convolutional Architecture for Fast Feature Embedding

[...]

Yangqing Jia¹, Evan Shelhamer², Jeff Donahue², Sergey Karayev², Jonathan Long², Ross Girshick², Sergio Guadarrama², Trevor Darrell² - Show less +4 more•Institutions (2)

Google¹, University of California, Berkeley²

20 Jun 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU ($\approx$ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

...read moreread less

12,531 citations

Proceedings Article•DOI•

Caffe: Convolutional Architecture for Fast Feature Embedding

[...]

Yangqing Jia¹, Evan Shelhamer², Jeff Donahue², Sergey Karayev², Jonathan Long², Ross Girshick², Sergio Guadarrama², Trevor Darrell² - Show less +4 more•Institutions (2)

Google¹, University of California, Berkeley²

03 Nov 2014

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments.Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

...read moreread less

10,161 citations

Book Chapter•DOI•

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

[...]

Saurabh Gupta¹, Ross Girshick¹, Pablo Arbeláez², Pablo Arbeláez¹, Jitendra Malik¹ - Show less +1 more•Institutions (2)

University of California¹, University of Los Andes²

06 Sep 2014

TL;DR: In this paper, a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity is proposed.

...read moreread less

Abstract: In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

...read moreread less

1,414 citations

Book Chapter•DOI•

Simultaneous Detection and Segmentation

[...]

Bharath Hariharan¹, Pablo Arbeláez¹, Pablo Arbeláez², Ross Girshick¹, Jitendra Malik¹ - Show less +1 more•Institutions (2)

University of California¹, University of Los Andes²

06 Sep 2014

TL;DR: This work builds on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN), introducing a novel architecture tailored for SDS, and uses category-specific, top-down figure-ground predictions to refine the bottom-up proposals.

...read moreread less

Abstract: We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual object instances. We build on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN [16]), introducing a novel architecture tailored for SDS. We then use category-specific, top-down figure-ground predictions to refine our bottom-up proposals. We show a 7 point boost (16% relative) over our baselines on SDS, a 5 point boost (10% relative) over state-of-the-art on semantic segmentation, and state-of-the-art performance in object detection. Finally, we provide diagnostic tools that unpack performance and provide directions for future work.

...read moreread less

1,276 citations

Posted Content•

Hypercolumns for Object Segmentation and Fine-grained Localization

[...]

Bharath Hariharan¹, Pablo Arbeláez², Ross Girshick³, Jitendra Malik¹•Institutions (3)

University of California, Berkeley¹, University of Los Andes², Microsoft³

21 Nov 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: Using hypercolumns as pixel descriptors, this work defines the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel, and shows results on three fine-grained localization tasks: simultaneous detection and segmentation, and keypoint localization.

...read moreread less

Abstract: Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as feature representation. However, the information in this layer may be too coarse to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation[22], where we improve state-of-the-art from 49.7[22] mean AP^r to 60.0, keypoint localization, where we get a 3.3 point boost over[20] and part labeling, where we show a 6.6 point gain over a strong baseline.

...read moreread less

1,090 citations

Posted Content•

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

[...]

Saurabh Gupta¹, Ross Girshick¹, Pablo Arbeláez², Pablo Arbeláez¹, Jitendra Malik¹ - Show less +1 more•Institutions (2)

University of California¹, University of Los Andes²

22 Jul 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new geocentric embedding is proposed for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity to facilitate the use of perception in fields like robotics.

...read moreread less

Abstract: In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

...read moreread less

1,059 citations

Book Chapter•DOI•

Part-Based R-CNNs for Fine-Grained Category Detection

[...]

Ning Zhang¹, Jeff Donahue¹, Ross Girshick¹, Trevor Darrell¹•Institutions (1)

University of California¹

06 Sep 2014

TL;DR: In this article, the authors propose a model for fine-grained categorization by leveraging deep convolutional features computed on bottom-up region proposals, which learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a finegrained category from a pose normalized representation.

...read moreread less

Abstract: Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

...read moreread less

1,035 citations

Posted Content•

Microsoft COCO: Common Objects in Context

[...]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár - Show less +6 more

01 May 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding.

...read moreread less

Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

...read moreread less

691 citations

Posted Content•

DenseNet: Implementing Efficient ConvNet Descriptor Pyramids

[...]

Forrest Iandola, Matthew W. Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, Kurt Keutzer - Show less +2 more

07 Apr 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: DenseNet is presented, an open source system that computes dense, multiscale features from the convolutional layers of a CNN based object classifier.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) can provide accurate object classification. They can be extended to perform object detection by iterating over dense or selected proposed object regions. However, the runtime of such detectors scales as the total number and/or area of regions to examine per image, and training such detectors may be prohibitively slow. However, for some CNN classifier topologies, it is possible to share significant work among overlapping regions to be classified. This paper presents DenseNet, an open source system that computes dense, multiscale features from the convolutional layers of a CNN based object classifier. Future work will involve training efficient object detectors with DenseNet feature descriptors.

...read moreread less

613 citations

Book Chapter•DOI•

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

[...]

Pulkit Agrawal¹, Ross Girshick¹, Jitendra Malik¹•Institutions (1)

University of California¹

06 Sep 2014

TL;DR: This paper experimentally probes several aspects of CNN feature learning in an attempt to help practitioners gain useful, evidence-backed intuitions about how to apply CNNs to computer vision problems.

...read moreread less

Abstract: In the last two years, convolutional neural networks (CNNs) have achieved an impressive suite of results on standard recognition datasets and tasks. CNN-based features seem poised to quickly replace engineered representations, such as SIFT and HOG. However, compared to SIFT and HOG, we understand much less about the nature of the features learned by large CNNs. In this paper, we experimentally probe several aspects of CNN feature learning in an attempt to help practitioners gain useful, evidence-backed intuitions about how to apply CNNs to computer vision problems.

...read moreread less

Posted Content•

Deformable Part Models are Convolutional Neural Networks

[...]

Ross Girshick¹, Forrest Iandola², Trevor Darrell², Jitendra Malik²•Institutions (2)

Microsoft¹, University of California, Berkeley²

18 Sep 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: DeepPyramid DPM as discussed by the authors proposes to replace the standard image features used in DPM with a learned feature extractor, which significantly outperforms DPM based on histograms of oriented gradients features.

...read moreread less

Abstract: Deformable part models (DPMs) and convolutional neural networks (CNNs) are two widely used tools for visual recognition. They are typically viewed as distinct approaches: DPMs are graphical models (Markov random fields), while CNNs are "black-box" non-linear classifiers. In this paper, we show that a DPM can be formulated as a CNN, thus providing a novel synthesis of the two ideas. Our construction involves unrolling the DPM inference algorithm and mapping each step to an equivalent (and at times novel) CNN layer. From this perspective, it becomes natural to replace the standard image features used in DPM with a learned feature extractor. We call the resulting model DeepPyramid DPM and experimentally validate it on PASCAL VOC. DeepPyramid DPM significantly outperforms DPMs based on histograms of oriented gradients features (HOG) and slightly outperforms a comparable version of the recently introduced R-CNN detection system, while running an order of magnitude faster.

...read moreread less

Posted Content•

LSDA: Large Scale Detection Through Adaptation

[...]

Judy Hoffman¹, Sergio Guadarrama¹, Eric Tzeng¹, Ronghang Hu², Jeff Donahue¹, Ross Girshick¹, Trevor Darrell¹, Kate Saenko³ - Show less +4 more•Institutions (3)

University of California, Berkeley¹, Tsinghua University², University of Massachusetts Lowell³

18 Jul 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors.

...read moreread less

Abstract: A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7.6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7.6K detector). Models and software are available at

...read moreread less

Posted Content•

On learning to localize objects with minimal supervision

[...]

Hyun Oh Song¹, Ross Girshick¹, Stefanie Jegelka¹, Julien Mairal², Zaid Harchaoui², Trevor Darrell¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, French Institute for Research in Computer Science and Automation²

05 Mar 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a new method that achieves this goal with only image-level labels of whether the objects are present or not, and combines a discriminative submodular cover problem for automatically discovering a set of positive object windows with a smoothed latent SVM formulation.

...read moreread less

Abstract: Learning to localize objects with minimal supervision is an important problem in computer vision, since large fully annotated datasets are extremely costly to obtain. In this paper, we propose a new method that achieves this goal with only image-level labels of whether the objects are present or not. Our approach combines a discriminative submodular cover problem for automatically discovering a set of positive object windows with a smoothed latent SVM formulation. The latter allows us to leverage efficient quasi-Newton optimization techniques. Our experiments demonstrate that the proposed approach provides a 50% relative improvement in mean average precision over the current state-of-the-art on PASCAL VOC 2007 detection.

...read moreread less

Posted Content•

Simultaneous Detection and Segmentation

[...]

Bharath Hariharan¹, Pablo Arbeláez¹, Pablo Arbeláez², Ross Girshick¹, Jitendra Malik¹ - Show less +1 more•Institutions (2)

University of California¹, University of Los Andes²

07 Jul 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: Simultaneous detection and segmentation (SDS) as mentioned in this paper detects all instances of a category in an image and, for each instance, marks the pixels that belong to it.

...read moreread less

Abstract: We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual object instances. We build on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN [16]), introducing a novel architecture tailored for SDS. We then use category-specific, top- down figure-ground predictions to refine our bottom-up proposals. We show a 7 point boost (16% relative) over our baselines on SDS, a 5 point boost (10% relative) over state-of-the-art on semantic segmentation, and state-of-the-art performance in object detection. Finally, we provide diagnostic tools that unpack performance and provide directions for future work.

...read moreread less

Proceedings Article•DOI•

Using k-Poselets for Detecting People and Localizing Their Keypoints

[...]

Georgia Gkioxari¹, Bharath Hariharan¹, Ross Girshick¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: A k-poselet is a deformable part model with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground-truth annotations, which enables a unified approach to person detection and keypoint prediction.

...read moreread less

Abstract: A k-poselet is a deformable part model (DPM) with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground-truth annotations. A separate template is used to learn the appearance of each part. The parts are allowed to move with respect to each other with a deformation cost that is learned at training time. This model is richer than both the traditional version of poselets and DPMs. It enables a unified approach to person detection and keypoint prediction which, barring contemporaneous approaches based on CNN features, achieves state-of-the-art keypoint prediction while maintaining competitive detection performance.

...read moreread less

Posted Content•

Part-based R-CNNs for Fine-grained Category Detection

[...]

Ning Zhang¹, Jeff Donahue¹, Ross Girshick¹, Trevor Darrell¹•Institutions (1)

University of California¹

15 Jul 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a model for fine-grained categorization that overcomes limitations by leveraging deep convolutional features computed on bottom-up region proposals, and learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine- grained category from a pose-normalized representation.

...read moreread less

Abstract: Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

...read moreread less

Proceedings Article•

LSDA: Large Scale Detection through Adaptation

[...]

Judy Hoffman¹, Sergio Guadarrama¹, Eric Tzeng¹, Ronghang Hu², Jeff Donahue¹, Ross Girshick¹, Trevor Darrell¹, Kate Saenko³ - Show less +4 more•Institutions (3)

University of California, Berkeley¹, Tsinghua University², University of Massachusetts Lowell³

08 Dec 2014

TL;DR: The Large Scale Detection through Adaptation (LSDA) algorithm as discussed by the authors learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors.

...read moreread less

Abstract: A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7.6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7.6K detector). Models and software are available at lsda.berkeleyvision.org.

...read moreread less

Posted Content•

R-CNNs for Pose Estimation and Action Detection

[...]

Georgia Gkioxari, Bharath Hariharan, Ross Girshick, Jitendra Malik

19 Jun 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents convolutional neural networks for the tasks of keypoint (pose) prediction and action classification of people in unconstrained images and gives state-of-the-art results for keypoint and action prediction.

...read moreread less

Abstract: We present convolutional neural networks for the tasks of keypoint (pose) prediction and action classification of people in unconstrained images. Our approach involves training an R-CNN detector with loss functions depending on the task being tackled. We evaluate our method on the challenging PASCAL VOC dataset and compare it to previous leading approaches. Our method gives state-of-the-art results for keypoint and action prediction. Additionally, we introduce a new dataset for action detection, the task of simultaneously localizing people and classifying their actions, and present results using our approach.

...read moreread less

Proceedings Article•DOI•

Understanding Objects in Detail with Fine-Grained Attributes

[...]

Andrea Vedaldi, Siddharth Mahendran, Stavros Tsogkas, Subhransu Maji¹, Ross Girshick², Juho Kannala, Esa Rahtu, Iasonas Kokkinos, Matthew B. Blaschko, David J. Weiss³, Ben Taskar⁴, Karen Simonyan, Naomi Saphra⁵, Sammy Mohamed⁶ - Show less +10 more•Institutions (6)

Toyota¹, University of Chicago², Google³, University of Washington⁴, Johns Hopkins University⁵, Stony Brook University⁶

23 Jun 2014

TL;DR: A dataset of 7, 413 airplanes annotated in detail with parts and their attributes is introduced, leveraging images donated by airplane spotters and crowd-sourcing both the design and collection of the detailed annotations to provide insights that should help researchers interested in designing fine-grained datasets for other basic level categories.

...read moreread less

Abstract: We study the problem of understanding objects in detail, intended as recognizing a wide array of fine-grained object attributes. To this end, we introduce a dataset of 7, 413 airplanes annotated in detail with parts and their attributes, leveraging images donated by airplane spotters and crowdsourcing both the design and collection of the detailed annotations. We provide a number of insights that should help researchers interested in designing fine-grained datasets for other basic level categories. We show that the collected data can be used to study the relation between part detection and attribute prediction by diagnosing the performance of classifiers that pool information from different parts of an object. We note that the prediction of certain attributes can benefit substantially from accurate part detection. We also show that, differently from previous results in object detection, employing a large number of part templates can improve detection accuracy at the expenses of detection speed. We finally propose a coarse-to-fine approach to speed up detection through a hierarchical cascade algorithm.

...read moreread less

Proceedings Article•

On learning to localize objects with minimal supervision

[...]

Hyun Oh Song¹, Ross Girshick¹, Stefanie Jegelka¹, Julien Mairal², Zaid Harchaoui², Trevor Darrell¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, French Institute for Research in Computer Science and Automation²

21 Jun 2014

TL;DR: In this article, a discriminative submodular cover problem is used for automatically discovering a set of positive object windows with a smoothed latent SVM formulation, which leverages efficient quasi-Newton optimization techniques.

...read moreread less

Abstract: Learning to localize objects with minimal supervision is an important problem in computer vision, since large fully annotated datasets are extremely costly to obtain. In this paper, we propose a new method that achieves this goal with only image-level labels of whether the objects are present or not. Our approach combines a discriminative submodular cover problem for automatically discovering a set of positive object windows with a smoothed latent SVM formulation. The latter allows us to leverage efficient quasi-Newton optimization techniques. Our experiments demonstrate that the proposed approach provides a 50% relative improvement in mean average precision over the current state-of-the-art on PASCAL VOC 2007 detection.

...read moreread less

Posted Content•

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

[...]

Pulkit Agrawal¹, Ross Girshick¹, Jitendra Malik¹•Institutions (1)

University of California¹

07 Jul 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors experimentally probe several aspects of CNN feature learning in an attempt to help practitioners gain useful, evidence-backed intuitions about how to apply CNNs to computer vision problems.

...read moreread less

Abstract: In the last two years, convolutional neural networks (CNNs) have achieved an impressive suite of results on standard recognition datasets and tasks. CNN-based features seem poised to quickly replace engineered representations, such as SIFT and HOG. However, compared to SIFT and HOG, we understand much less about the nature of the features learned by large CNNs. In this paper, we experimentally probe several aspects of CNN feature learning in an attempt to help practitioners gain useful, evidence-backed intuitions about how to apply CNNs to computer vision problems.

...read moreread less

Posted Content•

One-Bit Object Detection: On learning to localize objects with minimal supervision.

[...]

Hyun Oh Song, Ross Girshick, Stefanie Jegelka, Julien Mairal, Zaid Harchaoui, Trevor Darrell - Show less +2 more

05 Mar 2014

TL;DR: In this paper, a discriminative submodular cover problem is used to discover a set of positive object windows with a smoothed latent SVM formulation, which can leverage efficient Quasi-Newton optimization techniques.

...read moreread less

Abstract: Learning to localize objects with minimal supervision is an important problem in computer vision, since large fully annotated datasets are extremely costly to obtain. In this paper, we propose a new method that achieves this goal with only image-level labels of whether the objects are present or not. Our approach combines a discriminative submodular cover problem for automatically discovering a set of positive object windows with a smoothed latent SVM formulation. The latter allows us to leverage efficient Quasi-Newton optimization techniques. Our experiments demonstrate that the proposed approach provides approximately 70% relative improvement in average precision over the current state of the art on standard benchmark datasets.

...read moreread less

Posted Content•

Actions and Attributes from Wholes and Parts

[...]

Georgia Gkioxari¹, Ross Girshick², Jitendra Malik¹•Institutions (2)

University of California, Berkeley¹, Microsoft²

08 Dec 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the importance of parts for the tasks of action and attribute classification was investigated and a part-based approach was developed by leveraging convolutional network features inspired by recent advances in computer vision.

...read moreread less

Abstract: We investigate the importance of parts for the tasks of action and attribute classification. We develop a part-based approach by leveraging convolutional network features inspired by recent advances in computer vision. Our part detectors are a deep version of poselets and capture parts of the human body under a distinct set of poses. For the tasks of action and attribute classification, we train holistic convolutional neural networks and show that adding parts leads to top-performing results for both tasks. In addition, we demonstrate the effectiveness of our approach when we replace an oracle person detector, as is the default in the current evaluation protocol for both tasks, with a state-of-the-art person detection system.

...read moreread less

From Large-Scale Object Classifiers to Large-Scale Object Detectors: An Adaptation Approach

[...]

Judy Hoffman, Sergio Guadarrama, Eric Tzeng, Jeff Donahue, Ross Girshick, Trevor Darrell, Kate Saenko - Show less +3 more

01 Jan 2014

TL;DR: This paper proposes a Deep Detection Adaptation (DDA) algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors.

...read moreread less

Abstract: A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNN) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose a Deep Detection Adaptation (DDA) algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach.

...read moreread less

Showing papers by "Ross Girshick published in 2014"