Showing papers by "Sergio Guadarrama published in 2016"

PDF

Open Access

Proceedings Article•DOI•

Improved Image Captioning via Policy Gradient optimization of SPIDEr

[...]

Siqi Liu¹, Zhenhai Zhu¹, Ning Ye, Sergio Guadarrama², Kevin Murphy³ - Show less +1 more•Institutions (3)

Google¹, University of California, Berkeley², Washington State University³

01 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This article proposed a policy gradient method to directly optimize a linear combination of SPICE and CIDEr (a combination they call SPIDEr), which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.

...read moreread less

Abstract: Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for. In this paper, we show how to use a policy gradient (PG) method to directly optimize a linear combination of SPICE and CIDEr (a combination we call SPIDEr): the SPICE score ensures our captions are semantically faithful to the image, while CIDEr score ensures our captions are syntactically fluent. The PG method we propose improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing MLE training with PG. We show empirically that our algorithm leads to easier optimization and improved results compared to MIXER. Finally, we show that using our PG method we can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.

...read moreread less

271 citations

Posted Content•

Speed/accuracy trade-offs for modern convolutional object detectors

[...]

Jonathan Huang¹, Vivek Rathod¹, Chen Sun², Menglong Zhu³, Anoop Korattikara⁴, Alireza Fathi², Ian Fischer², Zbigniew Wojna⁵, Yang Song⁶, Sergio Guadarrama⁷, Kevin Murphy⁸ - Show less +7 more•Institutions (8)

Russian Academy of Sciences¹, Google², University of Pennsylvania³, University of California, Irvine⁴, University College London⁵, Chinese Center for Disease Control and Prevention⁶, University of California, Berkeley⁷, Cardiff University⁸

30 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems, and present a unified implementation of the Faster R-CNN, R-FCN and SSD systems, which they view as "meta-architectures".

...read moreread less

Abstract: The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

...read moreread less

158 citations

Posted Content•

Optimization of image description metrics using policy gradient methods

[...]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, Kevin Murphy - Show less +1 more

01 Dec 2016

TL;DR: A novel training procedure for image captioning models based on policy gradient methods is proposed, which allows us to directly optimize for the metrics of interest, rather than just maximizing likelihood of human generated captions.

...read moreread less

Abstract: In this paper, we propose a novel training procedure for image captioning models based on policy gradient methods. This allows us to directly optimize for the metrics of interest, rather than just maximizing likelihood of human generated captions. We show that by optimizing for standard metrics such as BLEU, CIDEr, METEOR and ROUGE, we can develop a system that improve on the metrics and ranks first on the MSCOCO image captioning leader board, even though our CNN-RNN model is much simpler than state of the art models. We further show that by also optimizing for the recently introduced SPICE metric, which measures semantic quality of captions, we can produce a system that significantly outperforms other methods as measured by human evaluation. Finally, we show how we can leverage extra sources of information, such as pre-trained image tagging models, to further improve quality.

...read moreread less

85 citations

Proceedings Article•DOI•

Cross-modal adaptation for RGB-D detection

[...]

Judy Hoffman¹, Saurabh Gupta¹, Jian Leong¹, Sergio Guadarrama², Trevor Darrell¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Google²

16 May 2016

TL;DR: This paper proposes a technique to adapt convolutional neural network (CNN) based object detectors trained on RGB images to effectively leverage depth images at test time to boost detection performance.

...read moreread less

Abstract: In this paper we propose a technique to adapt convolutional neural network (CNN) based object detectors trained on RGB images to effectively leverage depth images at test time to boost detection performance. Given labeled depth images for a handful of categories we adapt an RGB object detector for a new category such that it can now use depth images in addition to RGB images at test time to produce more accurate detections. Our approach is built upon the observation that lower layers of a CNN are largely task and category agnostic and domain specific while higher layers are largely task and category specific while being domain agnostic. We operationalize this observation by proposing a mid-level fusion of RGB and depth CNNs. Experimental evaluation on the challenging NYUD2 dataset shows that our proposed adaptation technique results in an average 21% relative improvement in detection performance over an RGB-only baseline even when no depth training data is available for the particular category evaluated. We believe our proposed technique will extend advances made in computer vision to RGB-D data leading to improvements in performance at little additional annotation effort.

...read moreread less

84 citations

Journal Article•

Large scale visual recognition through adaptation using joint representation and multiple instance learning

[...]

Judy Hoffman¹, Deepak Pathak¹, Eric Tzeng¹, Jonathan Long¹, Sergio Guadarrama¹, Trevor Darrell¹, Kate Saenko² - Show less +3 more•Institutions (2)

University of California, Berkeley¹, University of Massachusetts Lowell²

01 Jan 2016-Journal of Machine Learning Research

TL;DR: This work provides a novel formulation of a joint multiple instance learning method that includes examples from object-centric data with image-level labels when available, and also performs domain transfer learning to improve the underlying detector representation.

...read moreread less

Abstract: A major barrier towards scaling visual recognition systems is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) trained used 1.2M+ labeled images have emerged as clear winners on object classification benchmarks. Unfortunately, only a small fraction of those labels are available with bounding box localization for training the detection task and even fewer pixel level annotations are available for semantic segmentation. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect scene-centric images with precisely localized labels. We develop methods for learning large scale recognition models which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. We provide a novel formulation of a joint multiple instance learning method that includes examples from object-centric data with image-level labels when available, and also performs domain transfer learning to improve the underlying detector representation. We then show how to use our large scale detectors to produce pixel level annotations. Using our method, we produce a >7.6K category detector and release code and models at lsda.berkeleyvision.org.

...read moreread less

26 citations

Journal Article•DOI•

Understanding object descriptions in robotics by open-vocabulary object retrieval and detection

[...]

Sergio Guadarrama¹, Erik Rodner², Kate Saenko³, Trevor Darrell¹•Institutions (3)

University of California, Berkeley¹, University of Jena², University of Massachusetts Lowell³

01 Jan 2016-The International Journal of Robotics Research

TL;DR: This work addresses the problem of retrieving and detecting objects based on open-vocabulary natural language queries by introducing a novel object retrieval method and proposes a method for handling open vocabularies, that is, words not contained in the training data.

...read moreread less

Abstract: We address the problem of retrieving and detecting objects based on open-vocabulary natural language queries: given a phrase describing a specific object, for example ?the corn flakes box?, the task is to find the best match in a set of images containing candidate objects. When naming objects, humans tend to use natural language with rich semantics, including basic-level categories, fine-grained categories, and instance-level concepts such as brand names. Existing approaches to large-scale object recognition fail in this scenario, as they expect queries that map directly to a fixed set of pre-trained visual categories, for example ImageNet synset tags. We address this limitation by introducing a novel object retrieval method. Given a candidate object image, we first map it to a set of words that are likely to describe it, using several learned image-to-text projections. We also propose a method for handling open vocabularies, that is, words not contained in the training data. We then compare the natural language query to the sets of words predicted for each candidate and select the best match. Our method can combine category- and instance-level semantics in a common representation. We present extensive experimental results on several datasets using both instance-level and category-level matching and show that our approach can accurately retrieve objects based on extremely varied open-vocabulary queries. Furthermore, we show how to process queries referring to objects within scenes, using state-of-the-art adapted detectors. The source code of our approach will be publicly available together with pre-trained models at http://openvoc.berkeleyvision.org and could be directly used for robotics applications.

...read moreread less

16 citations

G-RMI Object Detection

[...]

Alireza Fathi, Anoop Korattikara, Chen Sun, Ian Fischer, Jonathan Huang, Kevin Murphy, Menglong Zhu, Sergio Guadarrama, Vivek Rathod, Yang Song, Zbigniew Wojna - Show less +7 more

01 Jan 2016

2 citations