Top 9 papers published by Sergio Guadarrama from Google in 2014

Posted Content•

Caffe: Convolutional Architecture for Fast Feature Embedding

[...]

Yangqing Jia¹, Evan Shelhamer², Jeff Donahue², Sergey Karayev², Jonathan Long², Ross Girshick², Sergio Guadarrama², Trevor Darrell² - Show less +4 more•Institutions (2)

Google¹, University of California, Berkeley²

20 Jun 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU ($\approx$ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

...read moreread less

12,531 citations

Proceedings Article•DOI•

Caffe: Convolutional Architecture for Fast Feature Embedding

[...]

Yangqing Jia¹, Evan Shelhamer², Jeff Donahue², Sergey Karayev², Jonathan Long², Ross Girshick², Sergio Guadarrama², Trevor Darrell² - Show less +4 more•Institutions (2)

Google¹, University of California, Berkeley²

03 Nov 2014

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments.Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

...read moreread less

10,161 citations

Posted Content•

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

17 Nov 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

3,935 citations

Posted Content•

LSDA: Large Scale Detection Through Adaptation

[...]

Judy Hoffman¹, Sergio Guadarrama¹, Eric Tzeng¹, Ronghang Hu², Jeff Donahue¹, Ross Girshick¹, Trevor Darrell¹, Kate Saenko³ - Show less +4 more•Institutions (3)

University of California, Berkeley¹, Tsinghua University², University of Massachusetts Lowell³

18 Jul 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors.

...read moreread less

Abstract: A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7.6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7.6K detector). Models and software are available at

...read moreread less

319 citations

Proceedings Article•

Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild

[...]

Jesse Thomason¹, Subhashini Venugopalan¹, Sergio Guadarrama², Kate Saenko³, Raymond J. Mooney¹ - Show less +1 more•Institutions (3)

University of Texas at Austin¹, University of California, Berkeley², University of Massachusetts Lowell³

23 Aug 2014

TL;DR: This paper proposes a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics, and uses state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video.

...read moreread less

Abstract: This paper integrates techniques in natural language processing and computer vision to improve recognition and description of entities and activities in real-world videos. We propose a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics. We use state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video. Our factor graph model combines these detection confidences with probabilistic knowledge mined from text corpora to estimate the most likely subject, verb, object, and place. Results on YouTube videos show that our approach improves both the joint detection of these latent, diverse sentence components and the detection of some individual components when compared to using the vision system alone, as well as over a previous n-gram language-modeling approach. The joint detection allows us to automatically generate more accurate, richer sentential descriptions of videos with a wide array of possible content.

...read moreread less

216 citations

Proceedings Article•

LSDA: Large Scale Detection through Adaptation

[...]

Judy Hoffman¹, Sergio Guadarrama¹, Eric Tzeng¹, Ronghang Hu², Jeff Donahue¹, Ross Girshick¹, Trevor Darrell¹, Kate Saenko³ - Show less +4 more•Institutions (3)

University of California, Berkeley¹, Tsinghua University², University of Massachusetts Lowell³

08 Dec 2014

TL;DR: The Large Scale Detection through Adaptation (LSDA) algorithm as discussed by the authors learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors.

...read moreread less

Abstract: A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7.6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7.6K detector). Models and software are available at lsda.berkeleyvision.org.

...read moreread less

159 citations

Proceedings Article•DOI•

Open-vocabulary Object Retrieval

[...]

Sergio Guadarrama¹, Erik Rodner², Kate Saenko³, Ning Zhang¹, Ryan Farrell¹, Jeff Donahue¹, Trevor Darrell¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Jena², University of Massachusetts Lowell³

12 Jul 2014

TL;DR: This paper introduces a novel object retrieval method that can combine categoryand instance-level semantics in a common representation and shows that the approach can accurately retrieve objects based on extremely varied open-vocabulary queries.

...read moreread less

Abstract: In this paper, we address the problem of retrieving objects based on open-vocabulary natural language queries: Given a phrase describing a specific object, e.g., “the corn flakes box”, the task is to find the best match in a set of images containing candidate objects. When naming objects, humans tend to use natural language with rich semantics, including basic-level categories, fine-grained categories, and instance-level concepts such as brand names. Existing approaches to large-scale object recognition fail in this scenario, as they expect queries that map directly to a fixed set of pre-trained visual categories, e.g. ImageNet synset tags. We address this limitation by introducing a novel object retrieval method. Given a candidate object image, we first map it to a set of words that are likely to describe it, using several learned image-to-text projections. We also propose a method for handling open-vocabularies, i.e., words not contained in the training data. We then compare the natural language query to the sets of words predicted for each candidate and select the best match. Our method can combine categoryand instance-level semantics in a common representation. We present extensive experimental results on several datasets using both instance-level and category-level matching and show that our approach can accurately retrieve objects based on extremely varied open-vocabulary queries. The source code of our approach will be publicly available together with pre-trained models at http://openvoc.berkeleyvision.org and could be directly used for robotics applications.

...read moreread less

86 citations

Posted Content•

Compute Less to Get More: Using ORC to Improve Sparse Filtering

[...]

Johannes Lederer¹, Sergio Guadarrama²•Institutions (2)

Cornell University¹, University of California, Berkeley²

16 Sep 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the Optimal Roundness Criterion (ORC) is proposed as a novel stopping criterion for sparse filtering, which is related with pre-processing procedures such as Statistical Whitening and demonstrate that it can make image classification with sparse filtering considerably faster and more accurate.

...read moreread less

Abstract: Sparse Filtering is a popular feature learning algorithm for image classification pipelines. In this paper, we connect the performance of Sparse Filtering with spectral properties of the corresponding feature matrices. This connection provides new insights into Sparse Filtering; in particular, it suggests early stopping of Sparse Filtering. We therefore introduce the Optimal Roundness Criterion (ORC), a novel stopping criterion for Sparse Filtering. We show that this stopping criterion is related with pre-processing procedures such as Statistical Whitening and demonstrate that it can make image classification with Sparse Filtering considerably faster and more accurate.

...read moreread less

5 citations

From Large-Scale Object Classifiers to Large-Scale Object Detectors: An Adaptation Approach

[...]

Judy Hoffman, Sergio Guadarrama, Eric Tzeng, Jeff Donahue, Ross Girshick, Trevor Darrell, Kate Saenko - Show less +3 more

01 Jan 2014

TL;DR: This paper proposes a Deep Detection Adaptation (DDA) algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors.

...read moreread less

Abstract: A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNN) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose a Deep Detection Adaptation (DDA) algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach.

...read moreread less

3 citations

Showing papers by "Sergio Guadarrama published in 2014"