Search or ask a question

Showing papers by "Sergio Guadarrama published in 2013"

PDF

Open Access

Proceedings Article•DOI•

YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

[...]

Sergio Guadarrama¹, Niveda Krishnamoorthy², Girish Malkarnenkar², Subhashini Venugopalan², Raymond J. Mooney², Trevor Darrell¹, Kate Saenko³ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

01 Dec 2013

TL;DR: This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs.

...read moreread less

Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities ``in-the-wild''. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from Web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects, we also use a Web-scale language model to ``fill in'' novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

...read moreread less

555 citations

Proceedings Article•

Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

[...]

Niveda Krishnamoorthy¹, Girish Malkarnenkar¹, Raymond J. Mooney¹, Kate Saenko², Sergio Guadarrama³ - Show less +1 more•Institutions (3)

University of Texas at Austin¹, University of Massachusetts Lowell², University of California, Berkeley³

14 Jul 2013

TL;DR: This work combines the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video, and shows that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification.

...read moreread less

Abstract: We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.

...read moreread less

233 citations

Proceedings Article•DOI•

Grounding spatial relations for human-robot interaction

[...]

Sergio Guadarrama¹, Lorenzo Riano¹, Dave Golland¹, Daniel Gouhring², Yangqing Jia¹, Dan Klein¹, Pieter Abbeel¹, Trevor Darrell¹ - Show less +4 more•Institutions (2)

University of California, Berkeley¹, Institute of Company Secretaries of India²

01 Nov 2013

TL;DR: A system for human-robot interaction that learns both models for spatial prepositions and for object recognition, and grounds the meaning of an input sentence in terms of visual percepts coming from the robot's sensors to send an appropriate command to the PR2 or respond to spatial queries.

...read moreread less

Abstract: We propose a system for human-robot interaction that learns both models for spatial prepositions and for object recognition. Our system grounds the meaning of an input sentence in terms of visual percepts coming from the robot's sensors in order to send an appropriate command to the PR2 or respond to spatial queries. To perform this grounding, the system recognizes the objects in the scene, determines which spatial relations hold between those objects, and semantically parses the input sentence. The proposed system uses the visual and spatial information in conjunction with the semantic parse to interpret statements that refer to objects (nouns), their spatial relationships (prepositions), and to execute commands (actions). The semantic parse is inherently compositional, allowing the robot to understand complex commands that refer to multiple objects and relations such as: “Move the cup close to the robot to the area in front of the plate and behind the tea box”. Our system correctly parses 94% of the 210 online test sentences, correctly interprets 91% of the correctly parsed sentences, and correctly executes 89% of the correctly interpreted sentences.

...read moreread less

145 citations

Proceedings Article•

Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013

[...]

Niveda Krishnamoorthy¹, Girish Malkarnenkar¹, Raymond J. Mooney¹, Kate Saenko², Sergio Guadarrama³ - Show less +1 more•Institutions (3)

University of Texas at Austin¹, University of Massachusetts Lowell², University of California, Berkeley³

01 Jan 2013

16 citations