Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

doi:10.1007/978-3-319-46448-0_31

Open AccessBook ChapterDOI

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

Gunnar A. Sigurdsson, +7 more

- pp 510-526

Chats0

TLDR

This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.

Abstract:

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 s, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

Citations

Non-local Neural Networks

SlowFast Networks for Video Recognition

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Temporal Relational Reasoning in Videos

TSM: Temporal Shift Module for Efficient Video Understanding

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Visualizing Data using t-SNE

Introduction to Modern Information Retrieval

Related Papers (5)

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Learning Spatiotemporal Features with 3D Convolutional Networks

Deep Residual Learning for Image Recognition

Two-Stream Convolutional Networks for Action Recognition in Videos

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild