Home
/
Authors
/
Carl Vondrick

Author

Carl Vondrick

Other affiliations: University of California, Irvine, Google, Massachusetts Institute of Technology

Bio: Carl Vondrick is an academic researcher from Columbia University. The author has contributed to research in topics: Computer science & Object detection. The author has an hindex of 37, co-authored 92 publications receiving 9054 citations. Previous affiliations of Carl Vondrick include University of California, Irvine & Google.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Generating Videos with Scene Dynamics

[...]

Carl Vondrick¹, Hamed Pirsiavash², Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

08 Sep 2016

TL;DR: A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

...read moreread less

Abstract: We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.

...read moreread less

998 citations

Proceedings Article•DOI•

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

[...]

Chunhui Gu¹, Chen Sun¹, David A. Ross¹, Carl Vondrick¹, Caroline Pantofaru¹, Yeqing Li¹, Sudheendra Vijayanarasimhan¹, George Toderici¹, Susanna Ricco¹, Rahul Sukthankar¹, Cordelia Schmid¹, Jitendra Malik², Jitendra Malik¹ - Show less +9 more•Institutions (2)

Google¹, Lawrence Berkeley National Laboratory²

18 Jun 2018

TL;DR: The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

...read moreread less

Abstract: This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.8% mAP, underscoring the need for developing new approaches for video understanding.

...read moreread less

850 citations

Posted Content•

SoundNet: Learning Sound Representations from Unlabeled Video

[...]

Yusuf Aytar¹, Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

27 Oct 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos and propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabelled video as a bridge.

...read moreread less

Abstract: We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels

...read moreread less

725 citations

Posted Content•

Generating Videos with Scene Dynamics

[...]

Carl Vondrick¹, Hamed Pirsiavash², Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

08 Sep 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors proposed a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background, which can generate tiny videos up to a second at full frame rate better than simple baselines.

...read moreread less

670 citations

Proceedings Article•DOI•

A large-scale benchmark dataset for event recognition in surveillance video

[...]

Sangmin Oh¹, Anthony Hoogs², A. G. Amitha Perera², Naresh P. Cuntoor², Chia-Chih Chen³, Jong Taek Lee³, Saurajit Mukherjee, Jake K. Aggarwal³, Hyungtae Lee⁴, Larry S. Davis⁴, Eran Swears², Xioyang Wang, Qiang Ji⁵, Kishore K. Reddy⁶, Mubarak Shah⁶, Carl Vondrick⁷, Hamed Pirsiavash⁷, Deva Ramanan⁷, Jenny Yuen⁸, Antonio Torralba⁸, Bi Song⁹, Anesco Fong, Amit K. Roy-Chowdhury⁹, Mita Desai¹⁰ - Show less +20 more•Institutions (10)

Georgia Institute of Technology¹, Kitware², University of Texas at Austin³, University of Maryland, College Park⁴, Rensselaer Polytechnic Institute⁵, University of Central Florida⁶, University of California, Irvine⁷, Massachusetts Institute of Technology⁸, University of California, Riverside⁹, DARPA¹⁰

20 Jun 2011

TL;DR: A new large-scale video dataset designed to assess the performance of diverseVisual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage is introduced.

...read moreread less

Abstract: We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual [15, 8]. Datasets have been developed for movies [11] and sports [12], but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.

...read moreread less

664 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

Book Chapter•DOI•

Microsoft COCO: Common Objects in Context

[...]

Tsung-Yi Lin¹, Michael Maire², Serge Belongie¹, James Hays, Pietro Perona², Deva Ramanan³, Piotr Dollár⁴, C. Lawrence Zitnick⁴ - Show less +4 more•Institutions (4)

Cornell University¹, California Institute of Technology², University of California, Irvine³, Microsoft⁴

06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

...read moreread less

30,462 citations

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Proceedings Article•DOI•

Fast R-CNN

[...]

Ross Girshick¹•Institutions (1)

Microsoft¹

07 Dec 2015

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.

...read moreread less

Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

...read moreread less

14,824 citations

Posted Content•

Fast R-CNN

[...]

Ross Girshick¹•Institutions (1)

Microsoft¹

30 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.

...read moreread less

14,747 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse