Home
/
Authors
/
Santiago Manen

Author

Santiago Manen

Bio: Santiago Manen is an academic researcher from ETH Zurich. The author has contributed to research in topics: Video tracking & Object detection. The author has an hindex of 6, co-authored 7 publications receiving 531 citations.

Topics: Video tracking, Object detection, Supervised learning, Trajectory, Pixel ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Prime Object Proposals with Randomized Prim's Algorithm

[...]

Santiago Manen¹, Matthieu Guillaumin¹, Luc Van Gool¹•Institutions (1)

ETH Zurich¹

01 Dec 2013

TL;DR: A novel and very efficient method for generic object detection based on a randomized version of Prim's algorithm, using the connectivity graph of an image's super pixels, with weights modelling the probability that neighbouring super pixels belong to the same object.

...read moreread less

Abstract: Generic object detection is the challenging task of proposing windows that localize all the objects in an image, regardless of their classes. Such detectors have recently been shown to benefit many applications such as speeding-up class-specific object detection, weakly supervised learning of object detectors and object discovery. In this paper, we introduce a novel and very efficient method for generic object detection based on a randomized version of Prim's algorithm. Using the connectivity graph of an image's super pixels, with weights modelling the probability that neighbouring super pixels belong to the same object, the algorithm generates random partial spanning trees with large expected sum of edge weights. Object localizations are proposed as bounding-boxes of those partial trees. Our method has several benefits compared to the state-of-the-art. Thanks to the efficiency of Prim's algorithm, it samples proposals very quickly: 1000 proposals are obtained in about 0.7s. With proposals bound to super pixel boundaries yet diversified by randomization, it yields very high detection rates and windows that tightly fit objects. In extensive experiments on the challenging PASCAL VOC 2007 and 2012 and SUN2012 benchmark datasets, we show that our method improves over state-of-the-art competitors for a wide range of evaluation scenarios.

...read moreread less

340 citations

Proceedings Article•DOI•

Online Video SEEDS for Temporal Window Objectness

[...]

Michael Van den Bergh¹, Gemma Roig¹, Xavier Boix¹, Santiago Manen¹, Luc Van Gool¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

01 Dec 2013

TL;DR: An online, real-time video super pixel algorithm based on the recently proposed SEEDS super pixels is introduced and a new capability is incorporated which delivers multiple diverse samples (hypotheses) of super pixels in the same image or video sequence.

...read moreread less

Abstract: Super pixel and objectness algorithms are broadly used as a pre-processing step to generate support regions and to speed-up further computations. Recently, many algorithms have been extended to video in order to exploit the temporal consistency between frames. However, most methods are computationally too expensive for real-time applications. We introduce an online, real-time video super pixel algorithm based on the recently proposed SEEDS super pixels. A new capability is incorporated which delivers multiple diverse samples (hypotheses) of super pixels in the same image or video sequence. The multiple samples are shown to provide a strong cue to efficiently measure the objectness of image windows, and we introduce the novel concept of objectness in temporal windows. Experiments show that the video super pixels achieve comparable performance to state-of-the-art offline methods while running at 30 fps on a single 2.8 GHz i7 CPU. State-of-the-art performance on objectness is also demonstrated, yet orders of magnitude faster and extended to temporal windows in video.

...read moreread less

95 citations

Proceedings Article•DOI•

PathTrack: Fast Trajectory Annotation with Path Supervision

[...]

Santiago Manen¹, Michael Gygli¹, Dengxin Dai¹, Luc Van Gool¹•Institutions (1)

ETH Zurich¹

01 Oct 2017

TL;DR: In this article, a path supervision framework is proposed to annotate trajectories and use it to produce a MOT dataset of unprecedented size, with more than 15,000 person trajectories in 720 sequences.

...read moreread less

Abstract: Progress in Multiple Object Tracking (MOT) has been historically limited by the size of the available datasets. We present an efficient framework to annotate trajectories and use it to produce a MOT dataset of unprecedented size. In our novel path supervision the annotator loosely follows the object with the cursor while watching the video, providing a path annotation for each object in the sequence. Our approach is able to turn such weak annotations into dense box trajectories. Our experiments on existing datasets prove that our framework produces more accurate annotations than the state of the art, in a fraction of the time. We further validate our approach by crowdsourcing the PathTrack dataset, with more than 15,000 person trajectories in 720 sequences. Tracking approaches can benefit training on such large-scale datasets, as did object recognition. We prove this by re-training an off-the-shelf person matching network, originally trained on the MOT15 dataset, almost halving the misclassification rate. Additionally, training on our data consistently improves tracking results, both on our dataset and on MOT15. On the latter, we improve the top-performing tracker (NOMT) dropping the number of ID Switches by 18% and fragments by 5%.

...read moreread less

68 citations

Posted Content•

PathTrack: Fast Trajectory Annotation with Path Supervision

[...]

Santiago Manen¹, Michael Gygli¹, Dengxin Dai¹, Luc Van Gool¹•Institutions (1)

ETH Zurich¹

07 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents an efficient framework to annotate trajectories and uses it to produce a MOT dataset of unprecedented size, originally trained on the MOT15 dataset, and proves that this framework produces more accurate annotations than the state of the art, in a fraction of the time.

...read moreread less

33 citations

Proceedings Article•DOI•

Learning Audio-Video Modalities from Image Captions

[...]

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid - Show less +3 more

01 Apr 2022

TL;DR: In this article , a video mining pipeline is proposed to transfer captions from image captioning datasets to video clips with no additional manual effort, which achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining.

...read moreread less

Abstract: A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.

...read moreread less

26 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

The PASCAL Visual Object Classes Challenge

[...]

Jianguo Zhang

01 Jan 2006

3,012 citations

Book Chapter•DOI•

Edge Boxes: Locating Object Proposals from Edges

[...]

C. Lawrence Zitnick¹, Piotr Dollár¹•Institutions (1)

Microsoft¹

06 Sep 2014

TL;DR: A novel method for generating object bounding box proposals using edges is proposed, showing results that are significantly more accurate than the current state-of-the-art while being faster to compute.

...read moreread less

Abstract: The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box’s boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.

...read moreread less

2,892 citations

Journal Article•DOI•

Deep Learning for Generic Object Detection: A Survey

[...]

Li Liu¹, Li Liu², Wanli Ouyang³, Xiaogang Wang⁴, Paul Fieguth⁵, Jie Chen², Xinwang Liu¹, Matti Pietikäinen² - Show less +4 more•Institutions (5)

National University of Defense Technology¹, University of Oulu², University of Sydney³, The Chinese University of Hong Kong⁴, University of Waterloo⁵

01 Feb 2020-International Journal of Computer Vision

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.

...read moreread less

Abstract: Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.

...read moreread less

1,897 citations

Proceedings Article•DOI•

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

[...]

Ming-Ming Cheng¹, Ziming Zhang², Wen-Yan Lin, Philip H. S. Torr¹•Institutions (2)

University of Oxford¹, Boston University²

23 Jun 2014

TL;DR: It is observed that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size, so as to train a generic objectness measure.

...read moreread less

Abstract: Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 × 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1, 000 proposals. Increasing the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.

...read moreread less

1,034 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

Collapse