Home
/
Authors
/
Vahid Kazemi

Author

Vahid Kazemi

Other affiliations: Google

Bio: Vahid Kazemi is an academic researcher from Royal Institute of Technology. The author has contributed to research in topics: Object detection & Pose. The author has an hindex of 6, co-authored 8 publications receiving 2370 citations. Previous affiliations of Vahid Kazemi include Google.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

One Millisecond Face Alignment with an Ensemble of Regression Trees

[...]

Vahid Kazemi¹, Josephine Sullivan¹•Institutions (1)

Royal Institute of Technology¹

23 Jun 2014

TL;DR: It is shown how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions.

...read moreread less

Abstract: This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.

...read moreread less

2,545 citations

Posted Content•

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

[...]

Vahid Kazemi, Ali Elqursh

11 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark.

...read moreread less

Abstract: This paper presents a new baseline for visual question answering task. Given an image and a question in natural language, our model produces accurate answers according to the content of the image. Our model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark. On VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0, our model scores 59.7% on validation set outperforming best previously reported results by 0.5%. The results presented in this paper are especially interesting because very similar models have been tried before but significantly lower performance were reported. In light of the new results we hope to see more meaningful research on visual question answering in the future.

...read moreread less

168 citations

Proceedings Article•DOI•

Multi-view Body Part Recognition with Random Forests.

[...]

Vahid Kazemi¹, Magnus Burenius¹, Hossein Azizpour¹, Josephine Sullivan¹•Institutions (1)

Royal Institute of Technology¹

01 Jan 2013

TL;DR: This thesis presents practical approaches for tackling the corre-spondence estimation problem with an emphasis on deformable objects and a hybrid generative/discriminative approach is used to perform accurate correspondence estimation in real-time.

...read moreread less

Abstract: Many computer vision tasks such as object detection, pose estimation,and alignment are directly related to the estimation of correspondences overinstances of an object class. Other tasks such as image classification andverification if not completely solved can largely benefit from correspondenceestimation. This thesis presents practical approaches for tackling the corre-spondence estimation problem with an emphasis on deformable objects.Different methods presented in this thesis greatly vary in details but theyall use a combination of generative and discriminative modeling to estimatethe correspondences from input images in an efficient manner. While themethods described in this work are generic and can be applied to any object,two classes of objects of high importance namely human body and faces arethe subjects of our experimentations.When dealing with human body, we are mostly interested in estimating asparse set of landmarks – specifically we are interested in locating the bodyjoints. We use pictorial structures to model the articulation of the body partsgeneratively and learn efficient discriminative models to localize the parts inthe image. This is a common approach explored by many previous works. Wefurther extend this hybrid approach by introducing higher order terms to dealwith the double-counting problem and provide an algorithm for solving theresulting non-convex problem efficiently. In another work we explore the areaof multi-view pose estimation where we have multiple calibrated cameras andwe are interested in determining the pose of a person in 3D by aggregating2D information. This is done efficiently by discretizing the 3D search spaceand use the 3D pictorial structures model to perform the inference.In contrast to the human body, faces have a much more rigid structureand it is relatively easy to detect the major parts of the face such as eyes,nose and mouth, but performing dense correspondence estimation on facesunder various poses and lighting conditions is still challenging. In a first workwe deal with this variation by partitioning the face into multiple parts andlearning separate regressors for each part. In another work we take a fullydiscriminative approach and learn a global regressor from image to landmarksbut to deal with insufficiency of training data we augment it by a large numberof synthetic images. While we have shown great performance on the standardface datasets for performing correspondence estimation, in many scenariosthe RGB signal gets distorted as a result of poor lighting conditions andbecomes almost unusable. This problem is addressed in another work wherewe explore use of depth signal for dense correspondence estimation. Hereagain a hybrid generative/discriminative approach is used to perform accuratecorrespondence estimation in real-time.

...read moreread less

88 citations

Proceedings Article•DOI•

Real-Time Face Reconstruction from a Single Depth Image

[...]

Vahid Kazemi¹, Cem Keskin², Jonathan Taylor², Pushmeet Kohli², Shahram Izadi² - Show less +1 more•Institutions (2)

Royal Institute of Technology¹, Microsoft²

08 Dec 2014

TL;DR: A real time method for recovering facial shape and expression from a single depth image, using a discriminatively trained prediction pipeline that employs random forests to generate an initial dense but noisy correspondence field and a robust initial fit of the model.

...read moreread less

Abstract: This paper contributes a real time method for recovering facial shape and expression from a single depth image. The method also estimates an accurate and dense correspondence field between the input depth image and a generic face model. Both outputs are a result of minimizing the error in reconstructing the depth image, achieved by applying a set of identity and expression blend shapes to the model. Traditionally, such a generative approach has shown to be computationally expensive and non-robust because of the non-linear nature of the reconstruction error. To overcome this problem, we use a discriminatively trained prediction pipeline that employs random forests to generate an initial dense but noisy correspondence field. Our method then exploits a fast ICP-like approximation to update these correspondences, allowing us to quickly obtain a robust initial fit of our model. The model parameters are then fine tuned to minimize the true reconstruction error using a stochastic optimization technique. The correspondence field resulting from our hybrid generative-discriminative pipeline is accurate and useful for a variety of applications such as mesh deformation and retexturing. Our method works in real-time on a single depth image i.e. Without temporal tracking, is free from per-user calibration, and works in low-light conditions.

...read moreread less

35 citations

Proceedings Article•DOI•

Face Alignment with Part-Based Modeling

[...]

Vahid Kazemi¹, Josephine Sullivan¹•Institutions (1)

Royal Institute of Technology¹

01 Jan 2011

TL;DR: A new method for face alignment with part-based modeling is proposed that is competitive in terms of precision with existing methods such as Active Appearance Models, but is more robust and more robust than existing methods.

...read moreread less

Abstract: We propose a new method for face alignment with part-based modeling. This method is competitive in terms of precision with existing methods such as Active Appearance Models, but is more robust and ...

...read moreread less

19 citations

Cited by

PDF

Open Access

More filters

On robust estimation of the location parameter

[...]

Frederick R. Forst

01 Jan 1980

3,652 citations

Proceedings Article•DOI•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

18 Jun 2018

TL;DR: In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.

...read moreread less

Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

...read moreread less

2,904 citations

Proceedings Article•DOI•

One Millisecond Face Alignment with an Ensemble of Regression Trees

[...]

Vahid Kazemi¹, Josephine Sullivan¹•Institutions (1)

Royal Institute of Technology¹

23 Jun 2014

...read moreread less

2,545 citations

Posted Content•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

25 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

...read moreread less

2,248 citations

Posted Content•

A Style-Based Generator Architecture for Generative Adversarial Networks

[...]

Tero Karras¹, Samuli Laine¹, Timo Aila¹•Institutions (1)

Nvidia¹

12 Dec 2018-arXiv: Neural and Evolutionary Computing

TL;DR: This article proposed an alternative generator architecture for GANs, borrowing from style transfer literature, which leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images.

...read moreread less

Abstract: We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.

...read moreread less

1,612 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse