scispace - formally typeset
Search or ask a question
Author

Umer Rafi

Other affiliations: University of Bonn
Bio: Umer Rafi is an academic researcher from RWTH Aachen University. The author has contributed to research in topics: Pose & Deep learning. The author has an hindex of 6, co-authored 10 publications receiving 372 citations. Previous affiliations of Umer Rafi include University of Bonn.

Papers
More filters
Book ChapterDOI
24 Jun 2015
TL;DR: How the SPENCER project advances the fields of detection and tracking of individuals and groups, recognition of human social relations and activities, normative human behavior learning, socially-aware task and motion planning, learning socially annotated maps, and conducting empirical experiments to assess socio-psychological effects of normative robot behaviors is described.
Abstract: We present an ample description of a socially compliant mobile robotic platform, which is developed in the EU-funded project SPENCER. The purpose of this robot is to assist, inform and guide passengers in large and busy airports. One particular aim is to bring travellers of connecting flights conveniently and efficiently from their arrival gate to the passport control. The uniqueness of the project stems from the strong demand of service robots for this application with a large potential impact for the aviation industry on one side, and on the other side from the scientific advancements in social robotics, brought forward and achieved in SPENCER. The main contributions of SPENCER are novel methods to perceive, learn, and model human social behavior and to use this knowledge to plan appropriate actions in real-time for mobile platforms. In this paper, we describe how the project advances the fields of detection and tracking of individuals and groups, recognition of human social relations and activities, normative human behavior learning, socially-aware task and motion planning, learning socially annotated maps, and conducting empirical experiments to assess socio-psychological effects of normative robot behaviors.

240 citations

Proceedings ArticleDOI
01 Jan 2016
TL;DR: An efficient deep network architecture is proposed that is trained efficiently with a transparent procedure and exploits the best available ingredients from deep learning with low computational budget and achieves impressive performance on popular benchmarks in human pose estimation.
Abstract: In recent years, human pose estimation has greatly benefited from deep learning and huge gains in performance have been achieved. However, to push for maximum performance recent approaches exploit computationally expensive deep network architectures, train on multiple datasets, apply additional post-processing and provide limited details about used design choices. This makes it hard not only to compare different methods and but also to reproduce existing results . In this work, we propose an efficient deep network architecture that is trained efficiently with a transparent procedure and exploits the best available ingredients from deep learning with low computational budget. The network is trained only on the same dataset without pre-training and achieves impressive performance on popular benchmarks in human pose estimation.

128 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A semantic occlusion model is introduced that is incorporated into a regression forest approach for human pose estimation from depth data and shows that it increases the joint estimation accuracy and outperforms the commercial Kinect 2 SDK for occluded joints.
Abstract: Human pose estimation from depth data has made significant progress in recent years and commercial sensors estimate human poses in real-time. However, state-of-the-art methods fail in many situations when the humans are partially occluded by objects. In this work, we introduce a semantic occlusion model that is incorporated into a regression forest approach for human pose estimation from depth data. The approach exploits the context information of occluding objects like a table to predict the locations of occluded joints. In our experiments on synthetic and real data, we show that our occlusion model increases the joint estimation accuracy and outperforms the commercial Kinect 2 SDK for occluded joints.

57 citations

Book ChapterDOI
TL;DR: This work proposes an approach that relies on keypoint correspondences for associating persons in videos that achieves state-of-the-art results for multi-frame pose estimation and multi- person pose tracking on the PosTrack and PoseTrack data sets.
Abstract: Video annotation is expensive and time consuming. Consequently, datasets for multi-person pose estimation and tracking are less diverse and have more sparse annotations compared to large scale image datasets for human pose estimation. This makes it challenging to learn deep learning based models for associating keypoints across frames that are robust to nuisance factors such as motion blur and occlusions for the task of multi-person pose tracking. To address this issue, we propose an approach that relies on keypoint correspondences for associating persons in videos. Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation using self-supervision. Combined with a top-down framework for human pose estimation, we use keypoints correspondences to (i) recover missed pose detections (ii) associate pose detections across video frames. Our approach achieves state-of-the-art results for multi-frame pose estimation and multi-person pose tracking on the PosTrack $2017$ and PoseTrack $2018$ data sets.

16 citations

Book ChapterDOI
23 Aug 2020
TL;DR: In this paper, the authors propose an approach that relies on keypoint correspondences for associating persons in videos, which is trained on a large scale image dataset for human pose estimation using self-supervision.
Abstract: Video annotation is expensive and time consuming. Consequently, datasets for multi-person pose estimation and tracking are less diverse and have more sparse annotations compared to large scale image datasets for human pose estimation. This makes it challenging to learn deep learning based models for associating keypoints across frames that are robust to nuisance factors such as motion blur and occlusions for the task of multi-person pose tracking. To address this issue, we propose an approach that relies on keypoint correspondences for associating persons in videos. Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image dataset for human pose estimation using self-supervision. Combined with a top-down framework for human pose estimation, we use keypoint correspondences to (i) recover missed pose detections and to (ii) associate pose detections across video frames. Our approach achieves state-of-the-art results for multi-frame pose estimation and multi-person pose tracking on the PoseTrack 2017 and 2018 datasets .

11 citations


Cited by
More filters
Posted Content
TL;DR: It is shown that, for models trained from scratch as well as pretrained ones, using a variant of the triplet loss to perform end-to-end deep metric learning outperforms most other published methods by a large margin.
Abstract: In the past few years, the field of computer vision has gone through a revolution fueled mainly by the advent of large datasets and the adoption of deep convolutional neural networks for end-to-end learning. The person re-identification subfield is no exception to this. Unfortunately, a prevailing belief in the community seems to be that the triplet loss is inferior to using surrogate losses (classification, verification) followed by a separate metric learning step. We show that, for models trained from scratch as well as pretrained ones, using a variant of the triplet loss to perform end-to-end deep metric learning outperforms most other published methods by a large margin.

2,679 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: Zhang et al. as mentioned in this paper adopted stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics, and designed Hourglass Residual Units (HRUs) to increase the receptive field of the network.
Abstract: In this paper, we propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. We adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. We further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on detailed descriptions for different body parts. Hence our model has the ability to focus on different granularity from local salient regions to global semantic consistent spaces. Additionally, we design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network. These units are extensions of residual units with a side branch incorporating filters with larger receptive field, hence features with various scales are learned and combined within the HRUs. The effectiveness of the proposed multi-context attention mechanism and the hourglass residual units is evaluated on two widely used human pose estimation benchmarks. Our approach outperforms all existing methods on both benchmarks over all the body parts. Code has been made publicly available.

543 citations

Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the non-differentiable post-processing and quantization error of human pose estimation.
Abstract: State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable post-processing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.

536 citations

Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this paper, the authors explore 3D human pose estimation from a single RGB image, using a simple architecture that reasons through intermediate 2D pose predictions, and demonstrate that their approach outperforms almost all state-of-the-art 3D pose estimation systems.
Abstract: We explore 3D human pose estimation from a single RGB image. While many approaches try to directly predict 3D pose from image measurements, we explore a simple architecture that reasons through intermediate 2D pose predictions. Our approach is based on two key observations (1) Deep neural nets have revolutionized 2D pose estimation, producing accurate 2D predictions even for poses with self-occlusions (2) Big-datasets of 3D mocap data are now readily available, making it tempting to lift predicted 2D poses to 3D through simple memorization (e.g., nearest neighbors). The resulting architecture is straightforward to implement with off-the-shelf 2D pose estimation systems and 3D mocap libraries. Importantly, we demonstratethatsuchmethodsoutperformalmostallstate-of-theart 3D pose estimation systems, most of which directly try to regress 3D pose from 2D measurements.

465 citations

Proceedings ArticleDOI
26 Feb 2018
TL;DR: It is shown that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results, and that optimization from end-to-end leads to significantly higher accuracy than separated learning.
Abstract: Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally, we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks.

455 citations