Showing papers on "Pose published in 2018"

PDF

Open Access

Proceedings Article•DOI•

[...]

Xiaolong Wang¹, Ross Girshick¹, Abhinav Gupta², Kaiming He¹•Institutions (2)

18 Jun 2018

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.

...read moreread less

Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

...read moreread less

8,059 citations

Journal Article•DOI•

Recent advances in convolutional neural networks

[...]

Jiuxiang Gu¹, Zhenhua Wang¹, Jason Kuen¹, Lianyang Ma¹, Amir Shahroudy¹, Bing Shuai¹, Ting Liu¹, Xingxing Wang¹, Gang Wang¹, Jianfei Cai¹, Tsuhan Chen¹ - Show less +7 more•Institutions (1)

Nanyang Technological University¹

01 May 2018-Pattern Recognition

TL;DR: A broad survey of the recent advances in convolutional neural networks can be found in this article, where the authors discuss the improvements of CNN on different aspects, namely, layer design, activation function, loss function, regularization, optimization and fast computation.

...read moreread less

3,125 citations

Proceedings Article•DOI•

VGGFace2: A Dataset for Recognising Faces across Pose and Age

[...]

Qiong Cao¹, Li Shen¹, Weidi Xie¹, Omkar M. Parkhi¹, Andrew Zisserman¹ - Show less +1 more•Institutions (1)

University of Oxford¹

15 May 2018

TL;DR: VGGFace2 as discussed by the authors is a large-scale face dataset with 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject.

...read moreread less

Abstract: In this paper, we introduce a new large-scale face dataset named VGGFace2. The dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession (e.g. actors, athletes, politicians). The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimise the label noise. We describe how the dataset was collected, in particular the automated and manual filtering stages to ensure a high accuracy for the images of each identity. To assess face recognition performance using the new dataset, we train ResNet-50 (with and without Squeeze-and-Excitation blocks) Convolutional Neural Networks on VGGFace2, on MS-Celeb-1M, and on their union, and show that training on VGGFace2 leads to improved recognition performance over pose and age. Finally, using the models trained on these datasets, we demonstrate state-of-the-art performance on the IJB-A and IJB-B face recognition benchmarks, exceeding the previous state-of-the-art by a large margin. The dataset and models are publicly available.

...read moreread less

2,365 citations

Journal Article•DOI•

DeepLabCut: markerless pose estimation of user-defined body parts with deep learning

[...]

Alexander Mathis¹, Pranav Mamidanna¹, Kevin M. Cury², Taiga Abe², Venkatesh N. Murthy³, Mackenzie W. Mathis³, Mackenzie W. Mathis¹, Matthias Bethge - Show less +4 more•Institutions (3)

University of Tübingen¹, Columbia University², Harvard University³

20 Aug 2018-Nature Neuroscience

TL;DR: Using a deep learning approach to track user-defined body parts during various behaviors across multiple species, the authors show that their toolbox, called DeepLabCut, can achieve human accuracy with only a few hundred frames of training data.

...read moreread less

Abstract: Quantifying behavior is crucial for many applications in neuroscience. Videography provides easy methods for the observation and recording of animal behavior in diverse settings, yet extracting particular aspects of a behavior for further analysis can be highly time consuming. In motor control studies, humans or other animals are often marked with reflective markers to assist with computer-based tracking, but markers are intrusive, and the number and location of the markers must be determined a priori. Here we present an efficient method for markerless pose estimation based on transfer learning with deep neural networks that achieves excellent results with minimal training data. We demonstrate the versatility of this framework by tracking various body parts in multiple species across a broad collection of behaviors. Remarkably, even when only a small number of frames are labeled (~200), the algorithm achieves excellent tracking performance on test frames that is comparable to human accuracy.

...read moreread less

2,303 citations

Journal Article•DOI•

Deep Learning for Computer Vision: A Brief Review.

[...]

Athanasios Voulodimos¹, Nikolaos Doulamis², Anastasios Doulamis², Eftychios Protopapadakis²•Institutions (2)

Technological Educational Institute of Athens¹, National Technical University of Athens²

01 Feb 2018-Computational Intelligence and Neuroscience

TL;DR: A brief overview of some of the most significant deep learning schemes used in computer vision problems, that is, Convolutional Neural Networks, Deep Boltzmann Machines and Deep Belief Networks, and Stacked Denoising Autoencoders are provided.

...read moreread less

Abstract: Over the last years deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques in several fields, with computer vision being one of the most prominent cases. This review paper provides a brief overview of some of the most significant deep learning schemes used in computer vision problems, that is, Convolutional Neural Networks, Deep Boltzmann Machines and Deep Belief Networks, and Stacked Denoising Autoencoders. A brief account of their history, structure, advantages, and limitations is given, followed by a description of their applications in various computer vision tasks, such as object detection, face recognition, action and activity recognition, and human pose estimation. Finally, a brief overview is given of future directions in designing deep learning schemes for computer vision problems and the challenges involved therein.

...read moreread less

1,970 citations

Proceedings Article•DOI•

Frustum PointNets for 3D Object Detection from RGB-D Data

[...]

Charles R. Qi¹, Wei Liu, Chenxia Wu, Hao Su², Leonidas J. Guibas¹ - Show less +1 more•Institutions (2)

Stanford University¹, University of California, San Diego²

18 Jun 2018

TL;DR: This work directly operates on raw point clouds by popping up RGBD scans and leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects.

...read moreread less

Abstract: In this work, we study 3D object detection from RGBD data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.

...read moreread less

1,947 citations

Book Chapter•DOI•

Simple Baselines for Human Pose Estimation and Tracking

[...]

Bin Xiao¹, Haiping Wu², Yichen Wei¹•Institutions (2)

Microsoft¹, University of Electronic Science and Technology of China²

08 Sep 2018

TL;DR: In this article, the authors provide simple and effective baseline methods for pose estimation, which are helpful for inspiring and evaluating new ideas for the field and achieve state-of-the-art results on challenging benchmarks.

...read moreread less

Abstract: There has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At the same time, the overall algorithm and system complexity increases as well, making the algorithm analysis and comparison more difficult. This work provides simple and effective baseline methods. They are helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. The code will be available at https://github.com/leoxiaobin/pose.pytorch.

...read moreread less

1,434 citations

Proceedings Article•DOI•

Cascaded Pyramid Network for Multi-person Pose Estimation

[...]

Yilun Chen, Zhicheng Wang¹, Yuxiang Peng¹, Zhiqiang Zhang², Gang Yu³, Jian Sun⁴ - Show less +2 more•Institutions (4)

Tsinghua University¹, Huazhong University of Science and Technology², Beijing Institute of Technology³, Peking University⁴

01 Jun 2018

TL;DR: A novel network structure called Cascaded Pyramid Network (CPN) is presented which targets to relieve the problem from these "hard" keypoints, with state-of-art results on the COCO keypoint benchmark, with average precision at 73.0.

...read moreread less

Abstract: The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which targets to relieve the problem from these "hard" keypoints. More specifically, our algorithm includes two stages: GlobalNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the "simple" keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the "hard" keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve state-of-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19% relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code1 and the detection results for person used will be publicly available for further research.

...read moreread less

1,257 citations

Proceedings Article•DOI•

OpenFace 2.0: Facial Behavior Analysis Toolkit

[...]

Tadas Baltrusaitis¹, Amir Zadeh², Yao Chong Lim², Louis-Philippe Morency²•Institutions (2)

Microsoft¹, Carnegie Mellon University²

15 May 2018

TL;DR: OpenFace 2.0 is an extension of OpenFace toolkit and is capable of more accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

...read moreread less

Abstract: Over the past few years, there has been an increased interest in automatic facial behavior analysis and understanding. We present OpenFace 2.0 - a tool intended for computer vision and machine learning researchers, affective computing community and people interested in building interactive applications based on facial behavior analysis. OpenFace 2.0 is an extension of OpenFace toolkit and is capable of more accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. The computer vision algorithms which represent the core of OpenFace 2.0 demonstrate state-of-the-art results in all of the above mentioned tasks. Furthermore, our tool is capable of real-time performance and is able to run from a simple webcam without any specialist hardware. Finally, unlike a lot of modern approaches or toolkits, OpenFace 2.0 source code for training models and running them is freely available for research purposes.

...read moreread less

1,107 citations

Proceedings Article•DOI•

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

[...]

Yu Xiang¹, Tanner Schmidt², Venkatraman Narayanan³, Dieter Fox²•Institutions (3)

Nvidia¹, University of Washington², Carnegie Mellon University³

26 Jun 2018

TL;DR: PoseCNN as discussed by the authors estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera, and regresses to a quaternion representation.

...read moreread less

Abstract: Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at this https URL.

...read moreread less

1,041 citations

Proceedings Article•DOI•

DensePose: Dense Human Pose Estimation in the Wild

[...]

Riza Alp Guler¹, Natalia Neverova², Iasonas Kokkinos³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, Facebook², University College London³

01 Feb 2018

TL;DR: This work establishes dense correspondences between an RGB image and a surface-based representation of the human body, a task referred to as dense human pose estimation, and improves accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu.

...read moreread less

Abstract: In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code, and videos are provided on the project page http://densepose.org.

...read moreread less

Posted Content•

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

[...]

Zhe Cao¹, Gines Hidalgo², Tomas Simon³, Shih-En Wei³, Yaser Sheikh² - Show less +1 more•Institutions (3)

University of California, Berkeley¹, Carnegie Mellon University², Facebook³

18 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: OpenPose is released, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints, and the first combined body and foot keypoint detector, based on an internal annotated foot dataset.

...read moreread less

Abstract: Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

...read moreread less

Proceedings Article•DOI•

LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain

[...]

Tixiao Shan¹, Brendan Englot¹•Institutions (1)

Stevens Institute of Technology¹

01 Oct 2018

TL;DR: A lightweight and ground-optimized lidar odometry and mapping method, LeGO-LOAM, for realtime six degree-of-freedom pose estimation with ground vehicles and integrated into a SLAM framework to eliminate the pose estimation error caused by drift is integrated.

...read moreread less

Abstract: We propose a lightweight and ground-optimized lidar odometry and mapping method, LeGO-LOAM, for realtime six degree-of-freedom pose estimation with ground vehicles. LeGO-LOAM is lightweight, as it can achieve realtime pose estimation on a low-power embedded system. LeGO-LOAM is ground-optimized, as it leverages the presence of a ground plane in its segmentation and optimization steps. We first apply point cloud segmentation to filter out noise, and feature extraction to obtain distinctive planar and edge features. A two-step Levenberg-Marquardt optimization method then uses the planar and edge features to solve different components of the six degree-of-freedom transformation across consecutive scans. We compare the performance of LeGO-LOAM with a state-of-the-art method, LOAM, using datasets gathered from variable-terrain environments with ground vehicles, and show that LeGO-LOAM achieves similar or better accuracy with reduced computational expense. We also integrate LeGO-LOAM into a SLAM framework to eliminate the pose estimation error caused by drift, which is tested using the KITTI dataset.

...read moreread less

Proceedings Article•DOI•

Real-Time Seamless Single Shot 6D Object Pose Prediction

[...]

Bugra Tekin¹, Sudipta N. Sinha², Pascal Fua¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Microsoft²

18 Jun 2018

TL;DR: A single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses is proposed, which substantially outperforms other recent CNN-based approaches when they are all used without postprocessing.

...read moreread less

Abstract: We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [10] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster - 50 fps on a Titan X (Pascal) GPU - and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [27, 28] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm. For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent CNN-based approaches [10, 25] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.

...read moreread less

Proceedings Article•DOI•

Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

[...]

Torsten Sattler¹, Will Maddern², Carl Toft³, Akihiko Torii⁴, Lars Hammarstrand³, Erik Stenborg³, Daniel Safari⁵, Daniel Safari⁴, Masatoshi Okutomi⁴, Marc Pollefeys¹, Marc Pollefeys⁶, Josef Sivic⁷, Fredrik Kahl⁸, Fredrik Kahl³, Tomas Pajdla⁷ - Show less +11 more•Institutions (8)

ETH Zurich¹, University of Oxford², Chalmers University of Technology³, Tokyo Institute of Technology⁴, Technical University of Denmark⁵, Microsoft⁶, Czech Technical University in Prague⁷, Lund University⁸

18 Jun 2018

TL;DR: This paper introduces the first benchmark datasets specifically designed for analyzing the impact of day-night changes, weather and seasonal variations, as well as sequence-based localization approaches and the need for better local features on visual localization.

...read moreread less

Abstract: Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing condition, including day-night changes, as well as weather and seasonal variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera pose estimates. In this paper, we introduce the first benchmark datasets specifically designed for analyzing the impact of such factors on visual localization. Using carefully created ground truth poses for query images taken under a wide variety of conditions, we evaluate the impact of various factors on 6DOF camera pose estimation accuracy through extensive experiments with state-of-the-art localization approaches. Based on our results, we draw conclusions about the difficulty of different conditions, showing that long-term localization is far from solved, and propose promising avenues for future work, including sequence-based localization approaches and the need for better local features. Our benchmark is available at visuallocalization.net.

...read moreread less

Book Chapter•DOI•

Implicit 3D Orientation Learning for 6D Object Detection from RGB Images

[...]

Martin Sundermeyer¹, Zoltan-Csaba Marton¹, Maximilian Durner¹, Manuel Brucker¹, Rudolph Triebel¹ - Show less +1 more•Institutions (1)

German Aerospace Center¹

08 Sep 2018

TL;DR: This work proposes a real-time RGB-based pipeline for object detection and 6D pose estimation based on a variant of the Denoising Autoencoder trained on simulated views of a 3D model using Domain Randomization.

...read moreread less

Abstract: We propose a real-time RGB-based pipeline for object detection and 6D pose estimation. Our novel 3D orientation estimation is based on a variant of the Denoising Autoencoder that is trained on simulated views of a 3D model using Domain Randomization.

...read moreread less

Book Chapter•DOI•

Integral Human Pose Regression

[...]

Xiao Sun¹, Bin Xiao¹, Fangyin Wei², Shuang Liang³, Yichen Wei¹ - Show less +1 more•Institutions (3)

Microsoft¹, Peking University², Tongji University³

08 Sep 2018

TL;DR: In this paper, a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the non-differentiable post-processing and quantization error of human pose estimation.

...read moreread less

Abstract: State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable post-processing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.

...read moreread less

Proceedings Article•DOI•

Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation

[...]

Mohamed Omran¹, Christoph Lassner¹, Gerard Pons-Moll¹, Peter V. Gehler¹, Bernt Schiele¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

17 Aug 2018

TL;DR: Neural Body Fitting (NBF) as discussed by the authors integrates a statistical body model as a layer within a CNN leveraging both reliable bottom-up body part segmentation and robust top-down body model constraints.

...read moreread less

Abstract: Direct prediction of 3D body pose and shape parameters remains a challenge even for highly parameterized, deep learning models. The representation of the prediction space is difficult to map to from the plain 2D image space, perspective ambiguities make the loss function noisy and training data is scarce. In this paper, we propose a novel approach (Neural Body Fitting (NBF)) that integrates a statistical body model as a layer within a CNN leveraging both reliable bottom-up body part segmentation and robust top-down body model constraints. NBF is fully differentiable and can be trained end-to-end from both 2D and 3D annotations. In detailed experiments we analyze how the components of our model improve model performance and present a robust, easy to use, end-to-end trainable framework for 3D human pose estimation from single 2D images.

...read moreread less

Journal Article•DOI•

Speeded up detection of squared fiducial markers

[...]

Francisco J. Romero-Ramirez¹, Rafael Muñoz-Salinas¹, Rafael Medina-Carnicer¹•Institutions (1)

University of Córdoba (Spain)¹

01 Aug 2018-Image and Vision Computing

TL;DR: This paper proposes a multi-scale strategy for speeding up marker detection in video sequences by wisely selecting the most appropriate scale for detection, identification and corner estimation.

...read moreread less

Book Chapter•DOI•

PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

[...]

George Papandreou¹, Tyler Zhu¹, Liang-Chieh Chen¹, Spyros Gidaris¹, Jonathan Tompson¹, Kevin Murphy¹ - Show less +2 more•Institutions (1)

Google¹

08 Sep 2018

TL;DR: In this article, a CNN is used to detect individual keypoints and predict their relative displacements, allowing them to group keypoints into person pose instances and then associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations.

...read moreread less

Abstract: We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417.

...read moreread less

Proceedings Article•DOI•

GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB

[...]

Franziska Mueller¹, Florian Bernard¹, Oleksandr Sotnychenko¹, Dushyant Mehta¹, Srinath Sridhar², Dan Casas, Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Stanford University²

18 Jun 2018

TL;DR: This work proposes a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network, and uses a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images.

...read moreread less

Abstract: We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence. Our tracking method combines a convolutional neural network with a kinematic 3D hand model, such that it generalizes well to unseen data, is robust to occlusions and varying camera viewpoints, and leads to anatomically plausible as well as temporally smooth hand motions. For training our CNN we propose a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network. To be more specific, we use a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images. For training this translation network we combine an adversarial loss and a cycle-consistency loss with a geometric consistency loss in order to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art on challenging RGB-only footage.

...read moreread less

Proceedings Article•DOI•

Through-Wall Human Pose Estimation Using Radio Signals

[...]

Mingmin Zhao¹, Tianhong Li¹, Mohammad Abu Alsheikh¹, Yonglong Tian¹, Hang Zhao¹, Antonio Torralba¹, Dina Katabi¹ - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

18 Jun 2018

TL;DR: A deep neural network approach that parses wireless signals in the WiFi frequencies to estimate 2D poses through walls despite never trained on such scenarios, and shows that it is almost as accurate as the vision-based system used to train it.

...read moreread less

Abstract: This paper demonstrates accurate human pose estimation through walls and occlusions. We leverage the fact that wireless signals in the WiFi frequencies traverse walls and reflect off the human body. We introduce a deep neural network approach that parses such radio signals to estimate 2D poses. Since humans cannot annotate radio signals, we use state-of-the-art vision model to provide cross-modal supervision. Specifically, during training the system uses synchronized wireless and visual inputs, extracts pose information from the visual stream, and uses it to guide the training process. Once trained, the network uses only the wireless signal for pose estimation. We show that, when tested on visible scenes, the radio-based system is almost as accurate as the vision-based system used to train it. Yet, unlike vision-based pose estimation, the radio-based system can estimate 2D poses through walls despite never trained on such scenarios. Demo videos are available at our website.

...read moreread less

Proceedings Article•DOI•

Disentangled Person Image Generation

[...]

Liqian Ma¹, Qianru Sun², Stamatios Georgoulis¹, Luc Van Gool³, Bernt Schiele², Mario Fritz² - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, Max Planck Society², ETH Zurich³

18 Jun 2018

TL;DR: A novel, two-stage reconstruction pipeline is proposed that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time and can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate targeted manipulations, that provide more control over the generation process.

...read moreread less

Abstract: Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor, respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on the Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task1.

...read moreread less

Proceedings Article•DOI•

2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning

[...]

Diogo C. Luvizon, David Picard¹, Hedi Tabia•Institutions (1)

Centre national de la recherche scientifique¹

26 Feb 2018

TL;DR: It is shown that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results, and that optimization from end-to-end leads to significantly higher accuracy than separated learning.

...read moreread less

Abstract: Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally, we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks.

...read moreread less

Proceedings Article•DOI•

Attention-Aware Compositional Network for Person Re-identification

[...]

Jing Xu, Rui Zhao¹, Feng Zhu, Huaming Wang, Wanli Ouyang² - Show less +1 more•Institutions (2)

SenseTime¹, University of Sydney²

18 Jun 2018

TL;DR: Zhang et al. as mentioned in this paper proposed an Attention-Aware Compositional Network (AACN) for person ReID, which consists of two main components: Pose-guided Part Attention (PPA) and Attention-aware Feature Composition (AFC).

...read moreread less

Abstract: Person re-identification (ReID) is to identify pedestrians observed from different camera views based on visual appearance. It is a challenging task due to large pose variations, complex background clutters and severe occlusions. Recently, human pose estimation by predicting joint locations was largely improved in accuracy. It is reasonable to use pose estimation results for handling pose variations and background clutters, and such attempts have obtained great improvement in ReID performance. However, we argue that the pose information was not well utilized and hasn't yet been fully exploited for person ReID. In this work, we introduce a novel framework called Attention-Aware Compositional Network (AACN) for person ReID. AACN consists of two main components: Pose-guided Part Attention (PPA) and Attention-aware Feature Composition (AFC). PPA is learned and applied to mask out undesirable background features in pedestrian feature maps. Furthermore, pose-guided visibility scores are estimated for body parts to deal with part occlusion in the proposed AFC module. Extensive experiments with ablation analysis show the effectiveness of our method, and state-of-the-art results are achieved on several public datasets, including Market-1501, CUHK03, CUHK01, SenseReID, CUHK03-NP and DukeMTMC-reID.

...read moreread less

Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

[...]

Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, Stan Birchfield - Show less +2 more

27 Sep 2018

TL;DR: This network is the first deep network trained only on synthetic data that is able to achieve state-of-the-art performance on 6-DoF object pose estimation and demonstrates a real-time system estimating object poses with sufficient accuracy for real-world semantic grasping of known household objects in clutter by a real robot.

...read moreread less

Abstract: Using synthetic data for training deep neural networks for robotic manipulation holds the promise of an almost unlimited amount of pre-labeled training data, generated safely out of harm's way. One of the key challenges of synthetic data, to date, has been to bridge the so-called reality gap, so that networks trained on synthetic data operate correctly when exposed to real-world data. We explore the reality gap in the context of 6-DoF pose estimation of known objects from a single RGB image. We show that for this problem the reality gap can be successfully spanned by a simple combination of domain randomized and photorealistic data. Using synthetic data generated in this manner, we introduce a one-shot deep neural network that is able to perform competitively against a state-of-the-art network trained on a combination of real and synthetic data. To our knowledge, this is the first deep network trained only on synthetic data that is able to achieve state-of-the-art performance on 6-DoF object pose estimation. Our network also generalizes better to novel environments including extreme lighting conditions, for which we show qualitative results. Using this network we demonstrate a real-time system estimating object poses with sufficient accuracy for real-world semantic grasping of known household objects in clutter by a real robot.

...read moreread less

Proceedings Article•DOI•

First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations

[...]

Guillermo Garcia-Hernando¹, Shanxin Yuan¹, Seungryul Baek¹, Tae-Kyun Kim¹•Institutions (1)

Imperial College London¹

01 Jun 2018

TL;DR: This work collects RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations, and sees clear benefits of using hand pose as a cue for action recognition compared to other data modalities.

...read moreread less

Abstract: In this work we study the use of 3D hand poses to recognize first-person dynamic hand actions interacting with 3D objects. Towards this goal, we collected RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations. To obtain hand pose annotations, we used our own mo-cap system that automatically infers the 3D location of each of the 21 joints of a hand model via 6 magnetic sensors and inverse kinematics. Additionally, we recorded the 6D object poses and provide 3D object models for a subset of hand-object interaction sequences. To the best of our knowledge, this is the first benchmark that enables the study of first-person hand actions with the use of 3D hand poses. We present an extensive experimental evaluation of RGB-D and pose-based action recognition by 18 baselines/state-of-the-art approaches. The impact of using appearance features, poses, and their combinations are measured, and the different training/testing protocols are evaluated. Finally, we assess how ready the 3D hand pose estimation field is when hands are severely occluded by objects in egocentric views and its influence on action recognition. From the results, we see clear benefits of using hand pose as a cue for action recognition compared to other data modalities. Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.

...read moreread less

Proceedings Article•DOI•

PoseTrack: A Benchmark for Human Pose Estimation and Tracking

[...]

Mykhaylo Andriluka¹, Umar Iqbal², Eldar Insafutdinov, Leonid Pishchulin, Anton Milan³, Juergen Gall², Bernt Schiele - Show less +3 more•Institutions (3)

Google¹, University of Bonn², Amazon.com³

01 Jun 2018

TL;DR: PoseTrack is a new large-scale benchmark for video-based human pose estimation and articulated tracking that conducts an extensive experimental study on recent approaches to articulated pose tracking and provides analysis of the strengths and weaknesses of the state of the art.

...read moreread less

Abstract: Existing systems for video-based pose estimation and tracking struggle to perform well on realistic videos with multiple people and often fail to output body-pose trajectories consistent over time. To address this shortcoming this paper introduces PoseTrack which is a new large-scale benchmark for video-based human pose estimation and articulated tracking. Our new benchmark encompasses three tasks focusing on i) single-frame multi-person pose estimation, ii) multi-person pose estimation in videos, and iii) multi-person articulated tracking. To establish the benchmark, we collect, annotate and release a new dataset that features videos with multiple people labeled with person tracks and articulated pose. A public centralized evaluation server is provided to allow the research community to evaluate on a held-out test set. Furthermore, we conduct an extensive experimental study on recent approaches to articulated pose tracking and provide analysis of the strengths and weaknesses of the state of the art. We envision that the proposed benchmark will stimulate productive research both by providing a large and representative training dataset as well as providing a platform to objectively evaluate and compare the proposed methods. The benchmark is freely accessible at https://posetrack.net/.

...read moreread less

Book Chapter•DOI•

Part-Aligned Bilinear Representations for Person Re-Identification

[...]

Yumin Suh¹, Jingdong Wang², Siyu Tang³, Tao Mei, Kyoung Mu Lee¹ - Show less +1 more•Institutions (3)

Seoul National University¹, Microsoft², Max Planck Society³

08 Sep 2018

TL;DR: A novel network that learns a part-aligned representation for person re-identification that handles the body part misalignment problem, that is, body parts are misaligned across human detections due to pose/viewpoint change and unreliable detection.

...read moreread less

Abstract: Comparing the appearance of corresponding body parts is essential for person re-identification. As body parts are frequently misaligned between the detected human boxes, an image representation that can handle this misalignment is required. In this paper, we propose a network that learns a part-aligned representation for person re-identification. Our model consists of a two-stream network, which generates appearance and body part feature maps respectively, and a bilinear-pooling layer that fuses two feature maps to an image descriptor. We show that it results in a compact descriptor, where the image matching similarity is equivalent to an aggregation of the local appearance similarities of the corresponding body parts. Since the image similarity does not depend on the relative positions of parts, our approach significantly reduces the part misalignment problem. Training the network does not require any part annotation on the person re-identification dataset. Instead, we simply initialize the part sub-stream using a pre-trained sub-network of an existing pose estimation network and train the whole network to minimize the re-identification loss. We validate the effectiveness of our approach by demonstrating its superiority over the state-of-the-art methods on the standard benchmark datasets including Market-1501, CUHK03, CUHK01 and DukeMTMC, and standard video dataset MARS.

...read moreread less

Proceedings Article•DOI•

A Benchmark Comparison of Monocular Visual-Inertial Odometry Algorithms for Flying Robots

[...]

Jeffrey A. Delmerico¹, Davide Scaramuzza¹•Institutions (1)

University of Zurich¹

21 May 2018

TL;DR: This paper evaluates an array of publicly-available VIO pipelines on different hardware configurations, including several single-board computer systems that are typically found on flying robots, and considers the pose estimation accuracy, per-frame processing time, and CPU and memory load while processing the EuRoC datasets.

...read moreread less

Abstract: Flying robots require a combination of accuracy and low latency in their state estimation in order to achieve stable and robust flight. However, due to the power and payload constraints of aerial platforms, state estimation algorithms must provide these qualities under the computational constraints of embedded hardware. Cameras and inertial measurement units (IMUs) satisfy these power and payload constraints, so visual-inertial odometry (VIO) algorithms are popular choices for state estimation in these scenarios, in addition to their ability to operate without external localization from motion capture or global positioning systems. It is not clear from existing results in the literature, however, which VIO algorithms perform well under the accuracy, latency, and computational constraints of a flying robot with onboard state estimation. This paper evaluates an array of publicly-available VIO pipelines (MSCKF, OKVIS, ROVIO, VINS-Mono, SVO+MSF, and SVO+GTSAM) on different hardware configurations, including several single-board computer systems that are typically found on flying robots. The evaluation considers the pose estimation accuracy, per-frame processing time, and CPU and memory load while processing the EuRoC datasets, which contain six degree of freedom (6DoF) trajectories typical of flying robots. We present our complete results as a benchmark for the research community.

...read moreread less

Collapse