scispace - formally typeset
Search or ask a question

Showing papers on "Pose published in 2023"


Journal ArticleDOI
TL;DR: VoxelTrack as mentioned in this paper employs a multi-branch network to jointly estimate 3D poses and re-ID features for all people in the environment, which can robustly estimate and track 3D pose even when people are severely occluded in some cameras.
Abstract: We present VoxelTrack for multi-person 3D pose estimation and tracking from a few cameras which are separated by wide baselines. It employs a multi-branch network to jointly estimate 3D poses and re-identification (Re-ID) features for all people in the environment. In contrast to previous efforts which require to establish cross-view correspondence based on noisy 2D pose estimates, it directly estimates and tracks 3D poses from a 3D voxel-based representation constructed from multi-view images. We first discretize the 3D space by regular voxels and compute a feature vector for each voxel by averaging the body joint heatmaps that are inversely projected from all views. We estimate 3D poses from the voxel representation by predicting whether each voxel contains a particular body joint. Similarly, a Re-ID feature is computed for each voxel which is used to track the estimated 3D poses over time. The main advantage of the approach is that it avoids making any hard decisions based on individual images. The approach can robustly estimate and track 3D poses even when people are severely occluded in some cameras. It outperforms the state-of-the-art methods by a large margin on four public datasets including Shelf, Campus, Human3.6 M and CMU Panoptic.

14 citations


Journal ArticleDOI
TL;DR: In this article , an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions is proposed for egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device.
Abstract: We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device. This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions that result in drastic differences in resolution between lower and upper body. We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions. The quantitative evaluation, on synthetic and real-world datasets, shows that our strategy leads to substantial improvements in accuracy over state of the art egocentric approaches. To tackle the lack of labelled data we also introduced a large photo-realistic synthetic dataset. xR-EgoPose offers high quality renderings of people with diverse skintones, body shapes and clothing, performing a range of actions. Our experiments show that the high variability in our new synthetic training corpus leads to good generalization to real world footage and to state of theart results on real world datasets with ground truth. Moreover, an evaluation on the Human3.6M benchmark shows that the performance of our method is on par with top performing approaches on the more classic problem of 3D human pose from a third person viewpoint.

11 citations


Journal ArticleDOI
TL;DR: Recently, AlphaPose as mentioned in this paper proposed a system that can perform accurate whole-body pose estimation and tracking jointly while running in real-time, using Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly estimate and track.
Abstract: Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this article, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.

8 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed the Flow-based Dual Attention GAN (FDA-GAN) to apply occlusion and deformation-aware feature fusion for higher generation quality.
Abstract: Human pose transfer aims at transferring the appearance of the source person to the target pose. Existing methods utilizing flow-based warping for non-rigid human image generation have achieved great success. However, they fail to preserve the appearance details in synthesized images since the spatial correlation between the source and target is not fully exploited. To this end, we propose the Flow-based Dual Attention GAN (FDA-GAN) to apply occlusion- and deformation-aware feature fusion for higher generation quality. Specifically, deformable local attention and flow similarity attention, constituting the dual attention mechanism, can derive the output features responsible for deformable- and occlusion-aware fusion, respectively. Besides, to maintain the pose and global position consistency in transferring, we design a pose normalization network for learning adaptive normalization from the target pose to the source person. Both qualitative and quantitative results show that our method outperforms state-of-the-art models in public iPER and DeepFashion datasets.

7 citations


Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a self-calibrated pose attention network (SCPAN) to achieve more robust and precise facial landmark detection in challenging scenarios, where a boundary-aware landmark intensity (BALI) field is proposed to model more effective facial shape constraints by fusing boundary and landmark intensity field information.
Abstract: Current fully supervised facial landmark detection methods have progressed rapidly and achieved remarkable performance. However, they still suffer when coping with faces under large poses and heavy occlusions for inaccurate facial shape constraints and insufficient labeled training samples. In this article, we propose a semisupervised framework, that is, a self-calibrated pose attention network (SCPAN) to achieve more robust and precise facial landmark detection in challenging scenarios. To be specific, a boundary-aware landmark intensity (BALI) field is proposed to model more effective facial shape constraints by fusing boundary and landmark intensity field information. Moreover, a self-calibrated pose attention (SCPA) model is designed to provide a self-learned objective function that enforces intermediate supervision without label information by introducing a self-calibrated mechanism and a pose attention mask. We show that by integrating the BALI fields and SCPA model into a novel SCPAN, more facial prior knowledge can be learned and the detection accuracy and robustness of our method for faces with large poses and heavy occlusions have been improved. The experimental results obtained for challenging benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in the literature.

7 citations


Journal ArticleDOI
TL;DR: The second Satellite Pose Estimation Competition (SPEC2021) as mentioned in this paper was designed to address the challenging problem of domain gap in developing space-borne computer vision algorithms, which causes the algorithm to overfit the features specific to the synthetic imagery.

6 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a unified framework based on cascade learning for simultaneous facial landmark detection and head pose estimation, as well as simultaneous eye center detection and gaze estimation.
Abstract: As a non-invasive method, vision-based driver monitoring aims to identify risky maneuvers for intelligent vehicles and it has gained an increasing interest over recent years. However, most existing methods tend to design models for specific tasks, such as head pose or gaze estimation, which results in redundant models hampering real time applications. Besides, most driver facial monitoring methods ignore the correlation of different tasks. In this work, we propose a unified framework based on cascade learning for simultaneous facial landmark detection and head pose estimation, as well as simultaneous eye center detection and gaze estimation. In particular, built upon the key idea that facial landmark locations and 3D face model parameters are implicitly correlated, we introduce a cascade regression framework to achieve these two tasks simultaneously. After coarsely extracting the driver’s eye region from the detected facial landmarks, we perform cascade regression for simultaneous eye center detection and gaze estimation. Leveraging the power of cascade learning allows our method to alternatively optimize facial landmark detection, head pose estimation, eye center localization, and gaze prediction. The comparison experiments conducted on benchmark datasets of 300-W, GI4E, BU, MPIIGaze, and driving dataset of SHRP2 demonstrate that our proposed method can achieve state-of-the-art performance with robust effectiveness on the real driver monitoring applications.

6 citations


Journal ArticleDOI
TL;DR: In this article , a multimodal Lying Pose (SLP) dataset is introduced, which includes in-bed pose images from 109 participants captured using multiple imaging modalities including RGB, long wave infrared (LWIR), depth, and pressure map.
Abstract: Computer vision field has achieved great success in interpreting semantic meanings from images, yet its algorithms can be brittle for tasks with adverse vision conditions and the ones suffering from data/label pair limitation. Among these tasks is in-bed human pose monitoring with significant value in many healthcare applications. In-bed pose monitoring in natural settings involves pose estimation in complete darkness or full occlusion. The lack of publicly available in-bed pose datasets hinders the applicability of many successful human pose estimation algorithms for this task. In this paper, we introduce our Simultaneously-collected multimodal Lying Pose (SLP) dataset, which includes in-bed pose images from 109 participants captured using multiple imaging modalities including RGB, long wave infrared (LWIR), depth, and pressure map. We also present a physical hyper parameter tuning strategy for ground truth pose label generation under adverse vision conditions. The SLP design is compatible with the mainstream human pose datasets; therefore, the state-of-the-art 2D pose estimation models can be trained effectively with the SLP data with promising performance as high as 95% at PCKh@0.5 on a single modality. The pose estimation performance of these models can be further improved by including additional modalities through the proposed collaborative scheme.

6 citations


Journal ArticleDOI
01 Feb 2023-Sensors
TL;DR: In this paper , a qualitative assessment of the pose estimation accuracy of a UAV equipped with a GNSS RTK receiver was performed with the use of statistical methods and the results were verified based on direct tachometric measurements.
Abstract: The growing possibilities offered by unmanned aerial vehicles (UAV) in many areas of life, in particular in automatic data acquisition, spur the search for new methods to improve the accuracy and effectiveness of the acquired information. This study was undertaken on the assumption that modern navigation receivers equipped with real-time kinematic positioning software and integrated with UAVs can considerably improve the accuracy of photogrammetric measurements. The research hypothesis was verified during field measurements with the use of a popular Enterprise series drone. The problems associated with accurate UAV pose estimation were identified. The main aim of the study was to perform a qualitative assessment of the pose estimation accuracy of a UAV equipped with a GNSS RTK receiver. A test procedure comprising three field experiments was designed to achieve the above research goal: an analysis of the stability of absolute pose estimation when the UAV is hovering over a point, and analyses of UAV pose estimation during flight along a predefined trajectory and during continuous flight without waypoints. The tests were conducted in a designated research area. The results were verified based on direct tachometric measurements. The qualitative assessment was performed with the use of statistical methods. The study demonstrated that in a state of apparent stability, horizontal deviations of around 0.02 m occurred at low altitudes and increased with a rise in altitude. Mission type significantly influences pose estimation accuracy over waypoints. The results were used to verify the accuracy of the UAV’s pose estimation and to identify factors that affect the pose estimation accuracy of an UAV equipped with a GNSS RTK receiver. The present findings provide valuable input for developing a new method to improve the accuracy of measurements performed with the use of UAVs.

4 citations


Journal ArticleDOI
TL;DR: In this paper , a spatial augmented reality system is proposed to assist workers in performing end-of-line quality inspection and individual handling of a high variety of products in the industrial domain of furniture production.

3 citations


Posted ContentDOI
28 Apr 2023-bioRxiv
TL;DR: In this article , a semi-supervised approach is proposed that leverages the spatio-temporal statistics of unlabeled videos in two different ways: first, they introduce unsupervised training objectives that penalize the network whenever its predictions violate smoothness of physical motion, multiple-view geometry, or depart from a low-dimensional subspace of plausible body configurations.
Abstract: Pose estimation algorithms are shedding new light on animal behavior and intelligence. Most existing models are only trained with labeled frames (supervised learning). Although effective in many cases, the fully supervised approach requires extensive image labeling, struggles to generalize to new videos, and produces noisy outputs that hinder downstream analyses. We address each of these limitations with a semi-supervised approach that leverages the spatiotemporal statistics of unlabeled videos in two different ways. First, we introduce unsupervised training objectives that penalize the network whenever its predictions violate smoothness of physical motion, multiple-view geometry, or depart from a low-dimensional subspace of plausible body configurations. Second, we design a new network architecture that predicts pose for a given frame using temporal context from surrounding unlabeled frames. These context frames help resolve brief occlusions or ambiguities between nearby and similar-looking body parts. The resulting pose estimation networks achieve better performance with fewer labels, generalize better to unseen videos, and provide smoother and more reliable pose trajectories for downstream analysis; for example, these improved pose trajectories exhibit stronger correlations with neural activity. We also propose a Bayesian post-processing approach based on deep ensembling and Kalman smoothing that further improves tracking accuracy and robustness. We release a deep learning package that adheres to industry best practices, supporting easy model development and accelerated training and prediction. Our package is accompanied by a cloud application that allows users to annotate data, train networks, and predict new videos at scale, directly from the browser.

Journal ArticleDOI
01 Feb 2023-Sensors
TL;DR: In this article , a new markerless 2D swimmer pose estimation approach based on the combined use of computer vision algorithms and fully convolutional neural networks is proposed, which is able to estimate the pose of a swimmer during exercise while guaranteeing adequate measurement accuracy.
Abstract: Professional swimming coaches make use of videos to evaluate their athletes’ performances. Specifically, the videos are manually analyzed in order to observe the movements of all parts of the swimmer’s body during the exercise and to give indications for improving swimming technique. This operation is time-consuming, laborious and error prone. In recent years, alternative technologies have been introduced in the literature, but they still have severe limitations that make their correct and effective use impossible. In fact, the currently available techniques based on image analysis only apply to certain swimming styles; moreover, they are strongly influenced by disturbing elements (i.e., the presence of bubbles, splashes and reflections), resulting in poor measurement accuracy. The use of wearable sensors (accelerometers or photoplethysmographic sensors) or optical markers, although they can guarantee high reliability and accuracy, disturb the performance of the athletes, who tend to dislike these solutions. In this work we introduce swimmerNET, a new marker-less 2D swimmer pose estimation approach based on the combined use of computer vision algorithms and fully convolutional neural networks. By using a single 8 Mpixel wide-angle camera, the proposed system is able to estimate the pose of a swimmer during exercise while guaranteeing adequate measurement accuracy. The method has been successfully tested on several athletes (i.e., different physical characteristics and different swimming technique), obtaining an average error and a standard deviation (worst case scenario for the dataset analyzed) of approximately 1 mm and 10 mm, respectively.

Journal ArticleDOI
13 Jan 2023-PeerJ
TL;DR: In this article , the BlazePose model was used to localize the body joints of the yoga poses, and a detailed analysis of the body joint detection accuracy was proposed in the form of percentage of corrected keypoints (PCK) and percentage of detected joints (PDJ) for individual body parts and individual body joints.
Abstract: Virtual motion and pose from images and video can be estimated by detecting body joints and their interconnection. The human body has diverse and complicated poses in yoga, making its classification challenging. This study estimates yoga poses from the images using a neural network. Five different yoga poses, viz. downdog, tree, plank, warrior2, and goddess in the form of RGB images are used as the target inputs. The BlazePose model was used to localize the body joints of the yoga poses. It detected a maximum of 33 body joints, referred to as keypoints, covering almost all the body parts. Keypoints achieved from the model are considered as predicted joint locations. True keypoints, as the ground truth body joint for individual yoga poses, are identified manually using the open source image annotation tool named Makesense AI. A detailed analysis of the body joint detection accuracy is proposed in the form of percentage of corrected keypoints (PCK) and percentage of detected joints (PDJ) for individual body parts and individual body joints, respectively. An algorithm is designed to measure PCK and PDJ in which the distance between the predicted joint location and true joint location is calculated. The experiment evaluation suggests that the adopted model obtained 93.9% PCK for the goddess pose. The maximum PCK achieved for the goddess pose, i.e., 93.9%, PDJ evaluation was carried out in the staggering mode where maximum PDJ is obtained as 90% to 100% for almost all the body joints.

Journal ArticleDOI
TL;DR: In this paper , a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) is proposed to adaptively handle varying view numbers and video length without camera calibration in 3D human pose estimation.
Abstract: This article proposes a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without camera calibration in 3D Human Pose Estimation (HPE). It consists of Feature Extractor, Multi-view Fusing Transformer (MFT), and Temporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose from each image and fuses the prediction according to the confidence. It provides pose-focused feature embedding and makes subsequent modules computationally lightweight. MFT fuses the features of a varying number of views with a novel Relative-Attention block. It adaptively measures the implicit relative relationship between each pair of views and reconstructs more informative features. TFT aggregates the features of the whole sequence and predicts 3D pose via a transformer. It adaptively deals with the video of arbitrary length and fully unitizes the temporal information. The migration of transformers enables our model to learn spatial geometry better and preserve robustness for varying application scenarios. We report quantitative and qualitative results on the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared with state-of-the-art methods with camera parameters, MTF-Transformer obtains competitive results and generalizes well to dynamic capture with an arbitrary number of unseen views.

Journal ArticleDOI
TL;DR: In this paper , an online measurement method for the assembly pose of the gear structure based on monocular vision is proposed, and the correction amount required to complete the internal and external teeth assembly is calculated based on the iterative update of the pose measurement method.
Abstract: The gear structure is an important part of the transmission device. The majority of manual methods are currently used to complete the assembly of the large internal gear. Manual assembly is difficult and inefficient. Therefore, an online measurement method for the assembly pose of the gear structure based on monocular vision is proposed. After the critical features of the gear structure have been detected, a duality elimination method based on traversal mapping dots is proposed to obtain the correct solution for the spatial circle pose. Concurrently, the circle pose optimization model is established to enhance pose precision. Then, a new calibration board was designed to complete the hand-eye calibration of the parallel mechanism and camera. Finally, the correction amount required to complete the internal and external teeth assembly is calculated based on the iterative update of the pose measurement method. The experimental results show that the comprehensive accuracy of the pose measurement method exceeds 0.2 mm, the average assembly time is approximately 14 min and the assembly success rate is approximately 97%. It has been realized that simulated gear structure parts can be assembled automatically.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a Pose-Appearance Relational Network (PARNet) to model the correlation between human pose and image appearance, which combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos.
Abstract: Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modelling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

Posted ContentDOI
TL;DR: PoseVocab as discussed by the authors constructs key poses and latent embeddings based on the training poses, and sample key rotations in $so(3)$ of each joint rather than the global pose vectors, and assign a pose embedding to each sampled key rotation.
Abstract: Creating pose-driven human avatars is about modeling the mapping from the low-frequency driving pose to high-frequency dynamic human appearances, so an effective pose encoding method that can encode high-fidelity human details is essential to human avatar modeling. To this end, we present PoseVocab, a novel pose encoding method that encourages the network to discover the optimal pose embeddings for learning the dynamic human appearance. Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses. To achieve pose generalization and temporal consistency, we sample key rotations in $so(3)$ of each joint rather than the global pose vectors, and assign a pose embedding to each sampled key rotation. These joint-structured pose embeddings not only encode the dynamic appearances under different key poses, but also factorize the global pose embedding into joint-structured ones to better learn the appearance variation related to the motion of each joint. To improve the representation ability of the pose embedding while maintaining memory efficiency, we introduce feature lines, a compact yet effective 3D representation, to model more fine-grained details of human appearances. Furthermore, given a query pose and a spatial position, a hierarchical query strategy is introduced to interpolate pose embeddings and acquire the conditional pose feature for dynamic human synthesis. Overall, PoseVocab effectively encodes the dynamic details of human appearance and enables realistic and generalized animation under novel poses. Experiments show that our method outperforms other state-of-the-art baselines both qualitatively and quantitatively in terms of synthesis quality. Code is available at https://github.com/lizhe00/PoseVocab.

Journal ArticleDOI
TL;DR: In this paper , a comprehensive analysis of recent progress about the 3D vision for robot manipulation, including 3D data acquisition and representation, robot-vision calibration, 3D object detection/recognition, 6-DOF pose estimation, grasping estimation, and motion planning, is presented.
Abstract: Robot manipulation, for example, pick-and-place manipulation, is broadly used for intelligent manufacturing with industrial robots, ocean engineering with underwater robots, service robots, or even healthcare with medical robots. Most traditional robot manipulations adopt 2-D vision systems with plane hypotheses and can only generate 3-DOF (degrees of freedom) pose accordingly. To mimic human intelligence and endow the robot with more flexible working capabilities, 3-D vision-based robot manipulation has been studied. However, this task is still challenging in the open world especially for general object recognition and pose estimation with occlusion in cluttered backgrounds and human-like flexible manipulation. In this article, we propose a comprehensive analysis of recent progress about the 3-D vision for robot manipulation, including 3-D data acquisition and representation, robot-vision calibration, 3-D object detection/recognition, 6-DOF pose estimation, grasping estimation, and motion planning. We then present some public datasets, evaluation criteria, comparisons, and challenges. Finally, the related application domains of robot manipulation are given, and some future directions and open problems are studied as well.

Journal ArticleDOI
TL;DR: In this article , a pose-only imaging geometry representation and algorithms are proposed to solve large-scale scenarios and computational robustness are great challenges facing the research community to achieve this goal.
Abstract: Visual navigation and three-dimensional (3D) scene reconstruction are essential for robotics to interact with the surrounding environment. Large-scale scenarios and computational robustness are great challenges facing the research community to achieve this goal. This paper raises a pose-only imaging geometry representation and algorithms that might help solve these challenges. The pose-only representation, equivalent to the classical multiple-view geometry, is discovered to be linearly related to camera global translations, which allows for efficient and robust camera motion estimation. As a result, the spatial feature coordinates can be analytically reconstructed and do not require nonlinear optimization. Comprehensive experiments demonstrate that the computational efficiency of recovering the scene and associated camera poses is significantly improved by 2-4 orders of magnitude.

Journal ArticleDOI
TL;DR: In this article , a 3D human pose recognition framework based on ANN for learning error estimation is presented, and a workable laboratory-based multisensory testbed has been developed to verify the concept and validation of results.
Abstract: Human pose recognition is a new field of study that promises to have widespread practical applications. While there have been efforts to improve human position estimation with radio frequency identification (RFID), no major research has addressed the problem of predicting full-body poses. Therefore, a system that can determine the human pose by analyzing the entire human body, from the head to the toes, is required. This paper presents a 3D human pose recognition framework based on ANN for learning error estimation. A workable laboratory-based multisensory testbed has been developed to verify the concept and validation of results. A case study was discussed to determine the conditions under which an acceptable estimation rate can be achieved in pose analysis. Using the Butterworth filtering technique, environmental factors are de-noised to reduce the system’s computational cost. The acquired signal is then segmented using an adaptive moving average technique to determine the beginning and ending points of an activity, and significant features are extracted to estimate the activity of each human pose. Experiments demonstrate that RFID transceiver-based solutions can be used effectively to estimate a person’s pose in real time using the proposed method.


Journal ArticleDOI
01 Jan 2023
TL;DR: In this article , a key point extraction with deep convolutional neural networks based pose estimation (KPE-DCNN) model was proposed for activity recognition, which is able to achieve good results compared with benchmark algorithms like CNN, DBN, SVM, STAL, T-CNN and so on.
Abstract: Human Action Recognition (HAR) and pose estimation from videos have gained significant attention among research communities due to its application in several areas namely intelligent surveillance, human robot interaction, robot vision, etc. Though considerable improvements have been made in recent days, design of an effective and accurate action recognition model is yet a difficult process owing to the existence of different obstacles such as variations in camera angle, occlusion, background, movement speed, and so on. From the literature, it is observed that hard to deal with the temporal dimension in the action recognition process. Convolutional neural network (CNN) models could be used widely to solve this. With this motivation, this study designs a novel key point extraction with deep convolutional neural networks based pose estimation (KPE-DCNN) model for activity recognition. The KPE-DCNN technique initially converts the input video into a sequence of frames followed by a three stage process namely key point extraction, hyperparameter tuning, and pose estimation. In the keypoint extraction process an OpenPose model is designed to compute the accurate keypoints in the human pose. Then, an optimal DCNN model is developed to classify the human activities label based on the extracted key points. For improving the training process of the DCNN technique, RMSProp optimizer is used to optimally adjust the hyperparameters such as learning rate, batch size, and epoch count. The experimental results tested using benchmark dataset like UCF sports dataset showed that KPE-DCNN technique is able to achieve good results compared with benchmark algorithms like CNN, DBN, SVM, STAL, T-CNN and so on.

Proceedings ArticleDOI
01 Jan 2023
TL;DR: Cascaded Pose Refinement Transformers (CRT-6D) as discussed by the authors uses a sparse set of features sampled from the feature pyramid where each element corresponds to an object keypoint, and employs lightweight deformable transformers and chain them together to iteratively refine proposed poses over the sampled OSKFs.
Abstract: Learning based 6D object pose estimation methods rely on computing large intermediate pose representations and/or iteratively refining an initial estimation with a slow render-compare pipeline. This paper introduces a novel method we call Cascaded Pose Refinement Transformers, or CRT-6D. We replace the commonly used dense intermediate representation with a sparse set of features sampled from the feature pyramid we call OSKFs(Object Surface Keypoint Features) where each element corresponds to an object keypoint. We employ lightweight deformable transformers and chain them together to iteratively refine proposed poses over the sampled OSKFs. We achieve inference runtimes 2× faster than the closest real-time state of the art methods while supporting up to 21 objects on a single model. We demonstrate the effectiveness of CRT-6D by performing extensive experiments on the LM-O and YCBV datasets. Compared to real-time methods, we achieve state of the art on LM-O and YCB-V, falling slightly behind methods with inference runtimes one order of magnitude higher. The source code is available at: https://github.com/PedroCastro/CRT-6D

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a two-branch learning model, namely, the joint global-local network, for human pose estimation (HPE) using millimeter wave radar.
Abstract: This article proposes a two-branch learning model, namely, the joint global–local network, for human pose estimation (HPE) using millimeter wave radar. The aim of this work is to remediate the ill-posed problems in HPE arising from using the destructive observations with superimposed reflection signals. In the developed two-branch learning model, the global branch takes use of the superimposed signals from the whole human body to reconstruct the coarse pose estimation from a global perspective, and the local branch is responsible for fining the pose estimations with the decomposed signals from individual body parts in a complementary way. In doing this, two branch learning processes will be coordinated with the followed attention-based fusion module in terms of the local and global consistency. It is remarkable that the learning driven by the decomposed signals is motivated by exploiting the spatial-temporal evolution patterns of individual body parts for inferring the corresponding movements, which plays a crucial yet complementary role in the collaboration with the learning driven by the superimposed signals. With the two-branch learning architecture, the proposed method is advantageous in incorporating the local motion constraints from individual body parts into the coarse global estimation from the whole human body, which contributes to reconstructing plausible yet accurate pose estimations with the local and global kinematic consistency. Extensive experiments are presented to demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed the integration of top-down and bottom-up approaches to exploit their strengths for multi-person pose estimation and applied a semi-supervised method to overcome the 3D ground truth data scarcity.
Abstract: Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. We also introduce a two-person pose discriminator that enforces natural two-person interactions. Finally, we apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our evaluations demonstrate the effectiveness of the proposed method and its individual components. Our code and pretrained models are available publicly: https://github.com/3dpose/3D-Multi-Person-Pose .


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a position awareness network (PANet) for spacecraft pose estimation, where the point cloud is first fed into a hierarchical embedding network to extract the key points and construct local structural descriptors.
Abstract: Spacecraft pose estimation plays a vital role in many on-orbit space missions, such as rendezvous and docking, debris removal, and on-orbit maintenance. At present, the mainstream descriptors-based pose estimation methods ignore the fact that satellite point cloud contains many similar structures, generating numerous mismatched correspondence pairs, and leading to low pose estimation accuracy. This article proposes a position awareness network (PANet) for spacecraft pose estimation. Specifically, the point cloud is first fed into a hierarchical embedding network to extract the key points and construct local structural descriptors. We also build the relative position features by encoding the relative position between key points and reference points. The matching matrix between point clouds is then calculated by comprehensively considering the local structure descriptors and relative location features. In this way, the problem of ambiguous matching caused by similar local structures is avoided. Finally, weighted singular value decomposition (SVD) is utilized to solve the pose between the point clouds based on the correspondence pairs generated by the matching matrix. Besides, a large-scale satellite point cloud dataset is also constructed for training and testing pose estimation algorithms. Empirical experiments on the dataset demonstrate the effectiveness of the proposed PANet, which achieves 1.18$^\circ$ rotation error and 0.136 m translate error, surpassing state-of-the-art methods by a large margin.

Proceedings ArticleDOI
19 Apr 2023
TL;DR: In this paper , the authors explore the feasibility of estimating body pose using IMUs already in devices that many users own, such as smartphones, smartwatches, and earbuds.
Abstract: Tracking body pose on-the-go could have powerful uses in fitness, mobile gaming, context-aware virtual assistants, and rehabilitation. However, users are unlikely to buy and wear special suits or sensor arrays to achieve this end. Instead, in this work, we explore the feasibility of estimating body pose using IMUs already in devices that many users own — namely smartphones, smartwatches, and earbuds. This approach has several challenges, including noisy data from low-cost commodity IMUs, and the fact that the number of instrumentation points on a user’s body is both sparse and in flux. Our pipeline receives whatever subset of IMU data is available, potentially from just a single device, and produces a best-guess pose. To evaluate our model, we created the IMUPoser Dataset, collected from 10 participants wearing or holding off-the-shelf consumer devices and across a variety of activity contexts. We provide a comprehensive evaluation of our system, benchmarking it on both our own and existing IMU datasets.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a perceptual pose SIMilarity (PSIM) metric, by assuming that human perception is highly adapted to extracting structural information from a given signal, and presented a perceptually robust 3D pose estimation framework: Temporal Propagating Long Short-Term Memory networks.
Abstract: Predicting a 3D pose directly from a monocular image is a challenging problem. Most pose estimation methods proposed in recent years have shown ‘quantitatively’ good results (below $\sim$ 50 mm ). However, these methods remain ‘perceptually’ flawed because their performance is only measured via a simple distance metric. Although this fact is well understood, the reliance on ‘quantitative’ information implies that the development of 3D pose estimation methods has been slowed down. To address this issue, we first propose a perceptual Pose SIMilarity (PSIM) metric, by assuming that human perception (HP) is highly adapted to extracting structural information from a given signal. Second, we present a perceptually robust 3D pose estimation framework: Temporal Propagating Long Short-Term Memory networks (TP-LSTMs). Toward this, we analyze the information-theory-based spatio-temporal posture correlations, including joint interdependency, temporal consistency, and HP. The experimental results clearly show that the proposed PSIM metric achieves a superior correlation with users’ subjective opinions than conventional pose metrics. Furthermore, we demonstrate the significant quantitative and perceptual performance improvements of TP-LSTMs compared to existing state-of-the-art methods.

Proceedings ArticleDOI
04 Jun 2023
TL;DR: Li et al. as discussed by the authors proposed a novel neural module for enhancing existing fast and lightweight 2D human pose estimation CNNs, which is tasked to encode global spatial and semantic information and provide it to the stem network during inference.
Abstract: This paper presents a novel neural module for enhancing existing fast and lightweight 2D human pose estimation CNNs, in order to increase their accuracy. A baseline stem CNN is augmented by a collateral module, which is tasked to encode global spatial and semantic information and provide it to the stem network during inference. The latter one outputs the final 2D human pose estimations. Since global information encoding is an inherent subtask of 2D human pose estimation, this particular setup allows the stem network to better focus on the local details of the input image and on precisely localizing each body joint, thus increasing overall 2D human pose estimation accuracy. Furthermore, the collateral module is designed to be lightweight, adding negligible runtime computational cost, so that the unified architecture retains the fast execution property of the stem network. Evaluation of the proposed method on public 2D human pose estimation datasets shows that it increases the accuracy of different baseline stem CNNs, while outperforming all competing fast 2D human pose estimation methods.