Mimetics: Towards Understanding Human Actions Out of Context

doi:10.1007/S11263-021-01446-Y

Home
/
Papers
/
Mimetics: Towards Understanding Human Actions Out of Context

Journal Article•DOI•

Mimetics: Towards Understanding Human Actions Out of Context

02 Mar 2021-International Journal of Computer Vision (Springer US)-Vol. 129, Iss: 5, pp 1675-1690

TL;DR: This paper proposes to benchmark action recognition methods in such absence of context and introduces a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark, and shows that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well.

read less

Abstract: Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in such absence of context and introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that (a) state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions and (b) models leveraging body language via human pose are less prone to context biases. In particular, we show that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well compared to 3D action recognition methods.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Revisiting Skeleton-based Action Recognition

[...]

01 Jun 2022

TL;DR: PoseConv3D as mentioned in this paper uses a 3D heatmap volume instead of a graph sequence as the base representation of human skeletons, which is more effective in learning spatio-temporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings.

...read moreread less

Abstract: Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt GCNs to extract features on top of human skeletons. Despite the positive results shown in these attempts, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseConv3D, a new approach to skeleton-based action recognition. PoseConv3D relies on a 3D heatmap volume instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseConv3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseConv3D can handle multiple-person scenarios without additional computation costs. The hierarchical features can be easily integrated with other modalities at early fusion stages, providing a great design space to boost the performance. PoseConv3D achieves the state-of-the-art on five of six standard skeleton-based action recognition benchmarks. Once fused with other modalities, it achieves the state-of-the-art on all eight multi-modality action recognition benchmarks. Code has been made available at: https://github.com/kennymckormick/pyskl.

...read moreread less

44 citations

Posted Content•

Quo Vadis, Skeleton Action Recognition ?

[...]

Pranay Gupta¹, Anirudh Thatipelli¹, Aditya Aggarwal¹, Shubh Maheshwari¹, Neel Trivedi¹, Sourav Das¹, Ravi Kiran Sarvadevabhatla¹ - Show less +3 more•Institutions (1)

International Institute of Information Technology, Hyderabad¹

04 Jul 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild', and proposes new frontiers for human action recognition.

...read moreread less

Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To begin with, we benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. To examine skeleton action recognition 'in the wild', we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild'. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. Finally, as a new frontier for action recognition, we introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. It also provides an assessment of top-performing approaches across a spectrum of activity settings and via the introduced datasets, proposes new frontiers for human action recognition.

...read moreread less

24 citations

Cites background from "Mimetics: Towards Understanding Hum..."

...HDM05 [34] 100 1500 Fixed Lab/Prompted Non-contextual Not specified High Fixed 3D-Iconic dataset [39] 20 1739 Fixed Lab/Prompted Non-contextual Not specified High Fixed Florence-3D [40] 9 215 Fixed Lab/Prompted Non-contextual Not specified High Fixed NTU-60 [41] 60 56880 Fixed Lab/Prompted Non-contextual 1-10 seconds High Fixed Large-RGB+D [56] 94 4953 Fixed Lab/Prompted Non-contextual Not specified High Fixed/Moving Kinetics-skeleton [55] 400 300,000 Fixed Wild Contextual 10 seconds High Fixed/Moving NTU-120 [30] 120 114,480 Fixed Lab/Prompted Non-contextual 1-10 seconds High Fixed Mimetics [53] 50 713 Fixed Wild Non-contextual 1-10 seconds Moderate Fixed Skeletics-152 152 125,657 Fixed Wild Contextual 10 seconds High Fixed/Moving Skeleton-Mimetics 23 319 Fixed Wild Non-contextual 1-10 seconds Moderate Fixed Metaphorics N....
[...]

Journal Article•DOI•

Quo Vadis, Skeleton Action Recognition?

[...]

Pranay Gupta¹, Anirudh Thatipelli¹, Aditya Aggarwal¹, Shubh Maheshwari¹, Neel Trivedi¹, Sourav Das¹, Ravi Kiran Sarvadevabhatla¹ - Show less +3 more•Institutions (1)

International Institute of Information Technology, Hyderabad¹

05 May 2021-International Journal of Computer Vision

TL;DR: Skeleton-Mimetics-152 as discussed by the authors is a 3D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset, and Metaphorics, a dataset with caption style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances.

...read moreread less

Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To study skeleton-action recognition in the wild, we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. We benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. The results from benchmarking the top performers of NTU-120 on the newly introduced datasets reveal the challenges and domain gap induced by actions in the wild. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. Via the introduced datasets, our work enables new frontiers for human action recognition.

...read moreread less

16 citations

Posted Content•

Pose And Joint-Aware Action Recognition.

[...]

Anshul Shah¹, Shlok Kumar Mishra¹, Ankan Bansal², Jun-Cheng Chen², Rama Chellappa², Abhinav Shrivastava³ - Show less +2 more•Institutions (3)

Johns Hopkins University¹, University of Maryland, College Park², Academia Sinica³

16 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new model for joint-based action recognition is presented, which first extracts motion features from each joint separately through a shared motion encoder before performing collective reasoning, and which outperforms the existing baseline on Mimetics, a dataset with out-of-context actions.

...read moreread less

Abstract: Most human action recognition systems typically consider static appearances and motion as independent streams of information. In this paper, we consider the evolution of human pose and propose a method to better capture interdependence among skeleton joints. Our model extracts motion information from each joint independently, reweighs the information and finally performs inter-joint reasoning. The effectiveness of pose and joint-based representations is strengthened using a geometry-aware data augmentation technique which jitters pose heatmaps while retaining the dynamics of the action. Our best model gives an absolute improvement of 8.19% on JHMDB, 4.31% on HMDB and 1.55 mAP on Charades datasets over state-of-the-art methods using pose heat-maps alone. Fusing with RGB and flow streams leads to improvement over state-of-the-art. Our model also outperforms the baseline on Mimetics, a dataset with out-of-context videos by 1.14% while using only pose heatmaps. Further, to filter out clips irrelevant for action recognition, we re-purpose our model for clip selection guided by pose information and show improved performance using fewer clips.

...read moreread less

10 citations

Cites background or methods or result from "Mimetics: Towards Understanding Hum..."

...Our model also obtains an improvement on Mimetics[55], a dataset with out-of-context actions using only pose heatmaps and without any tracking....
[...]
...But, studies [26, 27, 55] show that RGB and flow-based models capture a lot of dataset biases....
[...]
...But, these approaches have certain limitations like overfitting on small datasets[11], requirement for access to multiple posemodalities [59, 55], pose-tracking [55]....
[...]
...We also evaluate our method on the Mimetics [55] dataset....
[...]
...Mimetics [55] is a test set for fifty actions from the Kinetics dataset....
[...]

Proceedings Article•DOI•

Pose and Joint-Aware Action Recognition

[...]

01 Jan 2022

TL;DR: In this paper , a joint selector module re-weights the joint information to select the most discriminative joints for the task, and a joint-contrastive loss is proposed to pull together groups of joint features which convey the same action.

...read moreread less

Abstract: Recent progress on action recognition has mainly focused on RGB and optical flow features. In this paper, we approach the problem of joint-based action recognition. Unlike other modalities, constellation of joints and their motion generate models with succinct human motion information for activity recognition. We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder before performing collective reasoning. Our joint selector module re-weights the joint information to select the most discriminative joints for the task. We also propose a novel joint-contrastive loss that pulls together groups of joint features which convey the same action. We strengthen the joint-based representations by using a geometry-aware data augmentation technique which jitters pose heatmaps while retaining the dynamics of the action. We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets. A late fusion with RGB and Flow-based approaches yields additional improvements. Our model also outperforms the existing baseline on Mimetics, a dataset with out-of-context actions.

...read moreread less

9 citations

1
2
3
4
…

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Posted Content•

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

[...]

Shaoqing Ren¹, Kaiming He², Ross Girshick³, Jian Sun²•Institutions (3)

University of Science and Technology of China¹, Microsoft², Facebook³

04 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

...read moreread less

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

...read moreread less

23,183 citations

Proceedings Article•DOI•

Aggregated Residual Transformations for Deep Neural Networks

[...]

Saining Xie¹, Ross Girshick², Piotr Dollár², Zhuowen Tu¹, Kaiming He² - Show less +1 more•Institutions (2)

University of California, San Diego¹, Facebook²

21 Jul 2017

TL;DR: ResNeXt as discussed by the authors is a simple, highly modularized network architecture for image classification, which is constructed by repeating a building block that aggregates a set of transformations with the same topology.

...read moreread less

Abstract: We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call cardinality (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.

...read moreread less

7,183 citations

Proceedings Article•DOI•

Learning Spatiotemporal Features with 3D Convolutional Networks

[...]

Du Tran¹, Du Tran², Lubomir Bourdev², Rob Fergus², Lorenzo Torresani¹, Manohar Paluri² - Show less +2 more•Institutions (2)

Dartmouth College¹, Facebook²

07 Dec 2015

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

...read moreread less

7,091 citations

"Mimetics: Towards Understanding Hum..." refers methods in this paper

...While improvements of this approach have been proposed [14], most state-ofthe-art methods now use a 3D deep convolutional network [5, 43, 44, 50], optionally in combination with a twostream architecture....
[...]
...Different strategies have been deployed to handle video processing with CNNs such as two-stream architectures [14, 38], Recurrent Neural Networks (RNNs) [9], or spatio-temporal 3D convolutions [5, 13, 43]....
[...]