Towards Understanding Action Recognition
Summary (4 min read)
1. Introduction
- Current computer vision algorithms fall far below human performance on activity recognition tasks.
- Many things might be limiting current meth-ods: weak visual cues or lack of high-level cues for example.
- Higher-level pose features require the knowledge of joints (h) but can be semantically interpreted.
- While their main focus is to analyze the potential impact of different cues, the dataset is also valuable for evaluating human pose estimation and human detection in videos.
- The authors preliminary results show that pose features estimated from [33] perform much worse than the ground truth pose features, but they outperform low/mid level features for action recognition on clips where the full body is visible.
3.1. Selection
- The HMDB51 database [14] contains more than 5,100 clips of 51 different human actions collected from movies or the Internet.
- Annotating this entire dataset is impractical so J-HMDB is a subset with fewer categories.
- The authors excluded categories that contain mainly facial expressions like smiling, interactions with others such as shaking hands, and actions that can only be done in a specific way such as a cartwheel.
- For the remaining clips, the authors further crop them in time such that the first and last frame roughly correspond to the beginning and end of an action.
- In summary, there are 31,838 annotated frames in total.
3.2. Annotation
- For annotation, the authors use a 2D puppet model [36] in which the human body is represented as a set of 10 body parts connected by 13 joints (shoulder, elbow, wrist, hip, knee, ankle, neck) and two landmarks (face and belly).
- The authors built a graphical user interface to control the viewpoint and scale and in which the joints can be selected and moved in the image plane.
- The annotation involves adjusting the joint position so that the contours of the puppet align with image information [36] .
- The puppet mask (i.e. the region contained within the puppet) is also used to initialize GrabCut [23] to obtain a segmentation mask.
- Details about the annotation interface and the distribution of joint locations, viewpoints, and scales of the annotations are provided on the website.
3.3. Training and testing set generation
- For each action category, clips are randomly grouped into two sets with the constraint that the clips from the same video belong to the same set.
- The authors iterate the grouping until the ratio of the number of clips in the two sets and the ratio of the number of distinct video sources in the two sets are both close to 7:3.
- Three splits are randomly generated and the performance reported here is the average of the three splits.
- Note that the number of training/testing clips is similar across categories and the authors report the per-video accuracy, which does not differ much from the per-class accuracy.
4. Study of low-level features
- The authors focus their evaluation on the Dense Trajectories (DT) algorithm [30] since it is currently the best performing method on the HMDB51 database [14] and because it relies on video feature descriptors that are also used by other methods.
- The authors first review DT in Sec. 4.1, and then they replace pieces of the algorithm with the ground truth data to provide low, mid, and high level information in Sec. 4.2, Sec. 5 and Sec. 6.2 respectively.
4.1. DT features
- The DT algorithm [30] represents video data by dense trajectories along with motion and shape features around the trajectories.
- Feature points are further pruned to keep the ones whose eigenvalues of the auto-correlation matrix are larger than some threshold.
- Motion boundary histograms [6] are computed separately for the horizontal and vertical gradients of the optical flow (giving two descriptors), also known as MBH.
- While this decreases the performance on their dataset by less than 1%, it is necessary to fairly evaluate the impact of the flow accuracy using the puppet flow, which is generated at the original video scale.
- The multi-class classification is done by LIBSVM [4] using a one-vs-all approach.
4.2. DT given puppet flow
- The authors can not evaluate the gain of having perfect dense optical flow, and therefore perfect trajectories.
- Instead, the authors use the puppet flow as the ground truth motion in the foreground, i.e. within the puppet mask .
- The authors also try to compute (5) with features from the whole frame.
- It is now clear that the flow-related descriptors, Traj, HOF and MBH have a large gain (6.2-16 pp) over the baseline.
5. Study of mid-level features
- Estimating the location and size of the human in action might be an easier task than estimating accurate pixel-wise flow.
- In the section below, the authors only use Farnebäck's flow (of ).
5.1. DT given foreground mask
- The authors consider two types of regions of interest: the dilated puppet mask Dmask and bbox described above.
- The authors consider two ways of masking, one is in the feature space (F); i.e. compute flow/descriptors on the whole frame then only use those from within the mask.
- In 50% of the images, the overlap between the predicted box and the ground truth box exceeds 50%.
- This suggests that the human detector in [1] is not accurate enough to help action recognition.
5.2. DT given scale
- The authors resize all the frames as well as the corresponding Dmask such that all persons are around 200 pixels in height, and repeat the analysis in (10) .
- Finally, combining kernels of features relying on different low/mid level features results in a 12.4 pp gain over the baseline (Tab. 2 ( 13)).
- It is interesting to see that for many paired comparisons, such as (5) vs. ( 6), (1) vs. ( 7), (10) vs. (11) , the amount of performance change for an individual descriptor does not always result in a similar amount of overall performance change, indicating that the features are not very complimentary, but have different error characteristics.
6. Study of high-level features 6.1. Pose features
- For action recognition with pose features, the authors use various types of descriptors derived from joint annotations.
- The joints are in the neutral puppet positions.
- Note that unlike Traj in Sec. 4.1, the authors consider features along the xand ycoordinate as separate descriptors, and this results in better performance than treating them as one descriptor.
- With the noise, the performance drop is less than 2 pp.
6.2. DT given joints
- The authors use a smaller codebook size (N = 100) because here there are only 15 trajectories per frame.
- The subset contains 316 clips distributed over 12 categories.
- A closer look at the performance of individual descriptors reveals that the texture-based HOX benefits more given low/mid-level than high-level information, while the position-based Traj shows the opposite.
- Dense Trajectories given estimated joints results in a 3.8 pp gain over the baseline, and NTraj+ computed from the 15 estimated joint positions results in a 8.1 pp gain over the baseline (Tab. 3 (5) ).
- This suggests that while the estimated joint positions are not accurate compared to the ground truth, the derived pose features already outperform low/mid level features for action recognition.
6.3. Summary
- Table 4 summarizes the improvements to Dense Trajectories realized by providing low/mid-level and high-level features on the full dataset J-HMDB and the subset sub-J-HMDB.
- Overall, the two sets show a 12-17 pp improvement over the baseline with ground truth low/mid features and a 19-29 pp improvement with high-level features.
7. Discussion
- The authors have presented a complex, annotated, video dataset in order to analyze action recognition algorithms.
- Starting with a state-of-the-art method [30] , the authors supply the algorithm with a range of low-to-high-level ground truth information.
- It is also surprising that, with a good bounding box, which is probably easier to achieve than estimating accurate flow, one can obtain a large improvement over the baseline.
- While this might not be surprising, their contribution here is threefold.
- Third, for sub-J-HMDB, where the full body is visible, a recent pose estimation algorithm computes poses that are more reliable than low/mid level features for action recognition of complex actions in realistic videos.
Did you find this useful? Give us your feedback
Citations
2,372 citations
Cites background from "Towards Understanding Action Recogn..."
...6M” [10] that includes images and 3D poses of people but are captured in the controlled indoor environments, whereas our dataset includes real-world images but provides 2D poses only....
[...]
850 citations
Cites background or methods from "Towards Understanding Action Recogn..."
...We keep the softmax loss on JHMDB as it is the default loss used by previous methods on this dataset....
[...]
...To demonstrate the competitiveness of our baseline methods, we also apply them to the JHMDB dataset [18] and compare the results against the previous state-of-theart....
[...]
...A few datasets, such as CMU [20], MSR Actions [37], UCF Sports [29] and JHMDB [18] provide spatio-temporal annotations in each frame for short trimmed videos....
[...]
...One key difference between AVA and JHMDB (as well as many other action datasets) is that action labels in AVA are not mutually exclusive, i.e., multiple labels can be assigned to one bounding box....
[...]
...We can also see that using Deep Flow extracted flows and stacking multiple flows are both helpful for JHMDB as well as AVA....
[...]
809 citations
[...]
694 citations
684 citations
Cites methods from "Towards Understanding Action Recogn..."
...We report performance on two widely adopted video action detection datasets: JHDMB [52] and UCF-101-24 [45]....
[...]
References
40,826 citations
"Towards Understanding Action Recogn..." refers methods in this paper
...The multi-class classification is done by LIBSVM [4] using a one-vs-all approach....
[...]
31,952 citations
"Towards Understanding Action Recogn..." refers methods in this paper
...HOG: Histograms of oriented gradients [5] of 8 bins are computed in a 32-pixels × 32-pixels × 15-frames spatiotemporal volume surrounding the trajectory....
[...]
5,670 citations
3,833 citations
"Towards Understanding Action Recogn..." refers background or methods in this paper
...Commonly considered sources for action recognition are sport activities [18], YouTube videos [21], or movie scenes [14, 16]....
[...]
...HOF: Histograms of optical flow [16] are computed similarly as HOG except that there are 9 bins with the additional one corresponding to pixels with optical flow magnitude lower than a threshold....
[...]
3,533 citations
"Towards Understanding Action Recogn..." refers background or methods in this paper
...Since HMDB51 [14] is the most challenging dataset among the current movie datasets [30], we build on it to create J-HMDB....
[...]
...Training and testing splits are generated as in [14]....
[...]
...According to [30], the HMDB51 dataset [14] is the most challenging dataset for vision algorithms, with the best method achieving only 48% accuracy....
[...]
...The HMDB51 database [14] contains more than 5,100 clips of 51 different human actions collected from movies or the Internet....
[...]
...We focus on one of the most challenging datasets for action recognition (HMDB51 [14]) and on the approach that achieves the best performance on this dataset (Dense Trajectories [30])....
[...]