scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Action Quality Assessment Using Siamese Network-Based Deep Metric Learning

TL;DR: This work proposes a new action scoring system termed as Reference Guided Regression (RGR), which comprises a Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges, and a Score Estimation Module that uses the resemblance of a video with a reference video to give the assessment score.
Abstract: Automated vision-based score estimation models can be used to provide an alternate opinion to avoid judgment bias. Existing works have learned score estimation models by regressing the video representation to ground truth score provided by judges. However, such regression-based solutions lack interpretability in terms of giving reasons for the awarded score. One solution to make the scores more explicable is to compare the given action video with a reference video, which would capture the temporal variations vis-a-vis the reference video and map those variations to the final score. In this work, we propose a new action scoring system termed as Reference Guided Regression (RGR) , which comprises (1) a Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges, and (2) a Score Estimation Module that uses the first module to find the resemblance of a video with a reference video to give the assessment score. The proposed scoring model is tested for Olympics Diving and Gymnastic vaults and the model outperforms the existing state-of-the-art scoring models.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, a learning and fusion network of multiple hidden substages is proposed to assess athletic performance by segmenting videos into five substages by a temporal semantic segmentation, and a fully-connected-network-based hidden regression model is built to predict the score of each substage, fusing these scores into the overall score.
Abstract: Many of the existing methods for action quality assessment implement single-stage score regression networks that lack pertinence and rationality for the evaluation task. In this work, our target is to find a reasonable action quality assessment method for sports competitions that conforms to objective evaluation rules and field experience. To achieve this goal, three assessment scenarios, i.e., the overall-score-guided scenario, execution-score-guided scenario, and difficulty-level-based overall-score-guided scenario, are defined. A learning and fusion network of multiple hidden substages is proposed to assess athletic performance by segmenting videos into five substages by a temporal semantic segmentation. The feature of each video segment is extracted from the five feature backbone networks with shared weights, and a fully-connected-network-based hidden regression model is built to predict the score of each substage, fusing these scores into the overall score. We evaluate the proposed method on the UNLV-Diving dataset. The comparison results show that the proposed method based on objective evaluation rules of sports competitions outperforms the regression model directly trained on the overall score. The proposed multiple-substage network is more accurate than the single-stage score regression network and achieves state-of-the-art performance by leveraging objective evaluation rules and field experience that are beneficial for building an accurate and reasonable action quality assessment model.

7 citations

Proceedings ArticleDOI
29 Jun 2021
TL;DR: In this paper, the same embeddings are also used to temporally align the sequences prior to quality assessment, which further increases the accuracy, provides robustness to variance in execution speed and enables to provide fine-grained interpretability of the assessment score.
Abstract: Action Quality Assessment (AQA) is a video understanding task aiming at the quantification of the execution quality of an action. One of the main challenges in relevant, deep learning-based approaches is the collection of training data annotated by experts. Current methods perform fine-tuning on pre-trained backbone models and aim to improve performance by modeling the subjects and the scene. In this work, we consider embeddings extracted using a self-supervised training method based on a differential cycle consistency loss between sequences of actions. These are shown to improve the state-of-the-art without the need for additional annotations or scene modeling. The same embeddings are also used to temporally align the sequences prior to quality assessment which further increases the accuracy, provides robustness to variance in execution speed and enables us to provide fine-grained interpretability of the assessment score. The experimental evaluation of the method on the MTL-AQA dataset demonstrates significant accuracy gain compared to the state-of-the-art baselines, which grows even more when the action execution sequences are not well aligned.

4 citations

Journal ArticleDOI
TL;DR: In this paper , a set of graph-based joint relations is learned for each type of action by means of trainable joint relation graphs built according to the human skeleton structure, and the learned joint relations can visually interpret the assessment process.
Abstract: Action assessment, the process of evaluating how well an action is performed, is an important task in human action analysis. Action assessment has experienced considerable development based on visual cues; however, existing methods neglect to adaptively learn different architectures for varied types of actions and are therefore limited in achieving high-performance assessment for each type of action. In fact, every type of action has specific evaluation criteria, and human experts are trained for years to correctly evaluate a single type of action. Therefore, it is difficult for a single assessment architecture to achieve high performance for all types of actions. However, manually designing an assessment architecture for each specific type of action is very difficult and impracticable. This work addresses this problem by adaptively designing different assessment architectures for different types of actions, and the proposed approach is therefore called the adaptive action assessment. In order to facilitate our adaptive action assessment by exploiting the specific joint interactions for each type of action, a set of graph-based joint relations is learned for each type of action by means of trainable joint relation graphs built according to the human skeleton structure, and the learned joint relation graphs can visually interpret the assessment process. In addition, we introduce using a normalized mean squared error loss (N-MSE loss) and a Pearson loss that perform automatic score normalization to operate adaptive assessment training. The experiments on four benchmarks for action assessment demonstrate the effectiveness and feasibility of the proposed method. We also demonstrate the visual interpretability of our model by visualizing the details of the assessment process.

4 citations

TL;DR: Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and the authors' FS1000 dataset, and an analysis applying the method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving the method has strong robustness.
Abstract: Figure skating scoring is a challenging task because it requires judging players’ technical moves as well as coordination with the background music. Prior learning-based work cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in a 3-5 minutes lasting video, so an extremely long-range representation learning is necessary; 2) prior methods rarely considered the critical audio-visual relationship in their models. Thus, we introduce a multimodal MLP architecture, named Skating-Mixer. It extends the MLP-Mixer-based framework into a multimodal fashion and effectively learns long-term representations through our designed memory recurrent unit (MRU). Aside from the model, we also collected a high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In addition, we include an analysis applying our method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving our method has strong robustness.

4 citations

Journal ArticleDOI
01 Sep 2022
TL;DR: Inspired by the temporal dependencies of the action execution, this work proposes a self-supervised learning on the unlabeled videos by recovering the feature of a masked segment of an unlabeling video, and leverages adversarial learning to align the representation distribution of the labeled and the unl labeled samples to close their gap in the sample space.
Abstract: Action Quality Assessment aims to evaluate how well an action performs. Existing methods have achieved remarkable progress on fully-supervised action assessment. However, in real-world applications, with expert’s experience, it is not always feasible to manually label all samples. Therefore, it is important to study the problem of semi-supervised action assessment with only a small amount of samples annotated. A major challenge for semi-supervised action assessment is how to exploit the temporal pattern from unlabeled videos. Inspired by the temporal dependencies of the action execution, we propose a self-supervised learning on the unlabeled videos by recovering the feature of a masked segment of an unlabeled video. Furthermore, we leverage adversarial learning to align the representation distribution of the labeled and the unlabeled samples to close their gap in the sample space since unlabeled samples always come from unseen actions. Finally, we propose an adversarial self-supervised framework for semi-supervised action quality assessment. The extensive experimental results on the MTL-AQA and the Rhythmic Gymnastics datasets will demonstrate the effectiveness of our framework, achieving the state-of-the-art performances of semi-supervised action quality assessment.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Proceedings Article
08 Dec 2014
TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Abstract: Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

12,299 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

7,091 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: The idea is to learn a function that maps input patterns into a target space such that the L/sub 1/ norm in the target space approximates the "semantic" distance in the input space.
Abstract: We present a method for training a similarity metric from data. The method can be used for recognition or verification applications where the number of categories is very large and not known during training, and where the number of training samples for a single category is very small. The idea is to learn a function that maps input patterns into a target space such that the L/sub 1/ norm in the target space approximates the "semantic" distance in the input space. The method is applied to a face verification task. The learning process minimizes a discriminative loss function that drives the similarity metric to be small for pairs of faces from the same person, and large for pairs from different persons. The mapping from raw to the target space is a convolutional network whose architecture is designed for robustness to geometric distortions. The system is tested on the Purdue/AR face database which has a very high degree of variability in the pose, lighting, expression, position, and artificial occlusions such as dark glasses and obscuring scarves.

3,870 citations


"Action Quality Assessment Using Sia..." refers methods in this paper

  • ...Siamese network [30] has been widely used as a deep metric learning-based (DML) approach [22] to learn the similarity between two sequences....

    [...]

Book ChapterDOI
08 Oct 2016
TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

3,865 citations