scispace - formally typeset
Search or ask a question
Book ChapterDOI

Detecting Missed and Anomalous Action Segments Using Approximate String Matching Algorithm

16 Dec 2017-pp 101-111
TL;DR: An exemplar based Approximate String Matching (ASM) technique is proposed for detecting such anomalous and missing segments in action sequences and shows promising alignment and missed/anomalous notification results over this dataset.
Abstract: We forget action steps and perform some unwanted action movements as amateur performers during our daily exercise routine, dance performances, etc. To improve our proficiency, it is important that we get a feedback on our performances in terms of where we went wrong. In this paper, we propose a framework for analyzing and issuing reports of action segments that were missed or anomalously performed. This involves comparing the performed sequence with the standard action sequence and notifying when misalignments occur. We propose an exemplar based Approximate String Matching (ASM) technique for detecting such anomalous and missing segments in action sequences. We compare the results with those obtained from the conventional Dynamic Time Warping (DTW) algorithm for sequence alignment. It is seen that the alignment of the action sequences under conventional DTW fails in the presence of missed action segments and anomalous segments due to its boundary condition constraints. The performance of the two techniques has been tested on a complex aperiodic human action dataset with Warm up exercise sequences that we developed from correct and incorrect executions by multiple people. The proposed ASM technique shows promising alignment and missed/anomalous notification results over this dataset.
Citations
More filters
Journal ArticleDOI
TL;DR: This work proposes a new action scoring system termed as Reference Guided Regression (RGR), which comprises a Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges, and a Score Estimation Module that uses the resemblance of a video with a reference video to give the assessment score.
Abstract: Automated vision-based score estimation models can be used to provide an alternate opinion to avoid judgment bias. Existing works have learned score estimation models by regressing the video representation to ground truth score provided by judges. However, such regression-based solutions lack interpretability in terms of giving reasons for the awarded score. One solution to make the scores more explicable is to compare the given action video with a reference video, which would capture the temporal variations vis-a-vis the reference video and map those variations to the final score. In this work, we propose a new action scoring system termed as Reference Guided Regression (RGR) , which comprises (1) a Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges, and (2) a Score Estimation Module that uses the first module to find the resemblance of a video with a reference video to give the assessment score. The proposed scoring model is tested for Olympics Diving and Gymnastic vaults and the model outperforms the existing state-of-the-art scoring models.

29 citations


Cites background from "Detecting Missed and Anomalous Acti..."

  • ...Few early works [7], [6], [11], [10], [9] in the domain were hand crafted for specific actions and could not be generalised to different types of actions....

    [...]

  • ...rehabilitation [14], exercise [6], [7] and actions of daily living [27]....

    [...]

Journal ArticleDOI
17 Apr 2020
TL;DR: The aim of this study was to develop two novel methods of evaluating performance in the STS using a low-cost RGB camera and another an instrumented chair containing load cells in the seat of the chair to detect center of pressure movements and ground reaction forces.
Abstract: The sit-to-stand test (STS) is a simple test of function in older people that can identify people at risk of falls. The aim of this study was to develop two novel methods of evaluating performance in the STS using a low-cost RGB camera and another an instrumented chair containing load cells in the seat of the chair to detect center of pressure movements and ground reaction forces. The two systems were compared to a Kinect and a force plate. Twenty-one younger subjects were tested when performing two 5STS movements at self-selected slow and normal speeds while 16 older fallers were tested when performing one 5STS at a self-selected pace. All methods had acceptable limits of agreement with an expert for total STS time for younger subjects and older fallers, with smaller errors observed for the chair (−0.18 ± 0.17 s) and force plate (−0.19 ± 0.79 s) than for the RGB camera (−0.30 ± 0.51 s) and the Kinect (−0.38 ± 0.50 s) for older fallers. The chair had the smallest limits of agreement compared to the expert for both younger and older participants. The new device was also able to estimate movement velocity, which could be used to estimate muscle power during the STS movement. Subsequent studies will test the device against opto-electronic systems, incorporate additional sensors, and then develop predictive equations for measures of physical function.

12 citations


Cites methods from "Detecting Missed and Anomalous Acti..."

  • ...Poses estimated using this library are accurate at assessing human movement [25]....

    [...]

Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work presents a novel community detection-based human action segmentation algorithm that marks the existence of community structures in human action videos where the consecutive frames around the key poses group together to form communities similar to social networks.
Abstract: Temporal segmentation of complex human action videos into action primitives plays a pivotal role in building models for human action understanding Studies in the past have introduced unsupervised frameworks for deriving a known number of motion primitives from action videos Our work focuses towards answering a question: Given a set of videos with humans performing an activity, can the action primitives be derived from them without specifying any prior knowledge about the count for the constituting sub-actions categories? To this end, we present a novel community detection-based human action segmentation algorithm Our work marks the existence of community structures in human action videos where the consecutive frames around the key poses group together to form communities similar to social networks We test our proposed technique over the stitched Weizmann dataset and MHADI01-s motion capture dataset and our technique outperforms the state-of-the-art techniques of complex action segmentation without the count of actions being pre-specified

5 citations


Cites methods from "Detecting Missed and Anomalous Acti..."

  • ...Bag-of-Features approach [1], Template matching based segmentation approach[2] and Hidden Markov Model (HMM) [3][4]; b) Unsupervised approaches that model the video sequences without their ground-truth labels and have an explicit training phase, e....

    [...]

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This work introduces a novel sequence-to-sequence autoencoder-based scoring model which learns the representation from only expert performances and judges an unknown performance based on how well it can be regenerated from the learned model.
Abstract: Developing a model for the task of assessing quality of human action is a key research area in computer vision. The quality assessment task has been posed as a supervised regression problem, where models have been trained to predict score, given action representation features. However, human proficiency levels can widely vary and so do their scores. Providing all such performance variations and their respective scores is an expensive solution as it requires a domain expert to annotate many videos. The question arises - Can we exploit the variations of the performances from that of expert and map the variations to their respective scores? To this end, we introduce a novel sequence-to-sequence autoencoder-based scoring model which learns the representation from only expert performances and judges an unknown performance based on how well it can be regenerated from the learned model. We evaluated our model in predicting scores of a complex Sun- Salutation action sequence, and demonstrate that our model gives remarkable prediction accuracy compared to the baselines.

5 citations


Cites methods from "Detecting Missed and Anomalous Acti..."

  • ...Evaluation Metrics Baseline and Experiment Settings We compare our model with 3 baseline works - 1) Pose vs SVR [1], 2) C3D vs SVR, LSTM+SVR [3] 3) Expert Template Matching Approach [10] For Pose + SVR-based scoring[1], the pose sequences are pre-processed using DCT and DFT operations....

    [...]

  • ...The technique is compared with the state-of-the-art regression-based action scoring techniques[1, 3] and template-based assessment technique[10]....

    [...]

  • ...Following our previous work[10], we use the stacked hourglass networks[11] for human pose estimation....

    [...]

  • ...Evaluation Metrics Baseline and Experiment Settings We compare our model with 3 baseline works - 1) Pose vs SVR [1], 2) C3D vs SVR, LSTM+SVR [3] 3) Expert Template Matching Approach [10] For Pose + SVR-based scoring[1], the pose sequences are pre-processed using DCT and DFT operations....

    [...]

  • ...For the template based approach[10], and our approach, the poses are converted to 7 codebook words considering 7 distinct poses....

    [...]

Posted Content
TL;DR: In this article, the authors proposed a new action scoring system as a two-phase system: (1) a Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges; (2) Score Estimation Module that uses the first module to find the resemblance of a video to a reference video in order to give the assessment score.
Abstract: Automated vision-based score estimation models can be used as an alternate opinion to avoid judgment bias. In the past works the score estimation models were learned by regressing the video representations to the ground truth score provided by the judges. However such regression-based solutions lack interpretability in terms of giving reasons for the awarded score. One solution to make the scores more explicable is to compare the given action video with a reference video. This would capture the temporal variations w.r.t. the reference video and map those variations to the final score. In this work, we propose a new action scoring system as a two-phase system: (1) A Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges; (2) A Score Estimation Module that uses the first module to find the resemblance of a video to a reference video in order to give the assessment score. The proposed scoring model has been tested for Olympics Diving and Gymnastic vaults and the model outperforms the existing state-of-the-art scoring models.

4 citations

References
More filters
Journal ArticleDOI
H. Sakoe1, S. Chiba1
TL;DR: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition, in which the warping function slope is restricted so as to improve discrimination between words in different categories.
Abstract: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition. First, a general principle of time-normalization is given using time-warping function. Then, two time-normalized distance definitions, called symmetric and asymmetric forms, are derived from the principle. These two forms are compared with each other through theoretical discussions and experimental studies. The symmetric form algorithm superiority is established. A new technique, called slope constraint, is successfully introduced, in which the warping function slope is restricted so as to improve discrimination between words in different categories. The effective slope constraint characteristic is qualitatively analyzed, and the optimum slope constraint condition is determined through experiments. The optimized algorithm is then extensively subjected to experimental comparison with various DP-algorithms, previously applied to spoken word recognition by different research groups. The experiment shows that the present algorithm gives no more than about two-thirds errors, even compared to the best conventional algorithm.

5,906 citations

Book ChapterDOI
08 Oct 2016
TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

3,865 citations

Posted Content
TL;DR: Stacked hourglass networks as mentioned in this paper were proposed for human pose estimation, where features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body, and repeated bottom-up, top-down processing with intermediate supervision is critical to improving the performance of the network.
Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

2,369 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A learning-based framework that takes steps towards assessing how well people perform actions in videos by training a regression model from spatiotemporal pose features to scores obtained from expert judges and can provide interpretable feedback on how people can improve their action.
Abstract: While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of assessing how well people perform actions has been largely unexplored in computer vision. Since methods for assessing action quality have many real-world applications in healthcare, sports, and video retrieval, we believe the computer vision community should begin to tackle this challenging problem. To spur progress, we introduce a learning-based framework that takes steps towards assessing how well people perform actions in videos. Our approach works by training a regression model from spatiotemporal pose features to scores obtained from expert judges. Moreover, our approach can provide interpretable feedback on how people can improve their action. We evaluate our method on a new Olympic sports dataset, and our experiments suggest our framework is able to rank the athletes more accurately than a non-expert human. While promising, our method is still a long way to rivaling the performance of expert judges, indicating that there is significant opportunity in computer vision research to improve on this difficult yet important task.

177 citations

Journal ArticleDOI
01 Sep 2014
TL;DR: This paper presents the development of a Kinect-based system for ensuring home-based rehabilitation using a Dynamic Time Warping (DTW) algorithm and fuzzy logic to assist patients in conducting safe and effective home- based rehabilitation without the immediate supervision of a physician.
Abstract: Most formal rehabilitation facilities are situated in a hospital or care center setting, which may not always be conveniently accessible for patients, especially those in geographically isolated areas Home-based rehabilitation has potential to offer greater accessibility and thus increase consistent uptake In addition, the exercise performed in conventional rehabilitation contexts may be insufficient to ensure the patient's speedy recovery, with complimentary rehabilitation exercises at home required to make a difference The goal is to provide effective home-based rehabilitation offering outcomes similar to those obtained through hospital-based rehabilitation under the supervision of an occupational therapist This paper presents the development of a Kinect-based system for ensuring home-based rehabilitation using a Dynamic Time Warping (DTW) algorithm and fuzzy logic The ultimate goal is to assist patients in conducting safe and effective home-based rehabilitation without the immediate supervision of a physician

117 citations