scispace - formally typeset
Search or ask a question

Showing papers by "Majid Mirmehdi published in 2021"


Proceedings ArticleDOI
15 Jan 2021
TL;DR: Temporal-Relational CrossTransformers (TRX) as mentioned in this paper constructs class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches.
Abstract: We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared.1Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

88 citations


Journal ArticleDOI
16 Jun 2021-Sensors
TL;DR: In this article, a multimodal deep learning approach for discriminating between people with Parkinson's disease and without PD is presented, which uses two data modalities, acquired from vision and accelerometer sensors in a home environment to train variational autoencoder (VAE) models.
Abstract: Parkinson's disease (PD) is a chronic neurodegenerative condition that affects a patient's everyday life. Authors have proposed that a machine learning and sensor-based approach that continuously monitors patients in naturalistic settings can provide constant evaluation of PD and objectively analyse its progression. In this paper, we make progress toward such PD evaluation by presenting a multimodal deep learning approach for discriminating between people with PD and without PD. Specifically, our proposed architecture, named MCPD-Net, uses two data modalities, acquired from vision and accelerometer sensors in a home environment to train variational autoencoder (VAE) models. These are modality-specific VAEs that predict effective representations of human movements to be fused and given to a classification module. During our end-to-end training, we minimise the difference between the latent spaces corresponding to the two data modalities. This makes our method capable of dealing with missing modalities during inference. We show that our proposed multimodal method outperforms unimodal and other multimodal approaches by an average increase in F1-score of 0.25 and 0.09, respectively, on a data set with real patients. We also show that our method still outperforms other approaches by an average increase in F1-score of 0.17 when a modality is missing during inference, demonstrating the benefit of training on multiple modalities.

9 citations


Proceedings ArticleDOI
01 Jan 2021
TL;DR: This work improves its architecture and combines it with a tracking functionality that makes it possible to be deployed in real-world homes and shows a novel first example of subject-tailored health monitoring measurement by applying its methodology to a sit-to-stand detector to generate clinically relevant rehabilitation trends.
Abstract: The majority of the Ambient Assisted Living (AAL) systems, designed for home or lab settings, monitor one participant at a time – this is to avoid the complexities of pre-fusion correspondence of different sensors since carers, guests, and visitors may be involved in real world scenarios. Previous work from (Masullo et al., 2020) presented a solution to this problem that involves matching video sequences of silhouettes to accelerations from wearable sensors to identify members of a household while respecting their privacy. In this work, we elevate this approach to the next stage by improving its architecture and combining it with a tracking functionality that makes it possible to be deployed in real-world homes. We present experiments on a new dataset recorded in participants’ own houses, which includes multiple participants visited by guests, and show an auROC score of 90.2%. We also show a novel first example of subject-tailored health monitoring measurement by applying our methodology to a sit-to-stand detector to generate clinically relevant rehabilitation trends.

3 citations


Proceedings ArticleDOI
01 Jan 2021
TL;DR: In this paper, an end-to-end deep learning framework was proposed to measure PD severity in two important components, hand movement and gait, of the Unified Parkinson's Disease Rating Scale (UPDRS).
Abstract: Evaluating neurological disorders such as Parkinson's disease (PD) is a challenging task that requires the assessment of several motor and non-motor functions In this paper, we present an end-to-end deep learning framework to measure PD severity in two important components, hand movement and gait, of the Unified Parkinson's Disease Rating Scale (UPDRS) Our method leverages on an Inflated 3D CNN trained by a temporal segment framework to learn spatial and long temporal structure in video data We also deploy a temporal attention mechanism to boost the performance of our model Further, motion boundaries are explored as an extra input modality to assist in obfuscating the effects of camera motion for better movement assessment We ablate the effects of different data modalities on the accuracy of the proposed network and compare with other popular architectures We evaluate our proposed method on a dataset of 25 PD patients, obtaining 723% and 771% top-1 accuracy on hand movement and gait tasks respectively

3 citations


Posted Content
TL;DR: Temporal-Relational CrossTransformers (TRX) as discussed by the authors constructs class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches.
Abstract: We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

2 citations


Posted Content
TL;DR: In this paper, a holistic attention network based super-resolution approach and a custom-built altitude data exploitation network are integrated into standard recognition pipelines for animal detection in real-world settings.
Abstract: Visuals captured by high-flying aerial drones are increasingly used to assess biodiversity and animal population dynamics around the globe. Yet, challenging acquisition scenarios and tiny animal depictions in airborne imagery, despite ultra-high resolution cameras, have so far been limiting factors for applying computer vision detectors successfully with high confidence. In this paper, we address the problem for the first time by combining deep object detectors with super-resolution techniques and altitude data. In particular, we show that the integration of a holistic attention network based super-resolution approach and a custom-built altitude data exploitation network into standard recognition pipelines can considerably increase the detection efficacy in real-world settings. We evaluate the system on two public, large aerial-capture animal datasets, SAVMAP and AED. We find that the proposed approach can consistently improve over ablated baselines and the state-of-the-art performance for both datasets. In addition, we provide a systematic analysis of the relationship between animal resolution and detection performance. We conclude that super-resolution and altitude knowledge exploitation techniques can significantly increase benchmarks across settings and, thus, should be used routinely when detecting minutely resolved animals in aerial imagery.

1 citations


Posted Content
TL;DR: In this article, an unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data is presented.
Abstract: Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.