scispace - formally typeset
Search or ask a question

Showing papers by "Majid Mirmehdi published in 2022"


Proceedings ArticleDOI
02 Jan 2022
TL;DR: A novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries is incorporated within a pipeline to calculate confidence scores and action classes.
Abstract: We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate confidence scores and action classes. We achieve an average mAP of 34.6% on ActivityNet-1.3, particularly outperforming previous methods with the highest IoU of 0.95. TVNet also achieves mAP of 56.0% when combined with PGCN and 59.1% with MUSES at 0.5 IoU on THUMOS14 and outperforms prior work at all thresholds. Our code is available at https://github.com/hanielwang/TVNet.

7 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel deep learning approach, Feature Matching Auto-encoder (FeMA), which consists of two stages, predicting ischaemic stroke evolution at one week without voxel-wise annotation and predicting stroke treatment outcome at 90 days from a baseline scan.

3 citations


Journal ArticleDOI
TL;DR: This paper proposed a curriculum learning approach for sparsely labeled animal datasets leveraging large volumes of unlabeled data to improve supervised species detectors. But this method is not suitable for the task of finding great apes in camera trap footage taken in challenging real-world jungle environments.
Abstract: Abstract We propose a novel end-to-end curriculum learning approach for sparsely labelled animal datasets leveraging large volumes of unlabelled data to improve supervised species detectors. We exemplify the method in detail on the task of finding great apes in camera trap footage taken in challenging real-world jungle environments. In contrast to previous semi-supervised methods, our approach adjusts learning parameters dynamically over time and gradually improves detection quality by steering training towards virtuous self-reinforcement. To achieve this, we propose integrating pseudo-labelling with curriculum learning policies and show how learning collapse can be avoided. We discuss theoretical arguments, ablations, and significant performance improvements against various state-of-the-art systems when evaluating on the Extended PanAfrican Dataset holding approx. 1.8M frames. We also demonstrate our method can outperform supervised baselines with significant margins on sparse label versions of other animal datasets such as Bees and Snapshot Serengeti. We note that performance advantages are strongest for smaller labelled ratios common in ecological applications. Finally, we show that our approach achieves competitive benchmarks for generic object detection in MS-COCO and PASCAL-VOC indicating wider applicability of the dynamic learning concepts introduced. We publish all relevant source code, network weights, and data access details for full reproducibility.

3 citations


Journal ArticleDOI
TL;DR: A vision-based deep learning energy expenditure estimation system for a wide range of daily living activities can be calibrated to a specific person with footage and calorimeter data from 32 seconds of sweeping and 32 Seconds of sitting.
Abstract: Background Calorimetry is both expensive and obtrusive but provides the only way to accurately measure energy expenditure in daily living activities of any specific person, as different people can use different amounts of energy despite performing the same actions in the same manner. Deep learning video analysis techniques have traditionally required a lot of data to train; however, recent advances in few-shot learning, where only a few training examples are necessary, have made developing personalized models without a calorimeter a possibility. Objective The primary aim of this study is to determine which activities are most well suited to calibrate a vision-based personalized deep learning calorie estimation system for daily living activities. Methods The SPHERE (Sensor Platform for Healthcare in a Residential Environment) Calorie data set is used, which features 10 participants performing 11 daily living activities totaling 4.5 hours of footage. Calorimeter and video data are available for all recordings. A deep learning method is used to regress calorie predictions from video. Results Models are personalized with 32 seconds from all 11 actions in the data set, and mean square error (MSE) is taken against a calorimeter ground truth. The best single action for calibration is wipe (1.40 MSE). The best pair of actions are sweep and sit (1.09 MSE). This compares favorably to using a whole 30-minute sequence containing 11 actions to calibrate (1.06 MSE). Conclusions A vision-based deep learning energy expenditure estimation system for a wide range of daily living activities can be calibrated to a specific person with footage and calorimeter data from 32 seconds of sweeping and 32 seconds of sitting.

1 citations


Journal ArticleDOI
TL;DR: This work proposes a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI), and fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time.
Abstract: We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time. We evaluate the proposed model on inertial data from a wearable accelerometer device, using RGB videos and skeletons as privileged modalities, and show an improvement of accuracy of an average 6.6% on the UTD-MHAD dataset and an average 5.5% on the Berkeley MHAD dataset, reaching a new state-of-the-art for inertial-only classification accuracy on these datasets. We validate our framework through several ablation studies.

Proceedings ArticleDOI
25 Oct 2022
TL;DR:
Abstract: Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined boundaries with higher confidence. We obtain state-of-the-art performance on the challenging EPICKITCHENS-100 action detection as well as the standard THUMOS14 action detection benchmarks, and achieve improvement on the ActivityNet-1.3 benchmark.

Proceedings ArticleDOI
17 Aug 2022
TL;DR: Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework, is proposed and it is suggested that the proposed model can indeed enhance the Trans UNet architecture via exploiting temporal information and improving segmentation performance by a significant margin.
Abstract: We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of 0.8796 and an average surface distance of 1.0379 pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction.