Learning Multi-modal Attentional Consensus in Action Recognition for Elderly-Care Robots

doi:10.1109/UR52253.2021.9494666

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Various-Level Spatio-Temporal Alignment for Cross-Domain Action Recognition

[...]

Roger Martelli¹•Institutions (1)

Pusan National University Yangsan Hospital¹

01 Jan 2022

Proceedings Article•DOI•

A Deep Learning Human Activity Recognition Framework for Socially Assistive Robots to Support Reablement of Older Adults

[...]

Fraser Robinson, Goldie Nejat

29 May 2023

TL;DR: In this paper , a novel deep learning human activity recognition and classification architecture capable of autonomously identifying ADLs in home environments to enable long-term deployment of socially assistive robots to aid older adults.

...read moreread less

Abstract: Many older adults prefer to stay in their own homes and age-in-place. However, physical and cognitive limitations in independently completing activities of daily living (ADLs) requires older adults to receive assistive support, often necessitating transitioning to care centers. In this paper, we present the development of a novel deep learning human activity recognition and classification architecture capable of autonomously identifying ADLs in home environments to enable long-term deployment of socially assistive robots to aid older adults. Our deep learning architecture is the first to use multimodal inputs to create an embedding vector approach for classifying and monitoring multiple ADLs. It uses spatial mid-fusion to combine geometric, motion and semantic features of users, environments, and objects to classify and track ADLs. We leverage transfer learning to extract generic features using the early layers of deep networks trained on large datasets to apply our architecture to various ADLs. The embedding vector enables identification of unseen ADLs and determines intra-class variance for monitoring user ADL performance. Our proposed unique architecture can be used by socially assistive robots to promote reablement in the home via autonomously supporting the assistance of varying ADLs. Extensive experiments show improved classification accuracy compared to unimodal/dual-modal models and the ADL embedding space also incorporates the ability to distinctly identify and track seen and unseen ADLs.

...read moreread less

References

PDF

Open Access

More filters

Journal Article•DOI•

Squeeze-and-Excitation Networks

[...]

Jie Hu¹, Li Shen², Samuel Albanie², Gang Sun¹, Enhua Wu¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, University of Oxford²

18 Jun 2018

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.

...read moreread less

Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ${\sim }$ ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet .

...read moreread less

14,807 citations

Proceedings Article•DOI•

Learning Spatiotemporal Features with 3D Convolutional Networks

[...]

Du Tran¹, Du Tran², Lubomir Bourdev², Rob Fergus², Lorenzo Torresani¹, Manohar Paluri² - Show less +2 more•Institutions (2)

Dartmouth College¹, Facebook²

07 Dec 2015

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

...read moreread less

7,091 citations

Proceedings Article•

Two-Stream Convolutional Networks for Action Recognition in Videos

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

08 Dec 2014

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

...read moreread less

Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

...read moreread less

6,397 citations

Book Chapter•DOI•

CBAM: Convolutional Block Attention Module

[...]

Sanghyun Woo¹, Jongchan Park, Joon-Young Lee², In So Kweon¹•Institutions (2)

KAIST¹, Adobe Systems²

08 Sep 2018

TL;DR: Convolutional Block Attention Module (CBAM) as discussed by the authors is a simple yet effective attention module for feed-forward convolutional neural networks, given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.

...read moreread less

Abstract: We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

...read moreread less

5,335 citations

Proceedings Article•DOI•

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

[...]

Joao Carreira, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

21 Jul 2017

TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.

...read moreread less

Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101.

...read moreread less

5,073 citations

Learning Multi-modal Attentional Consensus in Action Recognition for Elderly-Care Robots

Citations

References

Related Papers (5)