scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Learning Multi-modal Attentional Consensus in Action Recognition for Elderly-Care Robots

TL;DR: Wang et al. as discussed by the authors proposed a new mid-level feature fusion method for two-stream based action recognition network, which leverages a whole feature map from each modality and achieves competitive performance in various experimental settings, especially for domain changing situations.
Abstract: This paper addresses a practical action recognition method for elderly-care robots. Multi-stream based models are one of the promising approaches for solving the complexity of real-world environments. While multi-modal action recognition have been actively studied, there is a lack of research on models that effectively combine features of different modalities. This paper proposes a new mid-level feature fusion method for two-stream based action recognition network. In multi-modal approaches, extracting complementary information between different modalities is an essential task. Our network model is designed to fuse features at an intermediate level of feature extraction, which leverages a whole feature map from each modality. Consensus feature map and consensus attention mechanism are proposed as effective ways to extract information from two different modalities: RGB data and motion features. We also introduce ETRI-Activity3D-LivingLab, a real-world RGB-D dataset for robots to recognize daily activities of the elderly. It is the first 3D action recognition dataset obtained in a variety of home environments where the elderly actually reside. We expect our new dataset to contribute to the practical study of action recognition with the previously released ETRI-Activity3D dataset. To prove the effectiveness of the method, extensive experiments are performed on NTU RGB+D, ETRI-Activity3D and, ETRI-Activity3D-LivingLab dataset. Our mid-level fusion method achieves competitive performance in various experimental settings, especially for domain-changing situations.
Citations
More filters
Proceedings ArticleDOI
29 May 2023
TL;DR: In this paper , a novel deep learning human activity recognition and classification architecture capable of autonomously identifying ADLs in home environments to enable long-term deployment of socially assistive robots to aid older adults.
Abstract: Many older adults prefer to stay in their own homes and age-in-place. However, physical and cognitive limitations in independently completing activities of daily living (ADLs) requires older adults to receive assistive support, often necessitating transitioning to care centers. In this paper, we present the development of a novel deep learning human activity recognition and classification architecture capable of autonomously identifying ADLs in home environments to enable long-term deployment of socially assistive robots to aid older adults. Our deep learning architecture is the first to use multimodal inputs to create an embedding vector approach for classifying and monitoring multiple ADLs. It uses spatial mid-fusion to combine geometric, motion and semantic features of users, environments, and objects to classify and track ADLs. We leverage transfer learning to extract generic features using the early layers of deep networks trained on large datasets to apply our architecture to various ADLs. The embedding vector enables identification of unseen ADLs and determines intra-class variance for monitoring user ADL performance. Our proposed unique architecture can be used by socially assistive robots to promote reablement in the home via autonomously supporting the assistance of varying ADLs. Extensive experiments show improved classification accuracy compared to unimodal/dual-modal models and the ADL embedding space also incorporates the ability to distinctly identify and track seen and unseen ADLs.
References
More filters
Journal ArticleDOI
18 Jun 2018
TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ${\sim }$ ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet .

14,807 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

7,091 citations

Proceedings Article
08 Dec 2014
TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

6,397 citations

Book ChapterDOI
08 Sep 2018
TL;DR: Convolutional Block Attention Module (CBAM) as discussed by the authors is a simple yet effective attention module for feed-forward convolutional neural networks, given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.
Abstract: We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

5,335 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.
Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101.

5,073 citations