scispace - formally typeset
Search or ask a question
Author

Hemerson Tacon

Bio: Hemerson Tacon is an academic researcher from Universidade Federal de Juiz de Fora. The author has contributed to research in topics: Convolutional neural network & Deep learning. The author has an hindex of 3, co-authored 7 publications receiving 24 citations.

Papers
More filters
Proceedings ArticleDOI
01 Dec 2018
TL;DR: A multi-stream network is the architecture of choice to incorporate temporal information, since it may benefit from pre-trained deep networks for images and from handcrafted features for initialization, and its training cost is usually lower than video-based networks.
Abstract: Advances in digital technology have increased event recognition capabilities through the development of devices with high resolution, small physical dimensions and high sampling rates. The recognition of complex events in videos has several relevant applications, particularly due to the large availability of digital cameras in environments such as airports, banks, roads, among others. The large amount of data produced is the ideal scenario for the development of automatic methods based on deep learning. Despite the significant progress achieved through image-based deep networks, video understanding still faces challenges in modeling spatio-temporal relations. In this work, we address the problem of human action recognition in videos. A multi-stream network is our architecture of choice to incorporate temporal information, since it may benefit from pre-trained deep networks for images and from handcrafted features for initialization. Furthermore, its training cost is usually lower than video-based networks. We explore visual rhythm images since they encode longer-term information when compared to still frames and optical flow. We propose a novel method based on point tracking for deciding the best visual rhythm direction for each video. Experiments conducted on the challenging UCF101 and HMDB51 data sets indicate that our proposed stream improves network performance, achieving accuracy rates comparable to the state-of-the-art approaches.

21 citations

Book ChapterDOI
01 Jul 2019
TL;DR: This work proposes the usage of multiple Visual Rhythm crops, symmetrically extended in time and separated by a fixed stride, which provide a 2D representation of the video volume matching the fixed input size of the 2D Convolutional Neural Network employed.
Abstract: Despite the expressive progress of deep learning models on the image classification task, they still need enhancement for efficient human action recognition. One way to achieve such gain is to augment the existing datasets. With this goal, we propose the usage of multiple Visual Rhythm crops, symmetrically extended in time and separated by a fixed stride. The symmetric extension preserves the video frame rate, which is crucial to not distort actions. The crops provide a 2D representation of the video volume matching the fixed input size of the 2D Convolutional Neural Network (CNN) employed. In addition, multiple crops with stride guarantee coverage of the entire video. Aiming to evaluate our method, a multi-stream strategy combining RGB and Optical Flow information is extended to include the Visual Rhythm. Accuracy rates fairly close to the state-of-the-art were obtained from the experiments with our method on the challenging UCF101 and HMDB51 datasets.

6 citations

Journal ArticleDOI
TL;DR: A multi-stream architecture based on the weighted voting of convolutional neural networks to deal with the problem of recognizing human actions in videos is proposed, with a new stream, Optical Flow Rhythm, besides using other streams for diversity.

4 citations

Book ChapterDOI
01 Jan 2020
TL;DR: A different pre-training procedure for the latter stream is developed using visual rhythm images extracted from a large and challenging video dataset, the Kinetics, which aims to classify trimmed videos based on the action being performed by one or more agents.
Abstract: Human action recognition aims to classify trimmed videos based on the action being performed by one or more agents. It can be applied to a large variety of tasks, such as surveillance systems, intelligent homes, health monitoring, and human-computer interaction. Despite the significant progress achieved through image-based deep networks, video understanding still faces challenges in modeling spatiotemporal relations. The inclusion of temporal information in the network may lead to significant growth in the training cost. To address this issue, we explore complementary handcrafted features to feed pre-trained two-dimensional (2D) networks in a multi-stream fashion. In addition to the commonly used RGB and optical flow streams, we propose the use of a stream based on visual rhythm images that encode long-term information. Previous works have shown that either RGB or optical flow streams may benefit from pre-training on ImageNet since they maintain a certain level of object shape. The visual rhythm, on the other hand, harshly deforms the silhouettes of the actors and objects. Therefore, we develop a different pre-training procedure for the latter stream using visual rhythm images extracted from a large and challenging video dataset, the Kinetics.

3 citations

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This work addresses the problem of human action recognition in videos through a multi-stream network that incorporates both spatial and temporal information, and employs a deep network to extract features from the video frames in order to generate the rhythm.
Abstract: Recent deep learning techniques have achieved satisfactory results for various image-related problems. However, many research questions remain open in tasks involving video sequences. Several applications demand the understanding of complex events in videos, such as traffic monitoring, person re-identification, security and surveillance. In this work, we address the problem of human action recognition in videos through a multi-stream network that incorporates both spatial and temporal information. The main contribution of our work is a stream based on a new variant of the visual rhythm, called Learnable Visual Rhythm (LVR). We employ a deep network to extract features from the video frames in order to generate the rhythm. The features are collected at multiple depths of the network to enable the analysis of different abstraction levels. This strategy significantly outperforms the handcrafted version on the UCF101 and HMDB51 datasets. Experiments conducted on these datasets show that our final multi-stream network achieved competitive results compared to state-of-the-art approaches.

2 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a framework with three main phases for human action recognition, i.e., pre-training, preprocessing, and recognition, which achieved state-of-the-art performance.
Abstract: Human action recognition techniques have gained significant attention among next-generation technologies due to their specific features and high capability to inspect video sequences to understand human actions. As a result, many fields have benefited from human action recognition techniques. Deep learning techniques played a primary role in many approaches to human action recognition. The new era of learning is spreading by transfer learning. Accordingly, this study’s main objective is to propose a framework with three main phases for human action recognition. The phases are pre-training, preprocessing, and recognition. This framework presents a set of novel techniques that are three-fold as follows, (i) in the pre-training phase, a standard convolutional neural network is trained on a generic dataset to adjust weights; (ii) to perform the recognition process, this pre-trained model is then applied to the target dataset; and (iii) the recognition phase exploits convolutional neural network and long short-term memory to apply five different architectures. Three architectures are stand-alone and single-stream, while the other two are combinations between the first three in two-stream style. Experimental results show that the first three architectures recorded accuracies of 83.24%, 90.72%, and 90.85%, respectively. The last two architectures achieved accuracies of 93.48% and 94.87%, respectively. Moreover, The recorded results outperform other state-of-the-art models in the same field.

24 citations

Proceedings ArticleDOI
05 Jun 2019
TL;DR: This work proposes and evaluates a multi-stream learning model based on convolutional neural networks using high-level handcrafted features as input in order to cope withporadic falls and shows that this approach outperforms, in terms of accuracy and sensitivity rates, to other similar tested methods found in literature.
Abstract: Sporadic falls, due to the lack of balance and other factors, are some of the complications that elderly people might experience more frequently than others. Accordingly, as there is a high probability of these events causing major health casualties, such as bone breaking or head clots, studies have been monitoring these falls to rapidly assist the victim. In this work, we propose and evaluate a multi-stream learning model based on convolutional neural networks using high-level handcrafted features as input in order to cope with this situation. Therefore, our approach consists of extracting high-level handcrafted features, for instance, human pose estimation and optical flow, and using each one as an input for a distinct VGG-16 classifier. In addition, these experiments are able to showcase what features can be used in fall detection. The results have shown that by assembling our directed input learners, our approach outperforms, in terms of accuracy and sensitivity rates, to other similar tested methods found in literature.

20 citations

Proceedings ArticleDOI
18 Jul 2022
TL;DR: Zhang et al. as mentioned in this paper proposed a graph-based framework for action recognition to model the spatio-temporal interactions among the entities in a video without any object-level supervision.
Abstract: Action recognition requires modelling the interactions between either human & human or human & objects. Re-cently, graph convolutional neural networks (GCNs) are exploited to effectively capture the structure of action by modelling the relationship among entities present in a video. However, most of the approaches depend on the effectiveness of object detection frameworks to detect the entities. In this paper, we propose a graph-based framework for action recognition to model the spatio-temporal interactions among the entities in a video without any object-level supervision. First, we obtain the salient space-time interest points (STIP) that contain rich information about the significant local variations in space and time by using the Harris 3D detector. In order to incorporate the local appearance and motion information of the entities, either low-level or deep features are extracted around these STIPs. Next, we build a graph by considering the extracted STIPs as nodes and are connected by spatial edges and temporal edges. These edges are determined based on a membership function that measures the similarity of entities associated with the STIPs. Finally, GCN is employed on the given graph to provide reasoning among different entities present in a video. We evaluate our method on three widely used datasets, namely, UCF-101, HMDB-51, SSV2 to demonstrate the efficacy of the proposed approach.

15 citations

Proceedings ArticleDOI
18 Jul 2022
TL;DR: A graph-based framework for action recognition to model the spatio-temporal interactions among the entities in a video without any object-level supervision is proposed and evaluated on three widely used datasets.
Abstract: Action recognition requires modelling the interactions between either human & human or human & objects. Re-cently, graph convolutional neural networks (GCNs) are exploited to effectively capture the structure of action by modelling the relationship among entities present in a video. However, most of the approaches depend on the effectiveness of object detection frameworks to detect the entities. In this paper, we propose a graph-based framework for action recognition to model the spatio-temporal interactions among the entities in a video without any object-level supervision. First, we obtain the salient space-time interest points (STIP) that contain rich information about the significant local variations in space and time by using the Harris 3D detector. In order to incorporate the local appearance and motion information of the entities, either low-level or deep features are extracted around these STIPs. Next, we build a graph by considering the extracted STIPs as nodes and are connected by spatial edges and temporal edges. These edges are determined based on a membership function that measures the similarity of entities associated with the STIPs. Finally, GCN is employed on the given graph to provide reasoning among different entities present in a video. We evaluate our method on three widely used datasets, namely, UCF-101, HMDB-51, SSV2 to demonstrate the efficacy of the proposed approach.

15 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed the YOLO V3 + VGG 16 transfer learning network to realize the automatic recognition, monitoring and analysis of small sample data, the recognition accuracy of the proposed method is greater than 96%, and the average deviation of the action execution time is less than 1 s.

12 citations