scispace - formally typeset
Search or ask a question

Showing papers by "Antonis A. Argyros published in 2021"


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this article, a real-time method that estimates the 3D human pose directly in the Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images is presented.
Abstract: We introduce a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images. Our contributions include: (a) A novel and compact 2D pose representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET) while maintaining real-time performance (70 fps in CPU-only execution).

12 citations


Journal ArticleDOI
TL;DR: A distributed framework that enables a team of heterogeneous robots to dynamically generate actions from a common, user-defined goal specification is presented and the integration of various robotic capabilities into a common task allocation and planning formalism is discussed.

7 citations


Journal ArticleDOI
08 Apr 2021
TL;DR: In this article, a 1D-CNN based prediction model was developed and trained with data from 4,800 trials recorded from 40 participants, with a prediction accuracy of more than 92% with less than 550 ms of IMU data.
Abstract: Most people touch their faces unconsciously, for instance to scratch an itch or to rest one's chin in their hands. To reduce the spread of the novel coronavirus (COVID-19), public health officials recommend against touching one's face, as the virus is transmitted through mucous membranes in the mouth, nose and eyes. Students, office workers, medical personnel and people on trains were found to touch their faces between 9 and 23 times per hour. This paper introduces FaceGuard, a system that utilizes deep learning to predict hand movements that result in touching the face, and provides sensory feedback to stop the user from touching the face. The system utilizes an inertial measurement unit (IMU) to obtain features that characterize hand movement involving face touching. Time-series data can be efficiently classified using 1D-Convolutional Neural Network (CNN) with minimal feature engineering; 1D-CNN filters automatically extract temporal features in IMU data. Thus, a 1D-CNN based prediction model is developed and trained with data from 4,800 trials recorded from 40 participants. Training data are collected for hand movements involving face touching during various everyday activities such as sitting, standing, or walking. Results showed that while the average time needed to touch the face is 1,200 ms, a prediction accuracy of more than 92% is achieved with less than 550 ms of IMU data. As for the sensory response, the paper presents a psychophysical experiment to compare the response time for three sensory feedback modalities, namely visual, auditory, and vibrotactile. Results demonstrate that the response time is significantly smaller for vibrotactile feedback (427.3 ms) compared to visual (561.70 ms) and auditory (520.97 ms). Furthermore, the success rate (to avoid face touching) is also statistically higher for vibrotactile and auditory feedback compared to visual feedback. These results demonstrate the feasibility of predicting a hand movement and providing timely sensory feedback within less than a second in order to avoid face touching.

6 citations


Proceedings ArticleDOI
29 Jun 2021
TL;DR: In this paper, the same embeddings are also used to temporally align the sequences prior to quality assessment, which further increases the accuracy, provides robustness to variance in execution speed and enables to provide fine-grained interpretability of the assessment score.
Abstract: Action Quality Assessment (AQA) is a video understanding task aiming at the quantification of the execution quality of an action. One of the main challenges in relevant, deep learning-based approaches is the collection of training data annotated by experts. Current methods perform fine-tuning on pre-trained backbone models and aim to improve performance by modeling the subjects and the scene. In this work, we consider embeddings extracted using a self-supervised training method based on a differential cycle consistency loss between sequences of actions. These are shown to improve the state-of-the-art without the need for additional annotations or scene modeling. The same embeddings are also used to temporally align the sequences prior to quality assessment which further increases the accuracy, provides robustness to variance in execution speed and enables us to provide fine-grained interpretability of the assessment score. The experimental evaluation of the method on the MTL-AQA dataset demonstrates significant accuracy gain compared to the state-of-the-art baselines, which grows even more when the action execution sequences are not well aligned.

4 citations


Journal ArticleDOI
TL;DR: In social and industrial facilities of the future such as hospitals, hotels, and warehouses, teams of robots will be deployed to assist humans in accomplishing everyday tasks like object handling, transportation, or pickup and delivery operations as discussed by the authors.
Abstract: In social and industrial facilities of the future such as hospitals, hotels, and warehouses, teams of robots will be deployed to assist humans in accomplishing everyday tasks like object handling, transportation, or pickup and delivery operations. In such a context, different robots (e.g., mobile platforms, static manipulators, or mobile manipulators) with different actuation, manipulation, and perception capabilities must be coordinated to achieve various complex tasks (e.g., cooperative parts assembly in the automotive industry or loading and unloading of palettes in warehouses) that require collaborative actions with each other and with human operators (Figure 1).

4 citations


Book ChapterDOI
22 Sep 2021
TL;DR: In this paper, an approach that builds time series representations of the performance of the humans and the objects is proposed, and a representation of an ongoing action is then compared to prototype actions.
Abstract: Action prediction is defined as the inference of an action label while the action is still ongoing. Such a capability is extremely useful for early response and further action planning. In this paper, we consider the problem of action prediction in scenarios involving humans interacting with objects. We formulate an approach that builds time series representations of the performance of the humans and the objects. Such a representation of an ongoing action is then compared to prototype actions. This is achieved by a Dynamic Time Warping (DTW)-based time series alignment framework which identifies the best match between the ongoing action and the prototype ones. Our approach is evaluated quantitatively on three standard benchmark datasets. Our experimental results reveal the importance of the fusion of human- and object-centered action representations in the accuracy of action prediction. Moreover, we demonstrate that the proposed approach achieves significantly higher action prediction accuracy compared to competitive methods.

3 citations


Proceedings ArticleDOI
18 Jul 2021
TL;DR: HandGAN as discussed by the authors is a cycle-consistent adversarial learning approach implementing multi-scale perceptual discriminators, which is designed to translate synthetic images of hands to the real domain by improving the appearance of synthetic hands to approximate the statistical distribution underlying a collection of real images.
Abstract: We present HandGAN (H-GAN), a cycle-consistent adversarial learning approach implementing multi-scale perceptual discriminators. It is designed to translate synthetic images of hands to the real domain. Synthetic hands provide complete ground-truth annotations, yet they are not representative of the target distribution of real-world data. We strive to provide the perfect blend of a realistic hand appearance with synthetic annotations. Relying on image-to-image translation, we improve the appearance of synthetic hands to approximate the statistical distribution underlying a collection of real images of hands. H-GAN tackles not only the cross-domain tone mapping but also structural differences in localized areas such as shading discontinuities. Results are evaluated on a qualitative and quantitative basis improving previous works. Furthermore, we relied on the hand classification task to claim our generated hands are statistically similar to the real domain of hands.

2 citations


Journal ArticleDOI
TL;DR: The most important novelty presented in this work is the eye shape estimation and the generation of 3D point meshes that has the potential for allowing clinicians to perform measurements on 3D representations of the eye, instead of doing so in 2D images that contain distortions induced because of the projection on the image space.

2 citations


Posted Content
TL;DR: In this article, the authors present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators, one of which targets spike timing dependent plasticity (STDP) and the other one targets spike delivery.
Abstract: We present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators. The first one targets spike timing dependent plasticity (STDP). It combines lazy- with event-driven plasticity and efficiently facilitates the computation of pre- and post-synaptic spikes using bitfields and integer intrinsics. It offers higher bandwidth than event-driven plasticity alone and achieves a 1.5x-2x speedup over our closest competitor. The second optimization targets spike delivery. We partition our graph representation in a way that bounds the number of neurons that need be updated at any given time which allows us to perform said update in shared memory instead of global memory. This is 2x-2.5x faster than our closest competitor. Both optimizations represent the final evolutionary stages of years of iteration on STDP and spike delivery inside "Spice" (/spaIk/), our state of the art SNN simulator. The proposed optimizations are not exclusive to our graph representation or pipeline but are applicable to a multitude of simulator designs. We evaluate our performance on three well-established models and compare ourselves against three other state of the art simulators.

1 citations


Posted Content
TL;DR: In this article, a cache-aware spike transmission algorithm and load balancing strategy are proposed to scale to millions of neurons, billions of synapses, and 8 GPUs, which is made possible by a novel, cacheaware, spike transmission and a model parallel multi-GPU distribution scheme.
Abstract: We present a SNN simulator which scales to millions of neurons, billions of synapses, and 8 GPUs. This is made possible by 1) a novel, cache-aware spike transmission algorithm 2) a model parallel multi-GPU distribution scheme and 3) a static, yet very effective load balancing strategy. The simulator further features an easy to use API and the ability to create custom models. We compare the proposed simulator against two state of the art ones on a series of benchmarks using three well-established models. We find that our simulator is faster, consumes less memory, and scales linearly with the number of GPUs.

1 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this paper, a coarse-to-fine action hierarchy based on linguistic label associations is proposed, and the potential benefits and drawbacks of the hierarchical organization of action classes in different levels of granularity are investigated.
Abstract: Human activity recognition is a fundamental and challenging task in computer vision. Its solution can support multiple and diverse applications in areas including but not limited to smart homes, surveillance, daily living assistance, Human-Robot Collaboration (HRC), etc. In realistic conditions, the complexity of human activities ranges from simple coarse actions, such as siting or standing up, to more complex activities that consist of multiple actions with subtle variations in appearance and motion patterns. A large variety of existing datasets target specific action classes, with some of them being coarse and others being fine-grained. In all of them, a description of the action and its complexity is manifested in the action label sentence. As the action/activity complexity increases, so is the label sentence size and the amount of action-related semantic information contained in this description. In this paper, we propose an approach to exploit the information content of these action labels to formulate a coarse-to-fine action hierarchy based on linguistic label associations, and investigate the potential benefits and drawbacks. Moreover, in a series of quantitative and qualitative experiments, we show that the exploitation of this hierarchical organization of action classes in different levels of granularity improves the learning speed and overall performance of a range of baseline and mid-range deep architectures for human action recognition (HAR).

Proceedings ArticleDOI
18 Jul 2021
TL;DR: In this article, a cache-aware spike transmission algorithm and load balancing strategy are proposed to scale to millions of neurons, billions of synapses, and 8 GPUs, which is made possible by a novel, cacheaware, spike transmission and a model parallel multi-GPU distribution scheme.
Abstract: We present a SNN simulator which scales to millions of neurons, billions of synapses, and 8 GPUs. This is made possible by 1) a novel, cache-aware spike transmission algorithm 2) a model parallel multi-GPU distribution scheme and 3) a static, yet very effective load balancing strategy. The simulator further features an easy to use API and the ability to create custom models. We compare the proposed simulator against two state of the art ones on a series of benchmarks using three well-established models. We find that our simulator is faster, consumes less memory, and scales linearly with the number of GPUs.

Posted Content
TL;DR: HandGAN (H-GAN) as discussed by the authors is a cycle-consistent adversarial learning approach implementing multi-scale perceptual discriminators, which is designed to translate synthetic images of hands to the real domain.
Abstract: We present HandGAN (H-GAN), a cycle-consistent adversarial learning approach implementing multi-scale perceptual discriminators It is designed to translate synthetic images of hands to the real domain Synthetic hands provide complete ground-truth annotations, yet they are not representative of the target distribution of real-world data We strive to provide the perfect blend of a realistic hand appearance with synthetic annotations Relying on image-to-image translation, we improve the appearance of synthetic hands to approximate the statistical distribution underlying a collection of real images of hands H-GAN tackles not only the cross-domain tone mapping but also structural differences in localized areas such as shading discontinuities Results are evaluated on a qualitative and quantitative basis improving previous works Furthermore, we relied on the hand classification task to claim our generated hands are statistically similar to the real domain of hands

Posted Content
TL;DR: In this paper, the authors employ a publicly available, multi-camera dataset of hands (InterHand2.6M), and perform effective image-based refinement to improve on the imperfect ground truth annotations, yielding a better dataset.
Abstract: The amount and quality of datasets and tools available in the research field of hand pose and shape estimation act as evidence to the significant progress that has been made. We find that there is still room for improvement in both fronts, and even beyond. Even the datasets of the highest quality, reported to date, have shortcomings in annotation. There are tools in the literature that can assist in that direction and yet they have not been considered, so far. To demonstrate how these gaps can be bridged, we employ such a publicly available, multi-camera dataset of hands (InterHand2.6M), and perform effective image-based refinement to improve on the imperfect ground truth annotations, yielding a better dataset. The image-based refinement is achieved through raytracing, a method that has not been employed so far to relevant problems and is hereby shown to be superior to the approximative alternatives that have been employed in the past. To tackle the lack of reliable ground truth, we resort to realistic synthetic data, to show that the improvement we induce is indeed significant, qualitatively, and quantitatively, too.