scispace - formally typeset
Search or ask a question
Book ChapterDOI

Activity Recognition for Natural Human Robot Interaction

TL;DR: This work presents a simple yet effective approach of modelling pose trajectories using directions traversed by human joints over the duration of an activity and represent the action as a histogram of direction vectors.
Abstract: The ability to recognize human activities is necessary to facilitate natural interaction between humans and robots. While humans can distinguish between communicative actions and activities of daily living, robots cannot draw such inferences effectively. To allow intuitive human robot interaction, we propose the use of human-like stylized gestures as communicative actions and contrast them from conventional activities of daily living. We present a simple yet effective approach of modelling pose trajectories using directions traversed by human joints over the duration of an activity and represent the action as a histogram of direction vectors. The descriptor benefits from being computationally efficient as well as scale and speed invariant. In our evaluation, the descriptor returned state of the art classification accuracies using off the shelf classification algorithms on multiple datasets.

Summary (3 min read)

1 Introduction

  • As robots are employed to perform wide range of tasks, especially in human environments, the need to facilitate natural interaction between humans and robots is becoming more pertinent.
  • For e.g., if a robot could recognize whether a person is drinking water, it could offer to pour more and react appropriately based on the person’s response.
  • The authors propose the use of human-like stylized gestures as communicative actions and contrast them from conventional activities of daily living.
  • In this work the authors introduce a novel activity descriptor: Histogram of Direction vectors (HODV) that transforms 3D spatio-temporal joint movements into unique directions; an approach that proves to be highly discriminative for activity recognition.
  • Further, learn to distinguish communicative actions to instruct a robot from conventional activities of daily living and obtain a descriptive labelling of the same.

1.1 Contributions and Outline

  • The contributions of this work are are as follows: Firstly, the authors introduce the problem of communicative vs non-communicative actions.
  • The authors provide analysis of their algorithm on two public datasets and demonstrate how the algorithm could be used for both Communicative/Interactive and Non-Communicative/Non-Interactive activity recognition.
  • The rest of the paper is organized as follows.
  • Section 2 presents a brief literature review.
  • Section 3 explains their dataset, while section 4 and 5 describe their algorithm and experimental results in detail respectively.

3 OUR DATASET

  • Recent advances in pose estimation [11] and cheap availability of RGBD cameras, has lead to many RGBD activity datasets [12, 14].
  • In addition, the activities were captured at various times of the day leading to varied lighting conditions.
  • A total of 5 participants were asked to perform 18 different activities, including 10 Communicative/Interactive activities and 8 Non-Interactive activities, each performed a total of three times with slight changes in viewpoint from the other instances.
  • ‘Catching the Robots attention’, ‘Pointing in a direction’, ‘Asking to stop’, ‘Expressing dissent’, ‘Chopping’, ‘Cleaning’, ‘Repeating’, ‘Beckoning’, ‘Asking to get phone’ and ‘facepalm’ were the 10 Robot-Interactive activities.
  • The authors stress that their dataset is different from publicly available datasets as they represent a new mix of activities, more aligned with how humans would perform these in real life.

4 Action Representation

  • Activities usually consist of sequences of sub-activities and can be fundamentally described using two aspects: a) Motor Trajectory and b) Activity context.
  • For eg., in a drinking activity, a subject picks a glass or a cup, brings it closer to his/her mouth and returns it.
  • While there are numerous possibilities behind the context of the activity, as a glass could contain juice while a cup could contain coffee, thereby giving more meaning to the activity ‘drinking’ and answering a question:.
  • The motor trajectory followed by most people for a generic drinking activity would predominantly be similar.
  • The authors describe the 3D trajectory of each joint separately and construct the final descriptor by concatenating the direction vector histogram of each joint.

4.1 Direction vectors from skeletons

  • The algorithm takes RGBD images as input and uses the primesense skeleton tracker [1] to extract skeleton joints at each frame.
  • The joint locations are then normalized by transforming the origin to the human torso, thereby making them invariant to human translation.
  • Direction vectors are then calculated for each joint i by computing the difference between joint coordinates of frame f and frame f + τ , where τ is a fixed time duration (e.g., 0.1 seconds) in terms of frame counts.
  • Mathematically, direction vectors are estimated for each joint at every frame as: dif = [ P if − P if+τ ] ,∀f ∈ [1, 2, . . . , fmax − τ ] (1) The next section explains the construction of their action descriptor, Histogram of direction vectors, and the final descriptor used to classify activities.

4.2 Histogram of direction vectors

  • The authors chose 27 primary directions in the 3D space and represented the direction taken by a joint by the nearest primary direction in that grid.
  • The grid entries represent real world directions such as, up, down, up-left, down-right and so on; resulting in a total of 27 directions.
  • The goal is to find the specific direction index q∗ that represents the direction which is at minimum euclidean distance from the direction vector.
  • To attain the total number of times a particular direction was taken during an activity, the authors perform cumulative addition of vector Normalizing the vector h∗ gives us a histogram hi, representing the probability of occurrence of each direction for a particular joint i, during the course of an activity.
  • Further, each histogram hi is concatenated to generate the final feature vector H = [h1, h2, . . . , hi]; namely the Histogram of direction Vectors.

5 Experimental Results

  • In this section the authors present detailed analysis of their experiments.
  • In addition to their dataset, the authors test their algorithm on two public datasets: The Cornell activity dataset (CAD-60) [12] and the UTKinect-Action Dataset [14].
  • The results reveal that the proposed approach performs comparable to the state of the art approaches, which in general, are computationally expensive and involve complicated modelling.
  • The authors use an SVM as their classification algorithm along with histogram intersection as the kernel choice.
  • The authors optimize the cost parameter using cross validation.

5.1 Our Dataset

  • On their dataset, the authors ran experiments using three different settings.
  • All experiments were performed using 5 fold cross subject cross validation, such that, at a time, all instances of one subject were used for testing and the instances from the other subjects were used for training.
  • Like in the previous setup, the algorithm was able to accurately classify actions which had distinct motion trajectories but gets confused with actions with very similar motion like Repeat and Facepalm.
  • Interactive actions were classified with an accuracy of 92.67% while Non Interactive activities were classified correctly with an accuracy of 85%.
  • The descriptor also benefits from being computationally efficient as the only calculations involved for each joints are: – Calculation of direction vectors, which can be performed in constant time.

5.2 Cornell Activity Dataset (CAD 60)

  • The dataset comprises of 60 RGBD video sequences of humans performing 12 unique activities of daily living.
  • The activities have been recorded in five different environments: Office, Kitchen, Bedroom, Bathroom, and Living room; generating a total of 12 unique activities performed by four different people: two males and two females.
  • The authors used the same experimental setup (4 fold cross-subject cross validation) and compare precision-recall values for the ’New Person’ setting as described in [12].
  • Table 1 shows a comparison of their algorithm with other state of the art approaches.
  • Considering that the authors use only skeleton data, their approach still outperforms other algorithms.

5.3 UTKinect Action Dataset

  • The UTKinect Action Dataset [14] presents RGBD video sequences and skeleton information of humans performing various activities from different views.
  • For this dataset, the authors compare their approach with the state of the art methodology called histogram of 3D skeleton joint positions (HOJ3D)[14] using Leave one Sequence out Cross validation and cross subject validation as defined previously in this paper.
  • This dataset has activities which look very similar e.g., Sit down and Stand Up.
  • The overall accuracies attained on the dataset are shown in Table 2.
  • The performs drops a bit under the cross subject crossvalidation scheme.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Activity Recognition for Natural Human Robot
Interaction
Addwiteey Chrungoo
1
, SS Manimaran and Balaraman Ravindran
2
1
School of Engineering and Applied Science, University of Pennsylvania,
Philadelphia PA 19104, USA
2
Department of Computer Science, Indian Institute of Technology Madras, Chennai,
India
Abstract. The ability to recognize human activities is necessary to fa-
cilitate natural interaction between humans and robots. While humans
can distinguish between communicative actions and activities of daily
living, robots cannot draw such inferences effectively. To allow intuitive
human robot interaction, we propose the use of human-like stylized ges-
ture s as communicative actions and contrast them from conventional
activities of daily living. We present a simple yet effective approach of
modelling pose trajectories using directions traversed by human joints
over the duration of an activity and represent the action as a histogram
of direction vectors. The descriptor benefits from being computationally
efficient as well as scale and speed invariant. In our evaluation, the de-
scriptor returned state of the art classification accuracies using off the
shelf classification algorithms on multiple datasets.
1 Introduction
As robots are employed to perform wide range of tasks, especially in human envi-
ronments, the need to facilitate natural interaction between humans and robots
is becoming more pertinent. In many roles, such as, indoor personal-assistants,
robots must be able to infer human activities and decipher whether or not a
human needs assistance. For e.g., if a robot could recognize whether a person
is drinking water, it could offer to pour more and react appropriately based on
the person’s response. In such scenarios, in addition to recognizing the drinking
activity, the robot needs to be capable of recognizing communicative actions, so
as to infer whether it should pour more or stop. This is similar in principle to
how humans assist others, i.e., either they assist if assistance is sought or they
foresee the need for assistance based on perception and acquired knowledge.
Though past works [10] have focussed on estimating human intent to take such
decisions, this work is motivated by the need for interaction between the robot
and human as a factor in deciding on an appropriate behaviour. Incorporating
such natural interactions is not easy when robots work in highly cluttered en-
vironments where people carry out activities in different ways leading to high
variability [14, 7]. However, to best support humans, assistive robots need to

behave interactively like humans, making it imperative to correctly understand
the human actions involved.
As a result, we are particularly interested in developing a concise representa-
tion for a wide variety of actions; both communicative and conventional activities
of daily living. We propose the use of human-like stylized gestures as commu-
nicative actions and contrast them from conventional activities of daily living.
Stylized gestures are symbolic representations of activities and are widely used
by humans across cultures to communicate with each other when verbal commu-
nication is not possible.We hypothesize that such actions have distinct motion
intrinsics as compared to conventional activities of daily living and can hence
be used effectively to communicate with robots in the absence of verbal means.
Fig. 1: The general framework of
the proposed approach
Before we can begin to develop a system for
activity recognition, we need an efficient rep-
resentation mechanism for human motion.
In this work we introduce a novel activ-
ity descriptor: Histogram of Direction vectors
(HODV) that transforms 3D spatio-temporal
joint movements into unique directions; an ap-
proach that proves to be highly discrimina-
tive for activity recognition. As shown in Fig-
ure 1, we represent skeletal joint movements
over time in a compact and efficient way that
models pose trajectories in terms of directions
traversed by human joints over the duration
of an activity. The issue we address in this
paper is as follows: Learn to recognise var-
ious human actions given a direction-vector
histogram representation using three dimen-
sional joint locations as raw data. Further,
learn to distinguish communicative actions to instruct a robot from conven-
tional activities of daily living and obtain a descriptive labelling of the same. We
show that our proposed approach is efficient in distinguishing Communicative
and Non Communicative activities in our novel RGBD dataset and also per-
forms equally well on two public datasets: Cornell Activity Dataset (CAD -60)
and UT-Kinect Dataset using off the shelf classification algorithms.
1.1 Contributions and Outline
The contributions of this work are are as follows: Firstly, we introduce the prob-
lem of communicative vs non-communicative actions. Secondly, we propose a
novel and computationally efficient activity descriptor based on pose trajecto-
ries. We provide analysis of our algorithm on two public datasets and demon-
strate how the algorithm could be used for both Communicative/Interactive
and Non-Communicative/Non-Interactive activity recognition. We will also re-
lease an annotated RGBD Human Robot Interaction dataset consisting of 18

unique activities including 10 stylized gestures as well as 8 conventional activi-
ties of daily living (within the same dataset) along with full source code of our
algorithm.
The rest of the paper is organized as follows. Section 2 presents a brief liter-
ature review. Section 3 explains our dataset, while section 4 and 5 describe our
algorithm and experimental results in detail respectively. We conclude the paper
in section 6 and also present directions for future work.
2 Related Work
Human activity recognition has been widely studied by computer vision re-
searchers for over two decades. The field, owing to its ability to augment human
robot interaction, has recently started receiving a lot of attention in the robotics
community. In this section, we restrict ourselves largely to research relevant to
robotics, and for an in-depth review of the field, one can refer to recent survey
papers [2].
Earlier works focussed on using IMU data and hidden Markov models(HMMs)
for activity recognition. Authors in [18] proposed a model based on multi sen-
sor fusion from wearable IMUs. They first classified activities into three groups,
namely: Zero, Transitional and Strong displacement activities, followed by a finer
classification using HMMs. Their approach was however restricted to very few
activity classes and was computationally expensive. Mansur et al.[8] also used
HMMs as their classification framework and developed a novel physics based
model using joint torques as features; claimed to be more discriminative com-
pared to kinematic features [12]. Zhang et al.[17] followed a vision based approach
and proposed a 4D spatio-temporal feature that combined both intensity and
depth information by concatenating depth and intensity gradients within a 4D
hyper-cuboid. Their method was however dependant on the size of the hyper-
cuboid and could not deal with scale variations. Sung et al.[12] combined human
pose and motion, as well as image and point-cloud information in their model.
They designed a hierarchical maximum entropy Markov model, which considered
activities as a superset of sub-activities.
While most of these works focussed on generating different features, work
on improving robot perception, including recognizing objects and tracking ob-
jects [4] led to the incorporation of domain knowledge [13] within recognition
frameworks. Authors in [5] proposed a joint framework for activity recognition
combining intention, activity and motion within a single framework. Further, [7,
10] incorporated affordances to anticipate activities and plan ahead for reactive
responses. Pieropan et al.[9] on the other hand introduced the idea of learning
from human demonstration and stressed the importance of modelling interaction
of objects with hands such that robots observing humans could learn the role of
an object in an activity and classify it accordingly.
While past works excluded the possibility of interaction with the agent, this
work aims to understand activities when interaction between robots and humans
is possible and realistic, especially, in terms of the human providing possible

instructions to a robot while also performing conventional activities of daily
living. The focus of our work is to utilize distinctions in motion to differentiate
between communicative/instructive actions and conventional activities of daily
living. Having said this, we do not see motion information alone as a replacement,
but as a complement to existing sensory modalities, to be fused for particularly
robust activity recognition over wide ranges of conditions.
3 OUR DATASET
Recent advances in pose estimation [11] and cheap availability of RGBD cam-
eras, has lead to many RGBD activity datasets [12, 14]. However, since none
of the datasets involved communicative/interactive activities alongside conven-
tional activities of daily living, we collected a new RGBD dataset involving
interactive as well as non interactive actions. Specifically, our interactive actions
were between a robot and a human; where the human interacts with the robot us-
ing stylized gestures; an approach commonly used by humans for human-human
interaction.
The activities were captured using a kinect camera mounted on a customized
pioneer P3Dx mobile robot platform. The robot was placed in an environment
wherein appearance changed from time to time, i.e., the background and ob-
jects in the scene varied. In addition, the activities were captured at various
times of the day leading to varied lighting conditions. A total of 5 partici-
pants were asked to perform 18 different activities, including 10 Communica-
tive/Interactive activities and 8 Non-Interactive activities, each performed a to-
tal of three times with slight changes in viewpoint from the other instances.
‘Catching the Robots attention’, ‘Pointing in a direction’, ‘Asking to stop’, ‘Ex-
pressing dissent’, ‘Chopping’, ‘Cleaning’, ‘Repeating’, ‘Beckoning’, ‘Asking to
get phone’ and ‘facepalm’ were the 10 Robot-Interactive activities. In Robot-
Interactive activities like ‘Facepalm’, the human brings his/her hand up to his
head, similarly, the activity ‘chopping’ involved a human repeatedly hitting one
of his hands with the other hand, creating a stylized chopping action and so on.
The non interactive activities were more conventional activities of daily living
like ‘Drinking something’, ‘Wearing a backpack’, ‘Relaxing’, ‘Cutting’, ‘Feeling
hot’, ‘Washing face’ ‘Looking at time’ and ‘Talking on cellphone’.
We stress that our dataset is different from publicly available datasets as we
represent a new mix of activities, more aligned with how humans would perform
these in real life. In addition, the dataset involves wide variability in how the
activities were performed by different people as subjects used both left and right
hands along with variable time durations. For e.g., in the ‘Drinking something’
activity, some subjects took longer to drink water and brought the glass to their
mouth couple of times, while others took the glass to their mouth just once. The
wide variety and variability makes recognition challenging. We have made the
data available at: http://rise.cse.iitm.ac.in/activity-recognition/

4 Action Representation
Activities usually consist of sequences of sub-activities and can be fundamentally
described using two aspects: a) Motor Trajectory and b) Activity context. For
eg., in a drinking activity, a subject picks a glass or a cup, brings it closer to
his/her mouth and returns it. While there are numerous possibilities behind the
context of the activity, as a glass could contain juice while a cup could contain
coffee, thereby giving more meaning to the activity ‘drinking’ and answering
a question: What is probably being drunk? The motor trajectory followed by
most people for a generic drinking activity would predominantly be similar.
We aim to exploit this similarity and introduce a local motion based action
representation called Histogram of Direction Vectors, defined as the distribution
of directions taken by each skeleton joint during all skeleton pose transitions
during an activity.
The intuition behind the descriptor is that directions have a clear physical
significance and capturing motion intrinsics as a function of direction should be
discriminative across classes. We describe the 3D trajectory of each joint sepa-
rately and construct the final descriptor by concatenating the direction vector
histogram of each joint.
4.1 Direction vectors from skeletons
The algorithm takes RGBD images as input and uses the primesense skeleton
tracker [1] to extract skeleton joints at each frame. For each joint i, P
i
f
represents
the 3D cartesian position of joint i at time frame f . The joint locations are then
normalized by transforming the origin to the human torso, thereby making them
invariant to human translation. Direction vectors are then calculated for each
joint i by computing the difference between joint coordinates of frame f and
frame f + τ , where τ is a fixed time duration (e.g., 0.1 seconds) in terms of
frame counts. Mathematically, direction vectors are estimated for each joint at
every frame as:
d
i
f
=
P
i
f
P
i
f+τ
, f [1, 2, . . . , f
max
τ ] (1)
The next section explains the construction of our action descriptor, Histogram
of direction vectors, and the final descriptor used to classify activities.
4.2 Histogram of direction vectors
At each frame f , the local region around a joint i is partitioned into a 3D
spatial grid. We chose 27 primary directions in the 3D space and represented the
direction taken by a joint by the nearest primary direction in that grid. The grid
entries represent real world directions such as, up, down, up-left, down-right and
so on; resulting in a total of 27 directions. The direction vector corresponding
to a joint i is mapped onto the index of one of 27 directions, by estimating
the 3D euclidean distance between grid coordinates σ
q
and the direction vector

Citations
More filters
Journal ArticleDOI
TL;DR: A new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell.
Abstract: Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.

436 citations

Journal ArticleDOI
TL;DR: A deep learning-based approach for temporal 3D pose recognition problems based on a combination of a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) recurrent network and a data augmentation method that has also been validated experimentally is proposed.

294 citations

Journal ArticleDOI
TL;DR: This paper constructs a 3D-based Deep Convolutional Neural Network to directly learn spatio-temporal features from raw depth sequences, then compute a joint based feature vector named JointVector for each sequence by taking into account the simple position and angle information between skeleton joints.

145 citations

Book
20 Dec 2016
TL;DR: A systematic survey of computational research in humanrobotinteraction HRI over the past decade is presented, categorized into eight topics and suggested promising future research areas are suggested.
Abstract: We present a systematic survey of computational research in humanrobotinteraction HRI over the past decade. Computational HRI isthe subset of the field that is specifically concerned with the algorithms,techniques, models, and frameworks necessary to build roboticsystems that engage in social interactions with humans. Within thefield of robotics, HRI poses distinct computational challenges in eachof the traditional core research areas: perception, manipulation, planning,task execution, navigation, and learning. These challenges areaddressed by the research literature surveyed here. We surveyed twelvepublication venues and include work that tackles computational HRIchallenges, categorized into eight topics: a perceiving humans andtheir activities; b generating and understanding verbal expression;c generating and understanding non-verbal behaviors; d modeling,expressing, and understanding emotional states; e recognizing andconveying intentional action; f collaborating with humans; g navigatingwith and around humans; and h learning from humans in asocial manner. For each topic, we suggest promising future researchareas.

118 citations

Journal ArticleDOI
TL;DR: A novel, custom-designed multi-robot platform for research on AI, robotics, and especially human–robot interaction for service robots designed as a part of the Building-Wide Intelligence project at the University of Texas at Austin is introduced.
Abstract: Recent progress in both AI and robotics have enabled the development of general purpose robot platforms that are capable of executing a wide variety of complex, temporally extended service tasks in open environments. This article introduces a novel, custom-designed multi-robot platform for research on AI, robotics, and especially human–robot interaction for service robots. Called BWIBots, the robots were designed as a part of the Building-Wide Intelligence (BWI) project at the University of Texas at Austin. The article begins with a description of, and justification for, the hardware and software design decisions underlying the BWIBots, with the aim of informing the design of such platforms in the future. It then proceeds to present an overview of various research contributions that have enabled the BWIBots to better (a) execute action sequences to complete user requests, (b) efficiently ask questions to resolve user requests, (c) understand human commands given in natural language, and (d) understand hum...

104 citations

References
More filters
Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

3,579 citations

Journal ArticleDOI
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

3,034 citations

Journal ArticleDOI
TL;DR: This article provides a detailed overview of various state-of-the-art research papers on human activity recognition, discussing both the methodologies developed for simple human actions and those for high-level activities.
Abstract: Human activity recognition is an important area of computer vision research. Its applications include surveillance systems, patient monitoring systems, and a variety of systems that involve interactions between persons and electronic devices such as human-computer interfaces. Most of these applications require an automated recognition of high-level activities, composed of multiple simple (or atomic) actions of persons. This article provides a detailed overview of various state-of-the-art research papers on human activity recognition. We discuss both the methodologies developed for simple human actions and those for high-level activities. An approach-based taxonomy is chosen that compares the advantages and limitations of each approach. Recognition methodologies for an analysis of the simple actions of a single person are first presented in the article. Space-time volume approaches and sequential approaches that represent and recognize activities directly from input images are discussed. Next, hierarchical recognition methodologies for high-level activities are presented and compared. Statistical approaches, syntactic approaches, and description-based approaches for hierarchical recognition are discussed in the article. In addition, we further discuss the papers on the recognition of human-object interactions and group activities. Public datasets designed for the evaluation of the recognition methodologies are illustrated in our article as well, comparing the methodologies' performances. This review will provide the impetus for future research in more productive areas.

2,084 citations


"Activity Recognition for Natural Hu..." refers background in this paper

  • ...In this section, we restrict ourselves largely to research relevant to robotics, and for an in-depth review of the field, one can refer to recent survey papers [2]....

    [...]

Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper presents a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures and achieves superior results on the challenging 3D action dataset.
Abstract: In this paper, we present a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures. We extract the 3D skeletal joint locations from Kinect depth maps using Shotton et al.'s method [6]. The HOJ3D computed from the action depth sequences are reprojected using LDA and then clustered into k posture visual words, which represent the prototypical poses of actions. The temporal evolutions of those visual words are modeled by discrete hidden Markov models (HMMs). In addition, due to the design of our spherical coordinate system and the robust 3D skeleton estimation from Kinect, our method demonstrates significant view invariance on our 3D action dataset. Our dataset is composed of 200 3D sequences of 10 indoor activities performed by 10 individuals in varied views. Our method is real-time and achieves superior results on the challenging 3D action dataset. We also tested our algorithm on the MSR Action 3D dataset and our algorithm outperforms Li et al. [25] on most of the cases.

1,453 citations


"Activity Recognition for Natural Hu..." refers background or methods in this paper

  • ...For this dataset, we compare our approach with the state of the art methodology called histogram of 3D skeleton joint positions (HOJ3D)[14] using Leave one Sequence out Cross validation (LOOCV) and cross subject validation as defined previously in this paper....

    [...]

  • ...Feature masking resulted in in- creased accuracy in not only our dataset (Figure 2) but also the CAD 60 and UTKinect Action Datasets. the average accuracy improved to 82.59%....

    [...]

  • ...Incorporating such natural interactions is not easy when robots work in highly cluttered environments where people carry out activities in different ways leading to high variability [14, 7]....

    [...]

  • ...Recent advances in pose estimation [11] and cheap availability of RGBD cameras, has lead to many RGBD activity datasets [12, 14]....

    [...]

  • ...The UTKinect Action Dataset [14] presents RGBD video sequences and skeleton information of humans performing various activities from different views....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a structural support vector machine (SSVM) was used to extract a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, their interactions with the objects in the form of associated affordances.
Abstract: Understanding human activities and object affordances are two very important skills, especially for personal robots which operate in human environments. In this work, we consider the problem of extracting a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, of their interactions with the objects in the form of associated affordances. Given a RGB-D video, we jointly model the human activities and object affordances as a Markov random field where the nodes represent objects and sub-activities, and the edges represent the relationships between object affordances, their relations with sub-activities, and their evolution over time. We formulate the learning problem using a structural support vector machine (SSVM) approach, where labelings over various alternate temporal segmentations are considered as latent variables. We tested our method on a challenging dataset comprising 120 activity videos collected from 4 subjects, and obtained an accuracy of 79.4% for affordance, 63.4% for sub-activity and 75.0% for high-level activity labeling. We then demonstrate the use of such descriptive labeling in performing assistive tasks by a PR2 robot.

666 citations

Frequently Asked Questions (13)
Q1. What have the authors contributed in "Activity recognition for natural human robot interaction" ?

To allow intuitive human robot interaction, the authors propose the use of human-like stylized gestures as communicative actions and contrast them from conventional activities of daily living. The authors present a simple yet effective approach of modelling pose trajectories using directions traversed by human joints over the duration of an activity and represent the action as a histogram of direction vectors. 

The focus of their work is to utilize distinctions in motion to differentiate between communicative/instructive actions and conventional activities of daily living. 

Since each skeleton is described by 20 joints, their feature vector is of dimensions 20 × 27, i.e., a total of 540 features were used for classification in this dataset. 

The intuition behind the descriptor is that directions have a clear physical significance and capturing motion intrinsics as a function of direction should be discriminative across classes. 

They first classified activities into three groups, namely: Zero, Transitional and Strong displacement activities, followed by a finer classification using HMMs. 

Recent advances in pose estimation [11] and cheap availability of RGBD cameras, has lead to many RGBD activity datasets [12, 14]. 

to best support humans, assistive robots need tobehave interactively like humans, making it imperative to correctly understand the human actions involved. 

10 subjects perform 10 different activities namely: walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands. 

Interactive actions were classified with an accuracy of 92.67% while Non Interactive activities were classified correctly with an accuracy of 85%. 

direction vectors are estimated for each joint at every frame as:dif = [ P if − P if+τ ] ,∀f ∈ [1, 2, . . . , fmax − τ ] 

In such scenarios, in addition to recognizing the drinking activity, the robot needs to be capable of recognizing communicative actions, so as to infer whether it should pour more or stop. 

All experiments were performed using 5 fold cross subject cross validation, such that, at a time, all instances of one subject were used for testing and the instances from the other subjects were used for training. 

As robots are employed to perform wide range of tasks, especially in human environments, the need to facilitate natural interaction between humans and robots is becoming more pertinent.