scispace - formally typeset
Open AccessProceedings ArticleDOI

Identification of emergent leaders in a meeting scenario using multiple kernel learning

Reads0
Chats0
TLDR
High correlations between the majority of the features and the social psychology questionnaires which are designed to estimate the leadership or dominance were demonstrated and significantly improved results compared to the state of the art methods.
Abstract
In this paper, an effective framework for detection of emergent leaders in small group is presented. In this scope, the combination of different types of nonverbal visual features; the visual focus of attention, head activity and body activity based features are utilized. Using them together ensued significant results. For the first time, multiple kernel learning (MKL) was applied for the identification of the most and the least emergent leaders. Taking the advantage of MKL's capability to use different kernels which corresponds to different feature subsets having different notions of similarity, significantly improved results compared to the state of the art methods were obtained. Additionally, high correlations between the majority of the features and the social psychology questionnaires which are designed to estimate the leadership or dominance were demonstrated.

read more

Content maybe subject to copyright    Report

Identification of Emergent Leaders in a Meeting Scenario
Using Multiple Kernel Learning
Cigdem Beyan
1
Francesca Capozzi
2
Cristina Becchio
3,4
Vittorio Murino
1,5
1
Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Genova, Italy
2
Department of Psychology, McGill University, Montreal, QC, Canada
3
Department of Psychology, University of Turin, Italy
4
Robotics, Brain and Cognitive Sciences, Istituto Italiano di Tecnologia, Genova, Italy
5
Department of Computer Science, University of Verona, Verona, Italy
(cigdem.beyan,cristina.becchio,vittorio.murino)@iit.it, francesca.capozzi@mcgill.ca
ABSTRACT
In this paper, an effective framework for detection of emer-
gent leaders in small group is presented. In this scope, the
combination of different types of nonverbal visual features;
the visual focus of attention, head activity and body activity
based features are utilized. Using them together ensued sig-
nificant results. For the first time, multiple kernel learning
(MKL) was applied for the identification of the most and
the least emergent leaders. Taking the advantage of MKL’s
capability to use different kernels which corresponds to dif-
ferent feature subsets having different notions of similarity,
significantly improved results compared to the state of the
art methods were obtained. Additionally, high correlations
between the majority of the features and the social psy-
chology questionnaires which are designed to estimate the
leadership or dominance were demonstrated.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing
Keywords
Emergent leadership; head pose; head activity; body activi-
ty; multiple kernel learning; social signal processing
1. INTRODUCTION
Social interaction is the main facet of human life and also
the fundamental research area for social psychology. Even
though psychologists have been working on social interac-
tions for a very long time, the automatic analysis of them is
a relatively new problem.
Social interactions are based on verbal (the words spoken)
and nonverbal (such as eye gaze, head-body activities, body
gestures, facial expressions, speaking style, speaking time,
interruptions, turn, etc.) communications. Verbal commu-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ASSP4MI’16, November 16 2016, Tokyo, Japan
c
2016 ACM. ISBN 978-1-4503-4557-6/16/11. . . $15.00
DOI:
http://dx.doi.org/10.1145/3005467.3005469
nication is a key in social interactions but it is also known
that a high amount of information is conveyed nonverbally
[18]. Besides, nonverbal communication allows to estimate
the human personality and relationships [18].
Social signal processing (SSP) is the field which aims to
analyze human interactions in an automatic way using the
recent advances in machine analysis (e.g. speech processing,
computer vision, machine learning). One topic among many
others in this area covers just social interactions in a small
group [25] such as a meeting. As examples, detecting group
interest level during meetings [10], modeling dominance in
small group conversations [2], identifying emergent leaders
(EL) in a meeting scenario [24, 6], and analysis of social
focus of attention [28] can be given.
In this paper, we are investigating the small groups in
terms of emergent leadership. An emergent leader (EL) is
the person who appears as the leader during a social interac-
tion. His/her leading power is related to his/her dominance,
influence, leadership and control rather than his/her role in
the organizational hierarchy [27]. Automatically identify-
ing ELs in a small group first investigated in [24] and in
the publications (such as [26, 27, 25]) related to that work.
Recently, in [6], the most and the least ELs in a meeting
environment were detected using visual nonverbal features
only.
To automatically identify the ELs, different supervised
(Support Vector Machine (SVM) [6], ranking the scores of
SVM [27, 24], collective classification [27, 24]) and unsuper-
vised learning (rule-based approach [27, 25, 24, 26], rank-
level fusion approach (called as RLFA in this paper) [26, 27,
24]) methods have been used. To the best of our knowledge,
none of the existing works applied multiple kernel learning
(MKL) for detection of ELs. However, given that differen-
t modalities and different types of features are being used,
MKL can perform well thanks to its ability to use differen-
t kernels corresponding to different feature subsets having
different notions of similarity.
In this work, we combine different types of visual nonver-
bal features to detect the most and the least ELs in a meet-
ing environment. The performance of MKL is compared
with the state of the art methods. The main contribution
of this work is utilizing MKL (for the first time) to detect
the ELs which demonstrated significantly improved results.
Additionally, for the first time, VFOA, head activity and
body activity based features are used together for emergent
leadership. Unlike the majority of the works in emergent

leadership literature, only visual features are utilized which
showed good results. This is also promising especially to an-
alyze the social interactions in the absence of audio sensors
[6, 14]. The presented results are also considerable regard-
ing the existing studies which showed better EL estimation
using audio features compared to video features or audio-
video together. Furthermore, it is likely that their fusion
with audio features might produce much better results when
the visual features are extracted using the methods given in
this work.
The rest of this paper is organized as follows. The previ-
ous studies about leadership are discussed in Section 2. In
Section 3, the data collection including questionnaires used
and the annotation of the data are given. In Section 4, the
nonverbal features used are described and the methods used
to extract them are presented. The experimental analysis
including EL detection results and the correlation analysis
of each feature with the questionnaires are reported in Sec-
tion 6. Finally, we conclude the paper with discussions and
future work in Section 7.
2. RELATED WORK
In this section, emergent leadership studies in SSP were
reviewed. The nonverbal features they used and the learning
methods that they applied were discussed. For the interested
readers, a detailed survey on emergent leadership in social
psychology and social computing can be found in [24].
In [26], only speech activity was used to identify emergen-
t leader. As nonverbal features; speaking length, speaking
turn duration and successful interruptions were used. To
predict the EL, a rule-based estimator (which is based on
the hypothesis that the EL is the one who has the highest
value of a nonverbal feature) and RLFA (which tests if any
combination of features are performing better than a single
feature) were used. The evaluation of the nonverbal fea-
tures were realized using the variables perceived leadership
(PLead), perceived dominance (PDom), perceived compe-
tence (PComp) and dominance ranking (RDom). The re-
sults showed that there is a significant relationship between
PDom and emergent leadership and the EL is the person
who talks the most, has more turns and interruptions. More-
over, the audio nonverbal features were successful to identify
the EL even they used individually (60-80% accuracy).
The fusion of speaking activity based features (speaking
length, speaking turns and speaking interruptions) with fea-
tures based on VFOA (attention received, given attention,
attention quotient and attention center) was investigated
in [25]. In that paper [25], looking while speaking, look-
ing while listening, being looked at while speaking, center
of attention while speaking and visual dominance ratio (the
ratio of looking while speaking and looking while listening
without speaking [14]) were extracted to identify the EL-
s. Similar to [26], a rule based estimator was used. The
evaluation of features were carried out in terms of PLead,
PDom, and RDom. The results showed that speaking activ-
ity based features are not only better than visual attention
based features but also better than the fusion of these mul-
ti modalities to detect the ELs. The main reason of poor
performance of nonverbal visual features can be the insuf-
ficient performance of the method used to extract VFOA
automatically (42% frame level accuracy on a subsample of
the data). In [24] (only the method presented in Chapter 6),
the emergent leadership detection performance of the same
visual attention features were explored more deeply in terms
of variables PLead, PDom, PComp, perceived liking (PLike)
and RDom. According to their analysis, the amount of at-
tention received from participants was the most informative
feature while the attention center was the following best
feature for emergent leadership. Additionally, for attention
received a strong correlation with PLead, PDom and RDom
were detected. On the other hand, a negative correlation
forattentionreceivedandattentioncenterwithPLikewere
found.
In [27], the emergent leadership were addressed using au-
dio and video based nonverbal features individually and to-
gether. As nonverbal audio features; speaking turn features
(such as total speaking length and total speaking turns) and
prosodic features (such as energy and pitch variation) were
extracted. As visual nonverbal features; head activity based
features (see Section 4.2), body activity based features (see
Section 4.3), and motion template based features (mean,
median, quantile and entropy of body activity which are ex-
tracted from motion energy image and motion history image
that summarize the spatial temporal activity of a person)
were used. Different than the presented feature extraction
in Section 4.2, in [27] head tracking was performed using
Particle filter. Unlike our study and [25], VFOA based fea-
tures were not used. But, the motion template based fea-
tures which can be seen redundant given that features based
on head and body activities were already being used were
utilized. As learning and estimation methods; rule based
estimation, RLFA, ranking the scores of SVM (with linear
kernel), collective classification [21] were compared. The
results showed that head and body activity based features
performs better than other visual features for PLead, P-
Dom and RDom. For audio features only the RLFA was the
best performing method followed by collective classification
methods. Moreover, speaking turn and the energy features
were the best features in terms of PLead, PDom and RDom
accuracies. For audio and visual features together, RLFA
performed the best while collective classification methods
were as good as it for PLead, PDom and RDom. Speaking
turn based features, body activity based features and ener-
gy together were the best performing feature combination
to infer the ELs.
Unlike studies [26, 25, 27] which utilized audio features
or audio with visual features, [6] presented detection of the
most and the least ELs in a meeting scenario using visual
nonverbal features only. In that study [6], head pose was
used to approximate the gaze. Then, VFOA was estimated
from head pose and novel VFOA features (see Section 4.1)
were proposed. The estimation of VFOA was performed
by methods based on SVM (72% frame level accuracy on a
subsample of the data). The best performing method when
all VFOA based features were used was SVM-cost [9] for
the most EL detection and the best performing method to
detect the least EL was RLFA.
In this paper, the VFOA, head and body activity based
features are combined together to identify the most and the
least ELs in a meeting scenario. The combination of dif-
ferent feature types presents very encouraging performance.
As the main contribution, for the first time, the MKL was
applied which demonstrated improved results compared to
the best performing methods of emergent leadership: RFLA
and SVM.

3. DATA COLLECTION
The leadership dataset used in this paper is the same
dataset presented in [6]. It is in overall 393 minutes hav-
ing 30 minutes meeting as the longest session and 12 min-
utes meeting as the shortest session. There are 16 meeting
sessions in total while each meeting sessions are composed
of the same gender, unacquainted four-participant (in to-
tal 44 females and 20 males with average age of 21.6 with
2.24 standard deviation). In total five cameras; four frontal
(with a resolution of 1280x1024 pixels and frame rate of 20
frame per second) to capture each participant individually
and a standard camera to record the whole scene (with a
resolution of 1440x1080 pixels and frame rate of 25 frame
per second, used only for data annotation) were used. Four
wireless lapel microphones were utilized for audio acquisi-
tion (audio sample rate=16 kHz). The usage of audio is out
of the scope of this paper and will be investigated as future
work. The participants performed either “winter survival”
or “desert survival” [16] tasks as these are the most com-
mon tasks in small group decision making, dominance and
leadership. For more details see [6].
3.1 Questionnaires
In total two different questionnaires namely i) the SYs-
tematic method for the Multiple Level Observation of Group-
s (SYMLOG) [4, 19] and ii) the General Leader Impression
Scale (GLIS) [20] were used for evaluation. The SYMLOG is
a tool to evaluate individuals in terms of dominance versus
submissiveness, acceptance versus non-acceptance of task
orientation of established authority, and friendliness versus
unfriendliness. On the other hand, GLIS is an instrument
used to evaluate the leadership attitude that each partici-
pant displays during a group interaction.
Both tools can be used as a self-assessment instrumen-
t and also as an instrument for external observation of a
group interaction. The interested readers can refer to [6] to
get information about how and why SYMLOG and GLIS
were applied as a self-assessment instrument. In this paper,
the results obtained from external judges were used to e-
valuate the extracted nonverbal visual features in terms of
emergent leadership. In detail, two independent judges were
used to observe each meeting and rate each participant us-
ing SMYLOG (called as SYMLOG-Observers in this paper)
and GLIS (called GLIS-Observers in this paper). The re-
sults showed that the dominance inter-class correlation (IC-
C), task orientation ICC and friendliness ICC were 0.866,
0.569 and 0.722, respectively while p<0.001 for SYMLOG-
Observers. For leadership impression, in our analysis we
only used the dominance sub-scale of SYMLOG. For GLIS-
Observers, ICC was 0.771 when p<0.001 for the leadership
attitude that each member displays during a group inter-
action. Additionally, it was observed that the leadership
impression obtained by GLIS-Observers and the dominance
from SYMLOG-Observers tend to correlate with each other.
The final scores of each questionnaire for each participant
were calculated as the average between the ratings of two
observers.
3.2 Data Annotation
16 meeting sessions were divided into small segments, each
lasting 5 minutes in average. By this segmentation, in to-
tal, 75 meeting segments were obtained. For the analysis
presented in Section 6.1, these 75 meeting segments were
used rather than using the original full meetings. The rea-
sons of such a segmentation were i) to obtain more training
and testing data similar to the approach in [15] and ii) to
have more accurate ground truth annotations since people
are more precise and stay more focused on annotation of
shorter videos as mentioned in [1].
For annotation of meeting segments, in total, 50 observers
were participated. Each observers annotated either 12 or 13
meeting segments in total, while no more than one segment
which belongs to the same meeting session was annotated by
the same observer. Each meeting segment was annotated by
8 annotators in average. Here, it is important to highlight
that psychology literature has already shown that human
observers are able to identify the ELs in a meeting scenario
[27]. During annotations, audio was not used to cope with
any possible problem that might occur due to the level of
understanding the spoken language (similar to [17]) which
also allowed us to utilize international observers.
Observers ranked the emergent leadership behavior that
each participant exhibited in a meeting session. In this pa-
per, we used the annotations regarding the most and the
least EL. The other rankings considered as the same class
(called the rest in Section 6.1). The analysis of leadership
annotations showed that annotating the least EL was more
challenging than annotating the most EL. In detail; for the
most EL annotation, in the 26 out of 75 video segments,
there were fully agreement and in the 49 out of 75 video
segments, there were 73% agreement. On the other hand,
for the least leader detection annotation, in the 13 out of
75 video segments there were a 100% agreement while in
the 62 out of 75 video segments there were 70% agreement.
For each meeting segment Krippendorff’s α coefficient (in
total 75) was also calculated using annotations: the most,
the least Els and the rest. The average of Krippendorff’s
α was found as 0.51 (reliability exist) with 0.27 standard
deviation while 7 segments have α smallerthan0.10(low
reliability) and 6 segments have α which is equal to 1.00
(perfect reliability).
4. NONVERBAL FEATURE EXTRACTION
In this section, the description of the extracted nonverbal
visual features and the methods used to obtain them are
presented. These nonverbal visual features include i) the
visual focus of attention (VFOA) based features which were
extracted using the estimation of the head pose, ii) the head
activity based features which were extracted using face de-
tection and optical flow, and iii) the body activity based
features that were obtained using image differencing. The
feature extraction process for each type of features are given
in Figures 1, 2 and 3.
4.1 Visual Focus of Attention Based Features
To extract VFOA based features, the approach given in [6]
was utilized. This method includes facial landmark detec-
tion, head pose estimation, modeling VFOA in a supervised
way, and estimating the whole VFOA to extract nonverbal
features.
By using the Constrained Local Model (CLM) [8], the
facial landmarks in 2D coordinates were converted to 3D
coordinates which were used to detect the head pose (pan,
tilt and roll). Later, the head pose representation (pan and
tilt only) was used to find the VFOA. The VFOA of a partic-
ipant is composed of four possibilities: left if the participant

Figure 1: Extraction of VFOA based features
Figure 2: Extraction of head activity based features
Figure 3: Extraction of body activity based features
is looking at the participant on his/her left, right if the par-
ticipant is looking at the participant on his/her right, front
if the participant is looking at the participant on his/her
front, no-one if the participant is not looking at any other
participant but somewhere else.
For modeling and estimating a participant’s VFOA for
the entire video, the cost function [9] (SVM-cost), the ran-
dom under sampling [31] (SVM-RUS), and the SMOTE [7]
(SVM-SMOTE) methods were combined with SVM (see [6]
for more information). As SVM model, the radial basis k-
ernel function (RBF) with varying kernel parameters were
used. After a participant’s VFOA was obtained it was s-
moothed (the span used for the moving average was taken
as 5) to denoise. Finally, the following nonverbal features
(referred as VFOAFea in Section 6) were extracted as pre-
sented in [6]:
totW atcher
i
: The total time that participant i is being
watched by the other participants in the meeting.
totME
i
: The total time that participant i is mutually look-
ing at any other participants in the meeting (also called
mutual engagement (ME)).
totW atcherN oME
i
: The total time that participant i is be-
ing watched by any other participants in the meeting while
there is no ME.
totNoLook
i
: The total time that are labeled as no-one in
the VFOA vector meaning that participant i is not looking
at any other participants in the meeting.
lookSomeOne
i
: The total time that participant i looked at
other participants in the meeting.
totInitiatorM E
i
: The total time that participant i initiate
the MEs with any other participants in the meeting.
stdInitiatorM E
i
: The standard deviation of the total time
that participant i initiate the MEs with any other partici-
pants in the meeting.
totInterCurrM E
i
: For participant i the total time inter-
current between the initiation of ME with any other partic-
ipants in the meeting.
stdtInterCurrM E
i
: For participant i the standard devia-
tion of the total time intercurrent between the initiation of
ME with any other participants in the meeting.
totW atchN oME
i
: The total time that participant i is look-
ing at any other participants in the meeting while there is
no ME.
maxT woW atcherW ME
i
: The maximum time that partic-
ipant i is looked at by any other two participants while par-
ticipant i canhaveaMEwithanyoftwopersons.
minT woW atcherW ME
i
: The minimum time that partici-
pant i is looked at by any other two participants while par-
ticipant i can have a ME with any of two participants.
maxT woW atcherN oME
i
: The maximum time that par-
ticipant i is looked at by any other two participants while
participant i canhavenoMEwithanyoftwoparticipants.
minT woW atcherN oME
i
: The minimum time that partic-
ipant i is looked at by any other two participants while par-
ticipant i canhavenoMEwithanyoftwoparticipants.
ratioW atcherLookSOne
i
: The ratio between the totW atch-
er
i
and lookSomeOne
i
.
In total 15 features were extracted. All features (except
ratioWatcherLookSOne) were divided by the length of the
corresponding meeting since the lengths of the meetings are
variable.
4.2 Head Activity Based Features

The extraction of head activity based features were adapt-
ed from [27] which were used by many other works such as [3]
and [22] to identify different types of social interactions. Dif-
ferent than the method presented in [27] to detect and track
the faces, in this study, we used the most well known face de-
tection algorithm called Viola-Jones [23] which is based on
Haar-like features and AdaBoost. A trained face detector
was used to detect the face of each participant. Basically,
this detector detects the faces and give rectangle bound-
ing boxes which tightly surrounds the detected faces. This
method was evaluated using 25600 randomly selected frames
(400 frames for each frontal video which was determined by
the confidence level=90% and margin error=4%) and per-
formed 90% accuracy with standard deviation of 0.12.
After the face area was detected, the optical flow vectors
(using Lucas-Kanade optical flow algorithm [5]) of two con-
secutive frames within the face area were found. The optical
flow vectors were used to obtain the average head motion in
x and y coordinates. This result in real-valued vectors rep-
resenting a participant’s head activity in 2 dimensions for a
given meeting. These vectors were binarized using a thresh-
old to distinguish the significant head activities from less
significant ones. The thresholds (one for each dimension)
were defined as the sum of the mean and the standard de-
viation of head motion per dimensions. This results in two
binary vectors one for each dimension where the head activ-
ity values greater than the threshold represents significant
activity for a given dimension and the values smaller than
or equal to the threshold represents an insignificant head
activity (small movements, noise, etc.). The obtained bina-
ry vectors were fused with an OR operation to have a final
binary head activity vector.
Instead of using the optical flow vectors, using the abso-
lute displacement of center of face bounding boxes in con-
secutive frames can be an alternative way to infer the head
motion. Our analysis with this approach showed that using
absolute displacement is not significantly worse than using
optical flow vector to detect the most and the least ELs.
But since approach with optical flow vectors performed bet-
ter in general, for the analysis in Section 6, this approach
was used.
Using the obtained real-valued head activity vectors and
the binary head activity vector for each participant, the fol-
lowing features (referred as HeadActFea in Section 6) were
extracted:
THL
i
: The total time that the head of participant i is mov-
ing.
THT
i
: The number of head activity turns for participant i
where each turn represents a continuous head activity.
AHT
i
: The average head activity turn duration for partici-
pant i.
stdHx
i
and stdHy
i
: The standard deviation of head activity
for participant i in x and y dimensions, respectively.
In total 5 features were extracted using the head activity.
All features except stdHx and stdHy were calculate using
the binary vector that was obtained whereas for stdH the
real valued vector was used.
4.3 Body Activity Based Features
The extraction of body activity based features were ap-
plied as given in [27]. Since in all of the meeting sessions the
background is stationary, image differencing was useful to
detect the moving pixels which suppose to be belong to the
participant in the frontal video. Here, it is possible to use
different foreground detection algorithms but our inference
(after testing different algorithms) is that image differencing
is more practical as it is less sensitive to noise besides being
the simplest foreground detection method. All the moving
pixels except the detected face area (obtained as given in
Section 4.2) was considered as a part of the body.
Before finding the difference image between consecutive
frames, each frame were converted to a gray-scale image.
The difference image was found in terms of moving and not
moving pixels using a threshold (taken as 30). Hence, if the
difference between the gray-scale values of two pixels belong
to the consecutive frames were greater than the threshold
used, that pixel was labeled as a moving pixel, otherwise it
was labeled as a not moving pixel. After the moving pixels
were found, the total number of moving pixels in each frame
were normalized by the size of the frame. As a result of
these steps, a real-valued vector was obtained. This vector
was binarized using another threshold (taken as %5) to dis-
tinguish the significant body activities from less significant
ones.
Using the obtained real-valued vector and the binary vec-
tor for each participant, the following features (referred as
BodyActFea in Section 6) were extracted:
TBL
i
: The total time that the body of participant i is mov-
ing.
TBT
i
: The number of body activity turns for participant i
where each turn represents a continuous body activity.
ABT
i
: The average body activity turn duration for partici-
pant i.
stdB
i
: The standard deviation of body activity for partici-
pant i.
In total 4 features were extracted using the body activity.
All features except stdB were calculated using the binary
vector that was obtained while for stdB the real valued vec-
tor was used.
5. MULTIPLE KERNEL LEARNING
Multiple Kernel Learning (MKL) methods use a set of k-
ernels and learn the optimal combinations of them in either
linear or non-linear way. MKL methods in general preferred
due to i) their ability to find the optimal kernel combina-
tion from a large set of kernels instead of trying which kernel
works the best, and ii) to utilize different kernels which can
correspond to different feature subsets coming from multiple
sources which probably have different notions of similarity
[12]. There have been extensive work on MKL in the litera-
ture; for a comprehensive survey on different MKL methods
and their comparisons, interested readers can refer to [12].
The simplest way to combine different kernels is to use an
un-weighted sum of kernel function which gives equal prefer-
ences to all kernels. A better strategy is to learn a weighted
sum. Arbitrary kernel weights (linear combination), non-
negative kernel weights (conic combination) and weights on
a simplex (convex combination) are possible kernel combi-
nations. Linear combinations of weights can be restrictive
whereas a nonlinear combination can be better.
In this study, we utilized Localized Multiple Kernel Learn-
ing (LMKL) [11, 12] which utilize nonlinear combinations of
kernel weights. LMKL is based on assigning different ker-
nel weights to different regions of the feature space. Briefly,
this method contains two components such that their opti-
mizations are performed jointly with a two-step procedure.

Citations
More filters
Journal ArticleDOI

Prediction of the Leadership Style of an Emergent Leader Using Audio and Visual Nonverbal Features

TL;DR: This work identifies the leadership style of an emergent leader as autocratic or democratic, and multiple kernel learning using multimodal nonverbal features is utilized to predict leadership styles that proved to achieve better predictions as compared to traditional learning methods.
Proceedings ArticleDOI

Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour

TL;DR: A novel method to robustly detect eye contact in natural three- and four- person interactions using off-the-shelf ambient cameras and exploits that, during conversations, people tend to look at the person who is currently speaking.
Journal ArticleDOI

Understanding nonverbal communication cues of human personality traits in human-robot interaction

TL;DR: An efficient computational framework is proposed to endow the robot with the capability of understanding the user's personality traits based on the userʼ s nonverbal communication cues represented by three visual features including the head motion, gaze, and body motion energy, and three vocal features.
Proceedings ArticleDOI

Investigation of Small Group Social Interactions Using Deep Visual Activity-Based Nonverbal Features

TL;DR: A novel method is proposed, which is composed of optical flow computation, deep neural network based feature learning, feature encoding and classification, which shows significantly better results not only as compared to the state of the art visual activity based-nonverbal features but also when the state-of-the-art visual activity-based- nonverbal features are combined with other audio-based and video-based non verbal features.
Proceedings ArticleDOI

Moving as a Leader: Detecting Emergent Leadership in Small Groups using Body Pose

TL;DR: An effective method that uses 2D body pose based nonverbal features to represent the visual activity of a person is proposed and it is suggested that it is possible to improve classification results by applying unsupervised feature learning as a preprocessing step.
References
More filters
Proceedings ArticleDOI

Rapid object detection using a boosted cascade of simple features

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Journal ArticleDOI

SMOTE: synthetic minority over-sampling technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Journal ArticleDOI

SMOTE: Synthetic Minority Over-sampling Technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Journal ArticleDOI

A comparison of methods for multiclass support vector machines

TL;DR: Decomposition implementations for two "all-together" multiclass SVM methods are given and it is shown that for large problems methods by considering all data at once in general need fewer support vectors.
Journal ArticleDOI

Performance of optical flow techniques

TL;DR: These comparisons are primarily empirical, and concentrate on the accuracy, reliability, and density of the velocity measurements; they show that performance can differ significantly among the techniques the authors implemented.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Identification of emergent leaders in a meeting scenario using multiple kernel learning" ?

In this paper, an effective framework for detection of emergent leaders in small group is presented. 

As future work, audio nonverbal features will be extracted and combined with the existing visual nonverbal features, assuming that their combination might produce much better results. 

Speaking turn based features, body activity based features and energy together were the best performing feature combination to infer the ELs. 

Instead of using the optical flow vectors, using the absolute displacement of center of face bounding boxes in consecutive frames can be an alternative way to infer the head motion. 

The simplest way to combine different kernels is to use an un-weighted sum of kernel function which gives equal preferences to all kernels. 

The best performing method when all VFOA based features were used was SVM-cost [9] for the most EL detection and the best performing method to detect the least EL was RLFA. 

The kernel weights obtained from LMKL can be used to extract the relative contributions of features when all features are concatenated [11, 12]. 

for the first time, VFOA, head activity and body activity based features are used together for emergent leadership. 

In detail, two independent judges were used to observe each meeting and rate each participant using SMYLOG (called as SYMLOG-Observers in this paper) and GLIS (called GLIS-Observers in this paper). 

Each observers annotated either 12 or 13 meeting segments in total, while no more than one segment which belongs to the same meeting session was annotated by the same observer. 

The main reason of poor performance of nonverbal visual features can be the insufficient performance of the method used to extract VFOA automatically (42% frame level accuracy on a subsample of the data). 

It is also detected that the majority of the nonverbal features used werehighly correlated with the results of the social psychology questionnaires which test the leadership and the dominance. 

The participants performed either “winter survival” or “desert survival” [16] tasks as these are the most common tasks in small group decision making, dominance and leadership. 

These nonverbal visual features include i) the visual focus of attention (VFOA) based features which were extracted using the estimation of the head pose, ii) the head activity based features which were extracted using face detection and optical flow, and iii) the body activity based features that were obtained using image differencing.