What is the significance of the kernel weights obtained from LMKL?

The kernel weights obtained from LMKL can be used to extract the relative contributions of features when all features are concatenated [11, 12].

What is the main reason of the poor performance of nonverbal visual features?

The main reason of poor performance of nonverbal visual features can be the insufficient performance of the method used to extract VFOA automatically (42% frame level accuracy on a subsample of the data).

What are the common nonverbal features used in the study?

It is also detected that the majority of the nonverbal features used werehighly correlated with the results of the social psychology questionnaires which test the leadership and the dominance.

(Open Access) Identification of emergent leaders in a meeting scenario using multiple kernel learning (2016) | Cigdem Beyan

Q: What are the contributions in "Identification of emergent leaders in a meeting scenario using multiple kernel learning" ?

In this paper, an effective framework for detection of emergent leaders in small group is presented.

Q: What are the future works in "Identification of emergent leaders in a meeting scenario using multiple kernel learning" ?

As future work, audio nonverbal features will be extracted and combined with the existing visual nonverbal features, assuming that their combination might produce much better results.

Q: What is the way to infer head activity?

Instead of using the optical flow vectors, using the absolute displacement of center of face bounding boxes in consecutive frames can be an alternative way to infer the head motion.

Q: What is the way to combine different kernels?

The simplest way to combine different kernels is to use an un-weighted sum of kernel function which gives equal preferences to all kernels.

Q: What are the first features used for emergent leadership?

for the first time, VFOA, head activity and body activity based features are used together for emergent leadership.

Q: How many observers annotated each meeting segment?

Each observers annotated either 12 or 13 meeting segments in total, while no more than one segment which belongs to the same meeting session was annotated by the same observer.

Identiﬁcation of Emergent Leaders in a Meeting Scenario

Using Multiple Kernel Learning

Cigdem Beyan

Francesca Capozzi

Cristina Becchio

3,4

Vittorio Murino

1,5

Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Genova, Italy

Department of Psychology, McGill University, Montreal, QC, Canada

Department of Psychology, University of Turin, Italy

Robotics, Brain and Cognitive Sciences, Istituto Italiano di Tecnologia, Genova, Italy

Department of Computer Science, University of Verona, Verona, Italy

(cigdem.beyan,cristina.becchio,vittorio.murino)@iit.it, francesca.capozzi@mcgill.ca

ABSTRACT

In this paper, an eﬀective framework for detection of emer-

gent leaders in small group is presented. In this scope, the

combination of diﬀerent types of nonverbal visual features;

the visual focus of attention, head activity and body activity

based features are utilized. Using them together ensued sig-

niﬁcant results. For the ﬁrst time, multiple kernel learning

(MKL) was applied for the identiﬁcation of the most and

the least emergent leaders. Taking the advantage of MKL’s

capability to use diﬀerent kernels which corresponds to dif-

ferent feature subsets having diﬀerent notions of similarity,

signiﬁcantly improved results compared to the state of the

art methods were obtained. Additionally, high correlations

between the majority of the features and the social psy-

chology questionnaires which are designed to estimate the

leadership or dominance were demonstrated.

Categories and Subject Descriptors

H.3.1 [Information Storage and Retrieval]: Content

Analysis and Indexing

Keywords

Emergent leadership; head pose; head activity; body activi-

ty; multiple kernel learning; social signal processing

1. INTRODUCTION

Social interaction is the main facet of human life and also

the fundamental research area for social psychology. Even

though psychologists have been working on social interac-

tions for a very long time, the automatic analysis of them is

a relatively new problem.

Social interactions are based on verbal (the words spoken)

and nonverbal (such as eye gaze, head-body activities, body

gestures, facial expressions, speaking style, speaking time,

interruptions, turn, etc.) communications. Verbal commu-

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

ASSP4MI’16, November 16 2016, Tokyo, Japan

 2016 ACM. ISBN 978-1-4503-4557-6/16/11. . . $15.00

DOI:

http://dx.doi.org/10.1145/3005467.3005469

nication is a key in social interactions but it is also known

that a high amount of information is conveyed nonverbally

[18]. Besides, nonverbal communication allows to estimate

the human personality and relationships [18].

Social signal processing (SSP) is the ﬁeld which aims to

analyze human interactions in an automatic way using the

recent advances in machine analysis (e.g. speech processing,

computer vision, machine learning). One topic among many

others in this area covers just social interactions in a small

group [25] such as a meeting. As examples, detecting group

interest level during meetings [10], modeling dominance in

small group conversations [2], identifying emergent leaders

(EL) in a meeting scenario [24, 6], and analysis of social

focus of attention [28] can be given.

In this paper, we are investigating the small groups in

terms of emergent leadership. An emergent leader (EL) is

the person who appears as the leader during a social interac-

tion. His/her leading power is related to his/her dominance,

inﬂuence, leadership and control rather than his/her role in

the organizational hierarchy [27]. Automatically identify-

ing ELs in a small group ﬁrst investigated in [24] and in

the publications (such as [26, 27, 25]) related to that work.

Recently, in [6], the most and the least ELs in a meeting

environment were detected using visual nonverbal features

only.

To automatically identify the ELs, diﬀerent supervised

(Support Vector Machine (SVM) [6], ranking the scores of

SVM [27, 24], collective classiﬁcation [27, 24]) and unsuper-

vised learning (rule-based approach [27, 25, 24, 26], rank-

level fusion approach (called as RLFA in this paper) [26, 27,

24]) methods have been used. To the best of our knowledge,

none of the existing works applied multiple kernel learning

(MKL) for detection of ELs. However, given that diﬀeren-

t modalities and diﬀerent types of features are being used,

MKL can perform well thanks to its ability to use diﬀeren-

t kernels corresponding to diﬀerent feature subsets having

diﬀerent notions of similarity.

In this work, we combine diﬀerent types of visual nonver-

bal features to detect the most and the least ELs in a meet-

ing environment. The performance of MKL is compared

with the state of the art methods. The main contribution

of this work is utilizing MKL (for the ﬁrst time) to detect

the ELs which demonstrated signiﬁcantly improved results.

Additionally, for the ﬁrst time, VFOA, head activity and

body activity based features are used together for emergent

leadership. Unlike the majority of the works in emergent

leadership literature, only visual features are utilized which

showed good results. This is also promising especially to an-

alyze the social interactions in the absence of audio sensors

[6, 14]. The presented results are also considerable regard-

ing the existing studies which showed better EL estimation

using audio features compared to video features or audio-

video together. Furthermore, it is likely that their fusion

with audio features might produce much better results when

the visual features are extracted using the methods given in

this work.

The rest of this paper is organized as follows. The previ-

ous studies about leadership are discussed in Section 2. In

Section 3, the data collection including questionnaires used

and the annotation of the data are given. In Section 4, the

nonverbal features used are described and the methods used

to extract them are presented. The experimental analysis

including EL detection results and the correlation analysis

of each feature with the questionnaires are reported in Sec-

tion 6. Finally, we conclude the paper with discussions and

future work in Section 7.

2. RELATED WORK

In this section, emergent leadership studies in SSP were

reviewed. The nonverbal features they used and the learning

methods that they applied were discussed. For the interested

readers, a detailed survey on emergent leadership in social

psychology and social computing can be found in [24].

In [26], only speech activity was used to identify emergen-

t leader. As nonverbal features; speaking length, speaking

turn duration and successful interruptions were used. To

predict the EL, a rule-based estimator (which is based on

the hypothesis that the EL is the one who has the highest

value of a nonverbal feature) and RLFA (which tests if any

combination of features are performing better than a single

feature) were used. The evaluation of the nonverbal fea-

tures were realized using the variables perceived leadership

(PLead), perceived dominance (PDom), perceived compe-

tence (PComp) and dominance ranking (RDom). The re-

sults showed that there is a signiﬁcant relationship between

PDom and emergent leadership and the EL is the person

who talks the most, has more turns and interruptions. More-

over, the audio nonverbal features were successful to identify

the EL even they used individually (60-80% accuracy).

The fusion of speaking activity based features (speaking

length, speaking turns and speaking interruptions) with fea-

tures based on VFOA (attention received, given attention,

attention quotient and attention center) was investigated

in [25]. In that paper [25], looking while speaking, look-

ing while listening, being looked at while speaking, center

of attention while speaking and visual dominance ratio (the

ratio of looking while speaking and looking while listening

without speaking [14]) were extracted to identify the EL-

s. Similar to [26], a rule based estimator was used. The

evaluation of features were carried out in terms of PLead,

PDom, and RDom. The results showed that speaking activ-

ity based features are not only better than visual attention

based features but also better than the fusion of these mul-

ti modalities to detect the ELs. The main reason of poor

performance of nonverbal visual features can be the insuf-

ﬁcient performance of the method used to extract VFOA

automatically (42% frame level accuracy on a subsample of

the data). In [24] (only the method presented in Chapter 6),

the emergent leadership detection performance of the same

visual attention features were explored more deeply in terms

of variables PLead, PDom, PComp, perceived liking (PLike)

and RDom. According to their analysis, the amount of at-

tention received from participants was the most informative

feature while the attention center was the following best

feature for emergent leadership. Additionally, for attention

received a strong correlation with PLead, PDom and RDom

were detected. On the other hand, a negative correlation

forattentionreceivedandattentioncenterwithPLikewere

found.

In [27], the emergent leadership were addressed using au-

dio and video based nonverbal features individually and to-

gether. As nonverbal audio features; speaking turn features

(such as total speaking length and total speaking turns) and

prosodic features (such as energy and pitch variation) were

extracted. As visual nonverbal features; head activity based

features (see Section 4.2), body activity based features (see

Section 4.3), and motion template based features (mean,

median, quantile and entropy of body activity which are ex-

tracted from motion energy image and motion history image

that summarize the spatial temporal activity of a person)

were used. Diﬀerent than the presented feature extraction

in Section 4.2, in [27] head tracking was performed using

Particle ﬁlter. Unlike our study and [25], VFOA based fea-

tures were not used. But, the motion template based fea-

tures which can be seen redundant given that features based

on head and body activities were already being used were

utilized. As learning and estimation methods; rule based

estimation, RLFA, ranking the scores of SVM (with linear

kernel), collective classiﬁcation [21] were compared. The

results showed that head and body activity based features

performs better than other visual features for PLead, P-

Dom and RDom. For audio features only the RLFA was the

best performing method followed by collective classiﬁcation

methods. Moreover, speaking turn and the energy features

were the best features in terms of PLead, PDom and RDom

accuracies. For audio and visual features together, RLFA

performed the best while collective classiﬁcation methods

were as good as it for PLead, PDom and RDom. Speaking

turn based features, body activity based features and ener-

gy together were the best performing feature combination

to infer the ELs.

Unlike studies [26, 25, 27] which utilized audio features

or audio with visual features, [6] presented detection of the

most and the least ELs in a meeting scenario using visual

nonverbal features only. In that study [6], head pose was

used to approximate the gaze. Then, VFOA was estimated

from head pose and novel VFOA features (see Section 4.1)

were proposed. The estimation of VFOA was performed

by methods based on SVM (72% frame level accuracy on a

subsample of the data). The best performing method when

all VFOA based features were used was SVM-cost [9] for

the most EL detection and the best performing method to

detect the least EL was RLFA.

In this paper, the VFOA, head and body activity based

features are combined together to identify the most and the

least ELs in a meeting scenario. The combination of dif-

ferent feature types presents very encouraging performance.

As the main contribution, for the ﬁrst time, the MKL was

applied which demonstrated improved results compared to

the best performing methods of emergent leadership: RFLA

and SVM.

3. DATA COLLECTION

The leadership dataset used in this paper is the same

dataset presented in [6]. It is in overall 393 minutes hav-

ing 30 minutes meeting as the longest session and 12 min-

utes meeting as the shortest session. There are 16 meeting

sessions in total while each meeting sessions are composed

of the same gender, unacquainted four-participant (in to-

tal 44 females and 20 males with average age of 21.6 with

2.24 standard deviation). In total ﬁve cameras; four frontal

(with a resolution of 1280x1024 pixels and frame rate of 20

frame per second) to capture each participant individually

and a standard camera to record the whole scene (with a

resolution of 1440x1080 pixels and frame rate of 25 frame

per second, used only for data annotation) were used. Four

wireless lapel microphones were utilized for audio acquisi-

tion (audio sample rate=16 kHz). The usage of audio is out

of the scope of this paper and will be investigated as future

work. The participants performed either “winter survival”

or “desert survival” [16] tasks as these are the most com-

mon tasks in small group decision making, dominance and

leadership. For more details see [6].

3.1 Questionnaires

In total two diﬀerent questionnaires namely i) the SYs-

tematic method for the Multiple Level Observation of Group-

s (SYMLOG) [4, 19] and ii) the General Leader Impression

Scale (GLIS) [20] were used for evaluation. The SYMLOG is

a tool to evaluate individuals in terms of dominance versus

submissiveness, acceptance versus non-acceptance of task

orientation of established authority, and friendliness versus

unfriendliness. On the other hand, GLIS is an instrument

used to evaluate the leadership attitude that each partici-

pant displays during a group interaction.

Both tools can be used as a self-assessment instrumen-

t and also as an instrument for external observation of a

group interaction. The interested readers can refer to [6] to

get information about how and why SYMLOG and GLIS

were applied as a self-assessment instrument. In this paper,

the results obtained from external judges were used to e-

valuate the extracted nonverbal visual features in terms of

emergent leadership. In detail, two independent judges were

used to observe each meeting and rate each participant us-

ing SMYLOG (called as SYMLOG-Observers in this paper)

and GLIS (called GLIS-Observers in this paper). The re-

sults showed that the dominance inter-class correlation (IC-

C), task orientation ICC and friendliness ICC were 0.866,

0.569 and 0.722, respectively while p<0.001 for SYMLOG-

Observers. For leadership impression, in our analysis we

only used the dominance sub-scale of SYMLOG. For GLIS-

Observers, ICC was 0.771 when p<0.001 for the leadership

attitude that each member displays during a group inter-

action. Additionally, it was observed that the leadership

impression obtained by GLIS-Observers and the dominance

from SYMLOG-Observers tend to correlate with each other.

The ﬁnal scores of each questionnaire for each participant

were calculated as the average between the ratings of two

observers.

3.2 Data Annotation

16 meeting sessions were divided into small segments, each

lasting 5 minutes in average. By this segmentation, in to-

tal, 75 meeting segments were obtained. For the analysis

presented in Section 6.1, these 75 meeting segments were

used rather than using the original full meetings. The rea-

sons of such a segmentation were i) to obtain more training

and testing data similar to the approach in [15] and ii) to

have more accurate ground truth annotations since people

are more precise and stay more focused on annotation of

shorter videos as mentioned in [1].

For annotation of meeting segments, in total, 50 observers

were participated. Each observers annotated either 12 or 13

meeting segments in total, while no more than one segment

which belongs to the same meeting session was annotated by

the same observer. Each meeting segment was annotated by

8 annotators in average. Here, it is important to highlight

that psychology literature has already shown that human

observers are able to identify the ELs in a meeting scenario

[27]. During annotations, audio was not used to cope with

any possible problem that might occur due to the level of

understanding the spoken language (similar to [17]) which

also allowed us to utilize international observers.

Observers ranked the emergent leadership behavior that

each participant exhibited in a meeting session. In this pa-

per, we used the annotations regarding the most and the

least EL. The other rankings considered as the same class

(called the rest in Section 6.1). The analysis of leadership

annotations showed that annotating the least EL was more

challenging than annotating the most EL. In detail; for the

most EL annotation, in the 26 out of 75 video segments,

there were fully agreement and in the 49 out of 75 video

segments, there were 73% agreement. On the other hand,

for the least leader detection annotation, in the 13 out of

75 video segments there were a 100% agreement while in

the 62 out of 75 video segments there were 70% agreement.

For each meeting segment Krippendorﬀ’s α coeﬃcient (in

total 75) was also calculated using annotations: the most,

the least Els and the rest. The average of Krippendorﬀ’s

α was found as 0.51 (reliability exist) with 0.27 standard

deviation while 7 segments have α smallerthan0.10(low

reliability) and 6 segments have α which is equal to 1.00

(perfect reliability).

4. NONVERBAL FEATURE EXTRACTION

In this section, the description of the extracted nonverbal

visual features and the methods used to obtain them are

presented. These nonverbal visual features include i) the

visual focus of attention (VFOA) based features which were

extracted using the estimation of the head pose, ii) the head

activity based features which were extracted using face de-

tection and optical ﬂow, and iii) the body activity based

features that were obtained using image diﬀerencing. The

feature extraction process for each type of features are given

in Figures 1, 2 and 3.

4.1 Visual Focus of Attention Based Features

To extract VFOA based features, the approach given in [6]

was utilized. This method includes facial landmark detec-

tion, head pose estimation, modeling VFOA in a supervised

way, and estimating the whole VFOA to extract nonverbal

features.

By using the Constrained Local Model (CLM) [8], the

facial landmarks in 2D coordinates were converted to 3D

coordinates which were used to detect the head pose (pan,

tilt and roll). Later, the head pose representation (pan and

tilt only) was used to ﬁnd the VFOA. The VFOA of a partic-

ipant is composed of four possibilities: left if the participant

Figure 1: Extraction of VFOA based features

Figure 2: Extraction of head activity based features

Figure 3: Extraction of body activity based features

is looking at the participant on his/her left, right if the par-

ticipant is looking at the participant on his/her right, front

if the participant is looking at the participant on his/her

front, no-one if the participant is not looking at any other

participant but somewhere else.

For modeling and estimating a participant’s VFOA for

the entire video, the cost function [9] (SVM-cost), the ran-

dom under sampling [31] (SVM-RUS), and the SMOTE [7]

(SVM-SMOTE) methods were combined with SVM (see [6]

for more information). As SVM model, the radial basis k-

ernel function (RBF) with varying kernel parameters were

used. After a participant’s VFOA was obtained it was s-

moothed (the span used for the moving average was taken

as 5) to denoise. Finally, the following nonverbal features

(referred as VFOAFea in Section 6) were extracted as pre-

sented in [6]:

totW atcher

: The total time that participant i is being

watched by the other participants in the meeting.

totME

: The total time that participant i is mutually look-

ing at any other participants in the meeting (also called

mutual engagement (ME)).

totW atcherN oME

: The total time that participant i is be-

ing watched by any other participants in the meeting while

there is no ME.

totNoLook

: The total time that are labeled as no-one in

the VFOA vector meaning that participant i is not looking

at any other participants in the meeting.

lookSomeOne

: The total time that participant i looked at

other participants in the meeting.

totInitiatorM E

: The total time that participant i initiate

the MEs with any other participants in the meeting.

stdInitiatorM E

: The standard deviation of the total time

that participant i initiate the MEs with any other partici-

pants in the meeting.

totInterCurrM E

: For participant i the total time inter-

current between the initiation of ME with any other partic-

ipants in the meeting.

stdtInterCurrM E

: For participant i the standard devia-

tion of the total time intercurrent between the initiation of

ME with any other participants in the meeting.

totW atchN oME

: The total time that participant i is look-

ing at any other participants in the meeting while there is

no ME.

maxT woW atcherW ME

: The maximum time that partic-

ipant i is looked at by any other two participants while par-

ticipant i canhaveaMEwithanyoftwopersons.

minT woW atcherW ME

: The minimum time that partici-

pant i is looked at by any other two participants while par-

ticipant i can have a ME with any of two participants.

maxT woW atcherN oME

: The maximum time that par-

ticipant i is looked at by any other two participants while

participant i canhavenoMEwithanyoftwoparticipants.

minT woW atcherN oME

: The minimum time that partic-

ipant i is looked at by any other two participants while par-

ticipant i canhavenoMEwithanyoftwoparticipants.

ratioW atcherLookSOne

: The ratio between the totW atch-

and lookSomeOne

In total 15 features were extracted. All features (except

ratioWatcherLookSOne) were divided by the length of the

corresponding meeting since the lengths of the meetings are

variable.

4.2 Head Activity Based Features

The extraction of head activity based features were adapt-

ed from [27] which were used by many other works such as [3]

and [22] to identify diﬀerent types of social interactions. Dif-

ferent than the method presented in [27] to detect and track

the faces, in this study, we used the most well known face de-

tection algorithm called Viola-Jones [23] which is based on

Haar-like features and AdaBoost. A trained face detector

was used to detect the face of each participant. Basically,

this detector detects the faces and give rectangle bound-

ing boxes which tightly surrounds the detected faces. This

method was evaluated using 25600 randomly selected frames

(400 frames for each frontal video which was determined by

the conﬁdence level=90% and margin error=4%) and per-

formed 90% accuracy with standard deviation of 0.12.

After the face area was detected, the optical ﬂow vectors

(using Lucas-Kanade optical ﬂow algorithm [5]) of two con-

secutive frames within the face area were found. The optical

ﬂow vectors were used to obtain the average head motion in

x and y coordinates. This result in real-valued vectors rep-

resenting a participant’s head activity in 2 dimensions for a

given meeting. These vectors were binarized using a thresh-

old to distinguish the signiﬁcant head activities from less

signiﬁcant ones. The thresholds (one for each dimension)

were deﬁned as the sum of the mean and the standard de-

viation of head motion per dimensions. This results in two

binary vectors one for each dimension where the head activ-

ity values greater than the threshold represents signiﬁcant

activity for a given dimension and the values smaller than

or equal to the threshold represents an insigniﬁcant head

activity (small movements, noise, etc.). The obtained bina-

ry vectors were fused with an OR operation to have a ﬁnal

binary head activity vector.

Instead of using the optical ﬂow vectors, using the abso-

lute displacement of center of face bounding boxes in con-

secutive frames can be an alternative way to infer the head

motion. Our analysis with this approach showed that using

absolute displacement is not signiﬁcantly worse than using

optical ﬂow vector to detect the most and the least ELs.

But since approach with optical ﬂow vectors performed bet-

ter in general, for the analysis in Section 6, this approach

was used.

Using the obtained real-valued head activity vectors and

the binary head activity vector for each participant, the fol-

lowing features (referred as HeadActFea in Section 6) were

extracted:

THL

: The total time that the head of participant i is mov-

ing.

THT

: The number of head activity turns for participant i

where each turn represents a continuous head activity.

AHT

: The average head activity turn duration for partici-

pant i.

stdHx

and stdHy

: The standard deviation of head activity

for participant i in x and y dimensions, respectively.

In total 5 features were extracted using the head activity.

All features except stdHx and stdHy were calculate using

the binary vector that was obtained whereas for stdH the

real valued vector was used.

4.3 Body Activity Based Features

The extraction of body activity based features were ap-

plied as given in [27]. Since in all of the meeting sessions the

background is stationary, image diﬀerencing was useful to

detect the moving pixels which suppose to be belong to the

participant in the frontal video. Here, it is possible to use

diﬀerent foreground detection algorithms but our inference

(after testing diﬀerent algorithms) is that image diﬀerencing

is more practical as it is less sensitive to noise besides being

the simplest foreground detection method. All the moving

pixels except the detected face area (obtained as given in

Section 4.2) was considered as a part of the body.

Before ﬁnding the diﬀerence image between consecutive

frames, each frame were converted to a gray-scale image.

The diﬀerence image was found in terms of moving and not

moving pixels using a threshold (taken as 30). Hence, if the

diﬀerence between the gray-scale values of two pixels belong

to the consecutive frames were greater than the threshold

used, that pixel was labeled as a moving pixel, otherwise it

was labeled as a not moving pixel. After the moving pixels

were found, the total number of moving pixels in each frame

were normalized by the size of the frame. As a result of

these steps, a real-valued vector was obtained. This vector

was binarized using another threshold (taken as %5) to dis-

tinguish the signiﬁcant body activities from less signiﬁcant

ones.

Using the obtained real-valued vector and the binary vec-

tor for each participant, the following features (referred as

BodyActFea in Section 6) were extracted:

TBL

: The total time that the body of participant i is mov-

ing.

TBT

: The number of body activity turns for participant i

where each turn represents a continuous body activity.

ABT

: The average body activity turn duration for partici-

pant i.

stdB

: The standard deviation of body activity for partici-

pant i.

In total 4 features were extracted using the body activity.

All features except stdB were calculated using the binary

vector that was obtained while for stdB the real valued vec-

tor was used.

5. MULTIPLE KERNEL LEARNING

Multiple Kernel Learning (MKL) methods use a set of k-

ernels and learn the optimal combinations of them in either

linear or non-linear way. MKL methods in general preferred

due to i) their ability to ﬁnd the optimal kernel combina-

tion from a large set of kernels instead of trying which kernel

works the best, and ii) to utilize diﬀerent kernels which can

correspond to diﬀerent feature subsets coming from multiple

sources which probably have diﬀerent notions of similarity

[12]. There have been extensive work on MKL in the litera-

ture; for a comprehensive survey on diﬀerent MKL methods

and their comparisons, interested readers can refer to [12].

The simplest way to combine diﬀerent kernels is to use an

un-weighted sum of kernel function which gives equal prefer-

ences to all kernels. A better strategy is to learn a weighted

sum. Arbitrary kernel weights (linear combination), non-

negative kernel weights (conic combination) and weights on

a simplex (convex combination) are possible kernel combi-

nations. Linear combinations of weights can be restrictive

whereas a nonlinear combination can be better.

In this study, we utilized Localized Multiple Kernel Learn-

ing (LMKL) [11, 12] which utilize nonlinear combinations of

kernel weights. LMKL is based on assigning diﬀerent ker-

nel weights to diﬀerent regions of the feature space. Brieﬂy,

this method contains two components such that their opti-

mizations are performed jointly with a two-step procedure.

Identification of emergent leaders in a meeting scenario using multiple kernel learning

Figures

Citations

Prediction of the Leadership Style of an Emergent Leader Using Audio and Visual Nonverbal Features

Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour

Understanding nonverbal communication cues of human personality traits in human-robot interaction

Investigation of Small Group Social Interactions Using Deep Visual Activity-Based Nonverbal Features

Moving as a Leader: Detecting Emergent Leadership in Small Groups using Body Pose

References

Rapid object detection using a boosted cascade of simple features

SMOTE: synthetic minority over-sampling technique

SMOTE: Synthetic Minority Over-sampling Technique

A comparison of methods for multiclass support vector machines

Performance of optical flow techniques

Related Papers (5)

Detecting emergent leader in a meeting environment using nonverbal visual features only

Moving as a Leader: Detecting Emergent Leadership in Small Groups using Body Pose

A Nonverbal Behavior Approach to Identify Emergent Leaders in Small Groups

Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition

Identifying emergent leadership in small groups using nonverbal communicative cues

Frequently Asked Questions (14)

Q1. What are the contributions in "Identification of emergent leaders in a meeting scenario using multiple kernel learning" ?

Q2. What are the future works in "Identification of emergent leaders in a meeting scenario using multiple kernel learning" ?

Q3. What features were the performing to infer the ELs?

Q4. What is the way to infer head activity?

Q5. What is the way to combine different kernels?

Q6. What is the performing method when all VFOA based features were used?

Q7. What is the significance of the kernel weights obtained from LMKL?

Q8. What are the first features used for emergent leadership?

Q9. How many independent judges were used to observe each meeting?

Q10. How many observers annotated each meeting segment?

Q11. What is the main reason of the poor performance of nonverbal visual features?

Q12. What are the common nonverbal features used in the study?

Q13. What are the common tasks in small group decision making, dominance and leadership?

Q14. What are the nonverbal visual features extracted from the annotations?