scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Overlapping speech, utterance duration and affective content in HHI and HCI — An comparison

TL;DR: The davero corpus a large naturalistic spoken corpus of real call center telephone conversations is investigated and the findings allow the prediction of forthcoming threat of overlapping speech, and hence preventive measures, especially in professional environments like call-centers with human or automatic agents.
Abstract: In human conversation, turn-taking is a critical issue. Especially if only the speech channel is available (e.g. telephone), correct timing as well as affective and verbal signals are required. In cases of failure, overlapping speech may occur which is in the focus of this paper. We investigate the davero corpus a large naturalistic spoken corpus of real callcenter telephone conversations and compare our findings to results on the well-known SmartKom corpus consisting of human-computer interaction. We first show that overlapping speech occurs in different types of situational settings — extending the well-known categories cooperative and competitive overlaps —, all of which are frequent enough to be analyzed. Furthermore, we present connections between the occurrence of overlapping speech and the length of the previous utterance, and show that overlapping speech occurs at dialog instances where certain affective states are changing. Our results allow the prediction of forthcoming threat of overlapping speech, and hence preventive measures, especially in professional environments like call-centers with human or automatic agents.

Summary (2 min read)

Introduction

  • The impact of food safety standards on bilateral trade is commonly evaluated using the gravity econometric model.
  • Burger et al. (2009) further extend the PPML estimation of Santos Silva and Tenreyro (2006) by considering the negative binomial, zero-inflated Poisson, and zero-inflated negative binomial models.
  • The Poisson regressions can solve the zero-omitted problem faced by the conventional log-normal OLS specification of the gravity equation and are robust to heteroskedasticity.
  • In this paper the authors use zero-accounting gravity models to evaluate the impact of food safety standards on developed country seafood imports.
  • Since the early 2000s, chemical standards including veterinary drug and other chemical residues have become the most serious challenges in the international seafood trade (Ababouch et al., 2005).

Conventional OLS and Zero-Accounting Models of the Gravity Equation

  • Anderson and van Wincoop’s gravity model: Tinbergen (1962) was the first to apply the Newtonian law of universal gravitation in physics to generate the gravity econometric model for studying bilateral trade flows.
  • The relevance of including GDPs in the gravity equation has been questioned because it is not relevant to the micro-founded gravity 1 Eq. (3) can be written in the level form as: K Gravity Model Selection in Seafood Trade 6 model (Disdier & Marette, 2010; Feenstra, 2004).
  • The Heckman estimation approach faces two essential problems.
  • Under such a situation, extensions of the PPML and NB models, Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models can be used to overcome the encountered problems.

Empirical Model Specification and Data Sources

  • In order to test the hypothesis that chemical standards act as barriers to international seafood trade, the authors first estimate the OLS gravity model suggested by Anderson and van Wincoop (2003) and the Heckman model in the log linear form of the dependent variable, bilateral trade.
  • The authors then estimate the gravity model in the level form using the Poisson family regressions: the PPML, NB, ZIP, and ZINB models.
  • Gravity Model Selection in Seafood Trade 15.

Estimated Results and Discussions

  • Table 1 shows the empirical results of the OLS and Heckman maximum likelihood models estimated in the log linear specification form.
  • The conditional marginal effect, and not the coefficient of the Heckman model, is comparable with the coefficient of the OLS model (Hoffmann & Kassouf, 2005).
  • With regards to the intensive margin of trade, conditioned on positive trade being observed, one unit reduction in chloramphenicol analytical limit (1 ppb) reduces bilateral seafood import 0.86% predicted by the OLS model and 0.84%predicted by the Heckman model.
  • The bilateral distance variable has a negative relationship with the probability of positive trade being observed.
  • Results of the Poisson family regressions are reported in Table 3.

Conclusions

  • The main objective of this investigation was to test if food safety standards act as barriers to international seafood trade.
  • The Gravity Equation in International Trade: Some Microeconomic Foundations and Emperical Evidence.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Overlapping Speech, Utterance Duration and
Affective Content in HHI and HCI an Comparison
Ingo Siegert, Ronald B
¨
ock, Andreas Wendemuth
Cognitive Systems Group
Otto von Guericke University Magdeburg, Germany
{firstname.lastname}@ovgu.de
Bogdan Vlasenko
Idiap Research Institute
Martigny, Switzerland
bogdan.vlasenko@idiap.ch
Kerstin Ohnemus
davero Dialog Gruppe
91058 Erlangen, Germany
kerstin.ohnemus@davero.de
Abstract—In human conversation, turn-taking is a critical
issue. Especially if only the speech channel is available (e.g.
telephone), correct timing as well as affective and verbal signals
are required. In cases of failure, overlapping speech may occur
which is in the focus of this paper. We investigate the davero
corpus a large naturalistic spoken corpus of real callcenter
telephone conversations and compare our findings to results on
the well-known SmartKom corpus consisting of human-computer
interaction. We first show that overlapping speech occurs in
different types of situational settings extending the well-known
categories cooperative and competitive overlaps –, all of which
are frequent enough to be analyzed. Furthermore, we present
connections between the occurrence of overlapping speech and
the length of the previous utterance, and show that overlapping
speech occurs at dialog instances where certain affective states are
changing. Our results allow the prediction of forthcoming threat
of overlapping speech, and hence preventive measures, especially
in professional environments like call-centers with human or
automatic agents.
I. INTRODUCTION
Human communication consists of several information
layers, of which the factual layer is the most important. But
beyond the pure textual information other relevant information
such as affective state, self-revelation, and appeal are transmit-
ted [1]. These different pieces of information are provided to
support human conversations and to increase the likelihood of
a fluent conversation.
One important requirement for a fluent and successful
conversation is an efficient turn-taking, which has to be orga-
nized by specific “underlying mechanisms”, such as intonation,
semantic cues, facial expressions, eye contact, breathing, and
gestures [2], [3], [4]. In the organization of turn-taking and
to evaluate the conversation, overlapping speech has a major
role. Based on the turn-taking model by Sacks et al. [3],
conversational partners aim to minimize overlaps and gaps in
their conversations. From to this model, overlaps occurs at
places of possible turn ends, either as “terminal overlaps” or
“simultaneous start”. Thus, overlapping speech is explained
as a result of turn-taking principles. This explanation is ex-
tended for different situational settings e.g. by [5], [6], where
short feedback signals confirming the statement of the current
speaker, are seen as “response token” overlaps. Furthermore,
several studies also analyses competitive overlaps, in which
the conversational partners compete for the turn [7], [8].
Many recent studies analyzed the phonetic structure of
overlapping speech and found that fundamental frequency,
intensity, speech rate and rhythm are important features charac-
terizing the overlaps as either being cooperative or competitive
[9], [8], [10]. Most of these studies concentrate on local
analyses investigating the acoustic characteristics next to or
directly at the overlap. Only a few studies incorporate for
example information on the duration of turns [11]. But the
relation of the length of utterances with the situational type of
overlap is not analyzed.
Former studies on overlapping speech concentrate to seek
an explanation of how overlapping speech works. They do
not analyze which consequences lead to an overlap or which
consequences the overlap has for the progress of the interac-
tion. The analyses especially disregard the length of utterances
where an overlap occurs (consequence lead to an overlap) and
which influence the affective state could have. Especially, in
[12] it is emphasized that affective states influence the turn-
taking behavior. Thus, problems in turn-taking can also be
traced back to changes in the affective state.
Furthermore, these studies are conducted only on human-
human interaction (HHI). Investigations on human-computer
interaction (HCI) do not consider overlapping speech as an
informative signal so far. But, to reach the target of a more
naturalistic interaction, future systems have to be adaptable to
the users’ individual skills, preferences, and current emotional
state [13], [14]. Lot of progress have been made in the area
of affect detection from speech, facial expression and gesture
[15]. For a fully naturalistic HCI, it is necessary to capture as
many human abilities as possible. Thus also linguistic features
gain considerable importance [16]. In an earlier study, we
could show that discourse particles, exchanged among the
interaction partners and used to signalize the progress of the
dialogue, are also used in naturalistic HCI, although the system
was not able to properly react to them [17]. The usage of these
cues is influenced by the user’s age and gender [18].
In this paper we will extend our analysis of linguistic
cues to overlapping speech and analyze the meaningfulness
of overlapping speech regarding the utterance length and the
user’s affective state change. We conducted a contrasting
study using human-human interaction (HHI) as well as HCI.
This will be the first step towards an automatic evaluation
of overlapping speech and could help future “Cognitive In-
focommunication” systems to understand the human better
[14]. Technical systems that use this extended recognition
of linguistic cues adapt to their users and thus become his
attendant and ultimately his companion [13], [19].

Based on these considerations, we investigate the following
three research questions in this paper.
Q1 Is overlapping speech occurring frequently enough in
our material to be analyzed in a dyadic conversation?
Q2 Is there any connection between overlapping speech
and the length of the previous utterance?
Q3 Is overlapping speech occurring at points where the
affective state is changing?
The remainder of the paper is structured as follows: In
Section II the utilized datasets are shortly described and
specific differences are emphasized. Afterwards, in Section III,
we describe the preparation of the date in terms of types of
overlap and affective annotation. In Section IV the results are
presented and discussed. Finally, in Section V, a conclusion of
our investigations and an outlook for further research is given.
II. UTILIZED DATASETS
A. Davero Corpus of Telephone-based Conversations
The dataset is described in detail in [20]. It is created
within a research project aiming to develop a technical system
that supports callcenter employees to respond appropriately to
the current affective state of the caller and was recorded in a
callcenter collecting real and authentic phone calls in German.
The calls embrace various topics, like information talks, data
change notifications, and complaints.
In order to allow a complete analysis of the conversation
both, agent and caller, were recorded acoustically. To gain real-
istic and high-quality recordings as well as to avoid disturbing
background noise, a separate recording place had been set up.
In total, 49 days 7 hours have been recorded. Since the
recorded phone conversations are real customer interactions
they had to be anonymized first, blanking out all personal
information. Furthermore, the start and end-times of each
dialog and overlapping speech segments were marked and each
utterance was assigned to its corresponding speaker (agent or
caller). To date, this dataset contains 1,600 dialogs with 27,000
individual utterances. The dialogs have an average length of
about 5 minutes with a standard deviation of ± 2 minutes.
B. SmartKom multi-modal Corpus
The SmartKom multi-modal corpus contains naturalistic af-
fects within a HCI [21]. The system responses were generated
by a Wizard-of-Oz (WOZ) setup. For our evaluations we use
German dialogs concerning a technical scenario, recorded in
a public environment. The database contains multiple audio
channels and two video channels (face and body in profile
posture). The primary aim of this corpus was the empirical
study of HCI in a number of different tasks. It is structured
into several sessions. Each session contains one conversation
and is approximately 4.5 minutes long.
This corpus has several annotation levels, of which for our
investigation the turn segmentation and an affective annotation
based on the acoustic channel is used [22]. The considered
set of the SmartKom corpus contains 438 emotionally labeled
dialogs with 12,076 utterances in total and 6,079 user utter-
ances. The utterances are labeled in seven broader affective
states: neutral, joy, anger, helplessness, pondering, surprise and
unidentifiable episodes. Unfortunately, the turn segmentation is
not time-aligned with the affective annotation.
III. PREPARATION OF DATASETS
A. Analysis of Overlapping Speech
We analyze overlapping speech as an additional pattern
of an interaction, as we assume that a valuable contribution
to the assessment of interactions is provided. Overlapping
speech refers to the case, when both speakers are talking
simultaneously.
1) Davero Corpus: By listening to examples of the Davero
corpus, four different situations (S) can be identified where
overlapping speech occurs:
S1 Short feedback, no interruption of the speaker
S2 Premature turn-taking at the end of the speaker’s turn
S3 Simultaneous starting after longer silence
S4 Barge-in, aiming to take the turn over
These situations are based on the descriptions of [11], distin-
guishing response tokens (S1), terminal overlaps (S2), simulta-
neous starts (S3) and competitive overlaps (S4). A prototypical
illustration is given in Figure 1.
In the first situation (S1), the listener just wants to give a
feedback. Lacking of other feedback methods (head nodding,
eye gaze), the listener has to give the feedback acoustically.
Thus no real turn-taking occurs. The second situation (S2) can
be seen as a functional turn-taking. The listener knows that
the speaker’s turn is due to end, but because of the missing
visual feedback the listener starts his turn a bit too early. In
this case just the alignment of the turn-taking is incomplete.
S3 is similar, both speakers start talking coincidentally after
a longer silence due to missing cues. S4 shows an disturbed
turn-taking. It describes the case where one speaker barges-in
while the other is still speaking to deliberately steal the turn
from that other speaker.
S1
S2
S3
Time
S4
Figure 1. Prototypes of the four different situations of two speakers (denoted
as and ) for overlapping speech in HHI.
To evaluate the overlapping speech according to these
descriptions, we employed two labelers with psychological
background for the assessment. They could choose between
all four situations or describe a situation not covered by
the definitions. Of the currently available 27,000 utterances
in 1,600 dialogs 5,100 utterances (18.9%, 830 dialogs) are
marked to contain overlapping speech.
The final assessment is as follows: S1 has a share of
61.6%, S2 a share of 11.2%, S3 a share of 10.7%, and S4
a share of 16.6%. Furthermore, no additional situation was
selected by the annotators. As inter-rater reliability of the
crosstalk labelling we calculated a Krippendorffs alpha of
0.63, a substantial reliability according to [23].

2) SmartKom Corpus: The SmartKom Corpus does not
have explicitly marked overlapping speech segments. By using
the segmentation annotation we could identify two types (T)
of overlapping speech (Figure 2):
T1 User interrupts the system.
T2 System interrupts the user.
T1
T2
Figure 2. Prototypes of the two different type of overlapping speech between
system () and user () in HCI.
In HCI the system is not seen as an equivalent dialog
partner [24], [25]. Therefore, the variety of types is not as big
as in HHI. From the 12,640 dialog acts within the SmartKom
corpus, we have 817 overlapping speech samples for T1 and
672 samples for T2.
B. Evaluation of the Affective States
We analyzed the affective states in both corpora based on
the Geneva Emotion Wheel by K. Scherer [26]. This is an
empirically tested instrument for the assessment of affective
states including 16 “emotional families”, which are arranged
on a circle along the axes dominance and valence.
1) Davero Corpus: To conduct the affective assessment,
we first employed a few annotators to manually segment the
recordings into single dialogs including the speaker turns. We
asked four annotators, all of them with psychological back-
ground, to assess the affective content of the single utterances.
We conducted several training rounds to make them familiar
with the used annotation scheme and the affective assessment
of acoustic data [27]. To support the annotation process, the
program ikannotate was used [28], [29]. This tool supports the
annotators by employing a three-step annotation process:
1. The annotator decides if the dominance is high or low.
2. The annotator decides for positive or negative valence.
3. The resulting quadrant of the wheel is displayed with
the containing emotion families. The annotator selects
one family among them to indicate the perceived
emotion.
For the present investigation, we only consider the labels from
step 1 and step 2, as we are only interested in a general
affective change. For the inter-rater reliability of the affective
annotation we calculated a Krippendorffs alpha of 0.20 for
dominance and of 0.35 for valence. Although these numbers
seem to be quite low compared to other reliability values
known from content analysis, they are in line with results from
other research groups on affective analyses [27].Considering
the annotation results we observe a nearly balanced distribution
among the utterances. High dominance has a share of 59.0%
and low dominance of 41.0%, positive valence has a share of
53.5% and negative valence of 46.5%.
2) SmartKom Corpus: The SmartKom Corpus already has
an affective annotation [30]. Unfortunately, this annotation is
not on the same time-scale as the dialog act segments. Thus,
we have to perform an alignment of both annotation levels,
by using the individual timing information of both annotation
levels. Unfortunately, the corpus authors only measured the
annotation correctness by comparing the results of different
annotation rounds rather than calculating an inter-rater agree-
ment measure like Krippendorffs alpha or Fleiss’ kappa. Their
calculated correctness is 45.52% [31]
Furthermore, in contrast to the Davero corpus, the affective
annotation in SmartKom is based on emotional categories.
Thus, we first have to deploy a mapping of the categories
used in SmartKom to our utilized dimensional categories of
dominance and valence. To conduct this mapping, we rely
on the Geneva Emotion Wheel [26], as it is also used for
the annotation of the Davero corpus. In analogy to similar
mappings [32], the assignment of the emotional categories to
the valence-dominance space is given in Table I. In contrast to
the Davero corpus, neutral is used as a category in SmartKom.
Table I. Mapping of SmartKom’s emotional categories to dominance and
valence.
category dominance valence
neutral 0 0
joy +1 +1
anger +1 -1
helplessness -1 -1
pondering -1 +1
surprise -1 +1
unidentifiable 0 0
In total we have 14,298 affective segments. Most of these
segments (59.7%) are neutral. The distribution on the dom-
inance dimension is 30.7% low and 69.3% high dominance.
Positive valence has a share of 39.0% positive and negative
valence a share of 61%. Thus, the emotional content within this
corpus is shifted towards negative valence and high dominance.
IV. RESULTS
A. Q1: Occurrence of Overlapping Speech
To answer the first question, we calculated the ratio of
overlapping speech segments and number of utterances. For the
Davero corpus we have 27,000 utterances and 5,100 of them
contain overlapping speech. Thus, we have a share of 18.9%
for overlapping speech segments. If we consider the dialog
level, we have 1,600 dialogs in total of which 830 dialogs
contain overlapping speech segments. This results in a share
of 51.9%.
The German part of the SmartKom Corpus has a total num-
ber of 12,076 utterances with 1,489 occurrences of overlapping
speech. This results in a share of 12.3% overlapping speech
utterances for this HCI. As we only take into account the
overlapping speech, we have 6,347 user utterances and 817
utterances contain overlapping speech.
Thus, we can conclude that overlapping speech is occur-
ring frequently and the first question: “Is overlapping speech
occurring frequently enough to be analyzed in a dyadic con-
versation?” is approved.
B. Q2: Overlapping Speech and Utterance Lengths
This investigation is triggered by the assumption that
overlapping speech is occurring because the actual speaker is

talking too long and the listener wants to get the turn. To
investigate this assumption, we calculated the mean length of
the utterance where the overlapping speech occurs (utt
ov erlap
)
in relation to all other utterances (utt
remain
) of this speaker
within a dialog. Afterwards, we averaged over all dialogs and
calculated the difference between both averaged mean lengths:
∆len = len
utt
overlap
len
utt
remain
(1)
This calculation is performed for each of the previous iden-
tified different situations separately and averaged afterwards
(∆len). Additionally, we used the non-parametric Mann-
Whitney-U-Test, to test the significance of the difference
within the utterance lengths. The star denotes the significance
level: ** p < 0.001.
1) Davero Corpus: From Figure 3 it can be seen, that in
two situations, the length of the utterance with overlapping
speech is different from other utterances of the same speaker.
For S1, the len
utt
overlap
is significantly longer than for other
utterances. In this situation one speaker gives statements that
are just confirmed by the listener without interrupting the
speaker. Thus, the speaker can continue his turn. The presence
of this type of overlapping speech does not indicate a change
in the progress of the dialog. The same statement can be made
for S2, where the len
utt
overlap
is significantly shorter than for
other utterances. The overlapping speech in both situations is
just occurring because only the acoustic channel can be used
to negotiate the turn-taking. Thus, the length of an utterance
together with the information of an occurring speech overlap
cannot be used as an indicator for dysfunctional conversation.
S1 S2 S3 S4
2
0
2
4
** **
Overlapping Speech Situation
∆len [s]
Figure 3. Difference of the utterance length for the four defined overlapping
speech situations and the average utterance length in the Davero corpus, stars
indicate a significant difference.
2) SmartKom Corpus: From Figure 4 it can be seen that
for type T1 the system utterance with the overlapping speech
segment (len
utt
overlap
) is significantly longer than the other
system utterances. Thus, it can be assumed that for a pos-
itive interaction outcome and fluent conversation the system
prompts shouldn’t be too long.
For the second type, where the system interrupts the users,
the users’ utterance length containing overlapping speech is not
significantly different from other user’s utterances. Therefore,
we assume that these interruptions of the system are caused
by operator errors of the WOZ-system.
Regarding our second research question, we can state that
there is a significant correlation between overlapping speech
and the length of the previous utterance in both HHI and HCI.
C. Q3: Overlapping Speech and Affective Changes
To investigate the affective change at the point where
overlapping speech occurs, we take into account the observed
affective states in two preceding utterances and compare it to
T1 T2
0
1
2
3
**
Type of Overlapping Speech
∆len [s]
Figure 4. Difference of the utterance length for the two types of overlapping
and the average utterance length in SmartKom. The stars indicate a significant
difference.
the observed affective states in the two succeeding utterances.
We distinguish between high (+1) and low (-1) dominance and
positive (+1) and negative (-1) valence. Utterances that do not
have an affective label or are labeled as neutral are assigned a
0. Thus, we can calculate the difference between the affective
states of the preceding utterances and the succeeding utterances
for an overlapping speech segment:
∆Affect = Affect
before overlap
Affect
after overlap
(2)
Afterwards, we average over all segments (∆Affect). The sig-
nificance of the affective change is tested by using the Mann-
Whitney-U-Test. The stars denotes the significance level: *
p < 0.01 and ** p < 0.001.
S1 S2 S3 S4
0
0.5
1
**
Overlapping Speech Situation
∆Dom
Figure 5. Dominance change at the point where overlapping speech occurs,
stars indicate a significant affective state change
1) Davero Corpus: Analyzing the change of affective states
in connection with overlapping speech only in S3 (simulta-
neous starting after longer silence) a significant change in
the affective state can be observed, see Figure 5. A possible
interpretation for this observation is that the dominance level
is dropping. Having a deeper analysis of the data, we can state
that the dominance level of the interrupter is raising, while the
dominance of the speaker whose turn is interrupted is slightly
decreasing. In this case, the overlapping speech event could be
a good marker for identifying changes in dominance. For all
other situations of overlapping speech, the dominance of the
two speakers is not influenced by overlapping speech.
S1 S2 S3 S4
0.2
0
0.2
Overlapping Speech Situation
∆Val
Figure 6. Valence change at the point where overlapping speech occurs in
the Davero corpus.
For the change of the speaker’s valence, we can state
that there is no significant connection with the occurrence of
overlapping speech, cf. Figure 6. This could be expected as
overlapping speech is related to the turn-taking behavior of

the speakers and the dominance of a speaker is seen as the
underlying mechanism to regulate the turn-taking [12].
T1 T2
1
0.5
0
** **
Type of Overlapping Speech
∆Dom
Figure 7. Dominance change at the point where overlapping speech occurs in
the SmartKom Corpus. The stars indicate a significant affective state change.
2) SmartKom Corpus: Regarding Figure 7, it can be seen
that the dominance for both types is significantly higher after
the overlapping speech segment than before. This it quite
obvious for the case where the user actively interrupts the
system, but when the system interrupts the user this seems
quite unintuitive and can only be explained in connection
with the valence change. Regarding the valence change (cf.
Figure 8), we can state that after the overlapping speech
the user is significantly more moved to negative values. This
finding in connection with a higher dominance shows that it
can be assumed that the user is more angry after overlapping
speech, either because he wants to speak and interrupts the
system, or he is annoyed because the system interrupts him.
T1 T2
0
0.5
1
** *
Type of Overlapping Speech
∆Val
Figure 8. Valence change at the point where overlapping speech occurs in
the SmartKom Corpus. The stars indicate a significant affective state change.
Regarding our third research question: “Is overlapping
speech occurring at points where the affective state is chang-
ing?”, we can conclude that in HHI only changes in the
dominance are related to overlapping speech, whereas in HCI
significant changes in both affective dimensions, dominance
and valence, can be observed.
V. CONCLUSION
In this paper, we present a first study investigating over-
lapping speech effects in both HHI and HCI. The analyses are
conducted on a dataset of realistic HHI containing telephone
based conversations and the well-known SmartKom Corpus
of naturalistic HCI. We could show that in both datasets
overlapping speech occurs frequently enough to be analyzed
with a share of 18.9% for HHI and 12.3% for HCI. For the
investigated HHI, this share is in-line with the results of other
research groups [11], [33]. The amount of overlap in HCI is
a bit lower but still sufficient. For this no numbers of other
researchers are to our best knowledge reported in the literature.
Based on the description of situational settings, we first an-
alyzed the correlation between the length of overlap-preceding
utterances and the occurrence of overlap. As a result of our
first analysis, we could expose significant relations to the
length of the spoken utterances and changes in the affective
state of the conversational partners. In HHI we could find a
significant correlation between overlapping speech as feedback
and premature turn-taking. Also in HCI a significant corre-
lation is found between overlapping speech and the length
of system utterances. The user’s utterance-lengths did not
show significant correlations for the occurrence of overlaps,
we assume that these overlaps are just caused by operator
malfuntions, as the SmartKom data are recorded in a WOZ-
scenario. And now pre-defined design rules are given for the
wizards how to use overlap [21].
Secondly, we analyzed the correlation of affective changes
at in the surrounding of the overlap in both types of inter-
actions. For this investigation, we showed that overlapping
speech goes along with changes in the affective states of
dominance and valence in certain situations. In HHI only
the situation where both speakers start simultaneously after a
longer pause effects a significant change of dominance. For the
valence dimension no significant correlation could be found. In
the investigated HCI both affective dimensions, dominance and
valence, show a significant correlation to overlapping speech
in both situation types.
From these results, we are able to derive some rules for
the organization of interactions: In telephone based HHI the
utterances should not be too long and the listener should be
encouraged to give feedback. This avoids competitive barge-
in overlaps. For HCI, the system should not talk to long as
for all overlapping speech segments, an affective change to
higher dominance and negative valence of the speaker can be
observed. But this kind of affective change should be avoided.
To evaluate these statements for their generality, a broader
investigation including additional corpora has to be conducted.
A possible application of our investigations in HHI and
HCI is the identification of parts where the affective state
changes based on the knowledge of overlapping speech and the
dialog course: As e.g. situation S3, where both speakers start
simultaneously, can be easily identified by duration analysis,
this knowledge can be used to find affective material for further
emotional analyses.
In our further research activities, we will develop a robust
automatic identification of the different types of overlap.
Together with the recognition of the user’s affective state,
we are a step further to future Cognitive Infocommunication
systems acting as a companion towards human users [13], [14].
VI. ACKNOWLEDGEMENTS
The work presented in this paper was done within the
Transregional Collaborative Research Centre SFB/TRR 62
‘Companion-Technology for Cognitive Technical Systems’
funded by the German Research Foundation (DFG).
REFERENCES
[1] F. Schulz von Thun, Miteinander reden 1 - St
¨
orungen und Kl
¨
arungen.
Reinbek, Germany: Rowohlt, 1981.
[2] R. Ishii, K. Otsuka, S. Kumano, and J. Yamato, Analysis of respiration
for prediction of ”who will be next speaker and when?” in multi-party
meetings, in Proc. of the 16th International Conference on Multimodal
Interaction, ser. ICMI ’14, Istanbul, Turkey, 2014, pp. 18–25.
[3] H. Sacks, E. A. Schegloff, and G. Jefferson, A simplest systematics
for the organization of turn taking for conversation, Language, vol. 50,
pp. 696–735, 1974.

Citations
More filters
Proceedings ArticleDOI
14 Apr 2018
TL;DR: This paper proposes detection of overlap segments using a neural network architecture consisting of long-short term memory (LSTM) models that learns the presence of overlap in speech by identifying the spectrotemporal structure of overlapping speech segments.
Abstract: The detection of overlapping speech segments is of key importance in speech applications involving analysis of multi-party conversations. The detection problem is challenging because overlapping speech segments are typically captured as short speech utterances far-field microphone recordings. In this paper, we propose detection of overlap segments using a neural network architecture consisting of long-short term memory (LSTM) models. The neural network architecture learns the presence of overlap in speech by identifying the spectrotemporal structure of overlapping speech segments. In order to evaluate the model performance, we perform experiments on simulated overlapped speech generated from the TIMIT database, and natural multi-talker conversational speech in the augmented Multiparty Interaction (AMI) meeting corpus. The proposed approach yields improvements over a Gaussian mixture model based overlap detection system. Furthermore, as an application of overlap detection, integration of overlap detection into speaker diarization task is shown to give improvement in diarization error rate.

17 citations


Cites background from "Overlapping speech, utterance durat..."

  • ...As described in [1], overlapped speech can be demarcated into four types, namely, (a) short feedback, no interruption of the speaker, (b) premature turn-taking at the end of the speakers turn, (c) simultaneous starting after longer silence, and (d) barge-in, aiming to take the turn over....

    [...]

  • ...Interestingly, owing to the four kinds of overlaps in natural conversations [1], the overlapping segments can be associated with overlaps of voiced and unvoiced segments, and also speech and non-speech (such laughter) segments....

    [...]

Proceedings Article
01 May 2020
TL;DR: The provided dataset – Voice Assistant Conversations in the wild (VACW) – includes the transcripts of both visitors requests and Alexa answers, identified topics and sessions as well as acoustic characteristics automatically extractable from the visitors’ audio files.
Abstract: Datasets featuring modern voice assistants such as Alexa, Siri, Cortana and others allow an easy study of human-machine interactions. But data collections offering an unconstrained, unscripted public interaction are quite rare. Many studies so far have focused on private usage, short pre-defined task or specific domains. This contribution presents a dataset providing a large amount of unconstrained public interactions with a voice assistant. Up to now around 40 hours of device directed utterances were collected during a science exhibition touring through Germany. The data recording was part of an exhibit that engages visitors to interact with a commercial voice assistant system (Amazon’s ALEXA), but did not restrict them to a specific topic. A specifically developed quiz was starting point of the conversation, as the voice assistant was presented to the visitors as a possible joker for the quiz. But the visitors were not forced to solve the quiz with the help of the voice assistant and thus many visitors had an open conversation. The provided dataset – Voice Assistant Conversations in the wild (VACW) – includes the transcripts of both visitors requests and Alexa answers, identified topics and sessions as well as acoustic characteristics automatically extractable from the visitors’ audio files.

9 citations


Cites background from "Overlapping speech, utterance durat..."

  • ...…expressions that have semantic similarity but different meanings, are still based on the evaluation of pre-defined keywords/intents, and are still unable to interpret prosodic information as it is needed for an emotional/dispositional understanding (Schuller et al., 2011; Siegert et al., 2015)....

    [...]

  • ...semantic similarity but different meanings, are still based on the evaluation of pre-defined keywords/intents, and are still unable to interpret prosodic information as it is needed for an emotional/dispositional understanding (Schuller et al., 2011; Siegert et al., 2015)....

    [...]

Proceedings ArticleDOI
01 Oct 2017
TL;DR: The findings show that overlapping speech is a key feature for predicting aggression levels, that discriminating only severe cases of overlap is a sufficient feature and that automatically predicted overlap is improving aggression recognition as well.
Abstract: Automatic recognition of negative affect and aggression is key in many safety critical domains such as surveillance and health care In this paper we explore the potential of overlapping speech for predicting aggression levels As a first step we consider 3 categories of overlapping speech based on literature Having an annotation of these overlap categories, we examine whether overlapping speech is a good feature for predicting aggression by using it in classification together with a set of acoustic features typically used for this purpose Next, we explore if this fine categorization of overlap is necessary in predicting aggression levels or a more coarse representation is sufficient Finally, we check the additive values of automatically predicted overlapping speech for aggression recognition The experiments are performed on a dataset of dyadic interactions between professional aggression training actors (actors) and naive participants (students) interacting freely based on short role descriptions Our findings show that overlapping speech is a key feature for predicting aggression levels, that discriminating only severe cases of overlap is a sufficient feature and that automatically predicted overlap is improving aggression recognition as well

7 citations


Cites background or methods from "Overlapping speech, utterance durat..."

  • ...A fourth type of overlapping speech was suggested in [28], namely when after a pause the two speakers start speaking simultaneously, which occurred in their call-center corpus....

    [...]

  • ...Inspired from related work in turn taking organization and the summary of overlap types proposed in [28] together with observing the patterns of overlapping speech in the used dataset, three types of overlapping speech were considered and annotated....

    [...]

  • ...Inspired form previous work that examined turn taking behavior [23], [24] and the overlap categorization in [28], we consider three categories of overlapping speech: short feedback, premature turn-taking and competitive overlaps....

    [...]

  • ...Furthermore, it was found that overlapping speech occurs at moments where affective states are changing [28]....

    [...]

Book ChapterDOI
01 Jan 2019
TL;DR: This work argues that, in this context, big data alone is not purposeful, since important effects are obscured, and since high-quality annotation is too costly, and encourages the collection and use of enriched data.
Abstract: Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-centric individualisation. Users expect devices to interact intelligently, to anticipate their needs, and to adapt to their behaviour. To do so, companion-like solutions have to take into account the affective and dispositional state of the user, and therefore to be trained and modified using interaction data and corpora. We argue that, in this context, big data alone is not purposeful, since important effects are obscured, and since high-quality annotation is too costly. We encourage the collection and use of enriched data. We report on recent trends in this field, presenting methodologies for collecting data with rich disposition variety and predictable classifications based on a careful design and standardised psychological assessments. Besides socio-demographic information and personality traits, we also use speech events to improve user state models. Furthermore, we present possibilities to increase the amount of enriched data in cross-corpus or intra-corpus way based on recent learning approaches. Finally, we highlight particular recent neural recognition approaches feasible for smaller datasets, and covering temporal aspects.

6 citations

Journal ArticleDOI
TL;DR: It is suggested that beliefs about agency affect how efficiently and how accurately older adults learn with technology, which has implications for computer mediated support in aging.

5 citations

References
More filters
Book ChapterDOI
09 Oct 2011
TL;DR: Experimental results show that the system's affective profile determines the rating of chatting enjoyment and user-system emotional connection to a large extent and self-reported emotional changes experienced by participants during an interaction with the system are strongly correlated with the type of applied profile.
Abstract: We describe the use of affective profiles in a dialog system and its effect on participants' perception of conversational partners and experienced emotional changes in an experimental setting, as well as the mechanisms for realising three different affective profiles and for steering task-oriented follow-up dialogs. Experimental results show that the system's affective profile determines the rating of chatting enjoyment and user-system emotional connection to a large extent. Self-reported emotional changes experienced by participants during an interaction with the system are also strongly correlated with the type of applied profile. Perception of core capabilities of the system, realism and coherence of dialog, are only influenced to a limited extent.

33 citations


"Overlapping speech, utterance durat..." refers background in this paper

  • ...In HCI the system is not seen as an equivalent dialog partner [24], [25]....

    [...]

Journal ArticleDOI
TL;DR: The presented investigations coherently support age-dependence of both expressiveness and problem-solving ability in communication with dialog systems, and induces design rules for future automatic designated “companion” systems.
Abstract: This paper addresses issues of automatically detecting significant dialog events (SDEs) in naturalistic HCI, and of deducing trait-specific conclusions relevant for the design of spoken dialog systems. We perform our investigations on the multimodal LAST MINUTE corpus with records from naturalistic interactions. First, we used textual transcripts to analyse interaction styles and discourse structures. We found indications that younger subjects prefer a more technical style in communication with dialog systems. Next, we model the subject’s internal success state with a hidden Markov model trained using the observed sequences of system feedback. This reveals that younger subjects interact significantly more successful with technical systems. Aiming on automatic detection of specific subjects’s reactions, we then semi-automatically annotate SDEs—phrases indicating an irregular, i.e. not-task-oriented subject behavior. We use both acoustic and linguistic features to build several trait-specific classifiers for dialog phases, which showed pronouncedly different accuracies for diverse age and gender groups. The presented investigations coherently support age-dependence of both expressiveness and problem-solving ability. This in turn induces design rules for future automatic designated “companion” systems.

22 citations


"Overlapping speech, utterance durat..." refers background in this paper

  • ...In HCI the system is not seen as an equivalent dialog partner [24], [25]....

    [...]

Book ChapterDOI
09 Oct 2011
TL;DR: The tool ikannotate is introduced, which allows the generation of a transcription of material directly annotated with prosodic features that can be emotionally labelled according to Basic Emotions, the Geneva Emotion Wheel, and Self Assessment Manikins.
Abstract: In speech recognition and emotion recognition from speech, qualitatively high transcription and annotation of given material is important. To analyse prosodic features, linguistics provides several transcription systems. Furthermore, in emotion labelling different methods are proposed and discussed. In this paper, we introduce the tool ikannotate, which combines prosodic information with emotion labelling. It allows the generation of a transcription of material directly annotated with prosodic features. Moreover, material can be emotionally labelled according to Basic Emotions, the Geneva Emotion Wheel, and Self Assessment Manikins. Finally, we present results of two usability tests observing the ability to identify emotions in labelling and comparing the transcription tool "Folker" with our application.

18 citations


"Overlapping speech, utterance durat..." refers methods in this paper

  • ...To support the annotation process, the program ikannotate was used [28], [29]....

    [...]

Book ChapterDOI
01 Jan 2009
TL;DR: This paper investigated the role of fundamental frequency (F0) as a resource for turn competition in overlapping speech and found that participants in talk-in-interaction systematically manipulate F0 height when competing for the turn.
Abstract: Overlapping talk is common in talk-in-interaction. Much of the previous research on this topic agrees that speaker overlaps can be either turn competitive or noncompetitive. An investigation of the differences in prosodic design between these two classes of overlaps can offer insight into how speakers use and orient to prosody as a resource for turn competition. In this paper, we investigate the role of fundamental frequency (F0) as a resource for turn competition in overlapping speech. Our methodological approach combines detailed conversation analysis of overlap instances with acoustic measurements of F0 in the overlapping sequence and in its local context. The analyses are based on a collection of overlap instances drawn from the ICSI Meeting corpus. We found that overlappers mark an overlapping incoming as competitive by raising F0 above their norm for turn beginnings, and retaining this higher F0 until the point of overlap resolution. Overlappees may respond to these competitive incomings by returning competition, in which case they raise their F0 too. Our results thus provide instrumental support for earlier claims made on impressionistic evidence, namely that participants in talk-in-interaction systematically manipulate F0 height when competing for the turn.

18 citations


"Overlapping speech, utterance durat..." refers background in this paper

  • ...Many recent studies analyzed the phonetic structure of overlapping speech and found that fundamental frequency, intensity, speech rate and rhythm are important features characterizing the overlaps as either being cooperative or competitive [9], [8], [10]....

    [...]

Book ChapterDOI
01 Jan 2014
TL;DR: For a successful speech-controlled human-computer interaction the pure textual information as well as individual skills, preferences, and affective states of the user have to be known.
Abstract: For a successful speech-controlled human-computer interaction (HCI) the pure textual information as well as individual skills, preferences, and affective states of the user have to be known. However, verbal human interaction consists of several information layers. Apart from pure textual information, further details regarding the speaker’s feelings, believes, and social relations are transmitted. The additional information is encoded through acoustics. Especially, the intonation reveals details about the speakers communicative relation and their attitude towards the ongoing dialogue.

14 citations


Additional excerpts

  • ...In an earlier study, we could show that discourse particles, exchanged among the interaction partners and used to signalize the progress of the dialogue, are also used in naturalistic HCI, although the system was not able to properly react to them [17]....

    [...]

Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Overlapping speech, utterance duration and affective content in hhi and hci – an comparison" ?

In cases of failure, overlapping speech may occur which is in the focus of this paper. The authors investigate the davero corpus a large naturalistic spoken corpus of real callcenter telephone conversations and compare their findings to results on the well-known SmartKom corpus consisting of human-computer interaction. The authors first show that overlapping speech occurs in different types of situational settings – extending the well-known categories cooperative and competitive overlaps –, all of which are frequent enough to be analyzed. Furthermore, the authors present connections between the occurrence of overlapping speech and the length of the previous utterance, and show that overlapping speech occurs at dialog instances where certain affective states are changing. 

In their further research activities, the authors will develop a robust automatic identification of the different types of overlap. Together with the recognition of the user ’ s affective state, the authors are a step further to future Cognitive Infocommunication systems acting as a companion towards human users [ 13 ], [ 14 ]. 

the authors used the non-parametric MannWhitney-U-Test, to test the significance of the difference within the utterance lengths. 

The considered set of the SmartKom corpus contains 438 emotionally labeled dialogs with 12,076 utterances in total and 6,079 user utterances. 

Of the currently available 27,000 utterances in 1,600 dialogs 5,100 utterances (18.9%, 830 dialogs) are marked to contain overlapping speech. 

For this investigation, the authors showed that overlapping speech goes along with changes in the affective states of dominance and valence in certain situations. 

This corpus has several annotation levels, of which for their investigation the turn segmentation and an affective annotation based on the acoustic channel is used [22]. 

A possible application of their investigations in HHI and HCI is the identification of parts where the affective state changes based on the knowledge of overlapping speech and the dialog course: 

the corpus authors only measured the annotation correctness by comparing the results of different annotation rounds rather than calculating an inter-rater agreement measure like Krippendorff’s alpha or Fleiss’ kappa. 

As the authors only take into account the overlapping speech, the authors have 6,347 user utterances and 817 utterances contain overlapping speech. 

To conduct the affective assessment, the authors first employed a few annotators to manually segment the recordings into single dialogs including the speaker turns.