scispace - formally typeset
Search or ask a question

Showing papers on "Utterance published in 2012"


Journal ArticleDOI
TL;DR: A framework for pragmatic analysis is proposed which treats discourse as a game, with context as a scoreboard organized around the questions under discussion by the interlocutors, and it is argued that the prosodic focus of an utterance canonically serves to reflect the question under discussion, placing additional constraints on felicity in context.
Abstract: A framework for pragmatic analysis is proposed which treats discourse as a game, with context as a scoreboard organized around the questions under discussion by the interlocutors. The framework is intended to be coordinated with a dynamic compositional semantics. Accordingly, the context of utterance is modeled as a tuple of different types of information, and the questions therein — modeled, as is usual in formal semantics, as alternative sets of propositions — constrain the felicitous flow of discourse. A requirement of Relevance is satisfied by an utterance (whether an assertion, a question or a suggestion) iff it addresses the question under discussion. Finally, it is argued that the prosodic focus of an utterance canonically serves to reflect the question under discussion (at least in English), placing additional constraints on felicity in context. http://dx.doi.org/10.3765/sp.5.6 BibTeX info

979 citations


Book Chapter
01 Jan 2012
TL;DR: Pragmatic studies of verbal communication start from the assumption that an essential feature of most human communication is the expression and recognition of intentions, and that the hearer infers the speaker’s intended meaning from evidence she has provided for this purpose.
Abstract: © Deirdre Wilson and Dan Sperber 2012 Introduction Pragmatic studies of verbal communication start from the assumption (first defended in detail by the philosopher Paul Grice) that an essential feature of most human communication, both verbal and non-verbal, is the expression and recognition of intentions (Grice 1957, 1969, 1982, 1989) On this approach, pragmatic interpretation is ultimately an exercise in metapsychology, in which the hearer infers the speaker’s intended meaning from evidence she has provided for this purpose An utterance is, of course, a linguistically coded piece of evidence, so that verbal comprehension involves an element of decoding However, the decoded linguistic meaning is merely the starting point for an inferential process that results in the attribution of a speaker’s meaning The central problem for pragmatics is that the linguistic meaning recovered by decoding vastly underdetermines the speaker’s meaning There may be ambiguities and referential ambivalences to resolve, ellipses to interpret, and other indeterminacies of explicit content to deal with There may be implicatures to identify, illocutionary indeterminacies to resolve, metaphors and ironies to interpret All this requires an appropriate set of contextual assumptions, which the hearer must also supply To illustrate, consider the examples in (1) and (2):

623 citations


Book ChapterDOI
01 Jan 2012
TL;DR: Wilson and Sperber as mentioned in this paper treat utterance interpretation as a two-phase process: a modular decoding phase is seen as providing input to a central inferential phase in which a linguistically encoded logical form is contextually enriched and used to construct a hypothesis about the speaker's informative intention.
Abstract: © Deirdre Wilson and Dan Sperber 2012. Introduction Our book Relevance (Sperber and Wilson 1986a) treats utterance interpretation as a two-phase process: a modular decoding phase is seen as providing input to a central inferential phase in which a linguistically encoded logical form is contextually enriched and used to construct a hypothesis about the speaker’s informative intention. Relevance was mainly concerned with the inferential phase of comprehension: we had to answer Fodor’s challenge that while decoding processes are quite well understood, inferential processes are not only not understood, but perhaps not even understandable (see Fodor 1983). Here we will look more closely at the decoding phase and consider what types of information may be linguistically encoded, and how the borderline between decoding and inference can be drawn. It might be that all linguistically encoded information is cut to a single pattern: all truth conditions, say, or all instructions for use. However, there is a robust intuition that two basic types of meaning can be found. This intuition surfaces in a variety of distinctions: between describing and indicating, stating and showing, saying and conventionally implicating, or between truth-conditional and non-truth-conditional, conceptual and procedural, or representational and computational meaning. In the literature, justifications for these distinctions have been developed in both strictly linguistic and more broadly cognitive terms.

417 citations


Patent
Andrej Ljolje1
02 Jul 2012
TL;DR: In this article, a system and method for performing speech recognition is disclosed, which comprises receiving an utterance, applying the utterance to a recognizer with a language model having pronunciation probabilities associated with unique word identifiers for words given their pronunciations.
Abstract: A system and method for performing speech recognition is disclosed. The method comprises receiving an utterance, applying the utterance to a recognizer with a language model having pronunciation probabilities associated with unique word identifiers for words given their pronunciations and presenting a recognition result for the utterance. Recognition improvement is found by moving a pronunciation model from a dictionary to the language model.

159 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the ToM network becomes active while a participant is understanding verbal irony, and that ToM activity is directly linked with language comprehension processes.

138 citations


Patent
19 Dec 2012
TL;DR: In this paper, features for processing a user utterance with respect to multiple subject matters or domains are disclosed for selecting a likely result from a particular domain with which to respond to the utterance or otherwise take action.
Abstract: Features are disclosed for processing a user utterance with respect to multiple subject matters or domains, and for selecting a likely result from a particular domain with which to respond to the utterance or otherwise take action. A user utterance may be transcribed by an automatic speech recognition (“ASR”) module, and the results may be provided to a multi-domain natural language understanding (“NLU”) engine. The multi-domain NLU engine may process the transcription(s) in multiple individual domains rather than in a single domain. In some cases, the transcription(s) may be processed in multiple individual domains in parallel or substantially simultaneously. In addition, hints may be generated based on previous user interactions and other data. The ASR module, multi-domain NLU engine, and other components of a spoken language processing system may use the hints to more efficiently process input or more accurately generate output.

127 citations


Journal ArticleDOI
TL;DR: It is proposed that gamma band power reflects a temporal binding phenomenon concerning the coordination of neural assemblies involved in accessing meaning of long samples of speech, during the processing of native and foreign languages.
Abstract: Spoken sentence comprehension relies on rapid and effortless temporal integration of speech units displayed at different rates. Temporal integration refers to how chunks of information perceived at different time scales are linked together by the listener in mapping speech sounds onto meaning. The neural implementation of this integration remains unclear. This study explores the role of short and long windows of integration in accessing meaning from long samples of speech. In a cross-linguistic study, we explore the time course of oscillatory brain activity between 1 and 100 Hz, recorded using EEG, during the processing of native and foreign languages. We compare oscillatory responses in a group of Italian and Spanish native speakers while they attentively listen to Italian, Japanese, and Spanish utterances, played either forward or backward. The results show that both groups of participants display a significant increase in gamma band power (55-75 Hz) only when they listen to their native language played forward. The increase in gamma power starts around 1000 msec after the onset of the utterance and decreases by its end, resembling the time course of access to meaning during speech perception. In contrast, changes in low-frequency power show similar patterns for both native and foreign languages. We propose that gamma band power reflects a temporal binding phenomenon concerning the coordination of neural assemblies involved in accessing meaning of long samples of speech.

105 citations


Journal ArticleDOI
TL;DR: Comprehension of IR sentences activates cortical motor areas reliably more than comprehension of sentences devoid of any implicit motor information, and this is true despite the fact that IR sentences contain no lexical reference to action.
Abstract: Research from the past decade has shown that understanding the meaning of words and utterances (i.e., abstracted symbols) engages the same systems we used to perceive and interact with the physical world in a content-specific manner. For example, understanding the word "grasp" elicits activation in the cortical motor network, that is, part of the neural substrate involved in planned and executing a grasping action. In the embodied literature, cortical motor activation during language comprehension is thought to reflect motor simulation underlying conceptual knowledge [note that outside the embodied framework, other explanations for the link between action and language are offered, e.g., Mahon, B. Z., & Caramazza, A. A critical look at the embodied cognition hypothesis and a new proposal for grouding conceptual content. Journal of Physiology, 102, 59-70, 2008; Hagoort, P. On Broca, brain, and binding: A new framework. Trends in Cognitive Sciences, 9, 416-423, 2005]. Previous research has supported the view that the coupling between language and action is flexible, and reading an action-related word form is not sufficient for cortical motor activation [Van Dam, W. O., van Dijk, M., Bekkering, H., & Rueschemeyer, S.-A. Flexibility in embodied lexical-semantic representations. Human Brain Mapping, doi: 10.1002/hbm.21365, 2011]. The current study goes one step further by addressing the necessity of action-related word forms for motor activation during language comprehension. Subjects listened to indirect requests (IRs) for action during an fMRI session. IRs for action are speech acts in which access to an action concept is required, although it is not explicitly encoded in the language. For example, the utterance "It is hot here!" in a room with a window is likely to be interpreted as a request to open the window. However, the same utterance in a desert will be interpreted as a statement. The results indicate (1) that comprehension of IR sentences activates cortical motor areas reliably more than comprehension of sentences devoid of any implicit motor information. This is true despite the fact that IR sentences contain no lexical reference to action. (2) Comprehension of IR sentences also reliably activates substantial portions of the theory of mind network, known to be involved in making inferences about mental states of others. The implications of these findings for embodied theories of language are discussed.

103 citations


Proceedings Article
01 Jan 2012
TL;DR: This work introduces an annotated and standardized corpus in the Spoken Dialog Systems (SDS) domain intended as a standardized basis for classification and evaluation tasks regarding task success prediction, dialog quality estimation or emotion recognition to foster comparability between different approaches on these fields.
Abstract: Standardized corpora are the foundation for spoken language research. In this work, we introduce an annotated and standardized corpus in the Spoken Dialog Systems (SDS) domain. Data from the Let's Go Bus Information System from the Carnegie Mellon University in Pittsburgh has been formatted, parameterized and annotated with quality, emotion, and task success labels containing 347 dialogs with 9,083 system-user exchanges. A total of 46 parameters have been derived automatically and semi-automatically from Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU) and Dialog Manager (DM) properties. To each spoken user utterance an emotion label from the set garbage, non-angry, slightly angry, very angry has been assigned. In addition, a manual annotation of Interaction Quality (IQ) on the exchange level has been performed with three raters achieving a Kappa value of 0.54. The IQ score expresses the quality of the interaction up to each system-user exchange on a score from 1-5. The presented corpus is intended as a standardized basis for classification and evaluation tasks regarding task success prediction, dialog quality estimation or emotion recognition to foster comparability between different approaches on these fields.

90 citations


Journal ArticleDOI
TL;DR: The authors make a case for the merits of an account of word type meaning in non-conceptual terms, in contrast to the widespread assumption that lexical meaning is conceptual, hence directly expressible.
Abstract: The concept expressed by the use of a word in a context often diverges from its lexically encoded context-independent meaning: it may be more specific or more general (or a combination of both) than the lexical meaning. Grasping the intended concept involves a pragmatic process of relevance-driven adjustment or modulation of the lexical meaning in interaction with the rest of the utterance and with contextual information. The issue addressed here is the nature of the input to the pragmatic process of meaning adjustment, that is, the nature of the standing (encoded) meaning of the word type. The widespread assumption that lexical meaning is conceptual, hence directly expressible, is challenged and a case made for the merits of an account of word type meaning in non-conceptual terms.

87 citations


Proceedings ArticleDOI
09 Jul 2012
TL;DR: To enable a robot to recover from a failure to understand a natural language utterance, an information-theoretic strategy for asking targeted clarifying questions and using information from the answer to disambiguate the language is described.
Abstract: Our goal is to build robots that can robustly interact with humans using natural language. This problem is challenging because human language is filled with ambiguity, and furthermore, due to limitations in sensing, the robot's perception of its environment might be much more limited than that of its human partner. To enable a robot to recover from a failure to understand a natural language utterance, this paper describes an information-theoretic strategy for asking targeted clarifying questions and using information from the answer to disambiguate the language. To identify good questions, we derive an estimate of the robot's uncertainty about the mapping between specific phrases in the language and aspects of the external world. This metric enables the robot to ask a targeted question about the parts of the language for which it is most uncertain. After receiving an answer, the robot fuses information from the command, the question, and the answer in a joint probabilistic graphical model in the G3 framework. When using answers to questions, we show the robot is able to infer mappings between parts of the language and concrete object groundings in the external world with higher accuracy than by using information from the command alone. Furthermore, we demonstrate that by effectively selecting which questions to ask, the robot is able to achieve significant performance gains while asking many fewer questions than baseline metrics.

Journal ArticleDOI
TL;DR: The results from this experiment imply that the cortical regions are dynamically recruited in language comprehension as a function of the processing demands of a task.
Abstract: This study used fMRI to investigate the neural correlates of analogical mapping during metaphor comprehension, with a focus on dynamic configuration of neural networks with changing processing demands and individual abilities. Participants with varying vocabulary sizes and working memory capacities read 3-sentence passages ending in nominal critical utterances of the form “X is a Y.” Processing demands were manipulated by varying preceding contexts. Three figurative conditions manipulated difficulty by varying the extent to which preceding contexts mentioned relevant semantic features for relating the vehicle and topic of the critical utterance to one another. In the easy condition, supporting information was mentioned. In the neutral condition, no relevant information was mentioned. In the most difficult condition, opposite features were mentioned, resulting in an ironic interpretation of the critical utterance. A fourth, literal condition included context that supported a literal interpretation of the critical utterance. Activation in lateral and medial frontal regions increased with increasing contextual difficulty. Lower vocabulary readers also had greater activation across conditions in the right inferior frontal gyrus. In addition, volumetric analyses showed increased right temporo-parietal junction and superior medial frontal activation for all figurative conditions over the literal condition. The results from this experiment imply that the cortical regions are dynamically recruited in language comprehension as a function of the processing demands of a task. Individual differences in cognitive capacities were also associated with differences in recruitment and modulation of working memory and executive function regions, highlighting the overlapping computations in metaphor comprehension and general thinking and reasoning.

Journal ArticleDOI
TL;DR: It is argued that verbal irony is one emergent, strategic possibility given the interface between people’s ability to infer mental states, and use language, and that it arises from the same set of abilities that underlie a wide range of inferential communicative behaviors.
Abstract: The way we speak can reveal much about what we intend to communicate, but the words we use often only indirectly relate to the meanings we wish to convey. Verbal irony is a commonly studied form of indirect speech in which a speaker produces an explicit evaluative utterance that implicates an unstated, opposing evaluation. Producing and understanding ironic language, as well as many other types of indirect speech, requires the ability to recognize mental states in others, sometimes described as a capacity for metarepresentation. This article aims to connect common elements between the major theoretical approaches to verbal irony to recent psycholinguistic, developmental, and neuropsychological research demonstrating the necessity for metarepresentation in the effective use of verbal irony in social interaction. Here I will argue that verbal irony is one emergent, strategic possibility given the interface between people’s ability to infer mental states, and use language. Rather than think of ironic communication as a specialized cognitive ability, I will claim that it arises from the same set of abilities that underlie a wide range of inferential communicative behaviors.

Patent
30 Jul 2012
TL;DR: In this article, a spoken utterance of a plurality of characters can be received, and each selected known character sequence can be scored based on, at least in part, a weighting of individual characters that comprise the known sequence.
Abstract: A method of and a system for processing speech. A spoken utterance of a plurality of characters can be received. A plurality of known character sequences that potentially correspond to the spoken utterance can be selected. Each selected known character sequence can be scored based on, at least in part, a weighting of individual characters that comprise the known character sequence.

Journal ArticleDOI
TL;DR: The results suggest that privileged knowledge does shape language use, but crucially, that the degree to which the addressee’s perspective is considered is shaped by the relevance of theAddressee's perspective to the utterance goals.
Abstract: We examined the extent to which speakers take into consideration the addressee’s perspective in language production. Previous research on this process had revealed clear deficits (Horton & Keysar, Cognition 59:91–117, 1996; Wardlow Lane & Ferreira, Journal of Experimental Psychology: Learning, Memory, and Cognition 34:1466–1481, 2008). Here, we evaluated a new hypothesis—that the relevance of the addressee’s perspective depends on the speaker’s goals. In two experiments, Korean speakers described a target object in situations in which the perspective status of a competitor object (e.g., a large plate when describing a smaller plate) was manipulated. In Experiment 1, we examined whether speakers would use scalar-modified expressions even when the competitor was hidden from the addressee. The results demonstrated that information from both the speaker’s and the addressee’s perspectives influenced production. In Experiment 2, we examined whether utterance goals modulate this process. The results indicated that when a speaker makes a request, the addressee’s perspective has a stronger influence than it does when the speaker informs the addressee. These results suggest that privileged knowledge does shape language use, but crucially, that the degree to which the addressee’s perspective is considered is shaped by the relevance of the addressee’s perspective to the utterance goals.

Book ChapterDOI
01 Jan 2012
TL;DR: For instance, van der Hulst et al. as discussed by the authors proposed the concept of word accent, which refers to the phonological marking of one most prominent position in a word.
Abstract: UC Berkeley Phonology Lab Annual Report (2012) Do All Languages Have Word Accent? Larry M. Hyman University of California, Berkeley Introduction The purpose of this paper is to address the question: Do all languages have word accent? 1 By word accent (henceforth, WA), I intend a concept a bit broader than the traditional notion of word-level stress-accent, as so extensively studied within the metrical literature. I will thus use the term as follows: Word accent refers to the phonological marking of one most prominent position in a word. The question in my title is thus intended to mean the following: Do all languages phonologically mark one most prominent position per word? As defined, WA is designed to be more descriptive and inclusive than word stress, which refers to a common type of prominence marking, typically analyzed as the headmost syllable of a metrical structure (cf. §2). Even so, the claim has been made that all languages have word stress, and thus necessarily WA: 2 A considerable number (probably the majority, and according to me: all) of the world's languages display a phenomenon known as word stress. (van der Hulst 2009: 1) On the other hand, a number of scholars have asserted that specific languages lack word stress, if not WA in general. This includes certain tone languages in Africa, but also languages without tone: [In Bella Coola there is] ... no phonemically significant phenomena of stress or pitch associated with syllables or words.... When two or more syllabics occur in a word or sentence, one can clearly hear different degrees of articulatory force. But these relative stresses in a sequence of acoustic syllables do not remain constant in repetitions of the utterance. (Newman 1947: 132). In fact, many languages do not provide unambiguous evidence of WA—or even words (Schiering, Bickel and Hildebrandt 2010). In many cases the interpretations have been theory- dependent and highly personal: Some people see (or hear) stress where others don’t. Given this This paper, which will appear in Harry van der Hulst, Studies on Word Accent, is a second revision (April 2012) of an oral paper presented at the Conference on Word Accent: Theoretical and Typological Issues, University of Connecticut, April 30, 2010. I would like to thank Harry van der Hulst, and two anonymous reviewers for their helpful comments on the earlier drafts. Harry van der Hulst has since indicated to me via personal communication that he meant “word accent” in the more general sense intended in this study.

Journal ArticleDOI
TL;DR: The amount and type of exposure needed for 10-month-olds to recognize words is examined, finding that the ability to rapidly recognize the words in continuous utterances is clearly linked to future language development.
Abstract: Infants' ability to recognize words in continuous speech is vital for building a vocabulary. We here examined the amount and type of exposure needed for 10-month-olds to recognize words. Infants first heard a word, either embedded within an utterance or in isolation, then recognition was assessed by comparing event-related potentials to this word versus a word that they had not heard directly before. Although all 10-month-olds showed recognition responses to words first heard in isolation, not all infants showed such responses to words they had first heard within an utterance. Those that did succeed in the latter, harder, task, however, understood more words and utterances when re-tested at 12 months, and understood more words and produced more words at 24 months, compared with those who had shown no such recognition response at 10 months. The ability to rapidly recognize the words in continuous utterances is clearly linked to future language development.

Journal ArticleDOI
TL;DR: The authors focus on the pragmatic functions of these particles, the most important of which is the indication of the type of link created between a preceding and the current utterance after the latter has been fully produced and is thus manifest to both participants.

Patent
29 Aug 2012
TL;DR: In this article, a conversation management method and a system for implementing the same are provided to create a score about utterance inclination and to verify the utterance inclinations of a user.
Abstract: PURPOSE: A conversation management method and a system for implementing the same are provided to create a score about utterance inclination and to verify the utterance inclination of a user CONSTITUTION: A conversation management system comprises a calculating part(101), a similarity calculating part(103), and an utterance inclination verifying part(105) The calculating part calculates importance of utterance inclination, similarity between utterance inclinations, and relative distance between the utterance inclinations by using one utterance inclination among a plurality of utterance inclinations included in a corpus and utterance inclination included in the order connection with the one utterance inclination The similarity calculating part compares conversation flow obtained from the corpus with conversation flow between a user and an education device by using the importance of the utterance inclination and the similarity between the utterance inclinations The similarity calculating part calculates similarity between conversation flows according to the comparison result The utterance inclination verifying part evaluates user utterance according to relative distance between the utterance inclinations and calculates utterance inclination evaluation score

Proceedings Article
01 May 2012
TL;DR: The paper describes the development of resources and free tools, consisting of acoustic models, phonetic dictionaries, and libraries and programs to deal with these data.
Abstract: SPPAS is a tool to produce automatic annotations which include utterance, word, syllabic and phonemic segmentations from a recorded speech sound and its transcription. SPPAS is distributed under the terms of the GNU Public License. It was successfully applied during the Evalita 2011 campaign, on Italian map-task dialogues. It can also deal with French, English and Chinese and there is an easy way to add other languages. The paper describes the development of resources and free tools, consisting of acoustic models, phonetic dictionaries, and libraries and programs to deal with these data. All of them are publicly available.

Journal Article
TL;DR: The most sophisticated model is not only able to handle specificity implicature but is also the first formal account of Horn implicatures that correctly pre- dicts human behavior in signaling games with no prior conven- tions.

DOI
01 Nov 2012
TL;DR: This work proposes alternative terminologies ("communicative act" and "communicative act sequence") that are more adequate to describe the new realities of online communication and can usefully be applied to such diverse entities as weblog entries, tweets, status updates on social network sites, comments on other postings and to sequences of such entities.
Abstract: New forms of communication that have recently developed in the context of Web 2.0 make it necessary to reconsider some of the analytical tools of linguistic analysis. In the context of keyboard-to-screen communication (KSC), as we shall call it, a range of old dichotomies have become blurred or cease to be useful altogether, e. g. "asynchronous" versus "synchronous", "written" versus "spoken", "monologic" versus "dialogic", and in particular "text" versus "utterance". We propose alternative terminologies ("communicative act" and "communicative act sequence") that are more adequate to describe the new realities of online communication and can usefully be applied to such diverse entities as weblog entries, tweets, status updates on social network sites, comments on other postings and to sequences of such entities. Furthermore, in the context of social network sites, different forms of communication traditionally separated (i. e. blog, chat, email and so on) seem to converge. We illustrate and discuss these phenomena with data from Twitter and Facebook.

Journal ArticleDOI
TL;DR: The design, development and field evaluation of a machine translation system from Spanish to Spanish Sign Language (LSE: Lengua de Signos Española) and the main problems found and a discussion on how to solve them are detailed.
Abstract: This paper describes the design, development and field evaluation of a machine translation system from Spanish to Spanish Sign Language (LSE: Lengua de Signos Espanola). The developed system focuses on helping Deaf people when they want to renew their Driver’s License. The system is made up of a speech recognizer (for decoding the spoken utterance into a word sequence), a natural language translator (for converting a word sequence into a sequence of signs belonging to the sign language), and a 3D avatar animation module (for playing back the signs). For the natural language translator, three technological approaches have been implemented and evaluated: an example-based strategy, a rule-based translation method and a statistical translator. For the final version, the implemented language translator combines all the alternatives into a hierarchical structure. This paper includes a detailed description of the field evaluation. This evaluation was carried out in the Local Traffic Office in Toledo involving real government employees and Deaf people. The evaluation includes objective measurements from the system and subjective information from questionnaires. The paper details the main problems found and a discussion on how to solve them (some of them specific for LSE).

Patent
22 Feb 2012
TL;DR: In this paper, a recipient computing device can receive a speech utterance to be processed by speech recognition and segment the utterance into two or more utterance segments, each of which can be sent to one of a plurality of available speech recognizers.
Abstract: A recipient computing device can receive a speech utterance to be processed by speech recognition and segment the speech utterance into two or more speech utterance segments, each of which can be to one of a plurality of available speech recognizers. A first one of the plurality of available speech recognizers can be implemented on a separate computing device accessible via a data network. A first segment can be processed by the first recognizer and the results of the processing returned to the recipient computing device, and a second segment can be processed by a second recognizer implemented at the recipient computing device.

Proceedings ArticleDOI
05 Mar 2012
TL;DR: This paper explored the potential of non-linguistic utterances and found that utterance rhythm may be an influential independent factor, whilst the pitch contour of an utterance may have little importance.
Abstract: Vocal affective displays are vital for achieving engaging and effective Human-Robot Interaction. The same can be said for linguistic interaction also, however, while emphasis may be placed upon linguistic interaction, there are also inherent risks: users are bound to a single language, and breakdowns are frequent due to current technical limitations. This work explores the potential of non-linguistic utterances. A recent study is briefly outlined in which school children were asked to rate a variety of non-linguistic utterances on an affective level using a facial gesture tool. Results suggest, for example, that utterance rhythm may be an influential independent factor, whilst the pitch contour of an utterance may have little importance. Also evidence for categorical perception of emotions is presented, an issue that may impact important areas of HRI away from vocal displays of affect.

Proceedings Article
05 Jul 2012
TL;DR: This paper presented a system that adapts to a listener's acoustic understanding problems by pausing, repeating and possibly rephrasing problematic parts of an utterance, which was rated as significantly more natural than two systems representing the current state of the art that either ignore the interrupting event or just pause.
Abstract: Participants in a conversation are normally receptive to their surroundings and their interlocutors, even while they are speaking and can, if necessary, adapt their ongoing utterance. Typical dialogue systems are not receptive and cannot adapt while uttering. We present combinable components for incremental natural language generation and incremental speech synthesis and demonstrate the flexibility they can achieve with an example system that adapts to a listener's acoustic understanding problems by pausing, repeating and possibly rephrasing problematic parts of an utterance. In an evaluation, this system was rated as significantly more natural than two systems representing the current state of the art that either ignore the interrupting event or just pause; it also has a lower response time.

Journal ArticleDOI
TL;DR: There is a belief, widespread in some quarters, that the syllable is usefully thought of as a cyclic or oscillatory system and the modulation of the amplitude envelope, arising in quasi-cyclic jaw wagging, contains syllabic information at a temporal rate matched to known theTA oscillations, and, of course, theta oscillation is found to be modulated in part by the amplitude package.
Abstract: There is a belief, widespread in some quarters, that the syllable is usefully thought of as a cyclic or oscillatory system. An oscillatory system allows the definition of phase (or “time” from the point of view of the system) and this in turn allows speculative accounts of entrainment between the bio-mechanical speech production of the speaker and the neurodynamics of the listener. A very explicit account of this nature is provided by Peelle and Davis (2012), and it in turn relies heavily upon a body of work by Poeppel and others, e.g., Luo and Poeppel (2007). These two articles may be taken as representative of a larger literature that shares a common approach to the syllable considered as an oscillatory system. A recent summary is provided in Giraud and Poeppel (2012). In this short note, a cautionary flag will be raised about such accounts from a phonetician's point of view. The syllable is a construct that is central to our understanding of speech. Very young speakers can be taught to introspect about the syllabic count of their utterances in many instances (Liberman et al., 1974). Some languages base their orthographic systems on the syllable. Musicians associate syllables with discrete notes (Patel and Daniele, 2003). Articulatory movements are made more readily interpretable if we posit the syllable as an organizing (or emergent) structure governing relative timing among discrete effectors (Browman and Goldstein, 1988). The apparent facility with which the syllable is employed in many accounts belies an important observation: syllables are not readily observable in the speech signal. Like their phonological cousins, the phonemes, they make a lot of sense to us as speakers and listeners, but it is not a simple matter to map from this intuition onto either the acoustic signal or the articulatory trace. Even competent adult English speakers may have difficulty counting syllables in a given utterance, and there may be no objective grounds upon which once can answer the difficult question of how many syllables are found in a specific production of an utterance. Just a few illustrative examples encountered by the author recently included such words and phrases as “zoologist” (found to be produced as 2, 3, or 4 syllables), “Carol” (1 or 2), “naturally” (2 and 3), “by his” (1 and 2), etc. Note: ambiguity obtains in attempting to identify the number of syllables in actually produced tokens, not in idealized, imagined ones. Such examples abound once one directs ones attention to specific utterances as spoken. They are not exceptions to a largely unproblematic majority. Poeppel, Giraud, Peelle, and others lean heavily on the “amplitude envelope” of the speech signal as a supposed carrier of information about syllabic phase, and the modulation of this envelope is supposed to arise from quasi-cyclic wagging of the jaw. Typical syllable rates are observed to lie approximately within the same range as theta oscillations, conventionally delimited to 4–8 Hz. And so the inference arises that modulation of the amplitude envelope, arising in quasi-cyclic jaw wagging, contains syllabic information at a temporal rate matched to known theta oscillations, and, of course, theta oscillation is found to be modulated in part by the amplitude envelope. But the wagging of the jaw is not a guide to the unfolding of syllables in sequence, and, even if it were, it is not typically possible to recover jaw position from the amplitude envelope (Beňus and Pouplier, 2011). The amplitude envelope bears a fiercely complex relationship to the movement of all the articulators, not just the jaw (see Figure ​Figure1).1). It is substantially modulated by all kinds of tongue movement, by lip aperture, and by velar opening and closing. It certainly does not provide unambiguous or even nearly unambiguous information about syllables. Intuitions about syllabic regularity are thus potentially misleading, and a theoretical account that depends upon information about syllabic sequence being present in the amplitude envelope must at the very least demonstrate that that information is, in fact, present. It is certainly not sufficient to point out that a mean syllable rate of about 5 Hz seems to match the entirely conventional range of theta oscillation, nor to observe that although speech is not strictly periodic, it is at least quasi-periodic. Such untempered laxness, it seems to this phonetician, will not serve. Figure 1 From top: sound wave, amplitude envelope, approximate syllable boundaries, and first principal component of jaw movement in the mid-saggital plane for one fairly rapid utterance of the Slovak sentence that might be represented canonically thus: but in ... Likewise, the observation that appropriate amplitude envelope modulation is a critical contributor to the intelligibility of speech is neither necessary nor sufficient to shore up a claim that the amplitude envelope provides information about syllables, or that it can serve as the basis for entrainment between speakers and listeners (Ghitza, 2012). Oscillation in the brain is uncontroversially present at a range of frequencies (Buzsaki and Draguhn, 2004), and furthermore, the temporal modulation of the amplitude envelope of the speech wave, or of a band pass filtered component thereof, may be causally linked to modulation of theta oscillation (Luo and Poeppel, 2007). None of this need be questioned to argue that it is the speech signal itself that is being mischaracterized in such accounts. The speech signal is not periodic in the sense required to support entrainment with the source of theta or gamma oscillations. Furthermore, caution is especially warranted as the term “rhythm” is used in fundamentally different ways within neuroscience – where it is treated as synonymous with “periodic” – and in our every day talk of speech – where rhythm is more akin to musical rhythm, and much harder to define in an objective sense. An entrainment account based on the amplitude envelope (or the jaw) as the mediating signal that yokes two systems together is fundamentally incomplete. It is incomplete, not because speakers and listeners do not entrain – they do, and there is increasing evidence for coupling at every level – but because such an account omits the knowledge that speakers/listeners bring to bear on the exchange (Cummins, 2012). This is perhaps best illustrated by the ability of speakers to speak in extremely close synchrony with one another (Cummins, 2009). In the absence of any periodic grid to support mutual timing registration, speakers can, without effort, align their spoken utterances of a novel text. This is possible, not because the syllables are recoverable from the amplitude envelope. Indeed, it was found that the amplitude envelope was neither necessary nor sufficient to facilitate synchronization among speakers (Cummins, 2009), and that synchronization depended upon a complex suite of interacting factors, among which intelligibility seemed to be the single most important (although intelligibility is not related to any single signal property). On the contrary, close synchronous speaking is possible because speakers share the knowledge of the severe spatio-temporal constraints that collectively define what it is to speak a specific language. The coupling exhibited here is, of course, between the neuro-bio-mechanics of one skilled speaker and the neuro-bio-mechanics of the other – like coupling with like. The coupling critically involves the whole of the two speakers, including their skill sets. There seems to be a need here for the development of formal models that can capture the reciprocal coupling of speaker and listener, taking into account their implicit but hugely constraining practical knowledge of what it is to speak. A mechanical model that treats syllable-producers as oscillators and syllable-hearers as entraining to those oscillations, seems, to this phonetician, to ignore much of the known complexity of speech as she is spoken and of speakers as they speak.

Book Chapter
01 Jan 2012
TL;DR: The interpretation of linguistic utterances is determined by the words involved and the way they are combined, but not exclusively so.
Abstract: The interpretation of linguistic utterances is determined by the words involved and the way they are combined, but not exclusively so. Establishing the content that is communicated by an utterance is inextricably intertwined with the communicative context where the utterance is made, including the expectations of the interlocutors about each other. Clark and Marshall (1981) make a good case that even the reference of a definite description depends on the reasoning of the speaker and hearer about each other’s knowledge state. Likewise, computing the implicatures of a sentence requires reasoning about the knowledge states and intentions of the communication partners. To use a worn-out example, Grice (1975) points out that a sentence like (1b), if uttered to the owner of an immobilized car by a passerby, carries much more information than what is literally said.

01 Jan 2012
TL;DR: In this article, the authors introduce a meta-model of psychotherapy process, which claims that all therapies strive to create a joint observational stance for making sense of clients' problematic experiences.
Abstract: Dialogical sequence analysis (DSA) is a microanalytic method of analyzing utterances. Based on Mikahil Bakhtin's theory of utterance it states that, when communicating, individuals simultaneously position themselves with regard to the referential object and the addressee. Depending "about what" people are speaking and "to whom" they direct their words affect the style and composition of their utterances. Such positioning is semiotic in the sense that the referential object is always construed by personal and historically formed meanings. The historicity of subjective construal applies to the addressee as well. Utterances are often complicated by the fact that there are often hidden or invisible addressees in addition to the ostensible interlocutor. DSA developed in the context of psychotherapy supervision and process research. The article introduces a meta-model of psychotherapy process, which claims that all therapies strive to create a joint observational stance for making sense of clients' problematic experiences. Hence, the psychotherapies provide a natural laboratory within which internal experiences become tangible through expressions and utterances. The fundamental unit of analyzing the double positioning in relation to the topic and the addressees is semiotic position. Being a relational concept, it cannot be used to single out and categorize distinct units of speech. The way by which semiotic positions are identified in DSA will be illustrated by three excerpts from psychotherapy literature. Psychotherapy research is a disciplined reflection of therapeutic practices. Clients and therapists work jointly toward an understanding of the client's presenting problems and attempt to find productive solutions, alternative ways of action or more constructive ways of relating to the problem. The psychotherapy researcher is an outsider that observes and examines the recordings of therapeutic exchanges or the pre- controlled constructions of clients and therapists that have been generated through interviews, rating scales, or structured recalls. The researcher is trying to make sense, afterwards, of an extremely complex process of joint action and communication that are mediated by the participants' ways of understanding what they are doing together and what the problem at hand is.

Journal ArticleDOI
01 Jan 2012
TL;DR: In this paper, it is argued that the speaker's impolite utterance may carry different pragmatic effects when directed towards (yet not necessarily targeting), and interpreted by, his/her interlocutor(s) (an addressee or a third party), overhearers (a bystander or an eavesdropper), or when face-threatening to a nonparticipant.
Abstract: The primary objective of this paper is to elucidate the workings of (intentional) impoliteness in multi-party film talk. The departure point is the diversification of hearer types, coupled with the premise that film discourse operates on two communicative levels, namely the inter-character level and the recipient’s level, at which the audience interprets characters’ conversations. It is thus argued that the speaker’s impolite utterance may carry different pragmatic effects when directed towards (yet not necessarily targeting), and interpreted by, his/her interlocutor(s) (an addressee or a third party), overhearers (a bystander or an eavesdropper), or when face-threatening to a non-participant. Moreover, the recipient (i. e., the viewer), is yet another hearer category, for whose pleasure impoliteness is interactionally rendered at the characters’ level. The theoretical proposal to extend the dyadic model of impoliteness is illustrated with utterances produced by Gregory House, the main protagonist of the television