scispace - formally typeset
Search or ask a question

Showing papers on "Utterance published in 2019"


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This paper presents a new data set of 57k annotated utterances in English, Spanish, Spanish and Thai and uses this data set to evaluate three different cross-lingual transfer methods, finding that given several hundred training examples in the the target language, the latter two methods outperform translating the training data.
Abstract: One of the first steps in the utterance interpretation pipeline of many task-oriented conversational AI systems is to identify user intents and the corresponding slots. Since data collection for machine learning models for this task is time-consuming, it is desirable to make use of existing data in a high-resource language to train models in low-resource languages. However, development of such models has largely been hindered by the lack of multilingual training data. In this paper, we present a new data set of 57k annotated utterances in English (43k), Spanish (8.6k) and Thai (5k) across the domains weather, alarm, and reminder. We use this data set to evaluate three different cross-lingual transfer methods: (1) translating the training data, (2) using cross-lingual pre-trained embeddings, and (3) a novel method of using a multilingual machine translation encoder as contextual word representations. We find that given several hundred training examples in the the target language, the latter two methods outperform translating the training data. Further, in very low-resource settings, multilingual contextual word representations give better results than using cross-lingual static embeddings. We also compare the cross-lingual methods to using monolingual resources in the form of contextual ELMo representations and find that given just small amounts of target language data, this method outperforms all cross-lingual methods, which highlights the need for more sophisticated cross-lingual methods.

238 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the context of a verbal exchange can be used to enhance neural decoder performance in real time and Contextual integration of decoded question likelihoods significantly improves answer decoding.
Abstract: Natural communication often occurs in dialogue, differentially engaging auditory and sensorimotor brain regions during listening and speaking. However, previous attempts to decode speech directly from the human brain typically consider listening or speaking tasks in isolation. Here, human participants listened to questions and responded aloud with answers while we used high-density electrocorticography (ECoG) recordings to detect when they heard or said an utterance and to then decode the utterance’s identity. Because certain answers were only plausible responses to certain questions, we could dynamically update the prior probabilities of each answer using the decoded question likelihoods as context. We decode produced and perceived utterances with accuracy rates as high as 61% and 76%, respectively (chance is 7% and 20%). Contextual integration of decoded question likelihoods significantly improves answer decoding. These results demonstrate real-time decoding of speech in an interactive, conversational setting, which has important implications for patients who are unable to communicate.

145 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this paper, a multilingual end-to-end (E2E) speech recognition model using a plurality of language-specific adaptor modules (300) that include one or more adaptor module specific to the particular native language and one OR more other adaptor modality specific to at least one other native language different than the particular language was proposed.
Abstract: A method (400) of transcribing speech using a multilingual end-to-end (E2E) speech recognition model (115) includes receiving audio data (110) for an utterance (106) spoken in a particular native language, obtaining a language vector (115) identifying the particular language, and processing, using the multilingual E2E speech recognition model, the language vector and acoustic features (117) derived from the audio data to generate a transcription (1:20) for the utterance. The multilingual E2E speech recognition model includes a plurality of language-specific adaptor modules (300) that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language. The method also includes providing the transcription for output.

108 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: A framework to exploit acoustic information in tandem with lexical data using two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance and an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities.
Abstract: In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data. The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance. Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities. The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal. The relevant textual data is then applied to attend parts of the audio signal. To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset. Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy.

105 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: The authors proposed rewriting human utterance as a pre-process to help multi-turn dialgoue modeling. But the task of utterance rewriting was not addressed in this paper.
Abstract: Recent research has achieved impressive results in single-turn dialogue modelling. In the multi-turn setting, however, current models are still far from satisfactory. One major challenge is the frequently occurred coreference and information omission in our daily conversation, making it hard for machines to understand the real intention. In this paper, we propose rewriting the human utterance as a pre-process to help multi-turn dialgoue modelling. Each utterance is first rewritten to recover all coreferred and omitted information. The next processing steps are then performed based on the rewritten utterance. To properly train the utterance rewriter, we collect a new dataset with human annotations and introduce a Transformer-based utterance rewriting architecture using the pointer network. We show the proposed architecture achieves remarkably good performance on the utterance rewriting task. The trained utterance rewriter can be easily integrated into online chatbots and brings general improvement over different domains.

97 citations


Journal ArticleDOI
TL;DR: A novel approach is addressed using a recently introduced method for quantifying the semantic context of speech and relating it to a commonly used method for indexing low-level auditory encoding of speech to suggest a mechanism that links top-down prior information with bottom-up sensory processing in the context of natural, narrative speech listening.
Abstract: Speech perception involves the integration of sensory input with expectations based on the context of that speech. Much debate surrounds the issue of whether or not prior knowledge feeds back to affect early auditory encoding in the lower levels of the speech processing hierarchy, or whether perception can be best explained as a purely feedforward process. Although there has been compelling evidence on both sides of this debate, experiments involving naturalistic speech stimuli to address these questions have been lacking. Here, we use a recently introduced method for quantifying the semantic context of speech and relate it to a commonly used method for indexing low-level auditory encoding of speech. The relationship between these measures is taken to be an indication of how semantic context leading up to a word influences how its low-level acoustic and phonetic features are processed. We record EEG from human participants (both male and female) listening to continuous natural speech and find that the early cortical tracking of a word's speech envelope is enhanced by its semantic similarity to its sentential context. Using a forward modeling approach, we find that prediction accuracy of the EEG signal also shows the same effect. Furthermore, this effect shows distinct temporal patterns of correlation depending on the type of speech input representation (acoustic or phonological) used for the model, implicating a top-down propagation of information through the processing hierarchy. These results suggest a mechanism that links top-down prior information with the early cortical entrainment of words in natural, continuous speech.SIGNIFICANCE STATEMENT During natural speech comprehension, we use semantic context when processing information about new incoming words. However, precisely how the neural processing of bottom-up sensory information is affected by top-down context-based predictions remains controversial. We address this discussion using a novel approach that indexes a word's similarity to context and how well a word's acoustic and phonetic features are processed by the brain at the time of its utterance. We relate these two measures and show that lower-level auditory tracking of speech improves for words that are more related to their preceding context. These results suggest a mechanism that links top-down prior information with bottom-up sensory processing in the context of natural, narrative speech listening.

91 citations


Posted Content
TL;DR: This survey reviews computational approaches for code-switched Speech and Natural Language Processing, including language processing tools and end-to-end systems and concludes with future directions and open problems in the field.
Abstract: Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world This survey reviews computational approaches for code-switched Speech and Natural Language Processing We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities As code-switching data and resources are scarce, we list what is available in various code-switched language pairs with the language processing tasks they can be used for We review code-switching research in various Speech and NLP applications, including language processing tools and end-to-end systems We conclude with future directions and open problems in the field

86 citations


Journal ArticleDOI
TL;DR: The experimental results on the AP17-OLR database demonstrate that the proposed end-to-end short utterances based speech language identification (SLI) approach can improve the performance of SLD, especially on short utterance.
Abstract: Conversations in the intelligent vehicles are usually short utterance. As the durations of the short utterances are small (e.g., less than 3 s), it is difficult to learn sufficient information to distinguish the type of languages. In this paper, we propose an end-to-end short utterances based speech language identification (SLI) approach, which is especially suitable for the short utterance based language identification. This approach is implemented with a long short-term memory (LSTM) neural network, which is designed for the SLI application in intelligent vehicles. The features used for LSTM learning are generated by a transfer learning method. The bottleneck features of a deep neural network, which are obtained for a mandarin acoustic-phonetic classifier, are used for the LSTM training. In order to improve the SLD accuracy with short utterances, a phase vocoder based time-scale modification method is utilized to reduce/increase the speech rate of the test utterance. By connecting the normal, speech rate reduced, and speech rate increased utterances, we can extend the length of the test utterances such that the performance of the SLI system is improved. The experimental results on the AP17-OLR database demonstrate that the proposed method can improve the performance of SLD, especially on short utterance. The proposed SLI has robust performance under the vehicular noisy environment.

81 citations


Journal ArticleDOI
TL;DR: A usage-based computational model of language acquisition which learns in a purely incremental fashion, through online processing based on chunking, and which offers broad, cross-linguistic coverage while uniting key aspects of comprehension and production within a single framework is presented.
Abstract: While usage-based approaches to language development enjoy considerable support from computational studies, there have been few attempts to answer a key computational challenge posed by usage-based theory: the successful modeling of language learning as language use. We present a usage-based computational model of language acquisition which learns in a purely incremental fashion, through online processing based on chunking, and which offers broad, cross-linguistic coverage while uniting key aspects of comprehension and production within a single framework. The model's design reflects memory constraints imposed by the real-time nature of language processing, and is inspired by psycholinguistic evidence for children's sensitivity to the distributional properties of multiword sequences and for shallow language comprehension based on local information. It learns from corpora of child-directed speech, chunking incoming words together to incrementally build an item-based "shallow parse." When the model encounters an utterance made by the target child, it attempts to generate an identical utterance using the same chunks and statistics involved during comprehension. High performance is achieved on both comprehension- and production-related tasks: the model's shallow parsing is evaluated across 79 single-child corpora spanning English, French, and German, while its production performance is evaluated across over 200 single-child corpora representing 29 languages from the CHILDES database. The model also succeeds in capturing findings from children's production of complex sentence types. Together, our modeling results suggest that much of children's early linguistic behavior may be supported by item-based learning through online processing of simple distributional cues, consistent with the notion that acquisition can be understood as learning to process language. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

80 citations


Journal ArticleDOI
TL;DR: In this paper, a corpus of WhatsApp chats written in Spanish was used to explore the functions of emoji in a rapport management framework, and the analysis showed that emoji are used across different domains in the corpus: they not only upgrade or downgrade different speech acts (illocutionary domain), but they also contribute to achieving a successful interaction by signaling closing sections or by helping to negotiate openings (discourse domain), as well as serving as a way to frame playful interactions.

68 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: This work proposes a multitask learning recipe, where a language identification task is explicitly learned in addition to the E2E speech recognition task, and introduces an efficient word vocabulary expansion method for language modeling to alleviate data sparsity issues under the code-switching scenario.
Abstract: Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances. In this work, we study end-to-end (E2E) approaches to the Mandarin-English code-switching speech recognition (CSSR) task. We first examine the effectiveness of using data augmentation and byte-pair encoding (BPE) subword units. More importantly, we propose a multitask learning recipe, where a language identification task is explicitly learned in addition to the E2E speech recognition task. Furthermore, we introduce an efficient word vocabulary expansion method for language modeling to alleviate data sparsity issues under the code-switching scenario. Experimental results on the SEAME data, a Mandarin-English CS corpus, demonstrate the effectiveness of the proposed methods.

Proceedings ArticleDOI
12 May 2019
TL;DR: An interaction-aware attention network (IAAN) that incorporate contextual information in the learned vocal representation through a novel attention mechanism is proposed that achieves 66.3% accuracy (7.9% over baseline methods) in four class emotion recognition and is also the current state-of-art recognition rates obtained on the benchmark database.
Abstract: Obtaining robust speech emotion recognition (SER) in scenarios of spoken interactions is critical to the developments of next generation human-machine interface. Previous research has largely focused on performing SER by modeling each utterance of the dialog in isolation without considering the transactional and dependent nature of the human-human conversation. In this work, we propose an interaction-aware attention network (IAAN) that incorporate contextual information in the learned vocal representation through a novel attention mechanism. Our proposed method achieves 66.3% accuracy (7.9% over baseline methods) in four class emotion recognition and is also the current state-of-art recognition rates obtained on the benchmark database.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: A large-scale multi-turn dataset is collected and manually labeled with the explicit relation between an utterance and its context and a “pick-and-combine” model is proposed to restore the incomplete utterance from its context.
Abstract: In multi-turn dialogue, utterances do not always take the full form of sentences. These incomplete utterances will greatly reduce the performance of open-domain dialogue systems. Restoring more incomplete utterances from context could potentially help the systems generate more relevant responses. To facilitate the study of incomplete utterance restoration for open-domain dialogue systems, a large-scale multi-turn dataset Restoration-200K is collected and manually labeled with the explicit relation between an utterance and its context. We also propose a “pick-and-combine” model to restore the incomplete utterance from its context. Experimental results demonstrate that the annotated dataset and the proposed approach significantly boost the response quality of both single-turn and multi-turn dialogue systems.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: This work focuses on cross-modal fusion techniques over deep learning models for emotion detection from spoken audio and corresponding transcripts, and investigates the use of long short-term memory (LSTM) recurrent neural network with pre-trained word embedding for text-based emotion recognition and convolutional neural network (CNN) with utterance-level descriptors for emotion recognition from speech.
Abstract: In human perception and understanding, a number of different and complementary cues are adopted according to different modalities. Various emotional states in communication between humans reflect this variety of cues across modalities. Recent developments in multi-modal emotion recognition utilize deeplearning techniques to achieve remarkable performances, with models based on different features suitable for text, audio and vision. This work focuses on cross-modal fusion techniques over deep learning models for emotion detection from spoken audio and corresponding transcripts. We investigate the use of long short-term memory (LSTM) recurrent neural network (RNN) with pre-trained word embedding for text-based emotion recognition and convolutional neural network (CNN) with utterance-level descriptors for emotion recognition from speech. Various fusion strategies are adopted on these models to yield an overall score for each of the emotional categories. Intra-modality dynamics for each emotion is captured in the neural network designed for the specific modality. Fusion techniques are employed to obtain the inter-modality dynamics. Speaker and session-independent experiments on IEMOCAP multi-modal emotion detection dataset show the effectiveness of the proposed approaches. This method yields state-of-the-art results for utterance-level emotion recognition based on speech and text.

Journal ArticleDOI
TL;DR: Focusing on the verb-DO noun relationship in simple spoken sentences, multivariate pattern analysis and computational semantic modeling is applied to source-localized electro/magnetoencephalographic data to map out the specific representational constraints that are constructed as each word is heard, and to determine how these constraints guide the interpretation of subsequent words in the utterance.
Abstract: Human speech comprehension is remarkable for its immediacy and rapidity. The listener interprets an incrementally delivered auditory input, millisecond by millisecond as it is heard, in terms of complex multilevel representations of relevant linguistic and nonlinguistic knowledge. Central to this process are the neural computations involved in semantic combination, whereby the meanings of words are combined into more complex representations, as in the combination of a verb and its following direct object (DO) noun (e.g., "eat the apple"). These combinatorial processes form the backbone for incremental interpretation, enabling listeners to integrate the meaning of each word as it is heard into their dynamic interpretation of the current utterance. Focusing on the verb-DO noun relationship in simple spoken sentences, we applied multivariate pattern analysis and computational semantic modeling to source-localized electro/magnetoencephalographic data to map out the specific representational constraints that are constructed as each word is heard, and to determine how these constraints guide the interpretation of subsequent words in the utterance. Comparing context-independent semantic models of the DO noun with contextually constrained noun models reflecting the semantic properties of the preceding verb, we found that only the contextually constrained model showed a significant fit to the brain data. Pattern-based measures of directed connectivity across the left hemisphere language network revealed a continuous information flow among temporal, inferior frontal, and inferior parietal regions, underpinning the verb's modification of the DO noun's activated semantics. These results provide a plausible neural substrate for seamless real-time incremental interpretation on the observed millisecond time scales.

Proceedings ArticleDOI
16 Sep 2019
TL;DR: This article proposed a collaborative memory network (CM-Net) based on the well-designed block, named CM-block, which first captures slot-specific and intent-specific features from memories in a collaborative manner, and then uses these enriched features to enhance local context representations, based on which the sequential information flow leads to more specific (slot and intent) global utterance representations.
Abstract: Spoken Language Understanding (SLU) mainly involves two tasks, intent detection and slot filling, which are generally modeled jointly in existing works. However, most existing models fail to fully utilize cooccurrence relations between slots and intents, which restricts their potential performance. To address this issue, in this paper we propose a novel Collaborative Memory Network (CM-Net) based on the well-designed block, named CM-block. The CM-block firstly captures slot-specific and intent-specific features from memories in a collaborative manner, and then uses these enriched features to enhance local context representations, based on which the sequential information flow leads to more specific (slot and intent) global utterance representations. Through stacking multiple CM-blocks, our CM-Net is able to alternately perform information exchange among specific memories, local contexts and the global utterance, and thus incrementally enriches each other. We evaluate the CM-Net on two standard benchmarks (ATIS and SNIPS) and a self-collected corpus (CAIS). Experimental results show that the CM-Net achieves the state-of-the-art results on the ATIS and SNIPS in most of criteria, and significantly outperforms the baseline models on the CAIS. Additionally, we make the CAIS dataset publicly available for the research community.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a multiscale deep convolutional long short-term memory (LSTM) framework for spontaneous speech emotion recognition, where a deep CNN model was used to learn segment-level features on the basis of the created image-like three channels of spectrograms.
Abstract: Recently, emotion recognition in real sceneries such as in the wild has attracted extensive attention in affective computing, because existing spontaneous emotions in real sceneries are more challenging and difficult to identify than other emotions. Motivated by the diverse effects of different lengths of audio spectrograms on emotion identification, this paper proposes a multiscale deep convolutional long short-term memory (LSTM) framework for spontaneous speech emotion recognition. Initially, a deep convolutional neural network (CNN) model is used to learn deep segment-level features on the basis of the created image-like three channels of spectrograms. Then, a deep LSTM model is adopted on the basis of the learned segment-level CNN features to capture the temporal dependency among all divided segments in an utterance for utterance-level emotion recognition. Finally, different emotion recognition results, obtained by combining CNN with LSTM at multiple lengths of segment-level spectrograms, are integrated by using a score-level fusion strategy. Experimental results on two challenging spontaneous emotional datasets, i.e., the AFEW5.0 and BAUM-1s databases, demonstrate the promising performance of the proposed method, outperforming state-of-the-art methods.

Book ChapterDOI
01 Jan 2019
TL;DR: It is the ability of the deep neural network’s techniques to perform complex correlation among speech signal features, which enhances its performance over traditional approaches.
Abstract: Automatic language identification has always been a challenging issue and an important research area in speech signal processing. It is the process of identifying a language from a random spoken utterance. This era is dominated by artificial intelligence and specifically, deep learning techniques. Prominent among the deep learning techniques are feed-forward deep neural network, convolutional neural network, long short term memory-recurrent neural network, etc. The various types of deep neural network techniques that were recently introduced have overshadowed conventional methods such as Gaussian mixture model, hidden Markov model, etc. These techniques showed significant improvement in recognition performance over various parameters. It is the ability of the deep neural network’s techniques to perform complex correlation among speech signal features, which enhances its performance over traditional approaches. This chapter provides in-depth concepts of various deep learning techniques for spoken language identification. It also explores and analyzes several works for speech recognition. Advantages and limitations of each of the techniques are reviewed. A summary of future scope for spoken language identification is also reviewed.

Journal ArticleDOI
TL;DR: A new form of influence infants have over their ambient language in everyday learning environments is illustrated by vocalizing, which catalyze the production of simplified, more easily learnable language from caregivers.
Abstract: What is the function of babbling in language learning? We examined the structure of parental speech as a function of contingency on infants’ non-cry prelinguistic vocalizations. We analyzed several acoustic and linguistic measures of caregivers’ speech. Contingent speech was less lexically diverse and shorter in utterance length than non-contingent speech. We also found that the lexical diversity of contingent parental speech only predicted infant vocal maturity. These findings illustrate a new form of influence infants have over their ambient language in everyday learning environments. By vocalizing, infants catalyze the production of simplified, more easily learnable language from caregivers.

Proceedings ArticleDOI
12 May 2019
TL;DR: In this paper, a cross-modal bilingual dictionary inferred from the monolingual speech and text corpora is used to initialize a speech-to-text translation system, which maps every source speech segment corresponding to a spoken word to its target text translation.
Abstract: We present a framework for building speech-to-text translation (ST) systems using only monolingual speech and text corpora, in other words, speech utterances from a source language and independent text from a target language. As opposed to traditional cascaded systems and end-to-end architectures, our system does not require any labeled data (i.e., transcribed source audio or parallel source and target text corpora) during training, making it especially applicable to language pairs with very few or even zero bilingual resources. The framework initializes the ST system with a cross-modal bilingual dictionary inferred from the monolingual corpora, that maps every source speech segment corresponding to a spoken word to its target text translation. For unseen source speech utterances, the system first performs word-by-word translation on each speech segment in the utterance. The translation is improved by leveraging a language model and a sequence denoising autoencoder to provide prior knowledge about the target language. Experimental results show that our unsupervised system achieves comparable BLEU scores to supervised end-to-end models despite the lack of supervision. We also provide an ablation analysis to examine the utility of each component in our system.

Posted Content
TL;DR: A novel Collaborative Memory Network (CM-Net) based on the well-designed block, named CM-block, which achieves the state-of-the-art results on the ATIS and SNIPS in most of criteria, and significantly outperforms the baseline models on the CAIS.
Abstract: Spoken Language Understanding (SLU) mainly involves two tasks, intent detection and slot filling, which are generally modeled jointly in existing works. However, most existing models fail to fully utilize co-occurrence relations between slots and intents, which restricts their potential performance. To address this issue, in this paper we propose a novel Collaborative Memory Network (CM-Net) based on the well-designed block, named CM-block. The CM-block firstly captures slot-specific and intent-specific features from memories in a collaborative manner, and then uses these enriched features to enhance local context representations, based on which the sequential information flow leads to more specific (slot and intent) global utterance representations. Through stacking multiple CM-blocks, our CM-Net is able to alternately perform information exchange among specific memories, local contexts and the global utterance, and thus incrementally enriches each other. We evaluate the CM-Net on two standard benchmarks (ATIS and SNIPS) and a self-collected corpus (CAIS). Experimental results show that the CM-Net achieves the state-of-the-art results on the ATIS and SNIPS in most of criteria, and significantly outperforms the baseline models on the CAIS. Additionally, we make the CAIS dataset publicly available for the research community.


Journal ArticleDOI
Jenny Helin1
TL;DR: This paper developed the notion of "dialogical writing" by drawing on the literature on performative utterances and a collaborative fieldwork project where writing became an integrated part of the research process.
Abstract: The foundational view of discourse as a descriptive mode of representation and writing as a retrospective stabilizing tool has been criticized in organization and management research. The purpose of this paper is to inquire into a more emergent, unfinished, and relational writing used throughout the research processes. To that aim, I develop the notion of ‘dialogical writing’ by drawing on the literature on performative utterances and a collaborative fieldwork project where writing became an integrated part of the research process. I come to understand this form of writing as one in situ where addressivity, responsiveness, and unfinalizability are emphasized. This enables writing to be part of a conversation; writing as a response to that which has been said and in anticipation of the next possible utterance. I close with implications for writing in organization studies, such as the possibility of thinking of writing as an offering of the tentative.

Proceedings ArticleDOI
12 May 2019
TL;DR: A novel end-to-end automatic speech recognition (ASR) method that takes into consideration long-range sequential context information beyond utterance boundaries, and can explicitly utilize relationships between a current target utterance and all preceding utterances.
Abstract: This paper describes a novel end-to-end automatic speech recognition (ASR) method that takes into consideration long-range sequential context information beyond utterance boundaries. In spontaneous ASR tasks such as those for discourses and conversations, the input speech often comprises a series of utterances. Accordingly, the relationships between the utterances should be leveraged for transcribing the individual utterances. While most previous end-to-end ASR methods only focus on utterance-level ASR that handles single utterances independently, the proposed method (which we call "large-context end-to-end ASR") can explicitly utilize relationships between a current target utterance and all preceding utterances. The method is modeled by combining an attention-based encoder-decoder model, which is one of the most representative end-to-end ASR models, with hierarchical recurrent encoder-decoder models, which are effective language models for capturing long-range sequential contexts beyond the utterance boundaries. Experiments on Japanese discourse speech tasks demonstrate the proposed method yields significant ASR performance improvements compared with the conventional utterance-level end-to-end ASR system.

Journal ArticleDOI
TL;DR: A range of audienceDesign effects are reviewed, organized by a novel cognitive framework for understanding audience design effects, whereby speakers independently generate communicatively relevant features to predict potential communicative effects.
Abstract: Audience design refers to the situation in which speakers fashion their utterances so as to cater to the needs of their addressees. In this article, a range of audience design effects are reviewed, organized by a novel cognitive framework for understanding audience design effects. Within this framework, feedforward (or one-shot) production is responsible for feedforward audience design effects, or effects based on already known properties of the addressee (e.g., child versus adult status) or the message (e.g., that it includes meanings that might be confusable). Then, a forward modeling approach is described, whereby speakers independently generate communicatively relevant features to predict potential communicative effects. This can explain recurrent processing audience design effects, or effects based on features of the produced utterance itself or on idiosyncratic features of the addressee or communicative situation. Predictions from the framework are delineated.

Journal ArticleDOI
TL;DR: This paper explore how the analysis of im/politeness can be tackled from a discursive pragmatics perspective and reveal the dynamic interaction among the three levels of discourse, i.e., micro, macro, and meso level.

Posted Content
TL;DR: The experiments show that by mapping the continues dialogue into a causal utterance pair, which is constructed by the utterance and the reply utterance, models can better capture the emotions of the replied utterance.
Abstract: In this paper, we investigate the emotion recognition ability of the pre-training language model, namely BERT. By the nature of the framework of BERT, a two-sentence structure, we adapt BERT to continues dialogue emotion prediction tasks, which rely heavily on the sentence-level context-aware understanding. The experiments show that by mapping the continues dialogue into a causal utterance pair, which is constructed by the utterance and the reply utterance, models can better capture the emotions of the reply utterance. The present method has achieved 0.815 and 0.885 micro F1 score in the testing dataset of Friends and EmotionPush, respectively.

Journal ArticleDOI
TL;DR: This article proposed a memory augmented GCN for goal-oriented dialogues, which exploits the entity relation graph in a knowledge-base and the dependency graph associated with an utterance to compute richer representations for words and entities.
Abstract: Domain specific goal-oriented dialogue systems typically require modeling three types of inputs, viz., (i) the knowledge-base associated with the domain, (ii) the history of the conversation, which is a sequence of utterances and (iii) the current utterance for which the response needs to be generated. While modeling these inputs, current state-of-the-art models such as Mem2Seq typically ignore the rich structure inherent in the knowledge graph and the sentences in the conversation context. Inspired by the recent success of structure-aware Graph Convolutional Networks (GCNs) for various NLP tasks such as machine translation, semantic role labeling and document dating, we propose a memory augmented GCN for goal-oriented dialogues. Our model exploits (i) the entity relation graph in a knowledge-base and (ii) the dependency graph associated with an utterance to compute richer representations for words and entities. Further, we take cognizance of the fact that in certain situations, such as, when the conversation is in a code-mixed language, dependency parsers may not be available. We show that in such situations we could use the global word co-occurrence graph to enrich the representations of utterances. We experiment with 4 datasets, viz., (i) the modified DSTC2 dataset (ii) recently released code-mixed versions of DSTC2 dataset in four languages (iii) Wizard-of-Oz style CAM676 dataset and (iv) Wizard-of-Oz style MultiWOZ dataset. On all the 4 datasets our method outperforms existing methods, on a wide range of evaluation metrics.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A pre-design Wizard of Oz study is described that offered insight into two factors contributing to perceived system intelligence: the system's ability to understand the analytic intent behind an input utterance and the ability to interpret an utterance contextually.
Abstract: Natural language can be a useful modality for creating and interacting with visualizations but users often have unrealistic expectations about the intelligence of natural language systems. The gulf between user expectations and system capabilities may lead to a disappointing user experience. So — if we want to engineer a natural language system, what are the requirements around system intelligence? This work takes a retrospective look at how we answered this question in the design of Ask Data, a natural language interaction feature for Tableau. We examine two factors contributing to perceived system intelligence: the system's ability to understand the analytic intent behind an input utterance and the ability to interpret an utterance contextually (i.e. taking into account the current visualization state and recent actions). Our aim was to understand the ways in which a system would need to support these two aspects of intelligence to enable a positive user experience. We first describe a pre-design Wizard of Oz study that offered insight into this question and narrowed the space of designs under consideration. We then reflect on the impact of this study on system development, examining how design implications from the study played out in practice. Our work contributes insights for the design of natural language interaction in visual analytics as well as a reflection on the value of pre-design empirical studies in the development of visual analytic systems.

Journal ArticleDOI
TL;DR: Just like monolinguals, experience shapes bilingual toddlers' word knowledge, and with more robust representations, toddlers are better able to recognize words in diverse sentences.
Abstract: In bilingual language environments, infants and toddlers listen to two separate languages during the same key years that monolingual children listen to just one and bilinguals rarely learn each of their two languages at the same rate. Learning to understand language requires them to cope with challenges not found in monolingual input, notably the use of two languages within the same utterance (e.g., Do you like the perro? or ?Te gusta el doggy?). For bilinguals of all ages, switching between two languages can reduce the efficiency in real-time language processing. But language switching is a dynamic phenomenon in bilingual environments, presenting the young learner with many junctures where comprehension can be derailed or even supported. In this study, we tested 20 Spanish-English bilingual toddlers (18- to 30-months) who varied substantially in language dominance. Toddlers' eye movements were monitored as they looked at familiar objects and listened to single-language and mixed-language sentences in both of their languages. We found asymmetrical switch costs when toddlers were tested in their dominant versus non-dominant language, and critically, they benefited from hearing nouns produced in their dominant language, independent of switching. While bilingualism does present unique challenges, our results suggest a united picture of early monolingual and bilingual learning. Just like monolinguals, experience shapes bilingual toddlers' word knowledge, and with more robust representations, toddlers are better able to recognize words in diverse sentences.