scispace - formally typeset
Search or ask a question

Showing papers on "Utterance published in 2023"


Journal ArticleDOI
TL;DR: In this paper , a Hindi-English code-mixed dataset, MaSaC, was developed for sarcasm detection and humor classification in conversational dialog, which to our knowledge is the first dataset of its kind.
Abstract: Sarcasm detection and humor classification are inherently subtle problems, primarily due to their dependence on the contextual and non-verbal information. Furthermore, existing studies in these two topics are usually constrained in non-English languages such as Hindi, due to the unavailability of qualitative annotated datasets. In this work, we make two major contributions considering the above limitations: (1) we develop a Hindi-English code-mixed dataset, MaSaC, for the multi-modal sarcasm detection and humor classification in conversational dialog, which to our knowledge is the first dataset of its kind; (2) we propose MSH-COMICS, a novel attention-rich neural architecture for the utterance classification. We learn efficient utterance representation utilizing a hierarchical attention mechanism that attends to a small portion of the input sentence at a time. Further, we incorporate dialog-level contextual attention mechanism to leverage the dialog history for the multi-modal classification. We perform extensive experiments for both the tasks by varying multi-modal inputs and various submodules of MSH-COMICS. We also conduct comparative analysis against existing approaches. We observe that MSH-COMICS attains superior performance over the existing models by >1 F1-score point for the sarcasm detection and 10 F1-score points in humor classification. We diagnose our model and perform thorough analysis of the results to understand the superiority and pitfalls.

13 citations


Journal ArticleDOI
TL;DR: In this paper , an end-to-end model based on generative adversarial networks (GANs) is proposed to translate spoken video to waveform without using any intermediate representation or separate waveform synthesis algorithm.
Abstract: Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on generative adversarial networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder–decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of the raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for Lip Reading in the Wild (LRW), featuring hundreds of speakers recorded entirely “in the wild.” We evaluate the generated samples in two different scenarios—seen and unseen speakers—using four objective metrics which measure the quality and intelligibility of artificial speech. We demonstrate that the proposed approach outperforms all previous works in most metrics on GRID and LRW.

12 citations


Journal ArticleDOI
TL;DR: The authors proposed a growing graph model for dialogues emotion detection based on retrieval of external knowledge atlas ATOMIC from local and global respectively, which can effectively represent the dialogues as a process variable in a sequence and the correlation among utterances also can be represented by the graph model.
Abstract: Dialogue emotion detection is always challenging due to human subjectivity and the randomness of dialogue content. In a conversation, the emotion of each person often develops via a cumulative process, which can be influenced by many elements of uncertainty. Much commonsense knowledge influences people's emotions imperceptibly, such as experiential or habitual knowledge. In the process of conversation, this commonsense knowledge information can be used to enrich the semantic information of each utterance and improve the accuracy of emotion recognition. In this paper, we propose a growing graph model for dialogues emotion detection based on retrieval of external knowledge atlas ATOMIC from local and global respectively, which can effectively represent the dialogues as a process variable in a sequence and the correlation among utterances also can be represented by the graph model. In particular, 1) we introduce a common sense knowledge graph for linking the commonsense knowledge retrieved from external knowledge atlas ATOMIC, which can effectively add auxiliary information to improve the performance of each utterance's representation. 2) We propose a novel self-supervised learning method for extracting the latent topic of each dialogue. Based on this design, we also propose an effective optimization mechanism to make the representation (embedding) of latent topic has a better distinction for the next operation. 3) Finally, the cross-attention module is utilized to combine the utterances' features and the latent conversation topic information. The attention mechanism can effectively use topic information to supplement the representation of utterances and improve recognition performance. The model is tested on three popular datasets in dialogue emotion detection and is empirically demonstrated to outperform the state-of-the-art approaches. Meanwhile, to demonstrate the performance of our approach, we also build a long dialogue dataset. The average length of each conversation is over 50 utterances. The final experimental results also demonstrate the superior performance of our approach.

4 citations


Journal ArticleDOI
TL;DR: The authors proposed a low-dimensional Supervised Cluster-level Contrastive Learning (SCCL) method, which first reduces the high-dimensional SCL space to a three-dimensional affect representation space Valence-Arousal-Dominance (VAD), then performs cluster-level contrastive learning to incorporate measurable emotion prototypes.
Abstract: A key challenge for Emotion Recognition in Conversations (ERC) is to distinguish semantically similar emotions. Some works utilise Supervised Contrastive Learning (SCL) which uses categorical emotion labels as supervision signals and contrasts in high-dimensional semantic space. However, categorical labels fail to provide quantitative information between emotions. ERC is also not equally dependent on all embedded features in the semantic space, which makes the high-dimensional SCL inefficient. To address these issues, we propose a novel low-dimensional Supervised Cluster-level Contrastive Learning (SCCL) method, which first reduces the high-dimensional SCL space to a three-dimensional affect representation space Valence-Arousal-Dominance (VAD), then performs cluster-level contrastive learning to incorporate measurable emotion prototypes. To help modelling the dialogue and enriching the context, we leverage the pre-trained knowledge adapters to infuse linguistic and factual knowledge. Experiments show that our method achieves new state-of-the-art results with 69.81% on IEMOCAP, 65.7% on MELD, and 62.51% on DailyDialog datasets. The analysis also proves that the VAD space is not only suitable for ERC but also interpretable, with VAD prototypes enhancing its performance and stabilising the training of SCCL. In addition, the pre-trained knowledge adapters benefit the performance of the utterance encoder and SCCL. Our code is available at: https://github.com/SteveKGYang/SCCL

3 citations


Journal ArticleDOI
TL;DR: The authors argue that second language (L2) pragmatic research needs to explore new avenues for integrating speech acts and interaction, by proposing a radically minimal, finite and interactional typology of speech acts.
Abstract: In this position paper, we argue that second language (L2) pragmatic research needs to explore new avenues for integrating speech acts and interaction, by proposing a radically minimal, finite and interactional typology of speech acts. While we will introduce what we mean by integrating speech acts and interaction in detail below, the following argument helps us to summarise the issue we consider in this study: When we describe language behaviour, we sometimes use terms such as ‘suggest’, ‘request’ and so on, which roughly indicate illocutionary values, and sometimes terms such as ‘agree’, ‘accept’, ‘contradict’, ‘turn down’, ‘refuse’, which are more indicative of the significance of the utterance relative to a preceding one. What we need to do is to distinguish between these two aspects of a communicative act – the illocutionary and the interactional. (Edmondson et al., 2023, pp. 25–26)

3 citations


Journal ArticleDOI
01 Feb 2023-iScience
TL;DR: In this paper , the amplitude of phase-locked responses parametrically decreases with natural AV speech synchrony, a pattern that is consistent with predictive coding, and the temporal statistics of AV speech affect large-scale oscillatory networks at multiple spatial and temporal resolutions.

2 citations


Journal ArticleDOI
TL;DR: In this paper , a general method to embed the triples in each graph into large-scalable models and thereby generate clinically correct responses based on the conversation history using the recently recently released MedDialog(EN) dataset was proposed.
Abstract: Smart healthcare systems that make use of abundant health data can improve access to healthcare services, reduce medical costs and provide consistently high-quality patient care. Medical dialogue systems that generate medically appropriate and human-like conversations have been developed using various pre-trained language models and a large-scale medical knowledge base based on Unified Medical Language System (UMLS). However, most of the knowledge-grounded dialogue models only use local structure in the observed triples, which suffer from knowledge graph incompleteness and hence cannot incorporate any information from dialogue history while creating entity embeddings. As a result, the performance of such models decreases significantly. To address this problem, we propose a general method to embed the triples in each graph into large-scalable models and thereby generate clinically correct responses based on the conversation history using the recently recently released MedDialog(EN) dataset. Given a set of triples, we first mask the head entities from the triples overlapping with the patient's utterance and then compute the cross-entropy loss against the triples' respective tail entities while predicting the masked entity. This process results in a representation of the medical concepts from a graph capable of learning contextual information from dialogues, which ultimately aids in leading to the gold response. We also fine-tune the proposed Masked Entity Dialogue (MED) model on smaller corpora which contain dialogues focusing only on the Covid-19 disease named as the Covid Dataset. In addition, since UMLS and other existing medical graphs lack data-specific medical information, we re-curate and perform plausible augmentation of knowledge graphs using our newly created Medical Entity Prediction (MEP) model. Empirical results on the MedDialog(EN) and Covid Dataset demonstrate that our proposed model outperforms the state-of-the-art methods in terms of both automatic and human evaluation metrics.

2 citations


Journal ArticleDOI
TL;DR: This paper analyzed the processing and understanding of ironic written sentences with or without quotation marks and asked whether and how these marks affect the subjects' reading of the sentences, finding that they increase the processing burden first, independently of the meaning specification in a sentence, but then play a crucial and beneficial role in processing and recognition of irony.
Abstract: Abstract Quotation marks are used for different purposes in language, one of which is to signal that something has to be interpreted in an ironic way, as in the utterance What a “nice” day! said on a rainy and cold day. The present contribution describes a reading time experiment in which we analyzed the processing and understanding of ironic written sentences with or without quotation marks and asked whether and how these marks affect the subjects’ reading of the sentences. Native speakers of English were exposed to two contexts and a subsequent target sentence. Semantically, context and target sentences were connected either ironically or literally or were entirely unrelated. Each of these three meaning conditions contained quotation marks or not. Within the target sentences, which were identical across the different conditions, we measured the reading time before the respective meaning (ironic, literal, unrelated) was revealed, at the phrase that made the scenario ironic, literal, or unrelated, and at the end of the sentence. Furthermore, having read the target sentence, subjects rated how well this sentence fit the preceding context, and the time they needed for their judgment was recorded as well. Results clearly show that quotation marks increase the processing burden first, independently of the meaning specification in a sentence, but then play a crucial and beneficial role in the processing and recognition of irony. We reflect upon these findings against the background of semantic and pragmatic theories of quotation.

2 citations


Journal ArticleDOI
TL;DR: In this article , the role of conventionality in the time course of indirect reply processing was investigated by comparing conventional and non-conventional indirect replies with direct reply, respectively, and the results showed that for conventional indirect replies, the second phrase elicited a larger anterior negativity compared with those in direct replies.

2 citations


Journal ArticleDOI
TL;DR: The authors introduced new expert-annotated utterance attributes to AnnoMI and described the entire data collection process in more detail, including dialogue source selection, transcription, annotation, and post-processing.
Abstract: Research on the analysis of counselling conversations through natural language processing methods has seen remarkable growth in recent years. However, the potential of this field is still greatly limited by the lack of access to publicly available therapy dialogues, especially those with expert annotations, but it has been alleviated thanks to the recent release of AnnoMI, the first publicly and freely available conversation dataset of 133 faithfully transcribed and expert-annotated demonstrations of high- and low-quality motivational interviewing (MI)—an effective therapy strategy that evokes client motivation for positive change. In this work, we introduce new expert-annotated utterance attributes to AnnoMI and describe the entire data collection process in more detail, including dialogue source selection, transcription, annotation, and post-processing. Based on the expert annotations on key MI aspects, we carry out thorough analyses of AnnoMI with respect to counselling-related properties on the utterance, conversation, and corpus levels. Furthermore, we introduce utterance-level prediction tasks with potential real-world impacts and build baseline models. Finally, we examine the performance of the models on dialogues of different topics and probe the generalisability of the models to unseen topics.

2 citations


MonographDOI
10 Mar 2023
TL;DR: This article reviewed grammatical encoding theories and evaluated them in light of relevant empirical evidence, concluding that the scope of grammatical decoding prior to utterance onset is determined by linguistic structure or by cognitive factors.
Abstract: During the production of spoken sentences, the linearisation of a 'thought' is accomplished via the process of grammatical encoding, i.e., the building of a hierarchical syntactic frame that fixes the linear order of lexical concepts. While much research has demonstrated the independence of lexical and syntactic representations, exactly what is represented remains a matter of dispute. Moreover, theories differ in terms of whether words or syntax drive grammatical encoding. This debate is also central to theories of the time-course of grammatical encoding. Speaking is usually a rapid process in which articulation begins before an utterance has been entirely planned. Current theories of grammatical encoding make different claims about the scope of grammatical encoding prior to utterance onset, and the degree to which planning scope is determined by linguistic structure or by cognitive factors. The authors review current theories of grammatical encoding and evaluate them in light of relevant empirical evidence. This title is also available as Open Access on Cambridge Core.

Journal ArticleDOI
TL;DR: In this article , the authors used RAVDESS dataset consisting of speech utterances with two emotional intensities, namely, strong and normal, raising the recognition difficulty level for the development of an efficient framework for the speech emotion recognition system (SER).

Journal ArticleDOI
TL;DR: In this paper , the authors designed the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements corpus so that models can be trained for the entire article, and included Transformers, the Long Document Transformer's bidirectional encoder representation, and the Generative Improving a pre trained Transformer 2 model for utterance classification in the context of a named entity recognition token classification problem.
Abstract: Current automatic writing feedback systems cannot distinguish between different discourse elements in students' writing. This is a problem because, without this ability, the guidance provided by these systems is too general for what students want to achieve on arrival. This is cause for concern because automated writing feedback systems are a great tool for combating student writing declines. According to the National Assessment of Educational Progress, less than 30 percent of high school graduates are gifted writers. If we can improve the automatic writing feedback system, we can improve the quality of student writing and stop the decline of skilled writers among students. Solutions to this problem have been proposed, the most popular being the fine-tuning of bidirectional encoder representations from Transformers models that recognize various utterance elements in student written assignments. However, these methods have their drawbacks. For example, these methods do not compare the strengths and weaknesses of different models, and these solutions encourage training models over sequences (sentences) rather than entire articles. In this article, I'm redesigning the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements corpus so that models can be trained for the entire article, and I've included Transformers, the Long Document Transformer's bidirectional encoder representation, and the Generative Improving a pre trained Transformer 2 model for utterance classification in the context of a named entity recognition token classification problem. Overall, the bi-directional encoder representation of the Transformers model railway using my sequence-merging preprocessing method outperforms the standard model by 17% and 41% in overall accuracy. I also found that the Long Document Transformer model performed the best in utterance classification with an overall f-1 score of 54%. However, the increase in validation loss from 0.54 to 0.79 indicates that the model is overfitting. Some improvements can still be made due to model overfittings, such as B. Implementation of early stopping techniques and further examples of rare utterance elements during training.

Journal ArticleDOI
02 Feb 2023-PLOS ONE
TL;DR: The authors investigated whether linguistic features that differentiate true and false utterances in English are also present in the Polish language and found that false statements are less complex in terms of vocabulary, are more concise and concrete, and have more positive words and fewer negative words.
Abstract: Lying appears in everyday oral and written communication. As a consequence, detecting it on the basis of linguistic analysis is particularly important. Our study aimed to verify whether the differences between true and false statements in terms of complexity and sentiment that were reported in previous studies can be confirmed using tools dedicated to measuring those factors. Further, we investigated whether linguistic features that differentiate true and false utterances in English—namely utterance length, concreteness, and particular parts-of-speech—are also present in the Polish language. We analyzed nearly 1,500 true and false statements, half of which were transcripts while the other half were written statements. Our results show that false statements are less complex in terms of vocabulary, are more concise and concrete, and have more positive words and fewer negative words. We found no significant differences between spoken and written lies. Using this data, we built classifiers to automatically distinguish true from false utterances, achieving an accuracy of 60%. Our results provide a significant contribution to previous conclusions regarding linguistic deception indicators.

Proceedings ArticleDOI
20 Feb 2023
TL;DR: In this article , a self-supervised learning (SSL) based method was proposed for automatic fluency assessment using wav2vec2.0 features and K-means clustering to assign a pseudo label (cluster index) to each frame.
Abstract: A typical fluency scoring system generally relies on an automatic speech recognition (ASR) system to obtain time stamps in input speech for either the subsequent calculation of fluency-related features or directly modeling speech fluency with an end-to-end approach. This paper describes a novel ASR-free approach for automatic fluency assessment using self-supervised learning (SSL). Specifically, wav2vec2.0 is used to extract frame-level speech features, followed by K-means clustering to assign a pseudo label (cluster index) to each frame. A BLSTM-based model is trained to predict an utterance-level fluency score from frame-level SSL features and the corresponding cluster indexes. Neither speech transcription nor time stamp information is required in the proposed system. It is ASR-free and can potentially avoid the ASR errors effect in practice. Experimental results carried out on non-native English databases show that the proposed approach significantly improves the performance in the"open response"scenario as compared to previous methods and matches the recently reported performance in the"read aloud"scenario.

Proceedings ArticleDOI
03 Jul 2023
TL;DR: This work proposes a QUEry-Enhanced Network (QUEEN), which adopts a fast and effective edit operation scoring network to model the relation between two tokens and achieves state-of-the-art performance on several public datasets.
Abstract: Incomplete utterance rewriting has recently raised wide attention. However, previous works do not consider the semantic structural information between incomplete utterance and rewritten utterance or model the semantic structure implicitly and insufficiently. To address this problem, we propose a QUEry-Enhanced Network(QUEEN) to solve this problem. Firstly, our proposed query template explicitly brings guided semantic structural knowledge between the incomplete utterance and the rewritten utterance making model perceive where to refer back to or recover omitted tokens. Then, we adopt a fast and effective edit operation scoring network to model the relation between two tokens. Benefiting from extra information and the well-designed network, QUEEN achieves state-of-the-art performance on several public datasets.

Journal ArticleDOI
TL;DR: In this article , the authors propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to out-of-vocabulary (OOV) words.

Proceedings ArticleDOI
04 Jun 2023
TL;DR: In this paper , both hand-crafted feature-based and end-to-end raw waveform DNN approaches for modeling speech emotion information in such short segments were investigated. But, the endtoend end-based approach tends to emphasize cepstral information instead of spectral information (such as flux and harmonicity).
Abstract: Conventionally, speech emotion recognition has been approached by utterance or turn-level modelling of input signals, either through extracting hand-crafted low-level descriptors, bag-of-audio-words features or by feeding long-duration signals directly to deep neural networks (DNNs). While this approach has been successful, there is a growing interest in modelling speech emotion information at the short segment level, at around 250ms-500ms (e.g. the 2021-22 MuSe Challenges). This paper investigates both hand-crafted feature-based and end-to-end raw waveform DNN approaches for modelling speech emotion information in such short segments. Through experimental studies on IEMOCAP corpus, we demonstrate that the end-to-end raw waveform modelling approach is more effective than using hand-crafted features for short-segment level modelling. Furthermore, through relevance signal-based analysis of the trained neural networks, we observe that the top performing end-to-end approach tends to emphasize cepstral information instead of spectral information (such as flux and harmonicity).

Book ChapterDOI
01 Jan 2023
TL;DR: Wang et al. as discussed by the authors proposed a Customized Conversational Recommender System (CCRS), which customizes CRS model for users from three perspectives: speaking like a human, human can speak with different styles according to the current dialogue context.
Abstract: Conversational recommender systems (CRS) aim to capture user’s current intentions and provide recommendations through real-time multi-turn conversational interactions. As a human-machine interactive system, it is essential for CRS to improve the user experience. However, most CRS methods neglect the importance of user experience. In this paper, we propose two key points for CRS to improve the user experience: (1) Speaking like a human, human can speak with different styles according to the current dialogue context. (2) Identifying fine-grained intentions, even for the same utterance, different users have diverse fine-grained intentions, which are related to users’ inherent preference. Based on the observations, we propose a novel CRS model, coined Customized Conversational Recommender System (CCRS), which customizes CRS model for users from three perspectives. For human-like dialogue services, we propose multi-style dialogue response generator which selects context-aware speaking style for utterance generation. To provide personalized recommendations, we extract user’s current fine-grained intentions from dialogue context with the guidance of user’s inherent preferences. Finally, to customize the model parameters for each user, we train the model from the meta-learning perspective. Extensive experiments and a series of analyses have shown the superiority of our CCRS on both the recommendation and dialogue services.

Proceedings ArticleDOI
04 Jun 2023
TL;DR: In this article , a simple and effective method for modeling context and utterance information concurrently was proposed for action item detection in the ICASSP 2023 Signal Processing Grand Challenge, which achieved remarkable improvements over baseline models.
Abstract: Action item detection aims at recognizing sentences containing information about actionable tasks, which can help people quickly grasp core tasks in the meeting without going through the redundant meeting contents. Therefore, in this paper, we thoroughly describe our carefully designed solution for the Action Item Detection Track of the General Meeting Understanding and Generation (MUG) challenge in the ICASSP 2023 Signal Processing Grand Challenge. Specifically, we systematically analyze the task instances provided by MUG and find that the key ingredient for successful action item detection is leveraging the dialogue context information into consideration. To this end, we design a simple and effective method for modelling context and utterance information concurrently. The experimental results show our method achieves remarkable improvements over baseline models, with an absolute increase of 0.62 of the F 1 score on the validation set. The stable generalizability of our method is further verified by our score on the final test set 1 .

Journal ArticleDOI
TL;DR: This paper showed that participants often interpret OVS sentences nonliterally, and the probability of nonliteral interpretations depended on the Levenshtein distance between the perceived sentence and the (potentially intended) SVO version of the sentence.
Abstract: Under the noisy-channel framework of language comprehension, comprehenders infer the speaker's intended meaning by integrating the perceived utterance with their knowledge of the language, the world, and the kinds of errors that can occur in communication. Previous research has shown that, when sentences are improbable under the meaning prior (implausible sentences), participants often interpret them nonliterally. The rate of nonliteral interpretation is higher when the errors that could have transformed the intended utterance into the perceived utterance are more likely. However, previous experiments on noisy channel processing mostly relied on implausible sentences, and it is unclear whether participants' nonliteral interpretations were evidence of noisy channel processing or the result of trying to conform to the experimenter's expectations in an experiment with nonsensical sentences. In the current study, we used the unique properties of Russian, an understudied language in the psycholinguistics literature, to test noisy-channel comprehension using only simple plausible sentences. The prior plausibility of sentences was tied only to their word order; subject-verb-object (SVO) sentences were more probable under the structural prior than object-verb-subject (OVS) sentences. In two experiments, we show that participants often interpret OVS sentences nonliterally, and the probability of nonliteral interpretations depended on the Levenshtein distance between the perceived sentence and the (potentially intended) SVO version of the sentence. The results show that the structural prior guides people's final interpretation, independent of the presence of semantic implausibility. (PsycInfo Database Record (c) 2023 APA, all rights reserved).

Journal ArticleDOI
TL;DR: In this article , the authors explored AMIE (Automated-vehicle Multi-modal In-cabin Experience), an agent responsible for handling certain passenger-Vehicle interactions, where passengers give instructions to AMIE and the agent should parse such commands properly and trigger the appropriate functionality of the AV system.
Abstract: Understanding passenger intents and extracting relevant slots are important building blocks towards developing contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal In-cabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our current explorations, we focused on AMIE scenarios describing usages around setting or changing the destination and route, updating driving behavior or speed, finishing the trip and other use-cases to support various natural commands. We collected a multi-modal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via a realistic scavenger hunt game activity. After exploring various recent Recurrent Neural Networks (RNN) based techniques, we introduced our own hierarchical joint models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results outperformed certain competitive baselines and achieved overall F1 scores of 0.91 for utterance-level intent detection and 0.96 for slot filling tasks. In addition, we conducted initial speech-to-text explorations by comparing intent/slot models trained and tested on human transcriptions versus noisy Automatic Speech Recognition (ASR) outputs. Finally, we compared the results with single passenger rides versus the rides with multiple passengers.

Journal ArticleDOI
TL;DR: In this article , the utility of speaker embeddings, representations extracted from a trained speaker recognition network, as robust features for detecting emotions was investigated. But, the dataset used for developing emotion recognition systems remains significantly smaller than those used for speech systems.
Abstract: The robustness of an acoustic emotion recognition system hinges on first having access to features that represent an acoustic input signal. These representations should abstract extraneous low-level variations present in acoustic signals and only capture speaker characteristics relevant for emotion recognition. Previous research has demonstrated that, in other classification tasks, when large labeled datasets are available, neural networks trained on these data learn to extract robust features from the input signal. However, the datasets used for developing emotion recognition systems remain significantly smaller than those used for developing other speech systems. Thus, acoustic emotion recognition systems remain in need of robust feature representations. In this work, we study the utility of speaker embeddings, representations extracted from a trained speaker recognition network, as robust features for detecting emotions. We first study the relationship between emotions and speaker embeddings, and demonstrate how speaker embeddings highlight the differences that exist between neutral speech and emotionally expressive speech. We quantify the modulations that variations in emotional expression incur on speaker embeddings, and show how these modulations are greater than those incurred from lexical variations in an utterance. Finally, we demonstrate how speaker embeddings can be used as a replacement for traditional low-level acoustic features for emotion recognition.

Journal ArticleDOI
01 Feb 2023-Sensors
TL;DR: In this paper , the authors proposed two architectures of speaker identification systems based on a combination of diarization and identification methods, which operate on the basis of segment-level or group-level classification.
Abstract: Diarization is an important task when work with audiodata is executed, as it provides a solution to the problem related to the need of dividing one analyzed call recording into several speech recordings, each of which belongs to one speaker. Diarization systems segment audio recordings by defining the time boundaries of utterances, and typically use unsupervised methods to group utterances belonging to individual speakers, but do not answer the question “who is speaking?” On the other hand, there are biometric systems that identify individuals on the basis of their voices, but such systems are designed with the prerequisite that only one speaker is present in the analyzed audio recording. However, some applications involve the need to identify multiple speakers that interact freely in an audio recording. This paper proposes two architectures of speaker identification systems based on a combination of diarization and identification methods, which operate on the basis of segment-level or group-level classification. The open-source PyAnnote framework was used to develop the system. The performance of the speaker identification system was verified through the application of the AMI Corpus open-source audio database, which contains 100 h of annotated and transcribed audio and video data. The research method consisted of four experiments to select the best-performing supervised diarization algorithms on the basis of PyAnnote. The first experiment was designed to investigate how the selection of the distance function between vector embedding affects the reliability of identification of a speaker’s utterance in a segment-level classification architecture. The second experiment examines the architecture of cluster-centroid (group-level) classification, i.e., the selection of the best clustering and classification methods. The third experiment investigates the impact of different segmentation algorithms on the accuracy of identifying speaker utterances, and the fourth examines embedding window sizes. Experimental results demonstrated that the group-level approach offered better identification results were compared to the segment-level approach, and the latter had the advantage of real-time processing.

Journal ArticleDOI
TL;DR: This paper investigated how variations in the language-related gene FOXP2 and executive function-related genes COMT, BDNF, and Kibra/WWC1 affect bilingual language control during two phases of speech production, namely the language schema phase and lexical response phase.
Abstract: Previous studies have debated whether the ability for bilinguals to mentally control their languages is a consequence of their experiences switching between languages or whether it is a specific, yet highly‐adaptive, cognitive ability. The current study investigates how variations in the language‐related gene FOXP2 and executive function‐related genes COMT, BDNF, and Kibra/WWC1 affect bilingual language control during two phases of speech production, namely the language schema phase (i.e., the selection of one language or another) and lexical response phase (i.e., utterance of the target). Chinese–English bilinguals (N = 119) participated in a picture‐naming task involving cued language switches. Statistical analyses showed that both genes significantly influenced language control on neural coding and behavioral performance. Specifically, FOXP2 rs1456031 showed a wide‐ranging effect on language control, including RTs, F(2, 113) = 4.00, FDR p = .036, and neural coding across three‐time phases (N2a: F(2, 113) = 4.96, FDR p = .014; N2b: F(2, 113) = 4.30, FDR p = .028, LPC: F(2, 113) = 2.82, FDR p = .060), while the COMT rs4818 (ts >2.69, FDR ps < .05), BDNF rs6265 (Fs >5.31, FDR ps < .05), and Kibra/WWC1 rs17070145 (ts > −3.29, FDR ps < .05) polymorphisms influenced two‐time phases (N2a and N2b). Time‐resolved correlation analyses revealed that the relationship between neural coding and cognitive performance is modulated by genetic variations in all four genes. In all, these findings suggest that bilingual language control is shaped by an individual's experience switching between languages and their inherent genome.

Journal ArticleDOI
TL;DR: This paper proposed a VAD-disentangled Variational AutoEncoder (VAD-VAE), which first introduces a target utterance reconstruction task based on Variational Autoencoder, and then disentangles three affect representations Valence-Arousal-Dominance from the latent space.
Abstract: In Emotion Recognition in Conversations (ERC), the emotions of target utterances are closely dependent on their context. Therefore, existing works train the model to generate the response of the target utterance, which aims to recognise emotions leveraging contextual information. However, adjacent response generation ignores long-range dependencies and provides limited affective information in many cases. In addition, most ERC models learn a unified distributed representation for each utterance, which lacks interpretability and robustness. To address these issues, we propose a VAD-disentangled Variational AutoEncoder (VAD-VAE), which first introduces a target utterance reconstruction task based on Variational Autoencoder, then disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space. We also enhance the disentangled representations by introducing VAD supervision signals from a sentiment lexicon and minimising the mutual information between VAD distributions. Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets. Further analysis proves the effectiveness of each proposed module and the quality of disentangled VAD representations. The code is available at https://github.com/SteveKGYang/VAD-VAE.

Proceedings ArticleDOI
27 Mar 2023
TL;DR: In this article , a Transformer is used to synthesize the sound from voice articulation, rhythm, and intonation to create explosion sounds, a kind of sound effect that has various representations.
Abstract: Sound creators use knowledge, techniques, and experience to create sound effects for media works, ensuring that these sound effects are suitable for different situations and dramatic presentations. This is a challenging task for inexperienced creators and beginners, but it is relatively easy for anyone to imagine desired sounds and express them as onomatopoeic utterances. Therefore, we propose a novel technique to easily create a desired sound effect by synthesizing the sound from voice articulation, rhythm, and intonation. In this research, we focus on explosion sounds, a kind of sound effect that has various representations. The proposed technique uses a Transformer, which trains the conversion from speech to explosion sounds. This paper describes the synthesizer model with Transformer and datasets for training the model and some results in current.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a novel multimodal adversarial learning network (MALN) to exploit both commonality and diversity from the unimodal features for emotion classification.
Abstract: Multimodal emotion recognition in conversations (ERC) aims to identify the emotional state of constituent utterances expressed by multiple speakers in dialogue from multimodal data. Existing multimodal ERC approaches focus on modeling the global context of the dialogue and neglect to mine the characteristic information from the corresponding utterances expressed by the same speaker. Additionally, information from different modalities exhibits commonality and diversity for emotional expression. The commonality and diversity of multimodal information are compensated for each other but not effectively exploited in previous multimodal ERC works. To tackle these issues, we propose a novel Multimodal Adversarial Learning Network (MALN). MALN first mines the speaker’s characteristics from context sequences and then incorporate them with the unimodal features. Afterward, we design a novel adversarial module AMDM to exploit both commonality and diversity from the unimodal features. Finally, AMDM fuses different modalities to generate refined utterance representations for emotion classification. Extensive experiments are conducted on two public multimodal ERC datasets, IEMOCAP and MELD. Through the experiments, MALN shows its superiority over the state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this paper , a time-domain adaptive attention network (TAANet) with local and global attention network is proposed for single-channel speech separation by applying self-attention based networks.
Abstract: Abstract Recent years have witnessed a great progress in single-channel speech separation by applying self-attention based networks. Despite the excellent performance in mining relevant long-sequence contextual information, self-attention networks cannot perfectly focus on subtle details in speech signals, such as temporal or spectral continuity, spectral structure, and timbre. To tackle this problem, we proposed a time-domain adaptive attention network (TAANet) with local and global attention network. Channel and spatial attention are introduced in local attention networks to focus on subtle details of the speech signals (frame-level features). In the global attention networks, a self-attention mechanism is used to explore the global associations of the speech contexts (utterance-level features). Moreover, we model the speech signal serially using multiple local and global attention blocks. This cascade structure enables our model to focus on local and global features adaptively, compared with other speech separation feature extraction methods, further boosting the separation performance. Versus other end-to-end speech separation methods, extensive experiments on benchmark datasets demonstrate that our approach obtains a superior result. (20.7 dB of SI-SNRi and 20.9 dB of SDRi on WSJ0-2mix).

Journal ArticleDOI
12 Jan 2023-Mind
TL;DR: In this paper , the notion of deniability is defined as the ability of a speaker to make it epistemically irrational for her audience to reason in certain ways, which is called untouchability.
Abstract: Communication can be risky. Like other kinds of actions, it comes with potential costs. For instance, an utterance can be embarrassing, offensive, or downright illegal. In the face of such risks, speakers tend to act strategically and seek ‘plausible deniability’. In this paper, we propose an account of the notion of deniability at issue. On our account, deniability is an epistemic phenomenon. A speaker has deniability if she can make it epistemically irrational for her audience to reason in certain ways. To avoid predictable confusion, we distinguish deniability from a practical correlate we call ‘untouchability’. Roughly, a speaker has untouchability if she can make it practically irrational for her audience to act in certain ways. These accounts shed light on the nature of strategic speech and suggest countermeasures against strategic speech.