Showing papers in &quot;Speech Communication in 2019&quot;

Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO

TL;DR: It is proved that this method can effectively reduce the confusion between emotions, thus improving the speech emotion recognition rate.

...read moreread less

74 citations

Journal Article•DOI•

[...]

Leila Kerkeni¹, Leila Kerkeni², Youssef Serrestou¹, Kosai Raoof¹, Mohamed Mbarki², Mohamed Ali Mahjoub², Catherine Cléder³ - Show less +3 more•Institutions (3)

Centre national de la recherche scientifique¹, University of Sousse², University of Nantes³

Speech-driven Animation with Meaningful Behaviors

TL;DR: A global approach for speech emotion recognition (SER) system using empirical mode decomposition (EMD) is proposed and a combination of all features extracted from the IMFs enhances the performance of the SER system and achieving 91.16% recognition rate.

...read moreread less

63 citations

Journal Article•DOI•

[...]

Najmeh Sadoughi¹, Carlos Busso¹•Institutions (1)

University of Texas at Dallas¹

Data augmentation using generative adversarial networks for robust speech recognition

TL;DR: In this article, a dynamic Bayesian network (DBN) is proposed to bridge the gap between rule-based and data-driven approaches, where a discrete variable is added to constrain the behaviors on the underlying constraint.

...read moreread less

48 citations

Journal Article•DOI•

[...]

Yanmin Qian¹, Hu Hu², Tian Tan¹•Institutions (2)

Shanghai Jiao Tong University¹, Georgia Institute of Technology²

Why listening in background noise is harder in a non-native language than in a native language : A review

TL;DR: The experiments show that the new data augmentation approaches can obtain the performance improvement under all noisy conditions, which including additive noise, channel distortion and reverberation, and a relative 6% to 14% WER reduction can be obtained upon an advanced acoustic model.

...read moreread less

40 citations

Journal Article•DOI•

[...]

Odette Scharenborg¹, Odette Scharenborg², Marjolein van Os³•Institutions (3)

Radboud University Nijmegen¹, Delft University of Technology², Saarland University³

01 Apr 2019-Speech Communication

TL;DR: Although spoken- word recognition in the presence of background noise is harder in a non-native language than in one's native language, this difference can be explained by differences in language exposure, which influences the uptake and use of phonetic and contextual information in the speech signal for spoken-word recognition.

...read moreread less

36 citations

Journal Article•DOI•

Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model

[...]

Xingfeng Li¹, Masato Akagi¹•Institutions (1)

Japan Advanced Institute of Science and Technology¹

Dysarthric speech classification from coded telephone speech using glottal features

TL;DR: This study uses a three-layer model composed of acoustic features, semantic primitives, and emotion dimensions to map acoustics into emotion dimensions and classify the continuous emotion dimensional values into basic categories by using the logistic model trees.

...read moreread less

32 citations

Journal Article•DOI•

[...]

N. P. Narendra¹, Paavo Alku¹•Institutions (1)

Aalto University¹

Golden speaker builder – An interactive tool for pronunciation training

TL;DR: The results showed that the glottal features in combination with the openSMILE-based acoustic features resulted in improved classification accuracies, which validate the complementary nature of glattal features.

...read moreread less

31 citations

Journal Article•DOI•

[...]

Shaojin Ding¹, Christopher Liberatore¹, Sinem Sonsaat², Ivana Lucic², Alif Silpachai², Guanlong Zhao¹, Evgeny Chukharev-Hudilainen², John M. Levis², Ricardo Gutierrez-Osuna¹ - Show less +5 more•Institutions (2)

Texas A&M University¹, Iowa State University²

Time-domain speech enhancement using generative adversarial networks

TL;DR: Golden Speaker Builder is presented, a tool that allows learners to generate a personalized “golden-speaker” voice: one that mirrors their own voice but with a native accent.

...read moreread less

30 citations

Journal Article•DOI•

[...]

Santiago Pascual¹, Joan Serrà², Antonio Bonafonte¹•Institutions (2)

Polytechnic University of Catalonia¹, Telefónica²

An iterative mask estimation approach to deep learning based multi-channel speech recognition

TL;DR: This work proposes a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal, and demonstrates the applicability of the approach for more generalized speech enhancement, where it has to regenerate voices from whispered signals.

...read moreread less

Journal Article•DOI•

[...]

Yan-Hui Tu¹, Jun Du¹, Lei Sun¹, Feng Ma, Hai-Kun Wang, Jingdong Chen², Chin-Hui Lee³ - Show less +3 more•Institutions (3)

University of Science and Technology of China¹, Northwestern Polytechnical University², Georgia Institute of Technology³

The relative contribution of computer assisted prosody training vs. instructor based prosody teaching in developing speaking skills by interpreter trainees: An experimental study.

TL;DR: A neural-network-based ideal ratio mask estimator learned from a multi-condition data set is adopted to incorporate prior information, obtained from the speech/noise interactions and the long acoustic context, into CGMM-based beamformed speech that has a higher signal-to-noise ratio (SNR) than the original noisy speech signal.

...read moreread less

Journal Article•DOI•

[...]

Mahmood Yenkimaleki¹, Vincent J. van Heuven², Vincent J. van Heuven³•Institutions (3)

VU University Amsterdam¹, Leiden University², University of Pannonia³

02 Feb 2019-Speech Communication

TL;DR: The results showed that the second experimental group (CAPT) performed better than the other groups in developing speaking skills, which has pedagogical implications for curriculum designers, interpreter training programs, and all who are involved in language study and pedagogy.

...read moreread less

Journal Article•DOI•

Text normalization using memory augmented neural networks

[...]

Subhojeet Pramanik¹, Aman Hussain¹•Institutions (1)

VIT University¹

01 May 2019-Speech Communication

TL;DR: This work presents a neural architecture that will serve as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent neural networks, and shows that this novel architecture is indeed a better alternative.

...read moreread less

Journal Article•DOI•

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

[...]

Fei Tao¹, Carlos Busso¹•Institutions (1)

University of Texas at Dallas¹

01 Oct 2019-Speech Communication

TL;DR: In this article, a bimodal recurrent neural network (BRNN) framework was proposed for speech activity detection in audiovisual speech processing systems, where acoustic and visual features are directly learned from the raw data during training.

...read moreread less

Journal Article•DOI•

VoiceHome-2, an extended corpus for multichannel speech processing in real homes

[...]

Nancy Bertin¹, Ewen Camberlein¹, Romain Lebarbenchon¹, Emmanuel Vincent², Sunit Sivasankaran², Irina Illina², Frédéric Bimbot¹ - Show less +3 more•Institutions (2)

University of Rennes¹, University of Lorraine²

OPENGLOT – An open environment for the evaluation of glottal inverse filtering

TL;DR: A new, extended version of the voiceHome corpus for distant-microphone speech processing in domestic environments, which includes short reverberated, noisy utterances spoken in French by 12 native French talkers in diverse realistic acoustic conditions and recorded by an 8- microphone device at various angles and distances and in various noise conditions.

...read moreread less

Journal Article•DOI•

[...]

Paavo Alku¹, Tiina Murtola¹, Jarmo Malinen¹, Juha Kuortti¹, Brad H. Story², Manu Airaksinen¹, Mika Salmi¹, E. Vilkman³, Ahmed Geneid³ - Show less +5 more•Institutions (3)

Aalto University¹, University of Arizona², University of Helsinki³

01 Feb 2019-Speech Communication

TL;DR: This study introduces a new environment, called OPENGLOT, for GIF evaluation, which is versatile, versatile, and open and can be used by anyone who wants to evaluate her or his new GIF method and compare it objectively to previously developed benchmark techniques.

...read moreread less

Journal Article•DOI•

Individual differences in acoustic-prosodic entrainment in spoken dialogue

[...]

Andreas Weise¹, Sarah Ita Levitan², Julia Hirschberg², Rivka Levitan¹, Rivka Levitan³ - Show less +1 more•Institutions (3)

The Graduate Center, CUNY¹, Columbia University², Brooklyn College³

DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters

TL;DR: It is proposed as a hypothesis for further study, that gender mediates more complex interactions between sociocultural norms, conversation context, and other factors.

...read moreread less

Journal Article•DOI•

[...]

Angel Mario Castro Martinez¹, Lukas Gerlach², Guillermo Paya-Vaya², Hynek Hermansky³, Jasper Ooster¹, Bernd Meyer¹ - Show less +2 more•Institutions (3)

University of Oldenburg¹, Leibniz University of Hanover², Johns Hopkins University³

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

TL;DR: Different performance measures to estimate the word error rates of simulated behind-the-ear hearing aid signals and detect the azimuth angle of the target source in 180-degree spatial scenes are looked at.

...read moreread less

Journal Article•DOI•

[...]

Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition

TL;DR: In this paper, the authors performed an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.

...read moreread less

Journal Article•DOI•

[...]

Sreeram Ganji¹, Kunal Dhawan¹, Rohit Sinha¹•Institutions (1)

Indian Institute of Technology Guwahati¹

Computer-vision analysis reveals facial movements made during Mandarin tone production align with pitch trajectories

TL;DR: This work describes the collection of a Hinglish (Hindi-English) code-switching database at the Indian Institute of Technology Guwahati (IITG) which is referred to as the IITG-HingCoS corpus, and elaborates the sources and the protocol used for collecting the corpus.

...read moreread less

Journal Article•DOI•

[...]

Saurabh Garg¹, Saurabh Garg², Ghassan Hamarneh², Allard Jongman³, Joan A. Sereno³, Yue Wang² - Show less +2 more•Institutions (3)

University of British Columbia¹, Simon Fraser University², University of Kansas³

01 Oct 2019-Speech Communication

TL;DR: Results suggest alignments between articulatory movements and pitch trajectories, with downward or upward head and eyebrow movements following the dipping and rising tone trajectories respectively, lip closing movement being associated with the falling tone, and minimal movements for the level tone.

...read moreread less

Journal Article•DOI•

Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis

[...]

Brian Stasak¹, Brian Stasak², Julien Epps², Julien Epps¹, Roland Goecke³ - Show less +1 more•Institutions (3)

Commonwealth Scientific and Industrial Research Organisation¹, University of New South Wales², University of Canberra³

Speaker recognition using PCA-based feature transformation

TL;DR: This study examines both manually and automatically labeled speech disfluencies features, demonstrating that detailed disfluency analysis leads to considerable gains, of up to 100% in absolute depression classification accuracy, especially with affective considerations, when compared with the affect-agnostic acoustic baseline.

...read moreread less

Journal Article•DOI•

[...]

Ahmed Isam Ahmed¹, John Chiverton, David Ndzi², Victor M. Becerra¹•Institutions (2)

University of Portsmouth¹, University of the West of Scotland²

Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target

TL;DR: A Weighted-Correlation Principal Component Analysis (WCR-PCA) for efficient transformation of speech features in speaker recognition is introduced and Extensions to improve the extraction of MFCC and LPCC features of speech are proposed.

...read moreread less

Journal Article•DOI•

[...]

Paria Dadvar¹, Masoud Geravanchizadeh¹•Institutions (1)

University of Tabriz¹

01 Apr 2019-Speech Communication

TL;DR: It is shown that the proposed binaural speech separation system outperforms the baseline systems in improving the intelligibility and quality of separated speech signals in reverberant and noisy conditions.

...read moreread less

Journal Article•DOI•

Multi-domain adversarial training of neural network acoustic models for distant speech recognition

[...]

Seyedmahdad Mirsamadi¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

Privacy-preserving PLDA speaker verification using outsourced secure computation

TL;DR: A novel strategy for training neural network acoustic models based on adversarial training which makes use of environment labels during training, and provides a motivating study on the mechanism by which a deep network learns environmental invariance.

...read moreread less

Journal Article•DOI•

[...]

Amos Treiber¹, Andreas Nautsch², Andreas Nautsch³, Jascha Kolberg³, Thomas Schneider¹, Christoph Busch³ - Show less +2 more•Institutions (3)

Technische Universität Darmstadt¹, Institut Eurécom², Darmstadt University of Applied Sciences³

Prosodic encoding of focus in Hijazi Arabic

TL;DR: This architecture improves on previous privacy-preserving ASV by using (probabilistic) embeddings (i-vectors) and by additionally protecting the vendor’s model and shows that privacy of subject and vendor data can be preserved in ASV while retaining practical verification times.

...read moreread less

Journal Article•DOI•

[...]

Muhammad Swaileh A. Alzaidi¹, Yi Xu², Anqi Xu²•Institutions (2)

King Saud University¹, University College London²

Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech

TL;DR: Findings of the first systematic acoustic analysis of focus prosody in Hijazi Arabic, an under-researched Arabic dialect, show that focused words have significantly expanded excursion size, higher maximum F0 and longer duration and show evidence of prosodic differences between contrastive focus and information focus.

...read moreread less

Journal Article•DOI•

[...]

Okko Räsänen¹, Shreyas Seshadri¹, Julien Karadayi², Eric Riebling³, John Bunce⁴, Alejandrina Cristia², Florian Metze³, Marisa Casillas⁵, Celia Renata Rosemberg⁶, Elika Bergelson⁷, Melanie Soderstrom⁴ - Show less +7 more•Institutions (7)

Aalto University¹, Centre national de la recherche scientifique², Carnegie Mellon University³, University of Manitoba⁴, Max Planck Society⁵, National Scientific and Technical Research Council⁶, Duke University⁷

01 Oct 2019-Speech Communication

TL;DR: A freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data is presented, based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts to the corresponding word count estimates.

...read moreread less

Journal Article•DOI•

A low-complexity permutation alignment method for frequency-domain blind source separation

[...]

Fang Kang¹, Feiran Yang¹, Jun Yang¹•Institutions (1)

Chinese Academy of Sciences¹