scispace - formally typeset
Search or ask a question

Showing papers in "Speech Communication in 2019"


Journal ArticleDOI
TL;DR: Results presented in this work show that the performance of an MMSE approach to speech enhancement significantly increases when utilising deep learning, and MMSE approaches utilising the proposed a priori SNR estimator are able to achieve higher enhanced speech quality and intelligibility scores than recent masking- and mapping-based deep learning approaches.

94 citations


Journal ArticleDOI
TL;DR: This paper investigates an end-to-end acoustic modeling approach using convolutional neural networks (CNNs), where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output.

91 citations


Journal ArticleDOI
TL;DR: It is proved that this method can effectively reduce the confusion between emotions, thus improving the speech emotion recognition rate.

74 citations


Journal ArticleDOI
TL;DR: A global approach for speech emotion recognition (SER) system using empirical mode decomposition (EMD) is proposed and a combination of all features extracted from the IMFs enhances the performance of the SER system and achieving 91.16% recognition rate.

63 citations


Journal ArticleDOI
TL;DR: In this article, a dynamic Bayesian network (DBN) is proposed to bridge the gap between rule-based and data-driven approaches, where a discrete variable is added to constrain the behaviors on the underlying constraint.

48 citations


Journal ArticleDOI
TL;DR: The experiments show that the new data augmentation approaches can obtain the performance improvement under all noisy conditions, which including additive noise, channel distortion and reverberation, and a relative 6% to 14% WER reduction can be obtained upon an advanced acoustic model.

40 citations


Journal ArticleDOI
TL;DR: Although spoken- word recognition in the presence of background noise is harder in a non-native language than in one's native language, this difference can be explained by differences in language exposure, which influences the uptake and use of phonetic and contextual information in the speech signal for spoken-word recognition.

36 citations


Journal ArticleDOI
TL;DR: This study uses a three-layer model composed of acoustic features, semantic primitives, and emotion dimensions to map acoustics into emotion dimensions and classify the continuous emotion dimensional values into basic categories by using the logistic model trees.

32 citations


Journal ArticleDOI
TL;DR: The results showed that the glottal features in combination with the openSMILE-based acoustic features resulted in improved classification accuracies, which validate the complementary nature of glattal features.

31 citations


Journal ArticleDOI
TL;DR: Golden Speaker Builder is presented, a tool that allows learners to generate a personalized “golden-speaker” voice: one that mirrors their own voice but with a native accent.

30 citations


Journal ArticleDOI
TL;DR: This work proposes a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal, and demonstrates the applicability of the approach for more generalized speech enhancement, where it has to regenerate voices from whispered signals.

Journal ArticleDOI
TL;DR: A neural-network-based ideal ratio mask estimator learned from a multi-condition data set is adopted to incorporate prior information, obtained from the speech/noise interactions and the long acoustic context, into CGMM-based beamformed speech that has a higher signal-to-noise ratio (SNR) than the original noisy speech signal.

Journal ArticleDOI
TL;DR: The results showed that the second experimental group (CAPT) performed better than the other groups in developing speaking skills, which has pedagogical implications for curriculum designers, interpreter training programs, and all who are involved in language study and pedagogy.

Journal ArticleDOI
TL;DR: This work presents a neural architecture that will serve as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent neural networks, and shows that this novel architecture is indeed a better alternative.

Journal ArticleDOI
TL;DR: In this article, a bimodal recurrent neural network (BRNN) framework was proposed for speech activity detection in audiovisual speech processing systems, where acoustic and visual features are directly learned from the raw data during training.

Journal ArticleDOI
TL;DR: A new, extended version of the voiceHome corpus for distant-microphone speech processing in domestic environments, which includes short reverberated, noisy utterances spoken in French by 12 native French talkers in diverse realistic acoustic conditions and recorded by an 8- microphone device at various angles and distances and in various noise conditions.

Journal ArticleDOI
TL;DR: This study introduces a new environment, called OPENGLOT, for GIF evaluation, which is versatile, versatile, and open and can be used by anyone who wants to evaluate her or his new GIF method and compare it objectively to previously developed benchmark techniques.

Journal ArticleDOI
TL;DR: It is proposed as a hypothesis for further study, that gender mediates more complex interactions between sociocultural norms, conversation context, and other factors.

Journal ArticleDOI
TL;DR: Different performance measures to estimate the word error rates of simulated behind-the-ear hearing aid signals and detect the azimuth angle of the target source in 180-degree spatial scenes are looked at.

Journal ArticleDOI
TL;DR: In this paper, the authors performed an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.

Journal ArticleDOI
TL;DR: This work describes the collection of a Hinglish (Hindi-English) code-switching database at the Indian Institute of Technology Guwahati (IITG) which is referred to as the IITG-HingCoS corpus, and elaborates the sources and the protocol used for collecting the corpus.

Journal ArticleDOI
TL;DR: Results suggest alignments between articulatory movements and pitch trajectories, with downward or upward head and eyebrow movements following the dipping and rising tone trajectories respectively, lip closing movement being associated with the falling tone, and minimal movements for the level tone.

Journal ArticleDOI
TL;DR: This study examines both manually and automatically labeled speech disfluencies features, demonstrating that detailed disfluency analysis leads to considerable gains, of up to 100% in absolute depression classification accuracy, especially with affective considerations, when compared with the affect-agnostic acoustic baseline.

Journal ArticleDOI
TL;DR: A Weighted-Correlation Principal Component Analysis (WCR-PCA) for efficient transformation of speech features in speaker recognition is introduced and Extensions to improve the extraction of MFCC and LPCC features of speech are proposed.

Journal ArticleDOI
TL;DR: It is shown that the proposed binaural speech separation system outperforms the baseline systems in improving the intelligibility and quality of separated speech signals in reverberant and noisy conditions.

Journal ArticleDOI
TL;DR: A novel strategy for training neural network acoustic models based on adversarial training which makes use of environment labels during training, and provides a motivating study on the mechanism by which a deep network learns environmental invariance.

Journal ArticleDOI
TL;DR: This architecture improves on previous privacy-preserving ASV by using (probabilistic) embeddings (i-vectors) and by additionally protecting the vendor’s model and shows that privacy of subject and vendor data can be preserved in ASV while retaining practical verification times.

Journal ArticleDOI
TL;DR: Findings of the first systematic acoustic analysis of focus prosody in Hijazi Arabic, an under-researched Arabic dialect, show that focused words have significantly expanded excursion size, higher maximum F0 and longer duration and show evidence of prosodic differences between contrastive focus and information focus.

Journal ArticleDOI
TL;DR: A freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data is presented, based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts to the corresponding word count estimates.

Journal ArticleDOI
TL;DR: A low-complexity permutation alignment method based on the inter-frequency dependence of signal power ratio and a clustering algorithm with centroids is adopted to achieve the fine global optimization in the fullband with only a few iterations.