Proceedings ArticleDOI
Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition
Mohammad Soleymanpour,Michael T. Johnson,Rahim Soleymanpour,Jeffrey Berry +3 more
- pp 7382-7386
Reads0
Chats0
TLDR
This paper aims to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR, and adds Dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration.Abstract:
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. To have robust dysarthria-specific ASR, sufficient training speech is required, which is not readily available. Recent advances in Text-To-Speech (TTS) synthesis multi-speaker end-to-end systems suggest the possibility of using synthesis for data augmentation. In this paper, we aim to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR. In the synthesized speech, we add dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/read more
Citations
More filters
Journal ArticleDOI
Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition
TL;DR: The proposed GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute on the TORGO and DementiaBank data respectively.
Journal ArticleDOI
Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech
Xin Zhang,Iván Vallés-Ṕerez,Andreas Stolcke,Chengzhu Yu,Jasha Droppo,Olabanji Y. Shonibare,Roberto Barra-Chicote,Venkatesh Ravichandran +7 more
TL;DR: The authors proposed Stutter-TTS, a neural text-to-speech model capable of synthesizing diverse types of stuttering utterances, where additional tokens are introduced into source text during training to represent speci�c stuttering characteristics.
Journal ArticleDOI
Dysarthria severity assessment using squeeze-and-excitation networks
Amlu Anna Joshy,Rajeev Rajan +1 more
TL;DR: In this article , the authors explored the potency of squeeze-and-excitation (SE) networks for dysarthria severity level classification using mel spectrograms and compared them with a shallow CNN and a convolutional recurrent neural network built using a bidirectional long short-term memory network.
Journal ArticleDOI
Use of Speech Impairment Severity for Dysarthric Speech Recognition
Mengzhe Geng,Zengrui Jin,Tianzi Wang,Shujie Hu,Jiajun Deng,Mingyu Cui,Guinan Li,Tianwei Yu,Xurong Xie,Xunying Liu +9 more
TL;DR: In this paper , a set of techniques to use both severity and speaker-identity in dysarthric speech recognition is proposed, such as multitask training incorporating severity prediction error, speaker-severity aware auxiliary feature adaptation, and structured LHUC transforms separately conditioned on speaker identity and severity.
Journal ArticleDOI
A comprehensive survey of automatic dysarthric speech recognition
TL;DR: A comprehensive survey of the recent advances in the automatic dysarthric speech recognition (DSR) using machine learning (ML) and deep learning (DL) paradigms is presented in this paper .
References
More filters
Proceedings Article
The Kaldi Speech Recognition Toolkit
Daniel Povey,Arnab Ghoshal,Gilles Boulianne,Lukas Burget,Ondrej Glembek,Nagendra Kumar Goel,Mirko Hannemann,Petr Motlicek,Yanmin Qian,Petr Schwarz,Jan Silovsky,Georg Stemmer,Karel Vesely +12 more
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
Jonathan Shen,Ruoming Pang,Ron Weiss,Mike Schuster,Navdeep Jaitly,Zongheng Yang,Zhifeng Chen,Yu Zhang,Yuxuan Wang,Rj Skerrv-Ryan,Rif A. Saurous,Yannis Agiomvrgiannakis,Yonghui Wu +12 more
TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
Proceedings ArticleDOI
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.
TL;DR: The Montreal Forced Aligner (MFA) is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features.
Posted Content
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang,RJ Skerry-Ryan,Daisy Stanton,Yonghui Wu,Ron Weiss,Navdeep Jaitly,Zongheng Yang,Ying Xiao,Zhifeng Chen,Samy Bengio,Quoc V. Le,Yannis Agiomyrgiannakis,Robert A. J. Clark,Rif A. Saurous +13 more
TL;DR: Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Posted Content
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
TL;DR: FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.