scispace - formally typeset
Proceedings ArticleDOI

Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition

Reads0
Chats0
TLDR
This paper aims to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR, and adds Dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration.
Abstract
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. To have robust dysarthria-specific ASR, sufficient training speech is required, which is not readily available. Recent advances in Text-To-Speech (TTS) synthesis multi-speaker end-to-end systems suggest the possibility of using synthesis for data augmentation. In this paper, we aim to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR. In the synthesized speech, we add dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

TL;DR: The proposed GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute on the TORGO and DementiaBank data respectively.
Journal ArticleDOI

Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech

TL;DR: The authors proposed Stutter-TTS, a neural text-to-speech model capable of synthesizing diverse types of stuttering utterances, where additional tokens are introduced into source text during training to represent speci�c stuttering characteristics.
Journal ArticleDOI

Dysarthria severity assessment using squeeze-and-excitation networks

TL;DR: In this article , the authors explored the potency of squeeze-and-excitation (SE) networks for dysarthria severity level classification using mel spectrograms and compared them with a shallow CNN and a convolutional recurrent neural network built using a bidirectional long short-term memory network.
Journal ArticleDOI

Use of Speech Impairment Severity for Dysarthric Speech Recognition

TL;DR: In this paper , a set of techniques to use both severity and speaker-identity in dysarthric speech recognition is proposed, such as multitask training incorporating severity prediction error, speaker-severity aware auxiliary feature adaptation, and structured LHUC transforms separately conditioned on speaker identity and severity.
Journal ArticleDOI

A comprehensive survey of automatic dysarthric speech recognition

TL;DR: A comprehensive survey of the recent advances in the automatic dysarthric speech recognition (DSR) using machine learning (ML) and deep learning (DL) paradigms is presented in this paper .
References
More filters
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
Proceedings ArticleDOI

Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.

TL;DR: The Montreal Forced Aligner (MFA) is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features.
Posted Content

Tacotron: Towards End-to-End Speech Synthesis

TL;DR: Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Posted Content

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

TL;DR: FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Related Papers (5)