Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

doi:10.48550/arXiv.2204.01670

Proceedings ArticleDOI

Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

- pp 51-55

TLDR

In this article , the authors explored the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech.

Abstract:

State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particu-larly difﬁcult as several aspects of speech such as articulation, prosody and phonation can be impaired. Speciﬁcally, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than ﬁlter-banks (Fbank) or models trained on a single language. Improvements were observed in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Shujie Hu, +8 more

TL;DR: In this paper , a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition were explored.

...read moreread less

DOI

Hierarchical Multi-Class Classification of Voice Disorders Using Self-Supervised Models and Glottal Features

Saska Tirronen, +2 more

IEEE open journal of signal processing

TL;DR: In this paper , a hierarchical classifier was used to detect laryngeal voice disorders, and the best performance was achieved by using features from wav2vec 2.0 LARGE together with hierarchical classification.

...read moreread less

Journal ArticleDOI

Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Eun Jung Yeo, +3 more

arXiv.org

TL;DR: A novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning, and how multi- task learning affects the severity classiﬁcation performance by an-alyzing the latent representations and regularization effect is presented.

...read moreread less

Journal ArticleDOI

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Mengzhe Geng, +9 more

- 18 May 2023 -

arXiv.org

TL;DR: In this paper , a set of techniques to use both severity and speaker-identity in dysarthric speech recognition is proposed, such as multitask training incorporating severity prediction error, speaker-severity aware auxiliary feature adaptation, and structured LHUC transforms separately conditioned on speaker identity and severity.

...read moreread less

Journal ArticleDOI

Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech

Pu Wang, +1 more

- 07 Apr 2023 -

Eurasip Journal on Audio, Speech, and Mu...

TL;DR: In this paper , the authors compare different mono- or cross-lingual pre-training (supervised and unsupervised ) methodologies for spoken language understanding (SLU) tasks on Dutch dysarthric speech.

...read moreread less

References

PDF

Open Access

More filters

Journal Article

Visualizing Data using t-SNE

Laurens van der Maaten, +1 more

- 01 Jan 2008 -

Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Journal ArticleDOI

Movement Disorder Society-sponsored revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS): scale presentation and clinimetric testing results.

Christopher G. Goetz, +87 more

- 15 Nov 2008 -

Movement Disorders

TL;DR: The combined clinimetric results of this study support the validity of the MDS‐UPDRS for rating PD.

...read moreread less

Proceedings Article

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Alexei Baevski, +3 more

TL;DR: It is shown for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

...read moreread less

Posted Content

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati, +10 more

- 16 May 2020 -

arXiv: Audio and Speech Processing

TL;DR: This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

...read moreread less

Journal ArticleDOI

Differential Diagnostic Patterns of Dysarthria

Frederic L. Darley, +2 more

- 01 Jun 1969 -

Journal of Speech Language and Hearing R...

TL;DR: Thirty-second speech samples were studied of at least 30 patients in each of 7 discrete neurologic groups, each patient unequivocally diagnosed as being a representative of his diagnostic group, leading to results leading to these conclusions.

...read moreread less

Collapse