Unsupervised Speech Recognition

Open AccessPosted Content

Unsupervised Speech Recognition

Alexei Baevski, +3 more

- 24 May 2021 -

arXiv: Computation and Language

Chats0

TLDR

In this paper, the authors leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training, achieving state-of-the-art performance on the TIMIT benchmark.

Abstract:

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Citations

PDF

Open Access

More filters

Posted Content

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

Heng-Jui Chang, +2 more

- 05 Oct 2021 -

arXiv: Computation and Language

TL;DR: DistilHuBERT as mentioned in this paper is a multi-task learning framework to distill hidden representations from a Hidden Unit BERT model directly, which reduces the size of HuBERT by 75% and 73% while retaining most performance in ten different tasks.

...read moreread less

Posted Content

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition.

Binbin Zhang, +11 more

- 07 Oct 2021 -

arXiv: Sound

TL;DR: WenetSpeech as discussed by the authors is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech and about 10000 hours unlabeled speech, with 22400+ hours in total.

...read moreread less

Posted Content

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Junyi Ao, +12 more

- 14 Oct 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this article, a unified-modal SpeechT5 framework is proposed to align the textual and speech information into a unified semantic space, and a crossmodal vector quantization method with random mixing-up to bridge speech and text.

...read moreread less

Proceedings ArticleDOI

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

TL;DR: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei as discussed by the authors .

...read moreread less

Posted Content

Unsupervised Automatic Speech Recognition: A Review.

Hanan Aldarmaki, +2 more

- 09 Jun 2021 -

arXiv: Computation and Language

TL;DR: In this article, the authors identify the limitations of what can be learned from speech data alone and understand the minimum requirements for speech recognition, and identify these limitations would help optimize the resources and efforts in ASR development for low-resource languages.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Collapse

arXiv: Learning

Unsupervised Speech Recognition

Citations

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition.

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Unsupervised Automatic Speech Recognition: A Review.

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

Dropout: a simple way to prevent neural networks from overfitting

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Distributed Representations of Words and Phrases and their Compositionality

Related Papers (5)

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Librispeech: An ASR corpus based on public domain audio books

Attention is All you Need

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Representation Learning with Contrastive Predictive Coding

Trending Questions (1)