scispace - formally typeset
Open AccessPosted Content

Unsupervised Speech Recognition

Reads0
Chats0
TLDR
In this paper, the authors leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training, achieving state-of-the-art performance on the TIMIT benchmark.
Abstract
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

read more

Citations
More filters
Posted Content

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

TL;DR: DistilHuBERT as mentioned in this paper is a multi-task learning framework to distill hidden representations from a Hidden Unit BERT model directly, which reduces the size of HuBERT by 75% and 73% while retaining most performance in ten different tasks.
Posted Content

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition.

TL;DR: WenetSpeech as discussed by the authors is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech and about 10000 hours unlabeled speech, with 22400+ hours in total.
Posted Content

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

TL;DR: In this article, a unified-modal SpeechT5 framework is proposed to align the textual and speech information into a unified semantic space, and a crossmodal vector quantization method with random mixing-up to bridge speech and text.
Proceedings ArticleDOI

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

TL;DR: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei as discussed by the authors .
Posted Content

Unsupervised Automatic Speech Recognition: A Review.

TL;DR: In this article, the authors identify the limitations of what can be learned from speech data alone and understand the minimum requirements for speech recognition, and identify these limitations would help optimize the resources and efforts in ASR development for low-resource languages.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Related Papers (5)
Trending Questions (1)
How do state-of-the-art works perform unsupervised speech recognition?

State-of-the-art unsupervised speech recognition systems, such as wav2vec-U, achieve significant reductions in phoneme error rates and word error rates compared to previous methods.