Open AccessPosted Content
Unsupervised Speech Recognition
Reads0
Chats0
TLDR
In this paper, the authors leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training, achieving state-of-the-art performance on the TIMIT benchmark.Abstract:
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.read more
Citations
More filters
Posted Content
DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
TL;DR: DistilHuBERT as mentioned in this paper is a multi-task learning framework to distill hidden representations from a Hidden Unit BERT model directly, which reduces the size of HuBERT by 75% and 73% while retaining most performance in ten different tasks.
Posted Content
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition.
Binbin Zhang,Hang Lv,Pengcheng Guo,Qijie Shao,Chao Yang,Lei Xie,Xin Xu,Hui Bu,Xiaoyu Chen,Chenchen Zeng,Di Wu,Zhendong Peng +11 more
TL;DR: WenetSpeech as discussed by the authors is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech and about 10000 hours unlabeled speech, with 22400+ hours in total.
Posted Content
SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing
Junyi Ao,Rui Wang,Long Zhou,Shujie Liu,Shuo Ren,Yu Wu,Tom Ko,Qing Li,Yu Zhang,Zhihua Wei,Yao Qian,Jinyu Li,Furu Wei +12 more
TL;DR: In this article, a unified-modal SpeechT5 framework is proposed to align the textual and speech information into a unified semantic space, and a crossmodal vector quantization method with random mixing-up to bridge speech and text.
Proceedings ArticleDOI
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
TL;DR: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei as discussed by the authors .
Posted Content
Unsupervised Automatic Speech Recognition: A Review.
TL;DR: In this article, the authors identify the limitations of what can be learned from speech data alone and understand the minimum requirements for speech recognition, and identify these limitations would help optimize the resources and efforts in ASR development for low-resource languages.
References
More filters
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Journal Article
Dropout: a simple way to prevent neural networks from overfitting
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article
Distributed Representations of Words and Phrases and their Compositionality
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.