Unsupervised speech representation learning using WaveNet autoencoders
TLDR
In this article, an unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms is considered. But the learned representation is tuned to contain only phonetic content, and the decoder is used to infer information discarded by the encoder from previous samples.Abstract:
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.read more
Citations
More filters
Proceedings ArticleDOI
Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition
TL;DR: In this article, a semi-supervised automatic speech recognition (ASR) system is proposed to exploit a large amount of unlabeled audio data via representation learning, where they reconstruct a temporal slice of filterbank features from past and future context frames.
Journal ArticleDOI
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
TL;DR: WavLM as discussed by the authors jointly learns masked speech prediction and denoising in pre-training to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark.
Journal ArticleDOI
Self-Supervised Speech Representation Learning: A Review
Abdelrahman Mohamed,Hung-yi Lee,Lasse Borgholt,Jakob D. Havtorn,Joakim Edin,Christian Igel,Katrin Kirchhoff,Shang-Wen Li,Karen Livescu,Lars Maaløe,Tara N. Sainath,Shinji Watanabe +11 more
TL;DR: This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
Posted Content
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends.
TL;DR: This paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition, Speaker Recognition (SR), and Speaker Emotion recognition (SER).
Posted Content
The Zero Resource Speech Challenge 2019: TTS without T
Ewan Dunbar,Robin Algayres,Julien Karadayi,Mathieu Bernard,Juan Benjumea,Xuan-Nga Cao,Lucie Miskic,Charlotte Dugrain,Lucas Ondel,Alan W. Black,Laurent Besacier,Sakriani Sakti,Emmanuel Dupoux +12 more
TL;DR: The Zero Resource Speech Challenge 2019 as discussed by the authors proposed to build a speech synthesizer without any text or phonetic labels, hence, TTS without T (text-to-speech without text), which provided raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels.
References
More filters
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
Going deeper with convolutions
Christian Szegedy,Wei Liu,Yangqing Jia,Pierre Sermanet,Scott Reed,Dragomir Anguelov,Dumitru Erhan,Vincent Vanhoucke,Andrew Rabinovich +8 more
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal Article
Dropout: a simple way to prevent neural networks from overfitting
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.