scispace - formally typeset
Open AccessJournal ArticleDOI

Unsupervised speech representation learning using WaveNet autoencoders

TLDR
In this article, an unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms is considered. But the learned representation is tuned to contain only phonetic content, and the decoder is used to infer information discarded by the encoder from previous samples.
Abstract
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.

read more

Citations
More filters
Proceedings ArticleDOI

Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition

TL;DR: In this article, a semi-supervised automatic speech recognition (ASR) system is proposed to exploit a large amount of unlabeled audio data via representation learning, where they reconstruct a temporal slice of filterbank features from past and future context frames.
Journal ArticleDOI

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

TL;DR: WavLM as discussed by the authors jointly learns masked speech prediction and denoising in pre-training to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark.
Journal ArticleDOI

Self-Supervised Speech Representation Learning: A Review

TL;DR: This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
Posted Content

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends.

TL;DR: This paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition, Speaker Recognition (SR), and Speaker Emotion recognition (SER).
Posted Content

The Zero Resource Speech Challenge 2019: TTS without T

TL;DR: The Zero Resource Speech Challenge 2019 as discussed by the authors proposed to build a speech synthesizer without any text or phonetic labels, hence, TTS without T (text-to-speech without text), which provided raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Related Papers (5)