Unsupervised speech representation learning using WaveNet autoencoders

doi:10.1109/TASLP.2019.2938863

Open AccessJournal ArticleDOI

Unsupervised speech representation learning using WaveNet autoencoders

Jan Chorowski, +3 more

- 25 Jan 2019 -

arXiv: Learning

TLDR

In this article, an unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms is considered. But the learned representation is tuned to contain only phonetic content, and the decoder is used to infer information discarded by the encoder from previous samples.

Abstract:

We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition

Shaoshi Ling, +3 more

TL;DR: In this article, a semi-supervised automatic speech recognition (ASR) system is proposed to exploit a large amount of unlabeled audio data via representation learning, where they reconstruct a temporal slice of filterbank features from past and future context frames.

...read moreread less

Journal ArticleDOI

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

- 01 Oct 2022 -

IEEE Journal of Selected Topics in Signa...

TL;DR: WavLM as discussed by the authors jointly learns masked speech prediction and denoising in pre-training to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark.

...read moreread less

Journal ArticleDOI

Self-Supervised Speech Representation Learning: A Review

Abdelrahman Mohamed, +11 more

- 21 May 2022 -

IEEE Journal of Selected Topics in Signa...

TL;DR: This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

...read moreread less

Posted Content

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends.

Siddique Latif, +5 more

- 02 Jan 2020 -

arXiv: Sound

TL;DR: This paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition, Speaker Recognition (SR), and Speaker Emotion recognition (SER).

...read moreread less

Posted Content

The Zero Resource Speech Challenge 2019: TTS without T

Ewan Dunbar, +12 more

- 25 Apr 2019 -

arXiv: Computation and Language

TL;DR: The Zero Resource Speech Challenge 2019 as discussed by the authors proposed to build a speech synthesizer without any text or phonetic labels, hence, TTS without T (text-to-speech without text), which provided raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Collapse

Related Papers (5)