scispace - formally typeset
Open AccessPosted Content

Scalable and Efficient Neural Speech Coding.

Reads0
Chats0
TLDR
In this article, a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine, where quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process.
Abstract
This work presents a scalable and efficient neural waveform codec (NWC) for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN autoencoder also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model architectures to our fully convolutional network model, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. We redefine LPC's quantization as a trainable module to enhance the bit allocation tradeoff between LPC and its following NWC modules. Compared to the other autoregressive decoder-based neural speech coders, our decoder has significantly smaller architecture, e.g., with only 0.12 million parameters, more than 100 times smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to reduce the network complexity in low bitrates, ours can scale up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable subjective scores against AMR-WB at the low bitrate range and provides transparent coding quality at 32 kbps.

read more

References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings ArticleDOI

Xception: Deep Learning with Depthwise Separable Convolutions

TL;DR: This work proposes a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions, and shows that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset, and significantly outperforms it on a larger image classification dataset.
Proceedings Article

Multi-Scale Context Aggregation by Dilated Convolutions

TL;DR: This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.
Posted Content

Auto-Encoding Variational Bayes

TL;DR: In this paper, a stochastic variational inference and learning algorithm was proposed for directed probabilistic models with intractable posterior distributions and large datasets, which scales to large datasets.
Related Papers (5)