In this paper, we propose the factorized hidden layer (FHL) approach to adapt the deep neural network (DNN) acoustic models for automatic speech recognition (ASR). FHL aims at modeling speaker dependent (SD) hidden layers by representing an SD affine transformation as a linear combination of bases. The combination weights are low-dimensional speaker parameters that can be initialized using speaker representations like i-vectors and then reliably refined in an unsupervised adaptation fashion. Therefore, our method provides an efficient way to perform both adaptive training and (test-time) adaptation. Experimental results have shown that the FHL adaptation improves the ASR performance significantly, compared to the standard DNN models, as well as other state-of-the-art DNN adaptation approaches, such as training with the speaker-normalized CMLLR features, speaker-aware training using i-vector and learning hidden unit contributions (LHUC). For Aurora 4, FHL achieves 3.8% and 2.3% absolute improvements over the standard DNNs trained on the LDA + STC and CMLLR features, respectively. It also achieves 1.7% absolute performance improvement over a system that combines the i-vector adaptive training with LHUC adaptation. For the AMI dataset, FHL achieved 1.4% and 1.9% absolute improvements over the sequence-trained CMLLR baseline systems, for the IHM and SDM tasks, respectively.

Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling

This paper studies the methods for emotional statistical parametric speech synthesis (SPSS) using recurrent neural networks (RNN) with long short-term memory (LSTM) units Two modeling approaches, ie, emotion-dependent modeling and unified modeling with emotion codes, are implemented and compared by experiments In the first approach, LSTM-RNN- based acoustic models are built separately for each emotion type A speaker-independent acoustic model estimated using the speech data from multi-speakers is adopted to initialize the emotion-dependent LSTM-RNNS Inspired by the speaker code techniques developed for speech recognition and speech synthesis, the second approach builds a unified LSTM-RNN-based acoustic model using the training data of a variety of emotion types In the unified LSTM-RNN model, an emotion code vector is input to all model layers to indicate the emotion characteristics of current utterance Experimental results on an emotional speech synthesis database with four emotion types (neutral style, happiness, anger, and sadness) show that both approaches achieve significant better naturalness of synthetic speech than HMM-based emotion- dependent modeling The emotion-dependent modeling approach outperforms the unified modeling approach and the HMM-based emotion-dependent modeling in terms of the subjective emotion classification rates for synthetic speech Furthermore, the emotion codes used by the unified modeling approach are capable of controlling the emotion type and intensity of synthetic speech effectively by interpolating and extrapolating the codes in the training set

Emotional statistical parametric speech synthesis using LSTM-RNNs

In the last years, deep bidirectional recurrent neural networks (DBRNN) and DBRNN with long short-term memory cells (DBLSTM) have outperformed the most accurate classifiers for confidence estimation in automatic speech recognition. At the same time, we have recently shown that speaker adaptation of confidence measures using DBLSTM yields significant improvements over non-adapted confidence measures. In accordance with these two recent contributions to the state of the art in confidence estimation, this paper presents a comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM models. First, we present new empirical evidences of the superiority of recurrent neural networks (RNN)-based confidence classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the Spanish poliMedia tasks. Second, we show new results on speaker-adapted confidence measures considering a multitask framework in which RNN-based confidence classifiers trained with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm that speaker-adapted confidence measures outperform their non-adapted counterparts. Last, we describe an unsupervised adaptation method of the acoustic DBLSTM model based on confidence measures that results in better automatic speech recognition performance.

/pdf/speaker-adapted-confidence-measures-for-asr-using-deep-54bz6yor8s.pdf

Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks

This paper proposes a novel regularized adaptation method for long short term memory (LSTM) recurrent neural network (RNN) based acoustic model trained with connectionist temporal classification (CTC) loss function (LSTM-RNN-CTC) to improve the performance of multi-accent Mandarin speech recognition task. In general, directly adjusting the network parameters with a small adaptation set may lead to over-fitting. In order to avoid this problem, we add a regularization term to the original training criterion. It forces the conditional probability distribution over initial and final (I/F) sequences estimated from the adapted model to be close to the accent independent (AI) model. Meanwhile, hidden layers of LSTM RNN should not be adjusted, but only the accent-specific output layer needs to be fine-tuned using this adaptation method. Experiments on RASC863 and CASIA regional accent speech corpus show that the proposed method obtains obvious improvement when compared with LSTM-RNN-CTC baseline model.

CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin speech recognition

This paper investigates speaker adaptation techniques for bidirectional long short term memory (BLSTM) recurrent neural network
 based acoustic models (AMs) trained with the connectionist temporal classification (CTC) objective function.
BLSTM-CTC AMs play an important role in end-to-end automatic speech recognition systems.
However, there is a lack of research in speaker adaptation algorithms for these models. We explore three different feature-space adaptation approaches for CTC AMs: feature-space maximum linear regression, i-vector based adaptation, and maximum a posteriori adaptation using GMM-derived features.
Experimental results on the TED-LIUM corpus demonstrate that speaker adaptation, applied in combination with data augmentation techniques, provides, in an unsupervised adaptation mode, for different test sets, up to 11--20% of relative word error rate reduction over the baseline model built on the raw filter-bank features. In addition, the adaptation behavior is compared for BLSTM-CTC AMs and time-delay neural network AMs trained with the cross-entropy criterion.

https://hal.archives-ouvertes.fr/hal-01728526/document

Evaluation of Feature-Space Speaker Adaptation for End-to-End Acoustic Models

Recently, recurrent neural network with bidirectional Long Short-Term Memory (RNN-BLSTM) acoustic model has been shown to give great performance on the TIMIT [1] and other speech recognition tasks. Meanwhile, the speaker code based adaptation method has been demonstrated as a valid adaptation method for Deep Neural Network (DNN) acoustic model [2]. However, whether the speaker code based adaptation method is also valid for RNN-BLSTM has not been reported to the best our knowledge. In this paper, we study how to conduct effective speaker code based speaker adaptation on RNN-BLSTM and demonstrate that the speaker code based adaptation method is also a valid adaptation method for RNN-BLSTM. Experimental results on TIMIT have shown that the adaptation of RNN-LSTM can achieve over 10% relative reduction in phone error rate (PER) compared to without adaptation. Then, a set of comparative experiments are implemented to analyze the different contribution of the adaptation on cell input and each gate activation function of the BLSTM. It's found that the adaptation on cell input activation function is more effective than the adaptation on each gate activation function.

Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code

Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody at-tributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.

/pdf/prosospeech-enhancing-prosody-with-quantized-vector-pre-r6n6oo82.pdf

Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech

We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.

PolyVoice: Language Models for Speech to Speech Translation

Recently, the speaker code based adaptation has been successfully expanded to recurrent neural networks using bidirectional Long Short-Term Memory (BLSTM-RNN) [1]. Experiments on the small-scale TIMIT task have demonstrated that the speaker code based adaptation is also valid for BLSTM-RNN. In this paper, we evaluate this method on large-scale task and introduce an error normalization method to balance the back-propagation errors derived from different layers for speaker codes. Meanwhile, we use singular value decomposition (SVD) method to conduct model compression. Results show that the speaker code based adaptation with SVD shows better recognition performance than the i-vector based speaker adaptation of the same dimension. Experimental results on Switchboard task show that the speaker code based adaptation on the hybrid BLSTM-DNN topology can achieve more than 9% relative reduction in word error rate (WER) compared to the speaker independent (SI) baseline.

Unsupervised speaker adaptation of BLSTM-RNN for LVCSR based on speaker code

Recently, several fast speaker adaptation methods have been proposed for the hybrid DNN-HMM models based on the so-called discriminative speaker codes (SC) [1-3] and applied to unsupervised speaker adaptation in speech recognition [4]. It has been demonstrated that the SC based methods are quite effective in adapting DNNs even when only a very small amount of adaptation data is available. However, in this way we have to estimate speaker code for new speakers by an updating process and obtain the final results through two-pass decoding. In this paper, we propose an alternative d-code extraction method to replace SC based on modeling speaker information with BLSTM-RNN which makes one-pass decoding possible. After that, a speaker clustering approach is introduced to decrease the target number of speaker-BLSTM which accelerates training speed and improves ASR performance at the same time. Meanwhile, an interpolation method is provided for taking use of d-codes from training set to improve the recognition accuracy especially when adaptation data is limited. Experimental results on Switchboard task have shown that the proposed methods lead to a comparable relative reduction in WER (about 9%) as the standard SC based adaptation method without the need of two-pass decoding.

Zhiying Huang

Papers

Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code

Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech

PolyVoice: Language Models for Speech to Speech Translation

Unsupervised speaker adaptation of BLSTM-RNN for LVCSR based on speaker code

Rapid speaker adaptation based on D-code extracted from BLSTM-RNN in LVCSR