We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture, with over 900x real-time factor for mel-spectrogram synthesis of a typical utterance.

FastPitch: Parallel Text-to-speech with Pitch Prediction

Encoding of Articulatory Kinematic Trajectories in Human Speech Sensorimotor Cortex

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.

https://dl.acm.org/doi/pdf/10.1145/3414685.3417836

MoGlow: probabilistic and controllable motion synthesis using normalising flows

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.

Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis

In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. Furthermore, DurIAN can be used to generate high quality facial expression which can be synchronized with generated speech with/without parallel speech and face data. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio that is 6 times faster than real time on a single CPU core. We show that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems. Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.

DurIAN: Duration Informed Attention Network For Multimodal Synthesis.

To solve the acoustic-to-articulatory inversion problem, this paper proposes a deep bidirectional long short term memory recurrent neural network and a deep recurrent mixture density network. The articulatory parameters of the current frame may have correlations with the acoustic features many frames before or after. The traditional pre-designed fixed-length context window may be either insufficient or redundant to cover such correlation information. The advantage of recurrent neural network is that it can learn proper context information on its own without the requirement of externally specifying a context window. Experimental results indicate that recurrent model can produce more accurate predictions for acoustic-to-articulatory inversion than deep neural network having fixed-length context window. Furthermore, the predicted articulatory trajectory curve of recurrent neural network is smooth. Average root mean square error of 0.816 mm on the MNGU0 test set is achieved without any post-filtering, which is state-of-the-art inversion accuracy.

/pdf/a-deep-recurrent-approach-for-acoustic-to-articulatory-55s8c5b80c.pdf

A deep recurrent approach for acoustic-to-articulatory inversion

In this paper, we present a robust and effective speech synthesis system that generates highly natural speech. The key component of proposed system is Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the attention mechanism used in existing end-to-end speech synthesis systems that accounts for various unavoidable artifacts. To improve the audio generation efficiency of neural vocoders, we also propose a multi-band audio generation framework exploiting the sparseness characteristics of neural network. With proposed multi-band processing framework, the total computational complexity of WaveRNN model can be effectively reduced from 9.8 to 3.6 GFLOPS without any performance loss. Finally, we show that proposed DurIAN system could generate highly natural speech that is on par with current state of the art end-to-end systems, while being robust and stable at the same time.

DurIAN: Duration Informed Attention Network for Speech Synthesis.

In traditional voice activity detection (VAD) approaches, some features of the audio stream, for example frame-energy features, are used for voice decision. In this paper, we present the general framework of a visual information based VAD approach in a multi-modal system. Firstly, the Gauss mixture visual models of voice and non-voice are designed, and the decision rule is discussed in detail. Subsequently, the visual feature extraction method for VAD is investigated. The best visual feature structure and the best mixture number are selected experimentally. Our experiments show that using visual information based VAD, prominent reduction in frame error rate (31.1% relatively) is achieved, and the audio-visual stream can be segmented into sentences for recognition much more precisely (98.4% relative reduction in sentence break error rate), compared to the frame-energy based approach in the clean audio case. Furthermore, the performance of visual based VAD is independent of background noise.

Voice activity detection using visual information

Bidirectional long short-term memory (BLSTM) based speech synthesis has shown great potential in improving the quality of the synthetic speech. However, for low-resource languages, it is difficult to obtain a high quality BLSTM model. BLSTM based speech synthesis can be viewed as a transformation between the input features and the output features. We assume that the input and output layers of BLSTM are language-dependent while the hidden layers can be language-independent if trained properly. We investigate whether sufficient training data of another language (auxiliary) can benefit the BLSTM training of a new language (target) that has only limited training data. In this paper, we propose 1) a multilingual BLSTM that shares hidden layers across different languages and 2) a specific training approach that can best utilize the training data from both the auxiliary and target languages. Experimental results demonstrate the effectiveness of the proposed approach. The multilingual BLSTM can learn the cross-lingual information, and can predict more accurate acoustic features for speech synthesis of the target language than the monolingual BLSTM that is trained with only the data from the target language. Subjective test also indicates that multilingual BLSTM outperforms the monolingual BLSTM in generating higher quality synthetic speech.

Peng Liu

Papers

DurIAN: Duration Informed Attention Network For Multimodal Synthesis.

A deep recurrent approach for acoustic-to-articulatory inversion

DurIAN: Duration Informed Attention Network for Speech Synthesis.

Voice activity detection using visual information

Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages