Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes
read more
Citations
NAUTILUS: A Versatile Voice Cloning System
NAUTILUS: a Versatile Voice Cloning System
A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation.
References
Speaker Verification Using Adapted Gaussian Mixture Models
Bayesian Speaker Verification with Heavy-Tailed Priors.
The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings
Deep Voice 2: Multi-Speaker Neural Text-to-Speech.
Related Papers (5)
Frequently Asked Questions (10)
Q2. What have the authors stated for future works in "Unsupervised speaker adaptation for dnn-based speech synthesis using input codes" ?
Their future work includes evaluation of the proposed technique using MP3 or AMR codec speech and speech recorded under real conditions as adaptation data.
Q3. How many acoustic features were extracted from the speech-synthesis model?
For extracting the acoustic features for the speech-synthesis model, WORLD analysis [20], [21] was used to obtain 259-dimensional acoustic feature vectors every 5 ms (each feature comprising 59-dimensional melspectral coefficients, a linearly interpolated fundamental frequency on the mel scale, and 25-dimensional band aperiodicities, along with their delta and delta-delta).
Q4. How many speakers were used to train the speaker-verification models?
The speech from 56 males and 56 females was used to train the speaker-verification models and the multi-speaker speech-synthesis models.
Q5. How many speakers were used as adaptation materials?
For the adaptation experiments, either 10, 50, or 100 utterances from each of the 23 speakers not included in the training set were used as adaptation materials.
Q6. What is the procedure for training a multi-speaker speech-synthesis model?
1) First, text-independent speaker verification models are constructed for each of the training speakers included in a speech database, which is also used for training the multi-speaker speech synthesis model.
Q7. How was the speech data used for the adaptation experiments?
A. Experimental conditions Speech database: For their experiments, the Japanese Voice Bank corpus, containing studio-quality native Japanese speech uttered by 65 males and 70 females aged between 10 and 89, was used.
Q8. How many speakers were used in the experiments?
The same utterances and speakers used in the experiments using only studio-quality speech data were used for training and adaptation, although 100 utterances from each of target speakers were used as adaptation data.
Q9. How was the SNR of low quality speech used for training the speaker-verification models?
signal to noise ratio (SNR) of lowquality speech used for training the speaker-verification models was adjusted by using α in Eq. (2).
Q10. Why is the speaker-verification model unable to output the appropriate speaker-similarity vector?
This is because performance of F0 extraction from a low-quality speech waveform was problematic, and the speaker-verification models using F0 features cannot output the appropriate speaker-similarity vector for speaker adaptation.