What have the authors stated for future works in "Unsupervised speaker adaptation for dnn-based speech synthesis using input codes" ?

Their future work includes evaluation of the proposed technique using MP3 or AMR codec speech and speech recorded under real conditions as adaptation data.

How many speakers were used to train the speaker-verification models?

The speech from 56 males and 56 females was used to train the speaker-verification models and the multi-speaker speech-synthesis models.

How many speakers were used in the experiments?

The same utterances and speakers used in the experiments using only studio-quality speech data were used for training and adaptation, although 100 utterances from each of target speakers were used as adaptation data.

How was the SNR of low quality speech used for training the speaker-verification models?

signal to noise ratio (SNR) of lowquality speech used for training the speaker-verification models was adjusted by using α in Eq. (2).

Why is the speaker-verification model unable to output the appropriate speaker-similarity vector?

This is because performance of F0 extraction from a low-quality speech waveform was problematic, and the speaker-verification models using F0 features cannot output the appropriate speaker-similarity vector for speaker adaptation.

(Open Access) Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes (2018) | Shinji Takaki

Q: What are the contributions in "Unsupervised speaker adaptation for dnn-based speech synthesis using input codes" ?

In this paper, an unsupervised speaker adaptation technique for DNN-based speech synthesis with input codes, using only speech data from a target speaker without transcriptions, was proposed.

Q: What is the procedure for training a multi-speaker speech-synthesis model?

1) First, text-independent speaker verification models are constructed for each of the training speakers included in a speech database, which is also used for training the multi-speaker speech synthesis model.

Q: How was the speech data used for the adaptation experiments?

A. Experimental conditions Speech database: For their experiments, the Japanese Voice Bank corpus, containing studio-quality native Japanese speech uttered by 65 males and 70 females aged between 10 and 89, was used.

Edinburgh Research Explorer

Unsupervised Speaker Adaptation for DNN-based Speech

Synthesis using Input Codes

Citation for published version:

Takaki, S, Nishimura, Y & Yamagishi, J 2019, Unsupervised Speaker Adaptation for DNN-based Speech

Synthesis using Input Codes. in Asia-Pacific Signal and Information Processing Association Annual Summit

and Conference 2018. Institute of Electrical and Electronics Engineers (IEEE), Honolulu, Hawaii, USA, pp.

649-658, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018,

Honolulu, Hawaii, United States, 12/11/18. https://doi.org/10.23919/APSIPA.2018.8659621

Digital Object Identifier (DOI):

10.23919/APSIPA.2018.8659621

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 09. Aug. 2022

Unsupervised Speaker Adaptation for

DNN-based Speech Synthesis using Input Codes

Shinji Takaki

∗

, Yoshikazu Nishimura

†

, Junichi Yamagishi

∗

National Institute of Informatics, Tokyo, Japan

†

alt Inc., Tokyo, Japan

Abstract—A new speaker-adaptation technique for deep neural

network (DNN)-based speech synthesis – which requires only

speech data without orthographic transcriptions – is proposed.

This technique is based on a DNN-based speech-synthesis model

that takes speaker, gender, and age into consideration as addi-

tional inputs and outputs acoustic parameters of corresponding

voices from text in order to construct a multi-speaker model

and perform speaker adaptation. It uses a new input code that

represents acoustic similarity to each of the training speakers

in a probability. The new input code, called “speaker-similarity

vector,” is obtained by concatenating posterior probabilities

calculated from each model of the training speakers. GMM-UBM

or i-vector/PLDA, which are widely used in text-independent

speaker veriﬁcation, are used to represent the speaker models,

since they can be used without text information. Text and the

speaker-similarity vectors of the training speakers are used

as input to ﬁrst train a multi-speaker speech-synthesis model,

which outputs acoustic parameters of the training speakers.

A new speaker-similarity vector is then estimated by using a

small amount of speech data uttered by an unknown target

speaker on the basis of the separately trained speaker models.

It is expected that inputting the estimated speaker-similarity

vector into the multi-speaker speech-synthesis model can generate

synthetic speech that resembles the target speaker’s voice. In

objective and subjective experiments, adaptation performance of

the proposed technique was evaluated using not only studio-

quality adaptation data but also low-quality (i.e., noisy and

reverberant) data. The results of the experiments indicate that

the proposed technique makes it possible to rapidly construct a

voice for the target speaker in DNN-based speech synthesis.

I. INTRODUCTION

The ﬂexibility and controllability of speech-synthesis sys-

tems are as important as naturalness of speech in some appli-

cations; hence, constructing such a ﬂexible speech-synthesis

system is an interesting research topic in the ﬁeld of DNN-

based speech synthesis. A variety of multi-speaker modeling

and speaker-adaptation techniques for DNN-based speech syn-

thesis have been proposed recently. Multi-speaker modeling is

a technique for synthesizing voices of various speakers by

using a common model, and speaker adaptation is a technique

for estimating a new acoustic model by using a small amount

of speech data uttered by a new target speaker or in a new

speaking style (e.g., a different emotion). To give a few

examples of the multi-speaker modeling in the ﬁeld of DNN-

based speech synthesis, using speaker codes that represent

a speaker’s identity for multi-speaker modeling, in which

additional inputs are used to distinguish speakers, has been

proposed [1], [2], [3]. A speaker-adaptation technique using i-

vectors as an additional input, one using an adaptation method

for speech recognition called “learning hidden-unit contribu-

tions” [4], one using linear transforms deﬁned by Gaussian

mixture models (GMMs), and combinations of those methods,

was proposed [5]. In another study, it was assumed that the

output layer in a DNN captures most speaker differences,

and under that assumption, it was attempted to estimate a

speaker-dependent output layer by using individual speaker’s

data while keeping the hidden network layers shared across

all speakers [6].

Prior to the present study, a DNN-based acoustic model us-

ing auxiliary features referred to as input codes was proposed

[7]. In that model, to more-effectively retain speaker voice

characteristics and allow speaker adaptation, a speaker’s iden-

tity, gender, and age classes were additionally used. Speaker

adaptation was performed by estimating a new speaker code

based on back-propagation (BP) using a small amount of the

target speaker’s speech data and associated linguistic features

obtained from text. Almost all other adaptation techniques

proposed for DNN synthesis are based on BP [1], [5], [6], [8],

[9]; hence, not only speech data but also linguistic features are

always required.

In this study, a new speaker-adaptation technique for DNN-

based speech synthesis – which requires only speech data with-

out orthographic transcriptions – is proposed. This technique is

traditionally called unsupervised speaker adaptation for speech

synthesis [10]. A naive dirty way is to obtain transcriptions

using external automatic speech recognition and use conven-

tional speaker adaptation based on speech and automatically

generated transcriptions [11]. However, this procedure may

result in issues when the outputs of speech recognition have

severe errors.

The proposed technique uses a new input code designed for

speaker adaptation without using text. The new input code,

called “speaker-similarity vector”, represents acoustic similar-

ity of a target speaker to each of several training speakers

in terms of probability, and it is obtained by concatenating

posterior probabilities calculated from each of the models

of the training speakers. Intuitively, this process may be

viewed as replacing a conventional binary hard speaker code

with continuous soft codes according to speaker similarity.

Therefore, if a multi-speaker speech-synthesis model is trained

using text, and speaker-similarity vectors are used as input, it

can be expected that speaker characteristics of synthetic speech

generated from the trained multi-speaker model will depend

on the speaker-similarity vectors, and the synthetic speech will

vary if the speaker similarity vectors change. Furthermore,

if a new speaker-similarity vector is estimated by using a

small amount of speech data uttered by an unknown target

speaker (on the basis of separately trained speaker models),

and the estimated speaker-similarity vector is input into a

multi-speaker speech synthesis model, the resulting synthetic

speech will probably resemble the voice of the new target

speaker. More importantly, the speaker-similarity vector can be

computed by using widely used text-independent automatic-

speaker-veriﬁcation models, such as the Gaussian Mixture

Model-Universal Background Model (GMM-UBM) [12] and

i-vector probabilistic linear discriminant analysis (PLDA) [13],

without the need for text information; hence, unsupervised

speaker adaptation can be achieved.

In this paper, also, we train robust speaker veriﬁcation

models to calculate appropriate posterior probabilities from

low-quality (i.e., noisy and reverberant) speech to synthe-

size the target speaker’s voice even if low-quality speech

are given as adaptation data. An issue to be addressed is

mismatch between recording conditions for the training data

and adaptation data fed into speaker-veriﬁcation models, in

which posterior probabilities are calculated from low-quality

speech via speaker-veriﬁcation models trained by using studio-

quality speech data. To alleviate this conditional mismatch

between training data and adaptation data, low-quality speech

data are artiﬁcially created, and speaker-veriﬁcation models are

trained using the created data instead of studio-quality speech

data.

The remainder of the paper is as follows. Section 2 de-

scribes multi-speaker modeling and the conventional speaker-

adaptation technique using input codes. Section 3 explains the

proposed speaker-adaptation technique that does not require

text information. Section 4 describes how to artiﬁcially create

noisy and reverberant speech. In Section 5, the proposed

approaches are evaluated by using studio-quality speech data,

and in Section 6, they are evaluated by using low-quality

speech data. Section 7 concludes the paper.

II. MULTI-SPEAKER SPEECH SYNTHESIS AND SPEAKER

ADAPTATION USING INPUT CODES

The previously proposed multi-speaker speech-synthesis

model [7] was trained using simply pooled data of multiple

speakers. Identity, gender, and age of the speaker are rep-

resented using one-hot vectors, binary values (0: Female; 1:

Male), and raw age values, respectively, and added as a part

of the input to the neural network.

To adapt the above multi-speaker speech-synthesis models

to a new speaker, the BP algorithm is used to minimize the

mean-square prediction error over a small amount of data

uttered by the target speaker according to the study by Bridle

and Cox [14]. Note that the BP algorithm only updates the

speaker codes, without changing the DNN weights, in contrast

to algorithms developed in other studies, e.g., [1], that used

ﬁxed codes but added new weights. The BP algorithm starts

from the average speaker code and continues until it estimates

a new speaker code and a stopping criterion is satisﬁed.

III. UNSUPERVISED SPEAKER ADAPTATION USING A

SPEAKER-SIMILARITY VECTOR

A. Flow of the proposed unsupervised speaker-adaptation

technique

The proposed unsupervised-speaker-adaptation technique

using speaker-similarity vectors is explained. The procedure

for training a multi-speaker model and performing speaker

adaptation based on the speaker similarity vectors is explained

as the following steps.

1) First, text-independent speaker veriﬁcation models are

constructed for each of the training speakers included

in a speech database, which is also used for training

the multi-speaker speech synthesis model. GMM-UBM

[12] or i-vector/PLDA [13] is used as a text-independent

speaker veriﬁcation model.

2) Then, the posterior probability of each training speaker

given by one of the multiple text-independent speaker

veriﬁcation models in Step 1 is computed. The ob-

tained posterior probabilities are concatenated to form

a speaker similarity vector for each of the training

speakers. 112-dimensional speaker similarity vectors are

obtained (since the number of training speakers was

112).

3) Next, the speaker similarity vectors computed in Step

2 are used to replace the one-hot-vector based speaker

code, and a DNN-based multi-speaker speech synthesis

model is constructed. Linguistic features, gender, and

age codes are the same as those used in the systems

described in Section 2

4) Speaker adaptation is performed as follows. A speaker-

similarity vector of an unknown target speaker is esti-

mated in a similar way as in Step 2: the posterior prob-

abilities of the target speaker given by the multiple text-

independent speaker-veriﬁcation models are computed,

and the obtained posterior probabilities are concatenated

to form a speaker-similarity vector.

5) The estimated speaker-similarity vector of the target

speaker is used as a new speaker code of the above

multi-speaker speech-synthesis model, thereby changing

the speaker characteristics of synthetic speech.

B. Speaker-veriﬁcation models

For text-independent speaker veriﬁcation, the GMM-UBM

[12] and i-vector/PLDA [13] approaches are widely used.

The proposed technique also uses these approaches. As for

the GMM-UBM approach, a speaker model is obtained by

adapting parameters of GMMs trained using speech data of

many speakers [12]. As for i-vector/PLDA, i-vectors are ﬁrst

computed from sufﬁcient statistics of speech data and are

regarded as observations for a Gaussian PLDA model given

w + Φβ + Γα

+ 

, (1)

The same technique may also be used to estimate age and gender codes.

However, this option is not explored in this paper due to space limitation.

where

w is a speaker-independent supervector. Φ and Γ rep-

resent eigenvoice matrices for speaker- and channel-dependent

components, respectively. Speaker and channel factors, β and

, are assumed to have a standard Gaussian distribution as a

prior distribution. In this study, the third term in Eq. (1) was

not used.

C. Advantage of proposed framework

As mentioned earlier, several techniques for speaker adapta-

tion using i-vectors [5] or d-vectors [15] have been developed.

As for the former, i-vectors are directly used as inputs for

DNN-based speech synthesis. On the other hand, as for the

proposed framework, GMM-UBM or i-vector/PLDA is used

only to calculate posterior probabilities for each training

speaker. That is, the proposed multi-speaker speech-synthesis

model does not depend on any acoustic parameterization

or dimensions of i-vectors and has weaker dependency on

acoustic features.

An unsupervised speaker-adaptation technique using a

bottle-neck layer of a DNN-based speaker-recognition model

for DNN-based speech synthesis was proposed by Doddipatla

et al. [15]. As for this technique, PCA is applied to the

bottle-neck features of the DNN-based speaker recognition,

and the ﬁrst eigenvector is interpolated on the basis of the

posterior probabilities of the speaker-recognition model. In

the following, we argue that the proposed technique is much

simpler and more intuitive for constructing a ﬂexible multi-

speaker speech-synthesis model.

IV. SPEAKER-VERIFICATION MODELS ROBUST AGAINST

LOW-QUALITY ADAPTATION SPEECH DATA

Speech used as adaptation data is usually low quality

because recording studio-quality speech incurs high cost. In

this study, robust speaker veriﬁcation models are trained to

perform the proposed unsupervised speaker adaptation without

signiﬁcant degradation of speech quality even if low-quality

speech is given as adaptation data. To train speaker-veriﬁcation

models robust against low-quality speech data, the mismatch

between recording conditions for training and adaptation

speech data fed into the models needs to be alleviated. Hence,

low-quality speech data is artiﬁcially created by adding noise

and reverberation to studio-quality speech, and the created data

is used for training the speaker-veriﬁcation models.

The low-quality speech data was artiﬁcially created by using

the Demand noise database [16] and the ACE Challenge

reverberant database [17]. The low-quality speech was created

by adding noise from the Demand database and reverberation

from the ACE Challenge database to studio-quality speech

waveforms. It is assumed that adaptation speech data used

for speech synthesis is recorded in indoor rooms, so noise and

room impulse responses recorded in an ofﬁce or meeting room

were used. The ﬁrst channel of ofﬁce-and-meeting-room noise

recordings at 48 kHz sampling frequency was selected from

the Demand database, and room impulse responses recorded

in an ofﬁce or meeting room (Ofﬁce 1 and Meeting Room 1)

were selected from the ACE database. Noise and reverberation

were added to studio-quality speech in the same way as used

in [18] as follows,

y = x ∗ h

+ α(n ∗ h

), (2)

where x and n represent a studio-quality speech waveform

and a noise waveform, respectively, h

and h

represent the

room impulse responses, ∗ is a convolution operator, and α is

used for adjusting signal-to-noise ratio (SNR). Room impulse

responses h

and h

are recorded using microphones located

in positions 1 and 2.

In our experiments, adaptation speech data was also ar-

tiﬁcially degraded by using the same way. Using speech

waveforms recorded under real conditions as adaptation data

is future work.

V. EXPERIMENTS USING STUDIO-QUALITY SPEECH DATA

The proposed technique for unsupervised speaker adapta-

tion using studio-quality speech data as adaptation data was

evaluated as described below.

A. Experimental conditions

Speech database: For our experiments, the Japanese Voice

Bank corpus, containing studio-quality native Japanese speech

uttered by 65 males and 70 females aged between 10 and

89, was used. The speech from 56 males and 56 females

was used to train the speaker-veriﬁcation models and the

multi-speaker speech-synthesis models. The speech from the

remaining speakers (9 males and 14 females) was saved for

speaker adaptation. With approximately 100 utterances per

speaker, this dataset yielded a total of 11,154 training-data

utterances. For the adaptation experiments, either 10, 50, or

100 utterances from each of the 23 speakers not included in the

training set were used as adaptation materials. The sampling

frequency of the speech-signal waveform was 48 kHz. Speaker

adaptation was evaluated by using 10 utterances per speaker

not included in either the training or adaptation sets.

Speaker-veriﬁcation models: To train the speaker veriﬁcation

models, an open-source toolkit called SIDEKIT [19] was used.

The acoustic features used for training these models are listed

in Table I. Since spectral features (20-dimensional MFCCs)

and fundamental frequency/F0 (1-dimensional) are dimension-

ally signiﬁcantly different, 20-dimensional F0 features were

also obtained by applying a discrete cosine transform (DCT)

to fundamental frequency values of the current, next, and

previous 32 frames. Moreover, instead of the standard MFCC,

spectral features used for speech synthesis models, referred to

as MGC, were investigated. Then, GMMs with 64 mixtures

were trained to extract 400-dimensional i-vectors. The size of

the eigenvoice matrices for the speaker dependent components

in Gaussian PLDA was 20.

Speech-synthesis models: For extracting the acoustic features

for the speech-synthesis model, WORLD analysis [20], [21]

was used to obtain 259-dimensional acoustic feature vectors

every 5 ms (each feature comprising 59-dimensional mel-

spectral coefﬁcients, a linearly interpolated fundamental fre-

quency on the mel scale, and 25-dimensional band aperiodici-

ties, along with their delta and delta-delta). The 259th feature

TABLE I

ACOUSTIC FEATURES USED FOR SPEAKER VERIFICATION MODELS.

MFCC 19-dim MFCCs (plus energy), ∆, ∆

MGC 19-dim WORLD mel-cepstrum (plus 0th), ∆, ∆

F0 20-dim features derived from F0, ∆, ∆

TABLE II

FOUR MULTI-SPEAKER SPEECH-SYNTHESIS MODELS USED FOR SPEAKER

ADAPTATION EXPERIMENTS. g AND i DENOTE GMM AND i-vector,

RESPECTIVELY.

Systems Multi-speaker model Adaptation

averaged one-hot vector –

supervised one-hot vector vector estimated by BP

unsupervised (g) speaker similarity vec. obtained from GMM-UBM

unsupervised (i) speaker similarity vec. obtained from i-vector/PLDA

was a binary voiced/unvoiced ﬂag. 389-dimensional linguistic

features were used as an input vector. This input vector was

augmented with speaker, gender, and age codes. The oracle

duration was used since it makes it possible to easily compute

objective measures such as mel-cepstral distortion. All multi-

speaker speech synthesis models were feedforward DNNs with

ﬁve hidden layers of 1024 nodes each. Sigmoid activation

functions were used for all units in the hidden and output

layers. The models were initialized randomly and trained to

minimize the mean square error by stochastic gradient descent.

Speaker adaptation: The proposed unsupervised speaker-

adaptation technique was compared with a supervised speaker-

adaptation technique using speaker codes. Systems constructed

for the experiments are listed in Table. II. An averaged system

is a reference system which uses one-hot vectors to train a

multi-speaker model and replaces all one-hot vector elements

with their average value during synthesis, since it can be

viewed as the average voice system. In a supervised system,

the multi-speaker model is the same as that used in the

averaged system, but the speaker code for the target speaker

is estimated on the basis of BP. Unsupervised systems (GMM

and i-vector) are proposed unsupervised speaker-adaptation

systems, in which speaker similarity vectors are estimated by

using either GMM-UBM or i-vector/PLDA, respectively.

B. Objective evaluation of multi-speaker modeling

Performance of multi-speaker modeling using the proposed

technique was evaluated. Speaker codes used in training the

multi-speaker speech synthesis model were used to synthesize

voices of training speakers. Objective results in terms of mel-

cepstrum distortion and root mean square error (RMSE) of log

F 0 (in short, LF0 RMSE) are shown in Fig. 1. The number

of mixtures for the unsupervised system (GMM) was 8, 16,

32, 64 or 128. Only MFCCs were used as features to train the

speaker veriﬁcation models.

It can be seen from Fig. 1 that all the other supervised and

unsupervised systems were signiﬁcantly accurate than the av-

eraged system. This result indicates the multi-speaker speech-

synthesis model using the proposed speaker-similarity vectors

as well as one-hot vectors was successful at approximating

the many speakers in the training corpus. Next, as for the

8 16 32 64 128

The number of mixtures

5.0

5.2

5.4

5.6

5.8

Mel-cepstrum distortion

AVM

Supervise

i  vec

GMM

averaged

supervised

unsupervised (i)

unsupervised (g)

(a) Mel-cepstrum distortion

8 16 32 64 128

The number of mixtures

27.5

30.0

32.5

35.0

37.5

40.0

42.5

45.0

LF0 RMSE

AVM

Supervise

i  vec

GMM

averaged

supervised

unsupervised (i)

unsupervised (g)

(b) LF0 RMSE

Fig. 1. Objective results (Mel-cepstrum distortion and LF0 RMSE) of multi-

speaker speech synthesis models.

8 16 32 64 128

The number of mixtures

5.2

5.3

5.4

5.5

5.6

5.7

5.8

Mel-cepstrum distortion

AVM

Supvervise(10utt)

Supvervise(50utt)

Supvervise(100utt)

GMM (10utt )

GMM (50utt )

GMM (100utt )

averaged

supervised

(10 utterances)

supervised

(50 utterances)

supervised

(100 utterances)

unsupervised

(GMM, 10 utterances)

unsupervised

(GMM, 50 utterances)

unsupervised

(GMM, 100 utterances)

(a) Mel-cepstrum distortion

8 16 32 64 128

The number of mixtures

30.0

32.5

35.0

37.5

40.0

42.5

45.0

LF0 RMSE

AVM

Supvervise(10utt)

Supvervise(50utt)

Supvervise(100utt)

GMM (10utt )

GMM (50utt )

GMM (100utt )

averaged

supervised

(10 utterances)

supervised

(50 utterances)

supervised

(100 utterances)

unsupervised

(GMM, 10 utterances)

unsupervised

(GMM, 50 utterances)

unsupervised

(GMM, 100 utterances)

(b) LF0 RMSE

Fig. 2. Objective results of supervised and proposed unsupervised adaptation

techniques. The number of mixtures for GMM was 8, 16, 32, 64 or 128. The

numbers included in the labels represent the number of adaptation utterances

(10, 50 or 100 utterances).

supervised and the proposed unsupervised (GMM and i-vector)

systems, their performances do not signiﬁcantly differ.

C. Objective evaluation of speaker-adaptation performance

Supervised and unsupervised adaptation: Objective results

of supervised and proposed unsupervised speaker-adaptation

systems based on GMM-UBM are shown in Fig. 2 in terms

of mel-cepstrum distortion and LF0 RMSE. Objective results

of the averaged system are also shown. First, it can be seen

that the unsupervised system based on GMM-UBM (GMM)

produces smaller errors than the averaged system, showing

that the proposed technique successfully performed speaker

adaptation. It can also be seen that the results of the unsuper-

vised system (GMM) are worse than those of the supervised

systems, as expected.

Second, in terms of the number of mixtures for the unsu-

pervised system (GMM), it can be seen that the lowest mel-

cepstrum distortion and LF0 RMSE are obtained by using

GMMs with 32 and 64 mixtures, respectively. The number

of mixtures of GMMs may be smaller than the number of

mixtures generally used in speaker-veriﬁcation tasks. Since the

ﬁnal aim is to perform speaker adaptation rather than veriﬁ-

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

Figures

Citations

NAUTILUS: A Versatile Voice Cloning System

NAUTILUS: a Versatile Voice Cloning System

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation.

References

I and J

Speaker Verification Using Adapted Gaussian Mixture Models

Bayesian Speaker Verification with Heavy-Tailed Priors.

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings

Deep Voice 2: Multi-Speaker Neural Text-to-Speech.

Related Papers (5)

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

Learning speaker representation for neural network based multichannel speaker extraction

Speaker change detection using features through a neural network speaker classifier

Frequently Asked Questions (10)

Q1. What are the contributions in "Unsupervised speaker adaptation for dnn-based speech synthesis using input codes" ?

Q2. What have the authors stated for future works in "Unsupervised speaker adaptation for dnn-based speech synthesis using input codes" ?

Q3. How many acoustic features were extracted from the speech-synthesis model?

Q4. How many speakers were used to train the speaker-verification models?

Q5. How many speakers were used as adaptation materials?

Q6. What is the procedure for training a multi-speaker speech-synthesis model?

Q7. How was the speech data used for the adaptation experiments?

Q8. How many speakers were used in the experiments?

Q9. How was the SNR of low quality speech used for training the speaker-verification models?

Q10. Why is the speaker-verification model unable to output the appropriate speaker-similarity vector?