scispace - formally typeset
Open AccessProceedings ArticleDOI

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

TLDR
The proposed speaker-adaptation technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis, and it is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speakers' voice.
Abstract
A new speaker-adaptation technique for deep neural network (DNN)-based speech synthesis - which requires only speech data without orthographic transcriptions - is proposed. This technique is based on a DNN-based speech-synthesis model that takes speaker, gender, and age into consideration as additional inputs and outputs acoustic parameters of corresponding voices from text in order to construct a multi-speaker model and perform speaker adaptation. It uses a new input code that represents acoustic similarity to each of the training speakers in a probability. The new input code, called “speaker-similarity vector,” is obtained by concatenating posterior probabilities calculated from each model of the training speakers. GMM-UBM or i-vector/PLDA, which are widely used in text-independent speaker verification, are used to represent the speaker models, since they can be used without text information. Text and the speaker-similarity vectors of the training speakers are used as input to first train a multi-speaker speech-synthesis model, which outputs acoustic parameters of the training speakers. A new speaker-similarity vector is then estimated by using a small amount of speech data uttered by an unknown target speaker on the basis of the separately trained speaker models. It is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speaker's voice. In objective and subjective experiments, adaptation performance of the proposed technique was evaluated using not only studio-quality adaptation data but also low-quality (i.e., noisy and reverberant) data. The results of the experiments indicate that the proposed technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Unsupervised Speaker Adaptation for DNN-based Speech
Synthesis using Input Codes
Citation for published version:
Takaki, S, Nishimura, Y & Yamagishi, J 2019, Unsupervised Speaker Adaptation for DNN-based Speech
Synthesis using Input Codes. in Asia-Pacific Signal and Information Processing Association Annual Summit
and Conference 2018. Institute of Electrical and Electronics Engineers (IEEE), Honolulu, Hawaii, USA, pp.
649-658, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018,
Honolulu, Hawaii, United States, 12/11/18. https://doi.org/10.23919/APSIPA.2018.8659621
Digital Object Identifier (DOI):
10.23919/APSIPA.2018.8659621
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 09. Aug. 2022

Unsupervised Speaker Adaptation for
DNN-based Speech Synthesis using Input Codes
Shinji Takaki
, Yoshikazu Nishimura
, Junichi Yamagishi
National Institute of Informatics, Tokyo, Japan
alt Inc., Tokyo, Japan
Abstract—A new speaker-adaptation technique for deep neural
network (DNN)-based speech synthesis which requires only
speech data without orthographic transcriptions is proposed.
This technique is based on a DNN-based speech-synthesis model
that takes speaker, gender, and age into consideration as addi-
tional inputs and outputs acoustic parameters of corresponding
voices from text in order to construct a multi-speaker model
and perform speaker adaptation. It uses a new input code that
represents acoustic similarity to each of the training speakers
in a probability. The new input code, called “speaker-similarity
vector, is obtained by concatenating posterior probabilities
calculated from each model of the training speakers. GMM-UBM
or i-vector/PLDA, which are widely used in text-independent
speaker verification, are used to represent the speaker models,
since they can be used without text information. Text and the
speaker-similarity vectors of the training speakers are used
as input to first train a multi-speaker speech-synthesis model,
which outputs acoustic parameters of the training speakers.
A new speaker-similarity vector is then estimated by using a
small amount of speech data uttered by an unknown target
speaker on the basis of the separately trained speaker models.
It is expected that inputting the estimated speaker-similarity
vector into the multi-speaker speech-synthesis model can generate
synthetic speech that resembles the target speaker’s voice. In
objective and subjective experiments, adaptation performance of
the proposed technique was evaluated using not only studio-
quality adaptation data but also low-quality (i.e., noisy and
reverberant) data. The results of the experiments indicate that
the proposed technique makes it possible to rapidly construct a
voice for the target speaker in DNN-based speech synthesis.
I. INTRODUCTION
The flexibility and controllability of speech-synthesis sys-
tems are as important as naturalness of speech in some appli-
cations; hence, constructing such a flexible speech-synthesis
system is an interesting research topic in the field of DNN-
based speech synthesis. A variety of multi-speaker modeling
and speaker-adaptation techniques for DNN-based speech syn-
thesis have been proposed recently. Multi-speaker modeling is
a technique for synthesizing voices of various speakers by
using a common model, and speaker adaptation is a technique
for estimating a new acoustic model by using a small amount
of speech data uttered by a new target speaker or in a new
speaking style (e.g., a different emotion). To give a few
examples of the multi-speaker modeling in the field of DNN-
based speech synthesis, using speaker codes that represent
a speaker’s identity for multi-speaker modeling, in which
additional inputs are used to distinguish speakers, has been
proposed [1], [2], [3]. A speaker-adaptation technique using i-
vectors as an additional input, one using an adaptation method
for speech recognition called “learning hidden-unit contribu-
tions” [4], one using linear transforms defined by Gaussian
mixture models (GMMs), and combinations of those methods,
was proposed [5]. In another study, it was assumed that the
output layer in a DNN captures most speaker differences,
and under that assumption, it was attempted to estimate a
speaker-dependent output layer by using individual speaker’s
data while keeping the hidden network layers shared across
all speakers [6].
Prior to the present study, a DNN-based acoustic model us-
ing auxiliary features referred to as input codes was proposed
[7]. In that model, to more-effectively retain speaker voice
characteristics and allow speaker adaptation, a speaker’s iden-
tity, gender, and age classes were additionally used. Speaker
adaptation was performed by estimating a new speaker code
based on back-propagation (BP) using a small amount of the
target speaker’s speech data and associated linguistic features
obtained from text. Almost all other adaptation techniques
proposed for DNN synthesis are based on BP [1], [5], [6], [8],
[9]; hence, not only speech data but also linguistic features are
always required.
In this study, a new speaker-adaptation technique for DNN-
based speech synthesis which requires only speech data with-
out orthographic transcriptions is proposed. This technique is
traditionally called unsupervised speaker adaptation for speech
synthesis [10]. A naive dirty way is to obtain transcriptions
using external automatic speech recognition and use conven-
tional speaker adaptation based on speech and automatically
generated transcriptions [11]. However, this procedure may
result in issues when the outputs of speech recognition have
severe errors.
The proposed technique uses a new input code designed for
speaker adaptation without using text. The new input code,
called “speaker-similarity vector”, represents acoustic similar-
ity of a target speaker to each of several training speakers
in terms of probability, and it is obtained by concatenating
posterior probabilities calculated from each of the models
of the training speakers. Intuitively, this process may be
viewed as replacing a conventional binary hard speaker code
with continuous soft codes according to speaker similarity.
Therefore, if a multi-speaker speech-synthesis model is trained
using text, and speaker-similarity vectors are used as input, it
can be expected that speaker characteristics of synthetic speech
generated from the trained multi-speaker model will depend
on the speaker-similarity vectors, and the synthetic speech will

vary if the speaker similarity vectors change. Furthermore,
if a new speaker-similarity vector is estimated by using a
small amount of speech data uttered by an unknown target
speaker (on the basis of separately trained speaker models),
and the estimated speaker-similarity vector is input into a
multi-speaker speech synthesis model, the resulting synthetic
speech will probably resemble the voice of the new target
speaker. More importantly, the speaker-similarity vector can be
computed by using widely used text-independent automatic-
speaker-verification models, such as the Gaussian Mixture
Model-Universal Background Model (GMM-UBM) [12] and
i-vector probabilistic linear discriminant analysis (PLDA) [13],
without the need for text information; hence, unsupervised
speaker adaptation can be achieved.
In this paper, also, we train robust speaker verification
models to calculate appropriate posterior probabilities from
low-quality (i.e., noisy and reverberant) speech to synthe-
size the target speaker’s voice even if low-quality speech
are given as adaptation data. An issue to be addressed is
mismatch between recording conditions for the training data
and adaptation data fed into speaker-verification models, in
which posterior probabilities are calculated from low-quality
speech via speaker-verification models trained by using studio-
quality speech data. To alleviate this conditional mismatch
between training data and adaptation data, low-quality speech
data are artificially created, and speaker-verification models are
trained using the created data instead of studio-quality speech
data.
The remainder of the paper is as follows. Section 2 de-
scribes multi-speaker modeling and the conventional speaker-
adaptation technique using input codes. Section 3 explains the
proposed speaker-adaptation technique that does not require
text information. Section 4 describes how to artificially create
noisy and reverberant speech. In Section 5, the proposed
approaches are evaluated by using studio-quality speech data,
and in Section 6, they are evaluated by using low-quality
speech data. Section 7 concludes the paper.
II. MULTI-SPEAKER SPEECH SYNTHESIS AND SPEAKER
ADAPTATION USING INPUT CODES
The previously proposed multi-speaker speech-synthesis
model [7] was trained using simply pooled data of multiple
speakers. Identity, gender, and age of the speaker are rep-
resented using one-hot vectors, binary values (0: Female; 1:
Male), and raw age values, respectively, and added as a part
of the input to the neural network.
To adapt the above multi-speaker speech-synthesis models
to a new speaker, the BP algorithm is used to minimize the
mean-square prediction error over a small amount of data
uttered by the target speaker according to the study by Bridle
and Cox [14]. Note that the BP algorithm only updates the
speaker codes, without changing the DNN weights, in contrast
to algorithms developed in other studies, e.g., [1], that used
fixed codes but added new weights. The BP algorithm starts
from the average speaker code and continues until it estimates
a new speaker code and a stopping criterion is satisfied.
III. UNSUPERVISED SPEAKER ADAPTATION USING A
SPEAKER-SIMILARITY VECTOR
A. Flow of the proposed unsupervised speaker-adaptation
technique
The proposed unsupervised-speaker-adaptation technique
using speaker-similarity vectors is explained. The procedure
for training a multi-speaker model and performing speaker
adaptation based on the speaker similarity vectors is explained
as the following steps.
1) First, text-independent speaker verification models are
constructed for each of the training speakers included
in a speech database, which is also used for training
the multi-speaker speech synthesis model. GMM-UBM
[12] or i-vector/PLDA [13] is used as a text-independent
speaker verification model.
2) Then, the posterior probability of each training speaker
given by one of the multiple text-independent speaker
verification models in Step 1 is computed. The ob-
tained posterior probabilities are concatenated to form
a speaker similarity vector for each of the training
speakers. 112-dimensional speaker similarity vectors are
obtained (since the number of training speakers was
112).
3) Next, the speaker similarity vectors computed in Step
2 are used to replace the one-hot-vector based speaker
code, and a DNN-based multi-speaker speech synthesis
model is constructed. Linguistic features, gender, and
age codes are the same as those used in the systems
described in Section 2
1
4) Speaker adaptation is performed as follows. A speaker-
similarity vector of an unknown target speaker is esti-
mated in a similar way as in Step 2: the posterior prob-
abilities of the target speaker given by the multiple text-
independent speaker-verification models are computed,
and the obtained posterior probabilities are concatenated
to form a speaker-similarity vector.
5) The estimated speaker-similarity vector of the target
speaker is used as a new speaker code of the above
multi-speaker speech-synthesis model, thereby changing
the speaker characteristics of synthetic speech.
B. Speaker-verification models
For text-independent speaker verification, the GMM-UBM
[12] and i-vector/PLDA [13] approaches are widely used.
The proposed technique also uses these approaches. As for
the GMM-UBM approach, a speaker model is obtained by
adapting parameters of GMMs trained using speech data of
many speakers [12]. As for i-vector/PLDA, i-vectors are first
computed from sufficient statistics of speech data and are
regarded as observations for a Gaussian PLDA model given
as
w
u
=
¯
w + Φβ + Γα
u
+
u
, (1)
1
The same technique may also be used to estimate age and gender codes.
However, this option is not explored in this paper due to space limitation.

where
¯
w is a speaker-independent supervector. Φ and Γ rep-
resent eigenvoice matrices for speaker- and channel-dependent
components, respectively. Speaker and channel factors, β and
α
u
, are assumed to have a standard Gaussian distribution as a
prior distribution. In this study, the third term in Eq. (1) was
not used.
C. Advantage of proposed framework
As mentioned earlier, several techniques for speaker adapta-
tion using i-vectors [5] or d-vectors [15] have been developed.
As for the former, i-vectors are directly used as inputs for
DNN-based speech synthesis. On the other hand, as for the
proposed framework, GMM-UBM or i-vector/PLDA is used
only to calculate posterior probabilities for each training
speaker. That is, the proposed multi-speaker speech-synthesis
model does not depend on any acoustic parameterization
or dimensions of i-vectors and has weaker dependency on
acoustic features.
An unsupervised speaker-adaptation technique using a
bottle-neck layer of a DNN-based speaker-recognition model
for DNN-based speech synthesis was proposed by Doddipatla
et al. [15]. As for this technique, PCA is applied to the
bottle-neck features of the DNN-based speaker recognition,
and the first eigenvector is interpolated on the basis of the
posterior probabilities of the speaker-recognition model. In
the following, we argue that the proposed technique is much
simpler and more intuitive for constructing a flexible multi-
speaker speech-synthesis model.
IV. SPEAKER-VERIFICATION MODELS ROBUST AGAINST
LOW-QUALITY ADAPTATION SPEECH DATA
Speech used as adaptation data is usually low quality
because recording studio-quality speech incurs high cost. In
this study, robust speaker verification models are trained to
perform the proposed unsupervised speaker adaptation without
significant degradation of speech quality even if low-quality
speech is given as adaptation data. To train speaker-verification
models robust against low-quality speech data, the mismatch
between recording conditions for training and adaptation
speech data fed into the models needs to be alleviated. Hence,
low-quality speech data is artificially created by adding noise
and reverberation to studio-quality speech, and the created data
is used for training the speaker-verification models.
The low-quality speech data was artificially created by using
the Demand noise database [16] and the ACE Challenge
reverberant database [17]. The low-quality speech was created
by adding noise from the Demand database and reverberation
from the ACE Challenge database to studio-quality speech
waveforms. It is assumed that adaptation speech data used
for speech synthesis is recorded in indoor rooms, so noise and
room impulse responses recorded in an office or meeting room
were used. The first channel of office-and-meeting-room noise
recordings at 48 kHz sampling frequency was selected from
the Demand database, and room impulse responses recorded
in an office or meeting room (Office 1 and Meeting Room 1)
were selected from the ACE database. Noise and reverberation
were added to studio-quality speech in the same way as used
in [18] as follows,
y = x h
1
+ α(n h
2
), (2)
where x and n represent a studio-quality speech waveform
and a noise waveform, respectively, h
1
and h
2
represent the
room impulse responses, is a convolution operator, and α is
used for adjusting signal-to-noise ratio (SNR). Room impulse
responses h
1
and h
2
are recorded using microphones located
in positions 1 and 2.
In our experiments, adaptation speech data was also ar-
tificially degraded by using the same way. Using speech
waveforms recorded under real conditions as adaptation data
is future work.
V. EXPERIMENTS USING STUDIO-QUALITY SPEECH DATA
The proposed technique for unsupervised speaker adapta-
tion using studio-quality speech data as adaptation data was
evaluated as described below.
A. Experimental conditions
Speech database: For our experiments, the Japanese Voice
Bank corpus, containing studio-quality native Japanese speech
uttered by 65 males and 70 females aged between 10 and
89, was used. The speech from 56 males and 56 females
was used to train the speaker-verification models and the
multi-speaker speech-synthesis models. The speech from the
remaining speakers (9 males and 14 females) was saved for
speaker adaptation. With approximately 100 utterances per
speaker, this dataset yielded a total of 11,154 training-data
utterances. For the adaptation experiments, either 10, 50, or
100 utterances from each of the 23 speakers not included in the
training set were used as adaptation materials. The sampling
frequency of the speech-signal waveform was 48 kHz. Speaker
adaptation was evaluated by using 10 utterances per speaker
not included in either the training or adaptation sets.
Speaker-verification models: To train the speaker verification
models, an open-source toolkit called SIDEKIT [19] was used.
The acoustic features used for training these models are listed
in Table I. Since spectral features (20-dimensional MFCCs)
and fundamental frequency/F0 (1-dimensional) are dimension-
ally significantly different, 20-dimensional F0 features were
also obtained by applying a discrete cosine transform (DCT)
to fundamental frequency values of the current, next, and
previous 32 frames. Moreover, instead of the standard MFCC,
spectral features used for speech synthesis models, referred to
as MGC, were investigated. Then, GMMs with 64 mixtures
were trained to extract 400-dimensional i-vectors. The size of
the eigenvoice matrices for the speaker dependent components
in Gaussian PLDA was 20.
Speech-synthesis models: For extracting the acoustic features
for the speech-synthesis model, WORLD analysis [20], [21]
was used to obtain 259-dimensional acoustic feature vectors
every 5 ms (each feature comprising 59-dimensional mel-
spectral coefficients, a linearly interpolated fundamental fre-
quency on the mel scale, and 25-dimensional band aperiodici-
ties, along with their delta and delta-delta). The 259th feature

TABLE I
ACOUSTIC FEATURES USED FOR SPEAKER VERIFICATION MODELS.
MFCC 19-dim MFCCs (plus energy), ,
2
MGC 19-dim WORLD mel-cepstrum (plus 0th), ,
2
F0 20-dim features derived from F0, ,
2
TABLE II
FOUR MULTI-SPEAKER SPEECH-SYNTHESIS MODELS USED FOR SPEAKER
ADAPTATION EXPERIMENTS. g AND i DENOTE GMM AND i-vector,
RESPECTIVELY.
Systems Multi-speaker model Adaptation
averaged one-hot vector
supervised one-hot vector vector estimated by BP
unsupervised (g) speaker similarity vec. obtained from GMM-UBM
unsupervised (i) speaker similarity vec. obtained from i-vector/PLDA
was a binary voiced/unvoiced flag. 389-dimensional linguistic
features were used as an input vector. This input vector was
augmented with speaker, gender, and age codes. The oracle
duration was used since it makes it possible to easily compute
objective measures such as mel-cepstral distortion. All multi-
speaker speech synthesis models were feedforward DNNs with
five hidden layers of 1024 nodes each. Sigmoid activation
functions were used for all units in the hidden and output
layers. The models were initialized randomly and trained to
minimize the mean square error by stochastic gradient descent.
Speaker adaptation: The proposed unsupervised speaker-
adaptation technique was compared with a supervised speaker-
adaptation technique using speaker codes. Systems constructed
for the experiments are listed in Table. II. An averaged system
is a reference system which uses one-hot vectors to train a
multi-speaker model and replaces all one-hot vector elements
with their average value during synthesis, since it can be
viewed as the average voice system. In a supervised system,
the multi-speaker model is the same as that used in the
averaged system, but the speaker code for the target speaker
is estimated on the basis of BP. Unsupervised systems (GMM
and i-vector) are proposed unsupervised speaker-adaptation
systems, in which speaker similarity vectors are estimated by
using either GMM-UBM or i-vector/PLDA, respectively.
B. Objective evaluation of multi-speaker modeling
Performance of multi-speaker modeling using the proposed
technique was evaluated. Speaker codes used in training the
multi-speaker speech synthesis model were used to synthesize
voices of training speakers. Objective results in terms of mel-
cepstrum distortion and root mean square error (RMSE) of log
F 0 (in short, LF0 RMSE) are shown in Fig. 1. The number
of mixtures for the unsupervised system (GMM) was 8, 16,
32, 64 or 128. Only MFCCs were used as features to train the
speaker verification models.
It can be seen from Fig. 1 that all the other supervised and
unsupervised systems were significantly accurate than the av-
eraged system. This result indicates the multi-speaker speech-
synthesis model using the proposed speaker-similarity vectors
as well as one-hot vectors was successful at approximating
the many speakers in the training corpus. Next, as for the
8 16 32 64 128
The number of mixtures
5.0
5.2
5.4
5.6
5.8
Mel-cepstrum distortion
AVM
Supervise
i vec
GMM
averaged
supervised
unsupervised (i)
unsupervised (g)
(a) Mel-cepstrum distortion
8 16 32 64 128
The number of mixtures
27.5
30.0
32.5
35.0
37.5
40.0
42.5
45.0
LF0 RMSE
AVM
Supervise
i vec
GMM
averaged
supervised
unsupervised (i)
unsupervised (g)
(b) LF0 RMSE
Fig. 1. Objective results (Mel-cepstrum distortion and LF0 RMSE) of multi-
speaker speech synthesis models.
8 16 32 64 128
The number of mixtures
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Mel-cepstrum distortion
AVM
Supvervise(10utt)
Supvervise(50utt)
Supvervise(100utt)
GMM (10utt )
GMM (50utt )
GMM (100utt )
averaged
supervised
(10 utterances)
supervised
(50 utterances)
supervised
(100 utterances)
unsupervised
(GMM, 10 utterances)
unsupervised
(GMM, 50 utterances)
unsupervised
(GMM, 100 utterances)
(a) Mel-cepstrum distortion
8 16 32 64 128
The number of mixtures
30.0
32.5
35.0
37.5
40.0
42.5
45.0
LF0 RMSE
AVM
Supvervise(10utt)
Supvervise(50utt)
Supvervise(100utt)
GMM (10utt )
GMM (50utt )
GMM (100utt )
averaged
supervised
(10 utterances)
supervised
(50 utterances)
supervised
(100 utterances)
unsupervised
(GMM, 10 utterances)
unsupervised
(GMM, 50 utterances)
unsupervised
(GMM, 100 utterances)
(b) LF0 RMSE
Fig. 2. Objective results of supervised and proposed unsupervised adaptation
techniques. The number of mixtures for GMM was 8, 16, 32, 64 or 128. The
numbers included in the labels represent the number of adaptation utterances
(10, 50 or 100 utterances).
supervised and the proposed unsupervised (GMM and i-vector)
systems, their performances do not significantly differ.
C. Objective evaluation of speaker-adaptation performance
Supervised and unsupervised adaptation: Objective results
of supervised and proposed unsupervised speaker-adaptation
systems based on GMM-UBM are shown in Fig. 2 in terms
of mel-cepstrum distortion and LF0 RMSE. Objective results
of the averaged system are also shown. First, it can be seen
that the unsupervised system based on GMM-UBM (GMM)
produces smaller errors than the averaged system, showing
that the proposed technique successfully performed speaker
adaptation. It can also be seen that the results of the unsuper-
vised system (GMM) are worse than those of the supervised
systems, as expected.
Second, in terms of the number of mixtures for the unsu-
pervised system (GMM), it can be seen that the lowest mel-
cepstrum distortion and LF0 RMSE are obtained by using
GMMs with 32 and 64 mixtures, respectively. The number
of mixtures of GMMs may be smaller than the number of
mixtures generally used in speaker-verification tasks. Since the
final aim is to perform speaker adaptation rather than verifi-

Figures
Citations
More filters
Journal ArticleDOI

NAUTILUS: A Versatile Voice Cloning System

TL;DR: In this paper, a novel speech synthesis system, called NAUTILUS, is proposed that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
Posted Content

NAUTILUS: a Versatile Voice Cloning System

TL;DR: A novel speech synthesis system that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker is introduced, and it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency.
Posted Content

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation.

TL;DR: Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements.
References
More filters
Book ChapterDOI

I and J

Journal ArticleDOI

Speaker Verification Using Adapted Gaussian Mixture Models

TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

Bayesian Speaker Verification with Heavy-Tailed Priors.

Patrick Kenny
TL;DR: A new approach to speaker verification is described which is based on a generative model of speaker and channel effects but differs from Joint Factor Analysis in several respects, including each utterance is represented by a low dimensional feature vector rather than by a high dimensional set of Baum-Welch statistics.
Journal ArticleDOI

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings

TL;DR: DEMAND (Diverse Environments Multi-channel Acoustic Noise Database) is provided, providing a set of 16-channel noise files recorded in a variety of indoor and outdoor settings to encourage research into algorithms beyond the stereo setup.
Proceedings Article

Deep Voice 2: Multi-Speaker Neural Text-to-Speech.

TL;DR: In this paper, a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model was proposed.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in "Unsupervised speaker adaptation for dnn-based speech synthesis using input codes" ?

In this paper, an unsupervised speaker adaptation technique for DNN-based speech synthesis with input codes, using only speech data from a target speaker without transcriptions, was proposed. 

Their future work includes evaluation of the proposed technique using MP3 or AMR codec speech and speech recorded under real conditions as adaptation data. 

For extracting the acoustic features for the speech-synthesis model, WORLD analysis [20], [21] was used to obtain 259-dimensional acoustic feature vectors every 5 ms (each feature comprising 59-dimensional melspectral coefficients, a linearly interpolated fundamental frequency on the mel scale, and 25-dimensional band aperiodicities, along with their delta and delta-delta). 

The speech from 56 males and 56 females was used to train the speaker-verification models and the multi-speaker speech-synthesis models. 

For the adaptation experiments, either 10, 50, or 100 utterances from each of the 23 speakers not included in the training set were used as adaptation materials. 

1) First, text-independent speaker verification models are constructed for each of the training speakers included in a speech database, which is also used for training the multi-speaker speech synthesis model. 

A. Experimental conditions Speech database: For their experiments, the Japanese Voice Bank corpus, containing studio-quality native Japanese speech uttered by 65 males and 70 females aged between 10 and 89, was used. 

The same utterances and speakers used in the experiments using only studio-quality speech data were used for training and adaptation, although 100 utterances from each of target speakers were used as adaptation data. 

signal to noise ratio (SNR) of lowquality speech used for training the speaker-verification models was adjusted by using α in Eq. (2). 

This is because performance of F0 extraction from a low-quality speech waveform was problematic, and the speaker-verification models using F0 features cannot output the appropriate speaker-similarity vector for speaker adaptation.