Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction

doi:10.1109/TASL.2010.2066968

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Unsupervised intralingual and cross-lingual speaker

adaptation for HMM-based speech synthesis using

two-pass decision tree construction

Matthew Gibson and William Byrne

Abstract—Hidden Markov model (HMM)-based speech syn-

thesis systems possess several advantages over concatenative

synthesis systems. One such advantage is the relative ease with

which HMM-based systems are adapted to speakers not present

in the training dataset. Speaker adaptation methods used in

the ﬁeld of HMM-based automatic speech recognition (ASR)

are adopted for this task. In the case of unsupervised speaker

adaptation, previous work has used a supplementary set of

acoustic models to estimate the transcription of the adaptation

data. This paper ﬁrstly presents an approach to the unsuper-

vised speaker adaptation task for HMM-based speech synthesis

models which avoids the need for such supplementary acoustic

models. This is achieved by deﬁning a mapping between HMM-

based synthesis models and ASR-style models, via a two-pass

decision tree construction process. Secondly, it is shown that this

mapping also enables unsupervised adaptation of HMM-based

speech synthesis models without the need to perform linguistic

analysis of the estimated transcription of the adaptation data.

Thirdly, this paper demonstrates how this technique lends itself

to the task of unsupervised cross-lingual adaptation of HMM-

based speech synthesis models, and explains the advantages of

such an approach. Finally, listener evaluations reveal that the

proposed unsupervised adaptation methods deliver performance

approaching that of supervised adaptation.

Index Terms—HMM-based speech synthesis, unsupervised

speaker adaptation, cross-lingual.

I. INTRODUCTION

H

IDDEN Markov model-based systems have delivered

synthetic speech of similar quality to that of concate-

native (or unit selection) synthesis systems [1]. Additionally,

HMM-based systems possess several advantages over unit

selection systems. These advantages include simple ways to

interpolate between speakers and synthesise speech of varying

styles or emotions [2; 3]. Perhaps the most signiﬁcant advan-

tage is the ability to adapt to new speakers using a relatively

small amount of adaptation data [4].

Most research into speaker adaptation for HMM-based

speech synthesis (or text-to-speech, TTS) has focussed upon

the supervised scenario, where transcribed adaptation data

is available. More recent work has tackled the challenge of

adaptation of HMM-based synthesis models using unlabelled

adaptation data [5]. As will be explained in due course, unsu-

pervised adaptation of HMM-based synthesis models is prob-

lematic for two reasons. Firstly, the modelling of supraseg-

mental contextual information renders the synthesis models

unsuitable for ASR purposes. Therefore a supplementary set

of triphone acoustic models are typically used to estimate

a transcription of the unlabelled adaptation data [5]. Sec-

ondly, linguistic analysis is required to transform word-level

transcriptions into transcriptions containing suprasegmental

contextual information. In the case of unsupervised adaptation,

it is feasible that such linguistic analysis exacerbates errors

present in the estimated word-level transcription.

This paper presents an alternative to the unsupervised adap-

tation approach described in [5]. In [5], adaptation transforms

estimated using triphone acoustic models are applied to the

more detailed acoustic models typically used in HMM-based

synthesis. While this technique avoids the need for linguistic

analysis of the estimated transcription of the adaptation data,

a separately-estimated triphone acoustic model set is still

required.

In this paper, a two-stage decision tree construction method

is introduced, which enables a single set of acoustic model

components to be used for both ASR and TTS. This method

is then used to circumvent the need for supplementary ASR

acoustic models and linguistic analysis of estimated transcrip-

tions during unsupervised adaptation. The application of the

two-stage decision tree construction method is then extended

to the task of unsupervised cross-lingual speaker adaptation.

Cross-lingual (or interlingual) speaker adaptation is deﬁned

as the adaptation of acoustic models associated with one

language, the target language, using adaptation data uttered

in a different language, the source language.

A large amount of research has been performed on the

cross-lingual adaptation task for ASR acoustic models. The

task typically arises in cases where a relatively small amount

of data is available to train an ASR acoustic model in a

particular target language. Bootstrapping the target language

acoustic models ([6]) based upon an explicit mapping from

source to target language phonemes has been explored, as well

as interpolation of the source and target language acoustic

models (also [6]). Later work ([7]) has successfully applied

the maximum a-posteriori (MAP) adaptation method to the

cross-lingual adaptation task, demonstrating the usefulness of

the prior knowledge contained within the source language.

Recent work [8; 9] has addressed the task of supervised

cross-lingual adaptation for HMM-based speech synthesis.

This work used TTS models of both source and target lan-

guages, and deﬁned a phoneme or state-level mapping between

the source and target language acoustic models. This mapping

was then deployed to translate the source language transcrip-

tion of the adaptation data to a target language phoneme

or state sequence. The target language TTS models were

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2

subsequently adapted using the source language acoustic data

and the corresponding mapped target language phoneme or

state sequence.

Techniques similar to those described above rely upon the

availability of both source and target language TTS models,

and the mapping mechanism between these models must be

established prior to adaptation. An alternative approach based

upon the two-stage decision tree construction technique is pro-

posed in this paper. As will be explained later, this alternative

approach is appealing because it requires no knowledge of the

source language acoustic model (or even the source language)

or its relationship to the target language acoustic model.

This paper evaluates the proposed unsupervised adaptation

schemes in both a standard adaptation scenario and a speaker

adaptive training (SAT) framework. The performance of these

techniques is compared with standard approaches to super-

vised and unsupervised speaker adaptation of HMM-based

synthesis models in both the intralingual (within-language)

and cross-lingual scenarios. In the cross-lingual case, parallel

translated adaptation datasets recorded by the same speaker

are used to compare the performance of intralingual and cross-

lingual adaptation in a controlled manner. Listener evaluations

reveal that the proposed unsupervised adaptation techniques

deliver performance approaching that of supervised intralin-

gual adaptation.

The paper is structured as follows. Section II provides a

brief introduction to HMM-based speech synthesis models

and explains why unsupervised adaptation of such models is

problematic. Section III explains the two-pass decision tree

construction technique, and how this enables unsupervised

adaptation of HMM-based synthesis models. Sections IV

and V respectively introduce the unsupervised intralingual

and cross-lingual approaches used in this work. Section VI

discusses the use of SAT in the context of HMM-based speech

synthesis. The proposed approaches to intralingual and cross-

lingual speaker adaptation are evaluated in Sections VII and

VIII respectively. Lastly, Section IX summarises the contribu-

tions of this work and highlights areas of future research.

II. UNSUPERVISED ADAPTATION AND HMM-BASED

SPEECH SYNTHESIS

In the domain of ASR, unsupervised adaptation is usually

conducted by ﬁrstly estimating a transcription of the adaptation

data using a speech recogniser. This speech recogniser usually

deploys the same models which are subsequently adapted.

In the domain of HMM-based synthesis, use of the same

unsupervised adaptation framework is problematic because

the acoustic models typically used in HMM-based speech

synthesis are not easily integrated into the ASR search proce-

dure. This, in turn, is because the context-dependent acoustic

models used in HMM-based speech synthesis [10] represent

suprasegmental information (e.g. syllabic stress, total number

of syllables in utterance) in addition to segmental informa-

tion (e.g. context-dependent phoneme label). These models

are henceforth referred to as full context models. Although

theoretically possible to recognise unlabelled data using full

context models, this requires information which relates to

complete hypotheses (e.g. the total number of words in an

utterance) when constructing a recognition network. When

using e.g. triphone acoustic models, such information may be

ignored to simplify the recognition network and to facilitate

dynamic network construction. The presence of suprasegmen-

tal contextual information in full context models therefore

adds a prohibitive amount of complexity to the construction

of recognition networks.

A simple solution to this problem is to use a separately-

estimated ASR-compliant acoustic model to obtain a tran-

scription of the adaptation data, followed by adaptation of

the TTS model using this transcription [5]. However this

solution involves estimation of a separate ASR model, and

such model estimation is often a lengthy procedure. Further,

use of different models during the recognition and adaptation

stages precludes the use of efﬁcient online adaptation strategies

[11]. For these reasons, alternative techniques which enable

TTS models to be deployed for ASR have been explored [12].

The two-pass decision tree construction technique [13] is one

such technique, as will be explained in the following section.

III. TWO-PASS DECISION TREE CONSTRUCTION

As is the case for ASR acoustic modelling, decision tree

clustering of the full contexts is used to enable robust esti-

mation of the model parameters. The minimum description

length (MDL) criterion [14] is used when constructing the

decision tree, which in turn uses questions pertaining to both

segmental and suprasegmental context. By performing this

decision tree construction in two stages, where the initial stage

uses questions relating to triphone contextual information, and

the second stage uses questions relating to all contextual infor-

mation, a well-deﬁned mapping between full context models

and triphone models may be established. This constrained

decision tree construction process is illustrated in Figure 1.

The ﬁrst stage, indicated as Pass 1 in Figure 1, uses

only questions relating to left, right and central phonemes to

construct a phonetic decision tree. This decision tree is used

to generate a set of tied triphone contexts, which are easily

integrated into the ASR search. No state output distributions

are estimated at this stage. Pass 2 extends the decision tree

constructed in Pass 1 by introducing additional questions

relating to suprasegmental information. The output of Pass 2

is an extended decision tree which deﬁnes a set of tied full

contexts. The MDL criterion is used for both Pass 1 and Pass

2.

After this two-pass decision tree construction, single com-

ponent Gaussian state output distributions are estimated to

model the tied full contexts associated with each leaf node

of the extended decision tree. These models are then used for

speech synthesis.

A mapping from the single component full context models

to multiple component triphone models is deﬁned as follows.

Each set of Gaussian components associated with the same

‘triphone ancestor’ are grouped to form a multiple component

mixture distribution to model the triphone context deﬁned

by the ‘triphone ancestor’. The derived triphone models are

illustrated at the bottom of Figure 1. The mixture weight of

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3

each component is calculated from the occupancies associated

with the Pass 2 leaf node contexts.

C-Nasal?

L-Vowel?

C-Vowel?

…

C-Nasal?

L-Vowel?

C-Vowel?

…

C-Nasal?

R-stressed?

2 syllables in utt?

…

Pass 1

Pass 2

R-stressed?

2

3

C-Nasal?

L-Vowel?

2 syllables in utt?

1

4 5

C-Nasal?

L-Vowel?

54

32

1

Full context models

(single-component)

Triphone models

(multi-component)

Mapping

Inverse

mapping

Model

mapping

Fig. 1. Two-pass decision tree construction. Mapping functions permit

sharing of components between full context models for TTS and triphone

models for ASR.

The inverse mapping from triphone models to full context

models is obtained by associating each Gaussian component

with its original full context. This is achieved by assigning a

unique full context identiﬁer to each component as illustrated

in Figure 1.

Mapping full context models to triphone models enables

ASR compatible acoustic models to be derived from TTS

acoustic models, thus avoiding the need for a separately-

estimated ASR model. Sections IV and V explain how these

mappings between full context and triphone models can be

exploited to perform unsupervised intralingual and cross-

lingual adaptation of full context models.

IV. UNSUPERVISED INTRALINGUAL ADAPTATION

As illustrated in Figure 2, triphone models derived from

estimated full context models (as described in Section III)

are used to transcribe unlabelled adaptation data. One ques-

tion remains, however. How is ASR output, e.g. a word,

phoneme or triphone sequence, used to adapt full context

models? One method, labelled as ‘full adaptation’ in Figure 2,

ﬁrstly performs linguistic analysis of the estimated word-level

transcription to produce an estimated full context labelling of

the adaptation data. The full context models are then adapted

directly using this labelling.

By deﬁning an inverse mapping between full context and tri-

phone models, the two-pass decision tree construction method

introduces an alternative to the ‘full adaptation’ technique. As

illustrated in Figure 2, the estimated triphone transcription may

be used to adapt the triphone models. The adapted triphone

models are then subsequently mapped back to full context

models using the inverse mapping. This is labelled as ‘triphone

adaptation’ in Figure 2.

Adaptation

Data

Triphone

adaptation

Estimated word and

triphone transcription

Recognise

Full

adaptation

Estimated full

context transcription

Inverse

mapping

Adapted

triphone

models

Adapted

full context

models

Mapping

Triphone

models

Training

Data

Train

Full context

models

Full context

transcription

Linguistic

analysis

1 2

Fig. 2. Unsupervised adaptation of full context models via (1) full adaptation

or (2) triphone adaptation.

Once word and triphone-level transcriptions of the adapta-

tion data are available, the full context models may be adapted

in these two different ways. Note that linguistic analysis may

exacerbate errors present in the estimated word-level tran-

scription. It is therefore feasible that the triphone adaptation

technique is more robust than full context adaptation in the

unsupervised case. This hypothesis is tested in the experiments

of Section VII.

V. UNSUPERVISED CROSS-LINGUAL ADAPTATION

Consider now the task of unsupervised cross-lingual speaker

adaptation, as deﬁned in Section I, in the case of full context

acoustic models. To transcribe the adaptation data one could

deploy an ASR system tailored to the source language i.e. a

source language lexicon, as well as source language acoustic

and language models. This estimated transcription may then

be subsequently mapped to the target language. This mapping

may be deﬁned at the phone level [8] or the state level [9].

The mapped transcription may then be used to adapt the target

language full context models.

The above approach deploys a large amount of source

language speciﬁc knowledge, as well as knowledge of the

relationship between source and target languages. Acquisition

of such knowledge typically depends upon a large amount

of transcribed acoustic data from the source language. Such

a database is certainly not available for all languages, and

is expensive to obtain. Further, if the source language is

unknown, clearly the approach described above cannot be

applied. For these reasons, an alternative method is explored

in this work.

The cross-lingual adaptation technique used in this work

treats the source language adaptation data as if it were uttered

in the target language. Target language acoustic models and a

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4

phoneme loop grammar are used to recognise the adaptation

data, thus mapping it onto a phoneme sequence in the target

language. Subsequently, the estimated triphone sequence is

used as the reference sequence, and the triphone adaptation

method of Figure 2 is used. This process is almost identical to

the triphone adaptation approach to unsupervised intralingual

adaptation. The sole difference is that, in order to avoid

language speciﬁc constraints, no dictionary or language model

is used during recognition. This method was ﬁrst introduced

and evaluated in [15].

The approach described in the previous paragraph uses no

source language ASR or TTS system. Further, no previously

learned mapping between source and target language acoustic

models is necessary. Indeed, no source language knowledge

whatsoever is used, so the technique may be applied even

when the source language is unknown.

By comparing the performance of unsupervised intralingual

and cross-lingual adaptation, the impact of source language

knowledge may be measured. This comparison is reported in

Section VIII.

VI. SPEAKER ADAPTIVE TRAINING

Speaker adaptive training (SAT, [16]) attempts to decouple

inter-speaker and intra-speaker variance when estimating a

speaker independent (SI) acoustic model. The SAT framework

simultaneously estimates sets of speaker dependent transforms

of the acoustic models (one set of transforms for each speaker

in the training set) and a speaker independent ‘canonical’

model. The transforms are designed to capture much of the

inter-speaker acoustic variance and consequently the canonical

model displays less variance than a standard SI system.

Both SAT-estimated and standard SI full context models are

used in the experimental work of this paper. Figure 3 illustrates

the procedure used to estimate these models. SAT-estimated

monophone models are estimated, then cloned to full context

models, which are SAT-estimated using one global transform

per state/stream combination per speaker. The statistics from

these untied full context models are then used to cluster the

full context models. Subsequent to full-context clustering, tied

models are re-estimated to create both SAT-estimated and

standard SI tied full context models.

There is evidence [17] that SAT-estimated models are

superior to standard SI-estimated models for HMM-based

speech synthesis. The evaluation of Section VIII revisits this

comparison to determine if the same conclusions hold in

the case of models generated using two-pass decision tree

construction. The performance of SAT-estimated and standard

SI models is compared both prior to and after adaptation.

VII. EVALUATION: INTRALINGUAL SPEAKER ADAPTATION

The evaluation described in this section is designed to

address the following questions regarding unsupervised in-

tralingual speaker adaptation of HMM-based synthesis models.

1) Does the constrained two-pass decision tree construction

process affect the naturalness of the resulting speech?

2) How does the proposed approach to unsupervised in-

tralingual adaptation compare with supervised intralin-

gual adaptation?

SAT monophone models

Training

Data

SAT (monophone)

Tied full

context models

Full context

clustering

SAT

SAT tied full

context models

Standard SI training

Standard SI tied full

context models

SAT full context models

SAT

Fig. 3. Estimation of speaker independent (SI) full context models using

speaker adaptive training (SAT) and standard model estimation.

3) How does the performance of triphone adaptation (as

described in Section IV) compare with that of full

adaptation?

A. Background information

The synthesis models used in this evaluation deploy the fol-

lowing acoustic features: STRAIGHT-analysed Mel-cepstral

coefﬁcients [18] (40 dimensions), fundamental frequency

(F 0), and measurements which quantify the aperiodicity of

the speech (5 dimensions). The ﬁrst and second order temporal

derivatives of all of these coefﬁcients are appended to yield

a feature vector of dimension 138. The feature vector is split

into ﬁve streams: cepstral coefﬁcients, aperiodicity measures,

F 0, ﬁrst derivative of F 0, and second derivative of F 0. Multi-

space probability distributions are used to model observations

of varying dimension, namely the F 0 observation [19]. Ex-

plicit duration models (hidden semi-Markov models, [20]) are

integrated to improve the quality of synthesised speech. One

decision tree per state and stream combination (where all three

F 0 streams are combined for the purposes of clustering) is

used, with an additional decision tree to cluster contexts of

the duration model. A speech utterance is generated from full

context models via feature sequence generation with global

variance consideration [21; 22]. Synthesis of the waveform

from the feature sequence is performed by the STRAIGHT

vocoder [18].

B. Systems

To address the questions posed at the start of this section,

the systems detailed in Table I are evaluated. Standard SI full

context models are estimated using the Wall Street Journal

(WSJ) SI84 training dataset (3, 586 male and 3, 552 female

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

utterances, 7136 utterances) and maximum likelihood estima-

tion. Note that such databases have proven useful for HMM-

based speech synthesis ([23]).

System Clustering Adaptation method Supervised?

A Standard - -

B Two-pass - -

C Two-pass Full Y

D Two-pass Full N

E Two-pass Triphone Y

F Two-pass Triphone N

G - - -

TABLE I

EVALUATED SYSTEMS (INTRALINGUAL ONLY).

Average voice models corresponding to standard, uncon-

strained decision tree construction (system A of Table I) are

estimated for comparison with those corresponding to two-

pass decision tree construction (system B). Note that only Mel-

cepstral, F 0, and aperiodicity models are adapted in this work,

so only those models are clustered using the two-pass decision

tree construction method. Duration models are clustered using

standard clustering methods and are identical in systems A

and B.

Adapted systems are derived from System B using either

the triphone or full adaptation method described in Section IV.

Constrained maximum likelihood linear regression (CMLLR,

[24]) adaptation is used, and the adaptation data corresponds

to spoke 4 of the 1993 ARPA evaluation (40 utterances for

speaker 440M). The adaptation techniques are evaluated in the

supervised and unsupervised cases, resulting in four adapted

model sets corresponding to systems C through F in Table I.

System G corresponds to vocoded natural speech, analysed

and resynthesised using the STRAIGHT technique [18]. This

system is included in the evaluation to establish an upper

bound on the performance of the synthesised speech.

In the case of unsupervised adaptation, triphone models

derived from the estimated full context average voice models

are used for the recognition step, in conjunction with the

closed vocabulary 20k bigram language model provided with

the WSJ0 corpus. A set of state transition probabilities are

estimated from the SI84 dataset for use with the triphone

models during recognition. A phoneme error rate of 47.1%

is observed for the unsupervised transcriptions.

C. Analysis of two-pass decision tree construction

Table II displays the number of leaf nodes created using

different decision tree construction methods, and for the

different streams. In all cases, the number of leaf nodes

Mel-cepstral F 0 aperiodicity

Pass 1 2208 6756 1644

Pass 2 2889 34849 2639

Standard 2621 30581 2160

TABLE II

NUMBER OF LEAF NODES CREATED USING DIFFERENT DECISION TREE

CONSTRUCTION METHODS.

generated after two-pass decision tree construction exceeds

that of standard tree construction. This demonstrates that

constraining the tree structure to satisfy the requirements of

the two-pass construction method, deﬁned in Section III, leads

to less compact trees.

D. Evaluation details

Two different evaluation methods were used to measure the

performance of the two-pass intralingual adaptation technique:

an opinion score evaluation, described in Section VII-D1, and

a paired comparison of several pairs of systems, described

in Section VII-D2. The opinion score evaluation provides a

performance measure and overall ranking of the systems stud-

ied, while the paired comparison more effectively discovers

signiﬁcant differences between system pairs.

1) Opinion score evaluation: The seven systems (A through

G) were evaluated by listening to synthesised utterances via

a web browser interface closely resembling that used in

the Blizzard Challenge 2007. The evaluation comprised two

sections. In the ﬁrst section, listeners judged the naturalness of

an initial set of synthesised utterances. In the second section,

listeners judged the similarity of a second set of synthesised

utterances to a target speaker’s (speaker 440M) speech. Four

of the target speaker’s natural utterances were available for

comparison. No utterances from the initial set were present

in the second set. Each synthetic utterance was judged using

a ﬁve point Likert-type psychometric response scale [25],

where ‘5’ is the most favourable response and ‘1’ is the least

favourable.

Twenty two native English speakers conducted the evalua-

tion. A Latin square experimental design was used to deﬁne

the order in which systems were judged (a different square for

each section of the evaluation). Each listener was assigned a

row of each Latin square, and judged seven different utterances

per section, each synthesised by a different system. The

synthesised utterances are a subset of the 1992 ARPA speaker

independent read 5k test dataset with no verbal punctuation.

Throughout this paper, signiﬁcant differences between sys-

tems evaluated using the opinion score evaluation are de-

tected using a pairwise Wilcoxon signed rank test which is

Bonferroni-corrected for multiple comparisons [26]. A differ-

ence is deemed signiﬁcant if this test discovers signiﬁcance at

the 95% conﬁdence level.

2) Paired comparison evaluation: Three pairs of systems

are selected and a preference test conducted in order to address

the questions stated at the start of this section. Each judge was

presented with pairs of synthesised utterances, one generated

from each system in the comparison. For each pair, the judge

was forced to select his preferred system, according to either

naturalness or similarity to a target speaker. In the case of

similarity, four of the target speaker’s natural utterances were

available to inform the judgement. The synthesised utterances

are a subset of the 1992 ARPA speaker independent read 5k

test dataset with no verbal punctuation.

The following pairs of systems were compared. Unadapted

standard (system A) and unadapted two-pass (system B)

were compared in terms of naturalness. Supervised triphone-

Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction

Figures

Citations

Transfer learning for speech and language processing

Transfer Learning for Speech and Language Processing

A survey on speech synthesis techniques in Indian languages

Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation

HMM-based Speech Synthesis System incorporated with Language Identification for Low-resourced Languages

References

A technique for the measurement of attitudes

Europarl: A Parallel Corpus for Statistical Machine Translation

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech parameter generation algorithms for HMM-based speech synthesis

A compact model for speaker-adaptive training

Related Papers (5)

Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction

A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for HMM-based speech synthesis

New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer

Decision tree usage for incremental parametric speech synthesis

In-service adaptation of multilingual hidden-Markov-models