scispace - formally typeset
Open AccessJournal ArticleDOI

Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction

Reads0
Chats0
TLDR
An approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models is presented, and a mapping between H MM-based synthesis models and ASR-style models is defined via a two-pass decision tree construction process.
Abstract
Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper first presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Second, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Third, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation.

read more

Content maybe subject to copyright    Report

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1
Unsupervised intralingual and cross-lingual speaker
adaptation for HMM-based speech synthesis using
two-pass decision tree construction
Matthew Gibson and William Byrne
Abstract—Hidden Markov model (HMM)-based speech syn-
thesis systems possess several advantages over concatenative
synthesis systems. One such advantage is the relative ease with
which HMM-based systems are adapted to speakers not present
in the training dataset. Speaker adaptation methods used in
the field of HMM-based automatic speech recognition (ASR)
are adopted for this task. In the case of unsupervised speaker
adaptation, previous work has used a supplementary set of
acoustic models to estimate the transcription of the adaptation
data. This paper firstly presents an approach to the unsuper-
vised speaker adaptation task for HMM-based speech synthesis
models which avoids the need for such supplementary acoustic
models. This is achieved by defining a mapping between HMM-
based synthesis models and ASR-style models, via a two-pass
decision tree construction process. Secondly, it is shown that this
mapping also enables unsupervised adaptation of HMM-based
speech synthesis models without the need to perform linguistic
analysis of the estimated transcription of the adaptation data.
Thirdly, this paper demonstrates how this technique lends itself
to the task of unsupervised cross-lingual adaptation of HMM-
based speech synthesis models, and explains the advantages of
such an approach. Finally, listener evaluations reveal that the
proposed unsupervised adaptation methods deliver performance
approaching that of supervised adaptation.
Index Terms—HMM-based speech synthesis, unsupervised
speaker adaptation, cross-lingual.
I. INTRODUCTION
H
IDDEN Markov model-based systems have delivered
synthetic speech of similar quality to that of concate-
native (or unit selection) synthesis systems [1]. Additionally,
HMM-based systems possess several advantages over unit
selection systems. These advantages include simple ways to
interpolate between speakers and synthesise speech of varying
styles or emotions [2; 3]. Perhaps the most significant advan-
tage is the ability to adapt to new speakers using a relatively
small amount of adaptation data [4].
Most research into speaker adaptation for HMM-based
speech synthesis (or text-to-speech, TTS) has focussed upon
the supervised scenario, where transcribed adaptation data
is available. More recent work has tackled the challenge of
adaptation of HMM-based synthesis models using unlabelled
adaptation data [5]. As will be explained in due course, unsu-
pervised adaptation of HMM-based synthesis models is prob-
lematic for two reasons. Firstly, the modelling of supraseg-
mental contextual information renders the synthesis models
unsuitable for ASR purposes. Therefore a supplementary set
of triphone acoustic models are typically used to estimate
a transcription of the unlabelled adaptation data [5]. Sec-
ondly, linguistic analysis is required to transform word-level
transcriptions into transcriptions containing suprasegmental
contextual information. In the case of unsupervised adaptation,
it is feasible that such linguistic analysis exacerbates errors
present in the estimated word-level transcription.
This paper presents an alternative to the unsupervised adap-
tation approach described in [5]. In [5], adaptation transforms
estimated using triphone acoustic models are applied to the
more detailed acoustic models typically used in HMM-based
synthesis. While this technique avoids the need for linguistic
analysis of the estimated transcription of the adaptation data,
a separately-estimated triphone acoustic model set is still
required.
In this paper, a two-stage decision tree construction method
is introduced, which enables a single set of acoustic model
components to be used for both ASR and TTS. This method
is then used to circumvent the need for supplementary ASR
acoustic models and linguistic analysis of estimated transcrip-
tions during unsupervised adaptation. The application of the
two-stage decision tree construction method is then extended
to the task of unsupervised cross-lingual speaker adaptation.
Cross-lingual (or interlingual) speaker adaptation is defined
as the adaptation of acoustic models associated with one
language, the target language, using adaptation data uttered
in a different language, the source language.
A large amount of research has been performed on the
cross-lingual adaptation task for ASR acoustic models. The
task typically arises in cases where a relatively small amount
of data is available to train an ASR acoustic model in a
particular target language. Bootstrapping the target language
acoustic models ([6]) based upon an explicit mapping from
source to target language phonemes has been explored, as well
as interpolation of the source and target language acoustic
models (also [6]). Later work ([7]) has successfully applied
the maximum a-posteriori (MAP) adaptation method to the
cross-lingual adaptation task, demonstrating the usefulness of
the prior knowledge contained within the source language.
Recent work [8; 9] has addressed the task of supervised
cross-lingual adaptation for HMM-based speech synthesis.
This work used TTS models of both source and target lan-
guages, and defined a phoneme or state-level mapping between
the source and target language acoustic models. This mapping
was then deployed to translate the source language transcrip-
tion of the adaptation data to a target language phoneme
or state sequence. The target language TTS models were

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2
subsequently adapted using the source language acoustic data
and the corresponding mapped target language phoneme or
state sequence.
Techniques similar to those described above rely upon the
availability of both source and target language TTS models,
and the mapping mechanism between these models must be
established prior to adaptation. An alternative approach based
upon the two-stage decision tree construction technique is pro-
posed in this paper. As will be explained later, this alternative
approach is appealing because it requires no knowledge of the
source language acoustic model (or even the source language)
or its relationship to the target language acoustic model.
This paper evaluates the proposed unsupervised adaptation
schemes in both a standard adaptation scenario and a speaker
adaptive training (SAT) framework. The performance of these
techniques is compared with standard approaches to super-
vised and unsupervised speaker adaptation of HMM-based
synthesis models in both the intralingual (within-language)
and cross-lingual scenarios. In the cross-lingual case, parallel
translated adaptation datasets recorded by the same speaker
are used to compare the performance of intralingual and cross-
lingual adaptation in a controlled manner. Listener evaluations
reveal that the proposed unsupervised adaptation techniques
deliver performance approaching that of supervised intralin-
gual adaptation.
The paper is structured as follows. Section II provides a
brief introduction to HMM-based speech synthesis models
and explains why unsupervised adaptation of such models is
problematic. Section III explains the two-pass decision tree
construction technique, and how this enables unsupervised
adaptation of HMM-based synthesis models. Sections IV
and V respectively introduce the unsupervised intralingual
and cross-lingual approaches used in this work. Section VI
discusses the use of SAT in the context of HMM-based speech
synthesis. The proposed approaches to intralingual and cross-
lingual speaker adaptation are evaluated in Sections VII and
VIII respectively. Lastly, Section IX summarises the contribu-
tions of this work and highlights areas of future research.
II. UNSUPERVISED ADAPTATION AND HMM-BASED
SPEECH SYNTHESIS
In the domain of ASR, unsupervised adaptation is usually
conducted by firstly estimating a transcription of the adaptation
data using a speech recogniser. This speech recogniser usually
deploys the same models which are subsequently adapted.
In the domain of HMM-based synthesis, use of the same
unsupervised adaptation framework is problematic because
the acoustic models typically used in HMM-based speech
synthesis are not easily integrated into the ASR search proce-
dure. This, in turn, is because the context-dependent acoustic
models used in HMM-based speech synthesis [10] represent
suprasegmental information (e.g. syllabic stress, total number
of syllables in utterance) in addition to segmental informa-
tion (e.g. context-dependent phoneme label). These models
are henceforth referred to as full context models. Although
theoretically possible to recognise unlabelled data using full
context models, this requires information which relates to
complete hypotheses (e.g. the total number of words in an
utterance) when constructing a recognition network. When
using e.g. triphone acoustic models, such information may be
ignored to simplify the recognition network and to facilitate
dynamic network construction. The presence of suprasegmen-
tal contextual information in full context models therefore
adds a prohibitive amount of complexity to the construction
of recognition networks.
A simple solution to this problem is to use a separately-
estimated ASR-compliant acoustic model to obtain a tran-
scription of the adaptation data, followed by adaptation of
the TTS model using this transcription [5]. However this
solution involves estimation of a separate ASR model, and
such model estimation is often a lengthy procedure. Further,
use of different models during the recognition and adaptation
stages precludes the use of efficient online adaptation strategies
[11]. For these reasons, alternative techniques which enable
TTS models to be deployed for ASR have been explored [12].
The two-pass decision tree construction technique [13] is one
such technique, as will be explained in the following section.
III. TWO-PASS DECISION TREE CONSTRUCTION
As is the case for ASR acoustic modelling, decision tree
clustering of the full contexts is used to enable robust esti-
mation of the model parameters. The minimum description
length (MDL) criterion [14] is used when constructing the
decision tree, which in turn uses questions pertaining to both
segmental and suprasegmental context. By performing this
decision tree construction in two stages, where the initial stage
uses questions relating to triphone contextual information, and
the second stage uses questions relating to all contextual infor-
mation, a well-defined mapping between full context models
and triphone models may be established. This constrained
decision tree construction process is illustrated in Figure 1.
The first stage, indicated as Pass 1 in Figure 1, uses
only questions relating to left, right and central phonemes to
construct a phonetic decision tree. This decision tree is used
to generate a set of tied triphone contexts, which are easily
integrated into the ASR search. No state output distributions
are estimated at this stage. Pass 2 extends the decision tree
constructed in Pass 1 by introducing additional questions
relating to suprasegmental information. The output of Pass 2
is an extended decision tree which defines a set of tied full
contexts. The MDL criterion is used for both Pass 1 and Pass
2.
After this two-pass decision tree construction, single com-
ponent Gaussian state output distributions are estimated to
model the tied full contexts associated with each leaf node
of the extended decision tree. These models are then used for
speech synthesis.
A mapping from the single component full context models
to multiple component triphone models is defined as follows.
Each set of Gaussian components associated with the same
‘triphone ancestor’ are grouped to form a multiple component
mixture distribution to model the triphone context defined
by the ‘triphone ancestor’. The derived triphone models are
illustrated at the bottom of Figure 1. The mixture weight of

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3
each component is calculated from the occupancies associated
with the Pass 2 leaf node contexts.
C-Nasal?
L-Vowel?
L-Vowel?
C-Vowel?
C-Nasal?
L-Vowel?
C-Vowel?
C-Nasal?
R-stressed?
2 syllables in utt?
Pass 1
Pass 2
R-stressed?
2
3
C-Nasal?
L-Vowel?
2 syllables in utt?
1
4 5
C-Nasal?
L-Vowel?
54
32
1
Full context models
(single-component)
Triphone models
(multi-component)
Mapping
Inverse
mapping
Model
mapping
Fig. 1. Two-pass decision tree construction. Mapping functions permit
sharing of components between full context models for TTS and triphone
models for ASR.
The inverse mapping from triphone models to full context
models is obtained by associating each Gaussian component
with its original full context. This is achieved by assigning a
unique full context identifier to each component as illustrated
in Figure 1.
Mapping full context models to triphone models enables
ASR compatible acoustic models to be derived from TTS
acoustic models, thus avoiding the need for a separately-
estimated ASR model. Sections IV and V explain how these
mappings between full context and triphone models can be
exploited to perform unsupervised intralingual and cross-
lingual adaptation of full context models.
IV. UNSUPERVISED INTRALINGUAL ADAPTATION
As illustrated in Figure 2, triphone models derived from
estimated full context models (as described in Section III)
are used to transcribe unlabelled adaptation data. One ques-
tion remains, however. How is ASR output, e.g. a word,
phoneme or triphone sequence, used to adapt full context
models? One method, labelled as ‘full adaptation’ in Figure 2,
firstly performs linguistic analysis of the estimated word-level
transcription to produce an estimated full context labelling of
the adaptation data. The full context models are then adapted
directly using this labelling.
By defining an inverse mapping between full context and tri-
phone models, the two-pass decision tree construction method
introduces an alternative to the ‘full adaptation’ technique. As
illustrated in Figure 2, the estimated triphone transcription may
be used to adapt the triphone models. The adapted triphone
models are then subsequently mapped back to full context
models using the inverse mapping. This is labelled as ‘triphone
adaptation’ in Figure 2.
Adaptation
Data
Triphone
adaptation
Estimated word and
triphone transcription
Recognise
Full
adaptation
Estimated full
context transcription
Inverse
mapping
Adapted
triphone
models
Adapted
full context
models
Mapping
Triphone
models
Training
Data
Train
Full context
models
Full context
transcription
Linguistic
analysis
1 2
Fig. 2. Unsupervised adaptation of full context models via (1) full adaptation
or (2) triphone adaptation.
Once word and triphone-level transcriptions of the adapta-
tion data are available, the full context models may be adapted
in these two different ways. Note that linguistic analysis may
exacerbate errors present in the estimated word-level tran-
scription. It is therefore feasible that the triphone adaptation
technique is more robust than full context adaptation in the
unsupervised case. This hypothesis is tested in the experiments
of Section VII.
V. UNSUPERVISED CROSS-LINGUAL ADAPTATION
Consider now the task of unsupervised cross-lingual speaker
adaptation, as defined in Section I, in the case of full context
acoustic models. To transcribe the adaptation data one could
deploy an ASR system tailored to the source language i.e. a
source language lexicon, as well as source language acoustic
and language models. This estimated transcription may then
be subsequently mapped to the target language. This mapping
may be defined at the phone level [8] or the state level [9].
The mapped transcription may then be used to adapt the target
language full context models.
The above approach deploys a large amount of source
language specific knowledge, as well as knowledge of the
relationship between source and target languages. Acquisition
of such knowledge typically depends upon a large amount
of transcribed acoustic data from the source language. Such
a database is certainly not available for all languages, and
is expensive to obtain. Further, if the source language is
unknown, clearly the approach described above cannot be
applied. For these reasons, an alternative method is explored
in this work.
The cross-lingual adaptation technique used in this work
treats the source language adaptation data as if it were uttered
in the target language. Target language acoustic models and a

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4
phoneme loop grammar are used to recognise the adaptation
data, thus mapping it onto a phoneme sequence in the target
language. Subsequently, the estimated triphone sequence is
used as the reference sequence, and the triphone adaptation
method of Figure 2 is used. This process is almost identical to
the triphone adaptation approach to unsupervised intralingual
adaptation. The sole difference is that, in order to avoid
language specific constraints, no dictionary or language model
is used during recognition. This method was first introduced
and evaluated in [15].
The approach described in the previous paragraph uses no
source language ASR or TTS system. Further, no previously
learned mapping between source and target language acoustic
models is necessary. Indeed, no source language knowledge
whatsoever is used, so the technique may be applied even
when the source language is unknown.
By comparing the performance of unsupervised intralingual
and cross-lingual adaptation, the impact of source language
knowledge may be measured. This comparison is reported in
Section VIII.
VI. SPEAKER ADAPTIVE TRAINING
Speaker adaptive training (SAT, [16]) attempts to decouple
inter-speaker and intra-speaker variance when estimating a
speaker independent (SI) acoustic model. The SAT framework
simultaneously estimates sets of speaker dependent transforms
of the acoustic models (one set of transforms for each speaker
in the training set) and a speaker independent ‘canonical’
model. The transforms are designed to capture much of the
inter-speaker acoustic variance and consequently the canonical
model displays less variance than a standard SI system.
Both SAT-estimated and standard SI full context models are
used in the experimental work of this paper. Figure 3 illustrates
the procedure used to estimate these models. SAT-estimated
monophone models are estimated, then cloned to full context
models, which are SAT-estimated using one global transform
per state/stream combination per speaker. The statistics from
these untied full context models are then used to cluster the
full context models. Subsequent to full-context clustering, tied
models are re-estimated to create both SAT-estimated and
standard SI tied full context models.
There is evidence [17] that SAT-estimated models are
superior to standard SI-estimated models for HMM-based
speech synthesis. The evaluation of Section VIII revisits this
comparison to determine if the same conclusions hold in
the case of models generated using two-pass decision tree
construction. The performance of SAT-estimated and standard
SI models is compared both prior to and after adaptation.
VII. EVALUATION: INTRALINGUAL SPEAKER ADAPTATION
The evaluation described in this section is designed to
address the following questions regarding unsupervised in-
tralingual speaker adaptation of HMM-based synthesis models.
1) Does the constrained two-pass decision tree construction
process affect the naturalness of the resulting speech?
2) How does the proposed approach to unsupervised in-
tralingual adaptation compare with supervised intralin-
gual adaptation?
SAT monophone models
Training
Data
SAT (monophone)
Tied full
context models
Full context
clustering
SAT
SAT tied full
context models
Standard SI training
Standard SI tied full
context models
SAT full context models
SAT
Fig. 3. Estimation of speaker independent (SI) full context models using
speaker adaptive training (SAT) and standard model estimation.
3) How does the performance of triphone adaptation (as
described in Section IV) compare with that of full
adaptation?
A. Background information
The synthesis models used in this evaluation deploy the fol-
lowing acoustic features: STRAIGHT-analysed Mel-cepstral
coefficients [18] (40 dimensions), fundamental frequency
(F 0), and measurements which quantify the aperiodicity of
the speech (5 dimensions). The first and second order temporal
derivatives of all of these coefficients are appended to yield
a feature vector of dimension 138. The feature vector is split
into ve streams: cepstral coefficients, aperiodicity measures,
F 0, first derivative of F 0, and second derivative of F 0. Multi-
space probability distributions are used to model observations
of varying dimension, namely the F 0 observation [19]. Ex-
plicit duration models (hidden semi-Markov models, [20]) are
integrated to improve the quality of synthesised speech. One
decision tree per state and stream combination (where all three
F 0 streams are combined for the purposes of clustering) is
used, with an additional decision tree to cluster contexts of
the duration model. A speech utterance is generated from full
context models via feature sequence generation with global
variance consideration [21; 22]. Synthesis of the waveform
from the feature sequence is performed by the STRAIGHT
vocoder [18].
B. Systems
To address the questions posed at the start of this section,
the systems detailed in Table I are evaluated. Standard SI full
context models are estimated using the Wall Street Journal
(WSJ) SI84 training dataset (3, 586 male and 3, 552 female

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5
utterances, 7136 utterances) and maximum likelihood estima-
tion. Note that such databases have proven useful for HMM-
based speech synthesis ([23]).
System Clustering Adaptation method Supervised?
A Standard - -
B Two-pass - -
C Two-pass Full Y
D Two-pass Full N
E Two-pass Triphone Y
F Two-pass Triphone N
G - - -
TABLE I
EVALUATED SYSTEMS (INTRALINGUAL ONLY).
Average voice models corresponding to standard, uncon-
strained decision tree construction (system A of Table I) are
estimated for comparison with those corresponding to two-
pass decision tree construction (system B). Note that only Mel-
cepstral, F 0, and aperiodicity models are adapted in this work,
so only those models are clustered using the two-pass decision
tree construction method. Duration models are clustered using
standard clustering methods and are identical in systems A
and B.
Adapted systems are derived from System B using either
the triphone or full adaptation method described in Section IV.
Constrained maximum likelihood linear regression (CMLLR,
[24]) adaptation is used, and the adaptation data corresponds
to spoke 4 of the 1993 ARPA evaluation (40 utterances for
speaker 440M). The adaptation techniques are evaluated in the
supervised and unsupervised cases, resulting in four adapted
model sets corresponding to systems C through F in Table I.
System G corresponds to vocoded natural speech, analysed
and resynthesised using the STRAIGHT technique [18]. This
system is included in the evaluation to establish an upper
bound on the performance of the synthesised speech.
In the case of unsupervised adaptation, triphone models
derived from the estimated full context average voice models
are used for the recognition step, in conjunction with the
closed vocabulary 20k bigram language model provided with
the WSJ0 corpus. A set of state transition probabilities are
estimated from the SI84 dataset for use with the triphone
models during recognition. A phoneme error rate of 47.1%
is observed for the unsupervised transcriptions.
C. Analysis of two-pass decision tree construction
Table II displays the number of leaf nodes created using
different decision tree construction methods, and for the
different streams. In all cases, the number of leaf nodes
Mel-cepstral F 0 aperiodicity
Pass 1 2208 6756 1644
Pass 2 2889 34849 2639
Standard 2621 30581 2160
TABLE II
NUMBER OF LEAF NODES CREATED USING DIFFERENT DECISION TREE
CONSTRUCTION METHODS.
generated after two-pass decision tree construction exceeds
that of standard tree construction. This demonstrates that
constraining the tree structure to satisfy the requirements of
the two-pass construction method, defined in Section III, leads
to less compact trees.
D. Evaluation details
Two different evaluation methods were used to measure the
performance of the two-pass intralingual adaptation technique:
an opinion score evaluation, described in Section VII-D1, and
a paired comparison of several pairs of systems, described
in Section VII-D2. The opinion score evaluation provides a
performance measure and overall ranking of the systems stud-
ied, while the paired comparison more effectively discovers
significant differences between system pairs.
1) Opinion score evaluation: The seven systems (A through
G) were evaluated by listening to synthesised utterances via
a web browser interface closely resembling that used in
the Blizzard Challenge 2007. The evaluation comprised two
sections. In the first section, listeners judged the naturalness of
an initial set of synthesised utterances. In the second section,
listeners judged the similarity of a second set of synthesised
utterances to a target speaker’s (speaker 440M) speech. Four
of the target speaker’s natural utterances were available for
comparison. No utterances from the initial set were present
in the second set. Each synthetic utterance was judged using
a five point Likert-type psychometric response scale [25],
where ‘5’ is the most favourable response and ‘1’ is the least
favourable.
Twenty two native English speakers conducted the evalua-
tion. A Latin square experimental design was used to define
the order in which systems were judged (a different square for
each section of the evaluation). Each listener was assigned a
row of each Latin square, and judged seven different utterances
per section, each synthesised by a different system. The
synthesised utterances are a subset of the 1992 ARPA speaker
independent read 5k test dataset with no verbal punctuation.
Throughout this paper, significant differences between sys-
tems evaluated using the opinion score evaluation are de-
tected using a pairwise Wilcoxon signed rank test which is
Bonferroni-corrected for multiple comparisons [26]. A differ-
ence is deemed significant if this test discovers significance at
the 95% confidence level.
2) Paired comparison evaluation: Three pairs of systems
are selected and a preference test conducted in order to address
the questions stated at the start of this section. Each judge was
presented with pairs of synthesised utterances, one generated
from each system in the comparison. For each pair, the judge
was forced to select his preferred system, according to either
naturalness or similarity to a target speaker. In the case of
similarity, four of the target speaker’s natural utterances were
available to inform the judgement. The synthesised utterances
are a subset of the 1992 ARPA speaker independent read 5k
test dataset with no verbal punctuation.
The following pairs of systems were compared. Unadapted
standard (system A) and unadapted two-pass (system B)
were compared in terms of naturalness. Supervised triphone-

Citations
More filters
Proceedings ArticleDOI

Transfer learning for speech and language processing

TL;DR: Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks as discussed by the authors, and is traditionally studied in the name of "model adaptation" and is referred to as cross-lingual vs. multilingual.
Posted Content

Transfer Learning for Speech and Language Processing

TL;DR: Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks as discussed by the authors, and is traditionally studied in the name of ''model adaptation'' in the context of cross-lingual learning.
Journal ArticleDOI

A survey on speech synthesis techniques in Indian languages

TL;DR: A review of the contributions made by different researchers in the field of Indian language speech synthesis along with a study on the Indian language characteristics and the associated challenges in designing TTS systems are provided.
Proceedings Article

Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation

TL;DR: Speech 2011: 12th Annual Conference of the International Speech Communication Association, 28-31 August, 2011, Florence, Italy.
Proceedings ArticleDOI

HMM-based Speech Synthesis System incorporated with Language Identification for Low-resourced Languages

TL;DR: This paper presents the development of a multi-language LID+TTS synthesis system that generate audio of input text using the predicted language in four South African languages, namely: Tshivenda, Sepedi, Xitsonga and IsiNdebele.
References
More filters
Book

A technique for the measurement of attitudes

Rensis Likert
TL;DR: The instrument to be described here is not, however, indirect in the usual sense of the word; it does not seek responses to items apparently unrelated to the attitudes investigated, and seeks to measure prejudice in a manner less direct than is true of the usual prejudice scale.

Europarl: A Parallel Corpus for Statistical Machine Translation

Philipp Koehn
TL;DR: A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.
Journal ArticleDOI

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.
Proceedings ArticleDOI

Speech parameter generation algorithms for HMM-based speech synthesis

TL;DR: A speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a spectral parameter vector and its dynamic feature vectors, is derived.
Proceedings ArticleDOI

A compact model for speaker-adaptive training

TL;DR: A novel approach to estimating the parameters of continuous density HMMs for speaker-independent (SI) continuous speech recognition that jointly annihilates the inter-speaker variation and estimates the HMM parameters of the SI acoustic models.
Related Papers (5)