scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

VTLN-based voice conversion

14 Dec 2003-pp 556-559
TL;DR: After applying several conventional VTLN warping functions, the piecewise linear function is extended to several segments, allowing a more detailed warping of the source spectrum.
Abstract: In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As voice conversion aims at the transformation of a source speaker's voice into that of a target speaker, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the piecewise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on voice conversion are performed on three corpora of two languages and both speaker genders.

Summary (1 min read)

Introduction

  • Vocal tract length normalization [1] tries to compensate for the effect of speaker dependent vocal tract lengths by warping the frequency axis of the amplitude spectrum.
  • Subsequently, in Section 3, the authors apply this training procedure to conventional warping functions depending on only one parameter.
  • All these considerations are discussed in Section 4.
  • For unvoiced signal parts, pseudo periods are used.
  • This class mapping is basis for an arbitrary statistical voice conversion parameter training.

2.1. Statistical Voice Conversion Parameter Training

  • Let XI1 = X1, . . . , XI be the spectra belonging to source classkS and Y J1 those of the mapped clasŝkT (kS), the authors generally estimate the parameter vectorϑ by minimizing the sum of the euclidean distances between all target class spectra and transformed source class spectra.
  • Here, the authors utilize the spectral conversion functionFϑ′ depending on the parameter vectorϑ′.
  • Fϑ′(Xi, ω)|2 dω (1) In conjunction with a suitable smoothing technique, the authors often can neglect the variety of the classes’ observation spec- tra by introducing a mean approximation without an essential effect on the voice conversion parameters.
  • In speech recognition, several VTLN warping functions have been proposed whose parameters usually are limited to one variable, the warping factorα.
  • (5) 4. WARPING FUNCTIONS WITH SEVERAL PARAMETERS.

4.1. Piece-Wise Linear Warping with Several Segments

  • One of the adversarial properties of the conventional warping functions with one parameter is that the whole frequency axis is always warped in the same direction, either to lower or to higher frequencies.
  • These functions are not able to model spectral conversions where certain parts of the axis move to higher frequencies, and other parts to lower frequencies, or vice versa.
  • Such functions would require at least one inflection point and would cross theω̃ = ω diagonal.
  • Applying the VTLN technique to voice conversion, the authors want to use more exact models than in speech recognition, i. e. warping functions with several parameters, for a better description of the individual characteristics of the speakers’ vocal tracts.
  • The correspondingω̃s are the parameters of the warping function.

5.1. Iterative Integrating Smoothing

  • Basis of the voice conversion technique delineated in this paper is the automatic class segmentation and mapping described in Section 2.
  • To avoid that the class-dependent voice conversion parameters jump at the class boundaries causing distinctly auconstant function over the time representing the mean pa-.

5.2. Deviation Penalty

  • Viewing Figures 1 and 2, the authors note that for certain classes the obtained parameter values highly deviate from the mean.
  • In Table 1, the authors show results for warping functions with one parameter (cf. Section 3).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

VTLN-BASED VOICE CONVERSION
David S
¨
undermann and Hermann Ney
RWTH Aachen University of Technology
Computer Science Department
Ahornstr. 55, 52056 Aachen, Germany
{suendermann,ney}@cs.rwth-aachen.de
ABSTRACT
In speech recognition, vocal tract length normalization
(VTLN) is a well-studied technique for speaker normaliza-
tion. As voice conversion aims at the transformation of a
source speaker’s voice into that of a target speaker, we want
to investigate whether VTLN is an appropriate method to
adapt the voice characteristics. After applying several con-
ventional VTLN warping functions, we extend the piece-
wise linear function to several segments, allowing a more
detailed warping of the source spectrum. Experiments on
voice conversion are performed on three corpora of two lan-
guages and both speaker genders.
1. INTRODUCTION
Vocal tract length normalization [1] tries to compensate for
the effect of speaker dependent vocal tract lengths by warp-
ing the frequency axis of the amplitude spectrum. In speech
recognition, VTLN aims at the normalization of a speaker’s
voice in order to remove individual speaker characteristics.
A similar task is voice conversion. It describes the mod-
ification of a source speaker’s voice such that it is perceived
to be spoken by a target speaker [2]. In this paper, we show
how VTLN can be applied to this task.
In Section 2, we delineate a method to find correspond-
ing speech segments respectively artificial phonetic classes
in the training material of the source and the target speaker.
These corresponding classes are used to estimate the param-
eters of class-dependent VTLN warping functions. Subse-
quently, in Section 3, we apply this training procedure to
conventional warping functions depending on only one pa-
rameter.
Often, these conventional functions do not sufficiently
model the speakers’ characteristics. Therefore, we intro-
duce a piece-wise linear warping function consisting of sev-
eral linear segments. The greater the parameter number is,
the more carefully we must deal with their practical estima-
tion. All these considerations are discussed in Section 4.
Since the parameter estimation for classes with only few
observations can be very inaccurate and, besides, we do not
want the parameters to change abruptly from one class to
another, in Section 5, we introduce two parameter smooth-
ing methods. Finally, in Section 6, we present experimental
results on three German and English corpora.
2. AUTOMATIC SEGMENTATION AND MAPPING
Most of the training procedures of state-of-the-art voice con-
version techniques require training data containing the same
utterances of both source and target speaker [3]. Besides,
these utterances should feature a high degree of natural time
alignment and similar pitch contour [4].
However, in several voice conversion applications (e. g.
spontaneous speaker adaptation or speech-to-speech trans-
lation) we do not possess corresponding time frames of
source and target speaker. In [5], we address this problem
as follows.
At first, we subdivide speech material of speaker S and
T into K
S
respectively K
T
artificial phonetic classes. This
is done by clustering the frequency spectra of period-syn-
chronous frames obtained by a pitch tracker. For unvoiced
signal parts, pseudo periods are used. Now, for each source
class k
S
we determine the most similar target class
ˆ
k
T
(k
S
).
This class mapping is basis for an arbitrary statistical voice
conversion parameter training.
2.1. Statistical Voice Conversion Parameter Training
Let X
I
1
= X
1
, . . . , X
I
be the spectra belonging to source
class k
S
and Y
J
1
those of the mapped class
ˆ
k
T
(k
S
), we
generally estimate the parameter vector ϑ by minimizing
the sum of the euclidean distances between all target class
spectra and transformed source class spectra. Here, we uti-
lize the spectral conversion function F
ϑ
0
depending on the
parameter vector ϑ
0
.
ϑ = arg min
ϑ
0
I
X
i=1
J
X
j=1
Z
π
ω =0
|Y
j
(ω) F
ϑ
0
(X
i
, ω)|
2
(1)
In conjunction with a suitable smoothing technique, we of-
ten can neglect the variety of the classes’ observation spec-

tra by introducing a mean approximation without an essen-
tial effect on the voice conversion parameters.
ϑ = arg min
ϑ
0
π
Z
ω =0
¯
¯
¯
Y (ω) F
ϑ
0
(
¯
X, ω)
¯
¯
2
(2)
Here,
¯
X and
¯
Y are the source and target classes’ average
spectra.
3. WARPING FUNCTIONS WITH ONE
PARAMATER
In speech recognition, several VTLN warping functions
have been proposed whose parameters usually are limited
to one variable, the warping factor α. Established warping
functions are
the symmetric piece-wise linear function with two seg-
ments [6]
˜ω
α
(ω) =
½
αω : ω ω
0
αω
0
+
π αω
0
π ω
0
(ω ω
0
) : ω ω
0
(3)
ω
0
=
(
7
8
π : α 1
7
8α
π : α 1
the power function [7]
˜ω
α
(ω) =
³
ω
π
´
α
the quadratic function [8]
˜ω
α
(ω) = ω + α
µ
ω
π
³
ω
π
´
2
the bilinear function [9]
˜z
α
(z) =
z α
1 αz
with z = e
(4)
In order to estimate the class dependent warping factor α,
we use Eqs. 1 or 2, where
F
α
(X, ω) = X(˜ω
α
(ω)). (5)
4. WARPING FUNCTIONS WITH SEVERAL
PARAMETERS
4.1. Piece-Wise Linear Warping with Several Segments
One of the adversarial properties of the conventional warp-
ing functions with one parameter is that the whole frequency
axis is always warped in the same direction, either to lower
or to higher frequencies. Consequently, these functions are
not able to model spectral conversions where certain parts
of the axis move to higher frequencies, and other parts to
lower frequencies, or vice versa. Such functions would re-
quire at least one inflection point and would cross the ˜ω = ω
diagonal.
Applying the VTLN technique to voice conversion, we
want to use more exact models than in speech recognition,
i. e. warping functions with several parameters, for a better
description of the individual characteristics of the speakers’
vocal tracts.
Assuming there is an ideal warping function for a given
class pair (k
S
,
ˆ
k
T
), an obvious model is given by the in-
terpolation of this function by several linear segments, as a
consequence from the simple two-segment linear warping,
vide Eq. 3.
˜ω
˜ω
S
1
(ω) =
˜ω
0,˜ω
1
(ω) for 0 ω
1
S+1
· π
.
.
.
.
.
.
˜ω
˜ω
s
,˜ω
s+1
(ω) for
s
S+1
· π ω
s+1
S+1
· π
.
.
.
.
.
.
˜ω
˜ω
S
(ω) for
S
S+1
· π ω π
(6)
˜ω
˜ω
0
,˜ω
00
(ω) = ˜ω
0
+
µ
S + 1
π
· ω s
· (˜ω
00
˜ω
0
)
0 ˜ω
1
· · · ˜ω
S
π. (7)
This formula describes a piece-wise linear function ˜ω(ω)
starting at (0, 0), ending at (π, π), and connecting S points
whose ω values are equidistantly distributed. The corre-
sponding ˜ω
s
are the parameters of the warping function.
The resulting function is monotonous according to Eq. 7, as
we do not want parts of the frequency axis to be exchanged.
4.2. Practical Parameter Estimation
In general, augmenting the number of parameters confronts
us with an increasing need of computation time. Particu-
larly, this is the case if the minimization of Eqs. 1 or 2 is per-
formed by calculating the distances for all possible param-
eter combinations concerning a certain resolution. This es-
timation method results in an exponential increase of com-
puting time in dependence on the number of considered pa-
rameters.
Viewing the definition of the piece-wise linear warping
function with several segments, cf. Eq. 6, we note that the
integrals used in Eqs. 1 and 2 can be rewritten as (also cp.
Eq. 5)
d
˜ω
S
1
=
π
Z
ω =0
¯
¯
¯
Y (ω) X(˜ω
˜ω
S
1
(ω))
¯
¯
¯
2
=
S
X
s=0
s+1
S+1
·π
Z
ω =
s
S+1
·π
¯
¯
¯
Y (ω) X(˜ω
˜ω
s+1
s
(ω))
¯
¯
¯
2
.

This enables us to use dynamic programming for searching
the minimum distance and therewith the optimal parameter
vector ˜ω
S
1
.
5. PARAMETER SMOOTHING
5.1. Iterative Integrating Smoothing
Basis of the voice conversion technique delineated in this
paper is the automatic class segmentation and mapping de-
scribed in Section 2. In Figure 1, we show the time course
of the word Arizona” and the corresponding classes for
K
S
= 8.
To avoid that the class-dependent voice conversion pa-
rameters jump at the class boundaries causing distinctly au-
dible artifacts in the converted speech, we introduce an inte-
grating parameter smoothing which iteratively adapts a pa-
rameter vector by adding a weighted mean of the chronolog-
ically neighbored vectors. Figure 2 shows the effect of this
smoothing technique for 5, 50 and 5000 iterations using the
symmetric piece-wise warping function described in Eq. 3.
If the number of iterations approaches infinity, we obtain a
constant function over the time representing the mean pa-
rameter vector.
5.2. Deviation Penalty
Viewing Figures 1 and 2, we note that for certain classes the
obtained parameter values highly deviate from the mean.
E. g. for k
S
= 7 we obtain an α less than 1, whereas the
particular voice conversion (female–male) should result in
values greater than 1. Considering the mean of ¯α = 1.3,
the parameter values are to be controlled and, if necessary,
corrected towards the mean.
This is performed by applying the minimization Eqs. 1
or 2 a second time, having added a penalty term to the en-
closed integral. Both addends are normalized by their max-
imum and then weighted utilizing the real value 0 λ 1
to adjust the penalty strength. Hence, λ = 1 does not in-
fluence the class parameters at all, whereas λ = 0 forces
all parameters to be equal to their mean
¯
ϑ. An equilibrium
between both terms is to be around λ = 0.5.
In the following, we assume X and Y to have the unity
energy E
0
in order to remove the dependence of the dis-
tances on the signal loudness.
d
ϑ
= λ
π
R
ω =0
|Y (w) X(˜ω
ϑ
(ω))|
2
max
X
0
,Y
0
π
R
ω =0
|Y
0
(w) X
0
(ω)|
2
+(1 λ)
π
R
ω =0
(˜ω
¯
ϑ
(ω) ˜ω
ϑ
(ω))
2
max
¯
ϑ
0
0
π
R
ω =0
(˜ω
¯
ϑ
0
(ω) ˜ω
ϑ
0
(ω))
2
e @ r i z@ U n æ
1
Fig. 1. Automatic Class Segmentation for the Word Ari-
zona”.
Fig. 2. Iterative Integrating Smoothing for Warping Func-
tions with One Parameter.
After calculating the maximal distance between arbitrary
complex spectra X
0
and Y
0
respectively real warping func-
tions
¯
ϑ
0
and ϑ
0
, we obtain
d
ϑ
=
π
Z
ω=0
½
λ
4E
0
|Y (ω) X(˜ω
ϑ
(ω))|
2
+
1 λ
π
3
(˜ω
¯
ϑ
(ω) ˜ω
ϑ
(ω))
2
¾
.
6. EXPERIMENTS
Several experiments have been performed to investigate the
properties of VTLN voice conversion with respect to the
warping functions discussed in this paper.

Three corpora of different languages and genders have
been applied:
[A] 3 English sentences of a female speaker,
[B] 10 German sentences of a male speaker (poems),
[C] 3 German sentences of a male speaker (news).
In the following, we report results for three combinations of
these corpora:
F2M: female [A] is converted to male [B],
M2F: male [B] is converted to female [A],
M2M: male [C] is converted to male [B].
As error measure, we use the normalized class average dis-
tance
d
cad
=
K
S
P
k=1
π
R
ω =0
¯
¯
¯
Y
k
(ω)
¯
X
k
(˜ω
ϑ
k
(ω))
¯
¯
2
4K
S
E
0
.
Again,
¯
X and
¯
Y are spectra with unity energy E
0
, conse-
quently, we have 0 d
cad
1 (cp. Section 5.2).
In Table 1, we show results for warping functions with
one parameter (cf. Section 3). In the third row the results
for the trivial solution ˜ω = ω, i. e. no warping at all, is
displayed to assess the absolute d
cad
values.
Table 1. Error Measure for Warping Functions with One
Parameter
class average distance [%]
warping function
F2M M2F M2M
no warping aa8.3aa aa13.2aa aa7.3aa
piece-wise linear 6.0 6.4 6.2
power 5.2 6.4 6.2
quadratic 5.4 7.8 6.2
bilinear 5.5 6.5 6.2
We note that the presented warping techniques do not
essentially differ, but nevertheless, in our experiments, the
power function consistently produced the best outcomes.
The most significant effect was achieved for male-to-female
voice conversion which is due to the large differences of the
vocal tract. Concerning the above results, the opposite case
is more complicated. This statement is also supported by
our next experiments dealing with the piece-wise warping
with several segments, vide Table 2
This table conspicuously demonstrates how the number
of free parameters affects the warping precision. If S be-
comes the number of spectral lines of the compared spec-
tra, it passes into a variant of dynamic frequency warping
with certain constraints. Nevertheless, subjective tests have
shown, that excessively increasing the number of free pa-
rameters, results in an overfitting between source and target
spectra and therewith disturbs the naturalness of the output
speech.
Table 2. Error Measure for the Piece-Wise Warping Func-
tion with Several Segments
class average distance [%]
S
F2M M2F M2M
1 aa6.7aa aa7.6aa aa6.3aa
2 6.0 6.1 5.7
4 5.4 5.0 5.1
8 4.9 4.1 4.7
16 4.5 3.4 4.0
32 4.2 2.3 3.0
64 4.1 1.4 2.3
Future experiments are to investigate the consistency of
the above results on other corpora. Furthermore, the overfit-
ting effect is to be demonstrated using an adequate objective
error criterion.
7. REFERENCES
[1] T. Kamm, G. Andreou, and J. Cohen, “Vocal tract
normalization in speech recognition: Compensating for
systematic speaker variability, in Proc. of the 15th
Annual Speech Research Symposium, Baltimore, USA,
1995.
[2] E. Moulines and Y. Sagisaka, “Voice conversion: State
of the art and perspectives, in Speech Communication,
16(2), 1995.
[3] O. T
¨
urk, “New methods for voice conversion, in PhD
Thesis, Bo
˘
gazic¸i University, Istanbul, Turkey, 2003.
[4] A. Kain and M. W. Macon, “Spectral voice transfor-
mations for text-to-speech synthesis, in Proc. of the
ICASSP’98, Sydney, Australia, 1998.
[5] D. S
¨
undermann and H. Ney, An automatic segmenta-
tion and mapping approach for voice conversion param-
eter training, in Proc. of the AST’03, Maribor, Slove-
nia, 2003.
[6] L. F. Uebel and P. C. Woodland, An investigation
into vocal tract length normalization, in Proc. of the
EUROSPEECH’99, Budapest, Hungary, 1999.
[7] E. Eide and H. Gish, A parametric approach to vocal
tract length normalization, in Proc. of the ICASSP’96,
Atlanta, USA, 1996.
[8] M. Pitz, S. Molau, R. Schl
¨
uter, and H. Ney, “Vocal tract
normalization equals linear transformation in cepstral
space, in Proc. of the EUROSPEECH’01, Aalborg,
Denmark, 2001.
[9] A. Acero and R. M. Stern, “Robust speech recognition
by normalization of the acoustic space, in Proc. of the
ICASSP’91, Toronto, Canada, 1991.
Citations
More filters
Journal ArticleDOI
TL;DR: A survey of past work and priority research directions for the future is provided, showing that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.

433 citations


Cites methods from "VTLN-based voice conversion"

  • ...As alternatives to data-driven statistical conversion methods, frequency warping based approaches to voice conversion were introduced in Toda et al. (2001), Sundermann and Ney (2003), Erro et al. (2010), Godoy et al. (2012) and Erro et al. (2013)....

    [...]

  • ...As alternatives to data-driven statistical conversion met hods, frequency warping based approaches to voice conversion wer e introduced in (Toda et al., 2001; Sundermann and Ney, 2003; Erro et al., 2010; Godoy et al., 2012; Erro et al., 2013)....

    [...]

01 Jan 2014
TL;DR: In this paper, the authors provide a survey of spoofing countermeasures for automatic speaker verificati on, highlighting the need for more effort in the future to ensure adequate protection against spoofing attacks.
Abstract: While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has resp onded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develo p spoofing countermeasures for automatic speaker verificati on, now that the technology has matured suffi ciently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and ide ntifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent e fforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, know n spoofing attacks.

371 citations

Journal ArticleDOI
TL;DR: This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
Abstract: Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this article, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.

187 citations


Cites methods from "VTLN-based voice conversion"

  • ...In recent literature, the warping function is either realized by a single parameter, such as VTLN-based approaches [26], [134]–[137], or represented as a...

    [...]

Journal ArticleDOI
TL;DR: Compared to standard probabilistic systems, Weighted Frequency Warping results in a significant increase in quality scores, whereas the conversion scores remain almost unaltered.
Abstract: Any modification applied to speech signals has an impact on their perceptual quality. In particular, voice conversion to modify a source voice so that it is perceived as a specific target voice involves prosodic and spectral transformations that produce significant quality degradation. Choosing among the current voice conversion methods represents a trade-off between the similarity of the converted voice to the target voice and the quality of the resulting converted speech, both rated by listeners. This paper presents a new voice conversion method termed Weighted Frequency Warping that has a good balance between similarity and quality. This method uses a time-varying piecewise-linear frequency warping function and an energy correction filter, and it combines typical probabilistic techniques and frequency warping transformations. Compared to standard probabilistic systems, Weighted Frequency Warping results in a significant increase in quality scores, whereas the conversion scores remain almost unaltered. This paper carefully discusses the theoretical aspects of the method and the details of its implementation, and the results of an international evaluation of the new system are also included.

185 citations


Cites background from "VTLN-based voice conversion"

  • ...Further improvements related to frequency-warping were presented in [8]–[10]....

    [...]

Journal ArticleDOI
TL;DR: The subjective listening tests indicate that the naturalness of the converted speech by the proposed method is comparable with that by the ML-GMM method with global variance constraint, and the results show the superiority of the method over PLS-based methods.
Abstract: We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, a spectrogram is reconstructed as a weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The linear combination weights are constrained to be sparse to avoid over-smoothing, and high-resolution spectra are employed in the exemplars directly without dimensionality reduction to maintain spectral details. In addition, a spectral compression factor and a residual compensation technique are included in the framework to enhance the conversion performances. We conducted experiments on the VOICES database to compare the proposed method with a large set of state-of-the-art baseline methods, including the maximum likelihood Gaussian mixture model (ML-GMM) with dynamic feature constraint and the partial least squares (PLS) regression based methods. The experimental results show that the objective spectral distortion of ML-GMM is reduced from 5.19 dB to 4.92 dB, and both the subjective mean opinion score and the speaker identification rate are increased from 2.49 and 73.50% to 3.15 and 79.50%, respectively, by the proposed method. The results also show the superiority of our method over PLS-based methods. In addition, the subjective listening tests indicate that the naturalness of the converted speech by our proposed method is comparable with that by the ML-GMM method with global variance constraint.

179 citations


Cites background from "VTLN-based voice conversion"

  • ...A large number of statistical parametric approaches have attempted to achieve a robust spectral mapping....

    [...]

References
More filters
Proceedings ArticleDOI
07 May 1996
TL;DR: A parametric method of normalisation is described which counteracts the effect of varied vocal tract length and is shown to be effective across a wide range of recognition systems and paradigms, but is particularly helpful in the case of a small amount of training data.
Abstract: Differences in vocal tract size among individual speakers contribute to the variability of speech waveforms. The first-order effect of a difference in vocal tract length is a scaling of the frequency axis; a female speaker, for example, exhibits formants roughly 20% higher than the formants of from a male speaker, with the differences most severe in open vocal tract configurations. We describe a parametric method of normalisation which counteracts the effect of varied vocal tract length. The method is shown to be effective across a wide range of recognition systems and paradigms, but is particularly helpful in the case of a small amount of training data.

328 citations

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Several algorithms are presented that increase the robustness of SPHINX, the CMU (Carnegie Mellon University) continuous-speech speaker-independent recognition systems, by normalizing the acoustic space via minimization of the overall VQ distortion.
Abstract: Several algorithms are presented that increase the robustness of SPHINX, the CMU (Carnegie Mellon University) continuous-speech speaker-independent recognition systems, by normalizing the acoustic space via minimization of the overall VQ distortion. The authors propose an affine transformation of the cepstrum in which a matrix multiplication perform frequency normalization and a vector addition attempts environment normalization. The algorithms for environment normalization are efficient and improve the recognition accuracy when the system is tested on a microphone other than the one on which it was trained. The frequency normalization algorithm applies a different warping on the frequency axis to different speakers and it achieves a 10% decrease in error rate. >

229 citations

Journal ArticleDOI
TL;DR: In this paper, the Jacobian determinant of the transformation matrix is computed analytically for three typical warping functions and it is shown that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.
Abstract: Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

217 citations

Journal ArticleDOI
TL;DR: In the SWITCHBOARD corpus as mentioned in this paper, an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers by warping the spectrum of each speaker linearly over a 20% range, and finding the maximum a posteriori probability of the data given the warp.
Abstract: The performance of speech recognition systems is often improved by accounting explicitly for sources of variability in the data. In the SWITCHBOARD corpus, studied during the 1994 CAIP workshop [Frontiers in Speech Processing Workshop II, CAIP (August 1994)], an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers. The method found a maximum probability parameter for each speaker which mapped an acoustic model to the mean of the models taken from a homogeneous speaker population. The underlying acoustic model was that of a straight tube, and the parameter estimation was accomplished by warping the spectrum of each speaker linearly over a 20% range (actually accomplished by digitally resampling the data), and finding the maximum a posteriori probability of the data given the warp. The technique produces statistically significant improvements in accuracy on a speech transcription task using each of four different speech recognition systems. The best parametrizations were later found to correlate well with vocal tract estimates computed manually from spectrograms.

103 citations


"VTLN-based voice conversion" refers background in this paper

  • ...INTRODUCTION Vocal tract length normalization [1] tries to compensate for the effect of speaker dependent vocal tract lengths by warping the frequency axis of the amplitude spectrum....

    [...]

Proceedings Article
16 Sep 1999
TL;DR: It was found that if multiple iterations of constrained MLLR is used there is no additional advantage to also using VTLN, and that as previously reported that the e ects of V TLN and unconstrained M LLR are largely additive.
Abstract: This paper investigates several di erent methods for performing vocal tract length normalisation (VTLN) which are either completely linear or piece-wise linear. Furthermore the combination of VTLN with either standard unconstrained maximum likelihood linear regression (MLLR) or constrained MLLR is considered. Results on the Switchboard corpus show that there is little di erence in performance between the di erent forms of VTLN, and that as previously reported that the e ects of VTLN and unconstrained MLLR are largely additive. However it was found that if multiple iterations of constrained MLLR is used there is no additional advantage to also using VTLN.

78 citations

Frequently Asked Questions (1)
Q1. What are the contributions in "Vtln-based voice conversion" ?

In speech recognition, vocal tract length normalization ( VTLN ) is a well-studied technique for speaker normalization. As voice conversion aims at the transformation of a source speaker ’ s voice into that of a target speaker, the authors want to investigate whether VTLN is an appropriate method to adapt the voice characteristics.