What are the contributions in "Vtln-based voice conversion" ?

Q: What are the contributions in "Vtln-based voice conversion" ?

In speech recognition, vocal tract length normalization ( VTLN ) is a well-studied technique for speaker normalization. As voice conversion aims at the transformation of a source speaker ’ s voice into that of a target speaker, the authors want to investigate whether VTLN is an appropriate method to adapt the voice characteristics.

(Open Access) VTLN-based voice conversion (2003) | David Sündermann

VTLN-BASED VOICE CONVERSION

David S

undermann and Hermann Ney

RWTH Aachen – University of Technology

Computer Science Department

Ahornstr. 55, 52056 Aachen, Germany

{suendermann,ney}@cs.rwth-aachen.de

ABSTRACT

In speech recognition, vocal tract length normalization

(VTLN) is a well-studied technique for speaker normaliza-

tion. As voice conversion aims at the transformation of a

source speaker’s voice into that of a target speaker, we want

to investigate whether VTLN is an appropriate method to

adapt the voice characteristics. After applying several con-

ventional VTLN warping functions, we extend the piece-

wise linear function to several segments, allowing a more

detailed warping of the source spectrum. Experiments on

voice conversion are performed on three corpora of two lan-

guages and both speaker genders.

1. INTRODUCTION

Vocal tract length normalization [1] tries to compensate for

the effect of speaker dependent vocal tract lengths by warp-

ing the frequency axis of the amplitude spectrum. In speech

recognition, VTLN aims at the normalization of a speaker’s

voice in order to remove individual speaker characteristics.

A similar task is voice conversion. It describes the mod-

iﬁcation of a source speaker’s voice such that it is perceived

to be spoken by a target speaker [2]. In this paper, we show

how VTLN can be applied to this task.

In Section 2, we delineate a method to ﬁnd correspond-

ing speech segments respectively artiﬁcial phonetic classes

in the training material of the source and the target speaker.

These corresponding classes are used to estimate the param-

eters of class-dependent VTLN warping functions. Subse-

quently, in Section 3, we apply this training procedure to

conventional warping functions depending on only one pa-

rameter.

Often, these conventional functions do not sufﬁciently

model the speakers’ characteristics. Therefore, we intro-

duce a piece-wise linear warping function consisting of sev-

eral linear segments. The greater the parameter number is,

the more carefully we must deal with their practical estima-

tion. All these considerations are discussed in Section 4.

Since the parameter estimation for classes with only few

observations can be very inaccurate and, besides, we do not

want the parameters to change abruptly from one class to

another, in Section 5, we introduce two parameter smooth-

ing methods. Finally, in Section 6, we present experimental

results on three German and English corpora.

2. AUTOMATIC SEGMENTATION AND MAPPING

Most of the training procedures of state-of-the-art voice con-

version techniques require training data containing the same

utterances of both source and target speaker [3]. Besides,

these utterances should feature a high degree of natural time

alignment and similar pitch contour [4].

However, in several voice conversion applications (e. g.

spontaneous speaker adaptation or speech-to-speech trans-

lation) we do not possess corresponding time frames of

source and target speaker. In [5], we address this problem

as follows.

At ﬁrst, we subdivide speech material of speaker S and

T into K

respectively K

artiﬁcial phonetic classes. This

is done by clustering the frequency spectra of period-syn-

chronous frames obtained by a pitch tracker. For unvoiced

signal parts, pseudo periods are used. Now, for each source

class k

we determine the most similar target class

This class mapping is basis for an arbitrary statistical voice

conversion parameter training.

2.1. Statistical Voice Conversion Parameter Training

Let X

= X

, . . . , X

be the spectra belonging to source

class k

and Y

those of the mapped class

), we

generally estimate the parameter vector ϑ by minimizing

the sum of the euclidean distances between all target class

spectra and transformed source class spectra. Here, we uti-

lize the spectral conversion function F

depending on the

parameter vector ϑ

ϑ = arg min

i=1

j=1

ω =0

(ω) − F

, ω)|

dω (1)

In conjunction with a suitable smoothing technique, we of-

ten can neglect the variety of the classes’ observation spec-

tra by introducing a mean approximation without an essen-

tial effect on the voice conversion parameters.

ϑ = arg min

ω =0

Y (ω) − F

(

X, ω)

dω (2)

Here,

X and

Y are the source and target classes’ average

spectra.

3. WARPING FUNCTIONS WITH ONE

PARAMATER

In speech recognition, several VTLN warping functions

have been proposed whose parameters usually are limited

to one variable, the warping factor α. Established warping

functions are

• the symmetric piece-wise linear function with two seg-

ments [6]

˜ω

(ω) =

αω : ω ≤ ω

αω

π −αω

π− ω

(ω − ω

) : ω ≥ ω

(3)

(

π : α ≤ 1

8α

π : α ≥ 1

• the power function [7]

˜ω

(ω) =

• the quadratic function [8]

˜ω

(ω) = ω + α

−

• the bilinear function [9]

˜z

(z) =

z − α

1 − αz

with z = e

iω

(4)

In order to estimate the class dependent warping factor α,

we use Eqs. 1 or 2, where

(X, ω) = X(˜ω

(ω)). (5)

4. WARPING FUNCTIONS WITH SEVERAL

PARAMETERS

4.1. Piece-Wise Linear Warping with Several Segments

One of the adversarial properties of the conventional warp-

ing functions with one parameter is that the whole frequency

axis is always warped in the same direction, either to lower

or to higher frequencies. Consequently, these functions are

not able to model spectral conversions where certain parts

of the axis move to higher frequencies, and other parts to

lower frequencies, or vice versa. Such functions would re-

quire at least one inﬂection point and would cross the ˜ω = ω

diagonal.

Applying the VTLN technique to voice conversion, we

want to use more exact models than in speech recognition,

i. e. warping functions with several parameters, for a better

description of the individual characteristics of the speakers’

vocal tracts.

Assuming there is an ideal warping function for a given

class pair (k

), an obvious model is given by the in-

terpolation of this function by several linear segments, as a

consequence from the simple two-segment linear warping,

vide Eq. 3.

˜ω

(ω) =











˜ω

0,˜ω

(ω) for 0 ≤ ω ≤

S+1

· π

˜ω

,˜ω

s+1

(ω) for

S+1

· π ≤ ω ≤

s+1

S+1

· π

˜ω

,π

(ω) for

S+1

· π ≤ ω ≤ π

(6)

˜ω

,˜ω

(ω) = ˜ω

S + 1

· ω − s

· (˜ω

− ˜ω

)

0 ≤ ˜ω

≤ · · · ≤ ˜ω

≤ π. (7)

This formula describes a piece-wise linear function ˜ω(ω)

starting at (0, 0), ending at (π, π), and connecting S points

whose ω values are equidistantly distributed. The corre-

sponding ˜ω

are the parameters of the warping function.

The resulting function is monotonous according to Eq. 7, as

we do not want parts of the frequency axis to be exchanged.

4.2. Practical Parameter Estimation

In general, augmenting the number of parameters confronts

us with an increasing need of computation time. Particu-

larly, this is the case if the minimization of Eqs. 1 or 2 is per-

formed by calculating the distances for all possible param-

eter combinations concerning a certain resolution. This es-

timation method results in an exponential increase of com-

puting time in dependence on the number of considered pa-

rameters.

Viewing the deﬁnition of the piece-wise linear warping

function with several segments, cf. Eq. 6, we note that the

integrals used in Eqs. 1 and 2 can be rewritten as (also cp.

Eq. 5)

˜ω

ω =0

Y (ω) − X(˜ω

˜ω

(ω))

dω

s=0

s+1

S+1

·π

ω =

S+1

·π

Y (ω) − X(˜ω

˜ω

s+1

(ω))

dω .

This enables us to use dynamic programming for searching

the minimum distance and therewith the optimal parameter

vector ˜ω

5. PARAMETER SMOOTHING

5.1. Iterative Integrating Smoothing

Basis of the voice conversion technique delineated in this

paper is the automatic class segmentation and mapping de-

scribed in Section 2. In Figure 1, we show the time course

of the word “Arizona” and the corresponding classes for

= 8.

To avoid that the class-dependent voice conversion pa-

rameters jump at the class boundaries causing distinctly au-

dible artifacts in the converted speech, we introduce an inte-

grating parameter smoothing which iteratively adapts a pa-

rameter vector by adding a weighted mean of the chronolog-

ically neighbored vectors. Figure 2 shows the effect of this

smoothing technique for 5, 50 and 5000 iterations using the

symmetric piece-wise warping function described in Eq. 3.

If the number of iterations approaches inﬁnity, we obtain a

constant function over the time representing the mean pa-

rameter vector.

5.2. Deviation Penalty

Viewing Figures 1 and 2, we note that for certain classes the

obtained parameter values highly deviate from the mean.

E. g. for k

= 7 we obtain an α less than 1, whereas the

particular voice conversion (female–male) should result in

values greater than 1. Considering the mean of ¯α = 1.3,

the parameter values are to be controlled and, if necessary,

corrected towards the mean.

This is performed by applying the minimization Eqs. 1

or 2 a second time, having added a penalty term to the en-

closed integral. Both addends are normalized by their max-

imum and then weighted utilizing the real value 0 ≤ λ ≤ 1

to adjust the penalty strength. Hence, λ = 1 does not in-

ﬂuence the class parameters at all, whereas λ = 0 forces

all parameters to be equal to their mean

ϑ. An equilibrium

between both terms is to be around λ = 0.5.

In the following, we assume X and Y to have the unity

energy E

in order to remove the dependence of the dis-

tances on the signal loudness.

= λ

ω =0

|Y (w) − X(˜ω

(ω))|

dω

max

ω =0

(w) − X

(ω)|

dω

+(1 − λ)

ω =0

(˜ω

(ω) − ˜ω

(ω))

dω

max

,ϑ

ω =0

(˜ω

(ω) − ˜ω

(ω))

dω

e @ r i z@ U n æ

Fig. 1. Automatic Class Segmentation for the Word “Ari-

zona”.

Fig. 2. Iterative Integrating Smoothing for Warping Func-

tions with One Parameter.

After calculating the maximal distance between arbitrary

complex spectra X

and Y

respectively real warping func-

tions

and ϑ

, we obtain

ω=0

|Y (ω) − X(˜ω

(ω))|

1 − λ

(˜ω

(ω) − ˜ω

(ω))

dω.

6. EXPERIMENTS

Several experiments have been performed to investigate the

properties of VTLN voice conversion with respect to the

warping functions discussed in this paper.

Three corpora of different languages and genders have

been applied:

[A] 3 English sentences of a female speaker,

[B] 10 German sentences of a male speaker (poems),

[C] 3 German sentences of a male speaker (news).

In the following, we report results for three combinations of

these corpora:

• F2M: female [A] is converted to male [B],

• M2F: male [B] is converted to female [A],

• M2M: male [C] is converted to male [B].

As error measure, we use the normalized class average dis-

tance

cad

k=1

ω =0

(ω) −

(˜ω

(ω))

Again,

X and

Y are spectra with unity energy E

, conse-

quently, we have 0 ≤ d

cad

≤ 1 (cp. Section 5.2).

In Table 1, we show results for warping functions with

one parameter (cf. Section 3). In the third row the results

for the trivial solution ˜ω = ω, i. e. no warping at all, is

displayed to assess the absolute d

cad

values.

Table 1. Error Measure for Warping Functions with One

Parameter

class average distance [%]

warping function

F2M M2F M2M

no warping aa8.3aa aa13.2aa aa7.3aa

piece-wise linear 6.0 6.4 6.2

power 5.2 6.4 6.2

quadratic 5.4 7.8 6.2

bilinear 5.5 6.5 6.2

We note that the presented warping techniques do not

essentially differ, but nevertheless, in our experiments, the

power function consistently produced the best outcomes.

The most signiﬁcant effect was achieved for male-to-female

voice conversion which is due to the large differences of the

vocal tract. Concerning the above results, the opposite case

is more complicated. This statement is also supported by

our next experiments dealing with the piece-wise warping

with several segments, vide Table 2

This table conspicuously demonstrates how the number

of free parameters affects the warping precision. If S be-

comes the number of spectral lines of the compared spec-

tra, it passes into a variant of dynamic frequency warping

with certain constraints. Nevertheless, subjective tests have

shown, that excessively increasing the number of free pa-

rameters, results in an overﬁtting between source and target

spectra and therewith disturbs the naturalness of the output

speech.

Table 2. Error Measure for the Piece-Wise Warping Func-

tion with Several Segments

class average distance [%]

F2M M2F M2M

1 aa6.7aa aa7.6aa aa6.3aa

2 6.0 6.1 5.7

4 5.4 5.0 5.1

8 4.9 4.1 4.7

16 4.5 3.4 4.0

32 4.2 2.3 3.0

64 4.1 1.4 2.3

Future experiments are to investigate the consistency of

the above results on other corpora. Furthermore, the overﬁt-

ting effect is to be demonstrated using an adequate objective

error criterion.

7. REFERENCES

[1] T. Kamm, G. Andreou, and J. Cohen, “Vocal tract

normalization in speech recognition: Compensating for

systematic speaker variability,” in Proc. of the 15th

Annual Speech Research Symposium, Baltimore, USA,

1995.

[2] E. Moulines and Y. Sagisaka, “Voice conversion: State

of the art and perspectives,” in Speech Communication,

16(2), 1995.

[3] O. T

urk, “New methods for voice conversion,” in PhD

Thesis, Bo

gazic¸i University, Istanbul, Turkey, 2003.

[4] A. Kain and M. W. Macon, “Spectral voice transfor-

mations for text-to-speech synthesis,” in Proc. of the

ICASSP’98, Sydney, Australia, 1998.

[5] D. S

undermann and H. Ney, “An automatic segmenta-

tion and mapping approach for voice conversion param-

eter training,” in Proc. of the AST’03, Maribor, Slove-

nia, 2003.

[6] L. F. Uebel and P. C. Woodland, “An investigation

into vocal tract length normalization,” in Proc. of the

EUROSPEECH’99, Budapest, Hungary, 1999.

[7] E. Eide and H. Gish, “A parametric approach to vocal

tract length normalization,” in Proc. of the ICASSP’96,

Atlanta, USA, 1996.

[8] M. Pitz, S. Molau, R. Schl

uter, and H. Ney, “Vocal tract

normalization equals linear transformation in cepstral

space,” in Proc. of the EUROSPEECH’01, Aalborg,

Denmark, 2001.

[9] A. Acero and R. M. Stern, “Robust speech recognition

by normalization of the acoustic space,” in Proc. of the

ICASSP’91, Toronto, Canada, 1991.

VTLN-based voice conversion

Figures

Citations

Voice conversion and spoofing attack on speaker verification systems

Speaker Recognition Anti-spoofing

Weighted frequency warping for voice conversion.

Introduction to Voice Presentation Attack Detection and Recent Advances

Voice conversion versus speaker verification: an overview

References

A parametric approach to vocal tract length normalization

Robust speech recognition by normalization of the acoustic space

Vocal tract normalization equals linear transformation in cepstral space

Vocal tract normalization in speech recognition: Compensating for systematic speaker variability

An investigation into vocal tract length normalisation.

Related Papers (5)

Continuous probabilistic transform for voice conversion

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

Voice transformation using PSOLA technique

Voice conversion through vector quantization

Transformation of formants for voice conversion using artificial neural networks

Frequently Asked Questions (1)

Q1. What are the contributions in "Vtln-based voice conversion" ?