Journal Article•DOI•

Transformation of formants for voice conversion using artificial neural networks

M. Narendranath¹, Hema A. Murthy¹, S. Rajendran¹, B. Yegnanarayana¹•Institutions (1)

01 Feb 1995-Speech Communication (Elsevier Science Publishers B. V.)-Vol. 16, Iss: 2, pp 207-216

TL;DR: A scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker using formants and a formant vocoder is proposed.

read less

About: This article is published in Speech Communication.The article was published on 1995-02-01 and is currently open access. It has received 207 citations till now. The article focuses on the topics: Formant & Voice analysis.

...read moreread less

Summary (2 min read)

Jump to: [1. Introduction] – [2. Speaker characteristics for voice conversion] – [3. Voice transformation studies] – [l,T-----l] – [Moreover this function can faithfully transform input parameters] – [For each set of formant data begin] – [4. Synthesis from transformed parameters] and [5. Summary and conclusion]

1. Introduction

Speech signal possesses mainly two kinds of information, namely the speech message part and the speaker identity part.
Extracting the message part of the information is the focus of research in the area of speech recognition (Rabiner and Juang, 1993) .
Both these involve identification of speaker characteristics, and extraction of these characteristics from the speech signal.
In particular, the authors address the issue of transforming the characteristics of the vocal tract system in the speech signal of the source speaker to that of the target speaker.

2. Speaker characteristics for voice conversion

In this section the authors first identify parameters which characterize inter-speaker variations and then develop methods for transforming them across speakers.
In general these factors are not known precisely.
These factors are the acoustic level characterization of the speaker.
At the segment level the vocal tract system and source characteristics of the speaker contributes to the speaker characteristics.
From a transformation point of view, it is convenient to represent the system with articulatory parameters.

3. Voice transformation studies

As mentioned before, the authors focus their attention on the transformation of formants and average pitch of the target speaker in voice conversion.
First the authors study how the formants and average pitch of two speakers differ.
The authors collected speech data for isolated utterances of vowels /i/, /e/, /a/, /o/ and /u/ from each of these five pairs of speakers.
The first three formants are extracted using a method based on minimum phase group delay functions (Murthy and Yegnanarayana, 1991) .
Moreover, the plots of the three scale factors (corresponding to the three formants) with respect to the various prototype vowels show a similar trend across different sets of male and female speakers.

l,T-----l

A notable deviation from the uniform scaling.
In the case of the scale factor for the second formant, it is high for front vowels /i/ and /e/.
These observations are consistent with a similar study conducted by Fant (Fant et al., 1991) .
This shows that the vocal tract shape transformation between two speakers is not linear.
During the training phase the network is trained with a discrete set of points on the mapping function.

Moreover this function can faithfully transform input parameters

In continuous speech the vocal tract system characteristics change rapidly across segments.
Hence if the transformation involves codebook mapping (Abe et al., 1988; Savic and Nam, 1991) , then, for a faithful transformaepeat.

For each set of formant data begin

The forrnant values (Fl-F3) corresponding to the source speake (male) are given as the input.
The network is trained using the back propagation algorithm to capture the transformation between the formants (McClelland et al., 1986) .
The pitch frequency for each segment is comization property was verified by comparing puted using the SIFT algorithm (Markel, 1972) .
The authors can observe a direct relationship between the height of the vowel and the inherent F, for both male and female speakers.

4. Synthesis from transformed parameters

Fig. 6 shows the tasks involved in the synthesis phase.
Formant transformation is quite straightforward if the authors have a neural network which has learned the transformation.
The gain contour extracted from the speech of the source speaker was used directly without any modification for synthesis.
Transformed speech was obtained for three cases: (a) (b) Cc) Average pitch transformation: speech with original system and source characteristics modified by average pitch.
Speech with original source characteristics and the transformed formants, also known as Formant transition.

5. Summary and conclusion

In this paper the authors have described a general scheme for voice conversion.
The authors have discussed the studies performed on interspeaker variation (gender differences) in the locations of formants and inherent pitch.
The authors have demonstrated that a feedforward neural network trained using the backpropagation algorithm can capture a function which could transform the formants of the source speaker to that of the target speaker.
Pitch was modified using an average pitch modification factor.
The quality of the transformation can be improved by using glottal pulse shape transformation at the segmental level and pitch contour transformation at the prosodic level, in addition to the proposed formant and average pitch transformations.

Did you find this useful? Give us your feedback

Figures (1)

Table 1 The percentage error between the source and the target formants before and after the application of the transformation learned by the neural network

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "Transformation of formants for voice conversion using artificial neural networks" ?

In this paper the authors propose a scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker. The scheme consists of a formant analysis phase, followed by a learning phase in which the implicit formant transformation is captured by a neural network.

Q2. What is the purpose of this paper?

In this paper the authors train a neural network to learn a transformation function which can transform the speaker dependent parameters extracted from the speech of the source speaker to match with that of the target speaker.

Q3. How many transitions do formants have in continuous speech?

But in continuous speech, since the vocal tract changes its shape continuously, the extracted formants will have many transitions.

Q4. What was used to excite the formant synthesizer for voiced frames?

Fant’s model (Fant, 1986) was used to excite the formant synthesizer for voiced frames and random noise for the case of unvoiced frames.

Q5. What is the way to train a neural network?

The first three formants from these two corresponding steady voiced regions are used as a pair of input and output formant vectors to a neural network.

Q6. What is the method for transforming the vocal tract parameters?

prosodic modifications were incorporated in the excitation signal using PSOLA (Pitch Synchronous Overlap Add) technique and speech was synthesized using the transformed spectral parameters.

Q7. What are the characteristics of the source speaker?

In the present study suprasegmental features of the source speaker are retained, while using the transformed vocal tract parameters for synthesis.

Q8. What are the two problems to be addressed in the development of a speech recognition system?

They are (1) identification of speaker characteristics or acquisition of speaker dependent knowledge in the analysis phase and (2) incorporation of the speaker specific knowledge while synthesis during the transformation phase.

Transformation of formants for voice conversion using artificial neural networks

Summary (2 min read)

1. Introduction

2. Speaker characteristics for voice conversion

3. Voice transformation studies

l,T-----l

Moreover this function can faithfully transform input parameters

For each set of formant data begin

4. Synthesis from transformed parameters

5. Summary and conclusion

Figures (1)

Citations

Cites background from "Transformation of formants for voic..."

Cites background from "Transformation of formants for voic..."

References

"Transformation of formants for voic..." refers background in this paper

"Transformation of formants for voic..." refers background in this paper

"Transformation of formants for voic..." refers background in this paper

"Transformation of formants for voic..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "Transformation of formants for voice conversion using artificial neural networks" ?

Q2. What is the purpose of this paper?

Q3. How many transitions do formants have in continuous speech?

Q4. What was used to excite the formant synthesizer for voiced frames?

Q5. What is the way to train a neural network?

Q6. What is the method for transforming the vocal tract parameters?

Q7. What are the characteristics of the source speaker?

Q8. What are the two problems to be addressed in the development of a speech recognition system?

Trending Questions (1)