scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Transformation of formants for voice conversion using artificial neural networks

01 Feb 1995-Speech Communication (Elsevier Science Publishers B. V.)-Vol. 16, Iss: 2, pp 207-216
TL;DR: A scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker using formants and a formant vocoder is proposed.
About: This article is published in Speech Communication.The article was published on 1995-02-01 and is currently open access. It has received 207 citations till now. The article focuses on the topics: Formant & Voice analysis.

Summary (2 min read)

1. Introduction

  • Speech signal possesses mainly two kinds of information, namely the speech message part and the speaker identity part.
  • Extracting the message part of the information is the focus of research in the area of speech recognition (Rabiner and Juang, 1993) .
  • Both these involve identification of speaker characteristics, and extraction of these characteristics from the speech signal.
  • In particular, the authors address the issue of transforming the characteristics of the vocal tract system in the speech signal of the source speaker to that of the target speaker.

2. Speaker characteristics for voice conversion

  • In this section the authors first identify parameters which characterize inter-speaker variations and then develop methods for transforming them across speakers.
  • In general these factors are not known precisely.
  • These factors are the acoustic level characterization of the speaker.
  • At the segment level the vocal tract system and source characteristics of the speaker contributes to the speaker characteristics.
  • From a transformation point of view, it is convenient to represent the system with articulatory parameters.

3. Voice transformation studies

  • As mentioned before, the authors focus their attention on the transformation of formants and average pitch of the target speaker in voice conversion.
  • First the authors study how the formants and average pitch of two speakers differ.
  • The authors collected speech data for isolated utterances of vowels /i/, /e/, /a/, /o/ and /u/ from each of these five pairs of speakers.
  • The first three formants are extracted using a method based on minimum phase group delay functions (Murthy and Yegnanarayana, 1991) .
  • Moreover, the plots of the three scale factors (corresponding to the three formants) with respect to the various prototype vowels show a similar trend across different sets of male and female speakers.

l,T-----l

  • A notable deviation from the uniform scaling.
  • In the case of the scale factor for the second formant, it is high for front vowels /i/ and /e/.
  • These observations are consistent with a similar study conducted by Fant (Fant et al., 1991) .
  • This shows that the vocal tract shape transformation between two speakers is not linear.
  • During the training phase the network is trained with a discrete set of points on the mapping function.

Moreover this function can faithfully transform input parameters

  • In continuous speech the vocal tract system characteristics change rapidly across segments.
  • Hence if the transformation involves codebook mapping (Abe et al., 1988; Savic and Nam, 1991) , then, for a faithful transformaepeat.

For each set of formant data begin

  • The forrnant values (Fl-F3) corresponding to the source speake (male) are given as the input.
  • The network is trained using the back propagation algorithm to capture the transformation between the formants (McClelland et al., 1986) .
  • The pitch frequency for each segment is comization property was verified by comparing puted using the SIFT algorithm (Markel, 1972) .
  • The authors can observe a direct relationship between the height of the vowel and the inherent F, for both male and female speakers.

4. Synthesis from transformed parameters

  • Fig. 6 shows the tasks involved in the synthesis phase.
  • Formant transformation is quite straightforward if the authors have a neural network which has learned the transformation.
  • The gain contour extracted from the speech of the source speaker was used directly without any modification for synthesis.
  • Transformed speech was obtained for three cases: (a) (b) Cc) Average pitch transformation: speech with original system and source characteristics modified by average pitch.
  • Speech with original source characteristics and the transformed formants, also known as Formant transition.

5. Summary and conclusion

  • In this paper the authors have described a general scheme for voice conversion.
  • The authors have discussed the studies performed on interspeaker variation (gender differences) in the locations of formants and inherent pitch.
  • The authors have demonstrated that a feedforward neural network trained using the backpropagation algorithm can capture a function which could transform the formants of the source speaker to that of the target speaker.
  • Pitch was modified using an average pitch modification factor.
  • The quality of the transformation can be improved by using glottal pulse shape transformation at the segmental level and pitch contour transformation at the prosodic level, in addition to the proposed formant and average pitch transformations.

Did you find this useful? Give us your feedback

Citations
More filters
Book
30 Aug 2004
TL;DR: artificial neural networks, artificial neural networks , مرکز فناوری اطلاعات و اصاع رسانی, کδاوρزی
Abstract: artificial neural networks , artificial neural networks , مرکز فناوری اطلاعات و اطلاع رسانی کشاورزی

2,254 citations

Journal ArticleDOI
TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.

1,741 citations


Cites background from "Transformation of formants for voic..."

  • ...Various sophisticated methods have been proposed (McAulay and Quatieri, 1986; Stylianou et al., 1995; Narendranath et al., 1995; Veldhuis and He, 1996) but their ̄exibility and resultant speech quality have been limited....

    [...]

Journal ArticleDOI
TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.
Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

914 citations

Proceedings ArticleDOI
22 Sep 2008
TL;DR: The 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia as discussed by the authors, was held at the University of Queensland, Queensland, Australia.
Abstract: INTERSPEECH2008: 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia.

796 citations

Proceedings ArticleDOI
12 May 1998
TL;DR: A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented and is found to perform more reliably for small training sets than a previous approach.
Abstract: A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

692 citations


Cites background from "Transformation of formants for voic..."

  • ...vector quantization with mapping codebooks [1], dynamic frequency warping [10], and neural networks [6]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: It is rigorously established that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available.

18,794 citations


"Transformation of formants for voic..." refers background in this paper

  • ...This is based on the property that a multilayered feedforward neural network using nonlinear processing elements can capture any arbitrary input-output mapping (Hornik et al., 1989)....

    [...]

Book
01 Jan 1993
TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

8,442 citations


"Transformation of formants for voic..." refers background in this paper

  • ...Extracting the message part of the information is the focus of research in the area of speech recognition (Rabiner and Juang, 1993)....

    [...]

  • ...The term speaker characteristics or voice is used to refer to those factors in the spoken utterance which carry information about the speaker, i.e., those factors which are used by listeners to identify the speaker of an utterance....

    [...]

Journal ArticleDOI
TL;DR: Perceptual validation of the relative importance of acoustic cues for signaling a breathy voice quality has been accomplished using a new voicing source model for synthesis of more natural male and female voices.
Abstract: Voice quality variations include a set of voicing sound source modifications ranging from laryngealized to normal to breathy phonation. Analysis of reiterant imitations of two sentences by ten female and six male talkers has shown that the potential acoustic cues to this type of voice quality variation include: (1) increases to the relative amplitude of the fundamental frequency component as open quotient increases; (2) increases to the amount of aspiration noise that replaces higher frequency harmonics as the arytenoids become more separated; (3) increases to lower formant bandwidths; and (4) introduction of extra pole zeros in the vocal-tract transfer function associated with tracheal coupling. Perceptual validation of the relative importance of these cues for signaling a breathy voice quality has been accomplished using a new voicing source model for synthesis of more natural male and female voices. The new formant synthesizer, KLSYN88, is fully documented here. Results of the perception study indicate that, contrary to previous research which emphasizes the importance of increased amplitude of the fundamental component, aspiration noise is perceptually most important. Without its presence, increases to the fundamental component may induce the sensation of nasality in a high-pitched voice. Further results of the acoustic analysis include the observations that: (1) over the course of a sentence, the acoustic manifestations of breathiness vary considerably--tending to increase for unstressed syllables, in utterance-final syllables, and at the margins of voiceless consonants; (2) on average, females are more breathy than males, but there are very large differences between subjects within each gender; (3) many utterances appear to end in a "breathy-laryngealized" type of vibration; and (4) diplophonic irregularities in the timing of glottal periods occur frequently, especially at the end of an utterance. Diplophonia and other deviations from perfect periodicity may be important aspects of naturalness in synthesis.

1,656 citations


"Transformation of formants for voic..." refers background in this paper

  • ...…variations and factors affecting voice quality have revealed that there are various parameters in the speech signal, both at the segmental and at the suprasegmental level, which contribute to the interspeaker variability (Klatt and Klatt, 1990; Fant et al., 1991; Childers and Lee, 1991)....

    [...]

Journal ArticleDOI
TL;DR: Application of this method for efficient transmission and storage of speech signals as well as procedures for determining other speechcharacteristics, such as formant frequencies and bandwidths, the spectral envelope, and the autocorrelation function, are discussed.
Abstract: A method of representing the speech signal by time‐varying parameters relating to the shape of the vocal tract and the glottal‐excitation function is described. The speech signal is first analyzed and then synthesized by representing it as the output of a discrete linear time‐varying filter, which is excited by a suitable combination of a quasiperiodic pulse train and white noise. The output of the linear filter at any sampling instant is a linear combination of the past output samples and the input. The optimum linear combination is obtained by minimizing the mean‐squared error between the actual values of the speech samples and their predicted values based on a fixed number of preceding samples. A 10th‐order linear predictor was found to represent the speech signal band‐limited to 5kHz with sufficient accuracy. The 10 coefficients of the predictor are shown to determine both the frequencies and bandwidths of the formants. Two parameters relating to the glottal‐excitation function and the pitch period are determined from the prediction error signal. Speech samples synthesized by this method will be demonstrated.

1,124 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: The authors propose a new voice conversion technique through vector quantization and spectrum mapping which makes it possible to precisely control voice individuality.
Abstract: The authors propose a new voice conversion technique through vector quantization and spectrum mapping. The basic idea of this technique is to make mapping codebooks which represent the correspondence between different speakers' codebooks. The mapping codebooks for spectrum parameters, power values and pitch frequencies are separately generated using training utterances. This technique makes it possible to precisely control voice individuality. To evaluate the performance of this technique, hearing tests are carried out on two kinds of voice conversions. One is a conversion between male and female speakers, the other is a conversion between male speakers. In the male-to-female conversion experiment, all converted utterances are judged as female, and in the male-to-male conversion, 65% of them are identified as the target speaker. >

554 citations


"Transformation of formants for voic..." refers background or methods in this paper

  • ...Abe et al. (1988) have developed a technique for voice conversion through vector quantization and spectral mapping....

    [...]

  • ...Hence if the transformation involves codebook mapping (Abe et al., 1988; Savic and Nam, 1991), then, for a faithful transforma- epeat For each set of formant data begin Step-l: Step-2: Step-3: end The forrnant values (Fl-F3) corresponding to the source speake (male) are given as the input....

    [...]

  • ...Since articulatory parameters are difficult to extract from the speech signal, as a compromise, formants are proposed for representing the vocal tract system information....

    [...]

Frequently Asked Questions (8)
Q1. What contributions have the authors mentioned in the paper "Transformation of formants for voice conversion using artificial neural networks" ?

In this paper the authors propose a scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker. The scheme consists of a formant analysis phase, followed by a learning phase in which the implicit formant transformation is captured by a neural network. 

In this paper the authors train a neural network to learn a transformation function which can transform the speaker dependent parameters extracted from the speech of the source speaker to match with that of the target speaker. 

But in continuous speech, since the vocal tract changes its shape continuously, the extracted formants will have many transitions. 

Fant’s model (Fant, 1986) was used to excite the formant synthesizer for voiced frames and random noise for the case of unvoiced frames. 

The first three formants from these two corresponding steady voiced regions are used as a pair of input and output formant vectors to a neural network. 

prosodic modifications were incorporated in the excitation signal using PSOLA (Pitch Synchronous Overlap Add) technique and speech was synthesized using the transformed spectral parameters. 

In the present study suprasegmental features of the source speaker are retained, while using the transformed vocal tract parameters for synthesis. 

They are (1) identification of speaker characteristics or acquisition of speaker dependent knowledge in the analysis phase and (2) incorporation of the speaker specific knowledge while synthesis during the transformation phase. 

Trending Questions (1)
How do I save a voice message from signal?

In this paper we propose a scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker.