Voice transformation using PSOLA technique

doi:10.1016/0167-6393(92)90012-V

Home
/
Papers
/
Voice transformation using PSOLA technique

Journal Article•DOI•

Voice transformation using PSOLA technique

H. Valbret¹, Eric Moulines¹, J. P. Tubach¹•Institutions (1)

Télécom ParisTech¹

01 Jun 1992-Vol. 11, Iss: 2, pp 175-187

TL;DR: A new system for voice conversion is described that combines a PSOLA (Pitch Synchronous Overlap and Add)-derived synthesizer and a module for spectral transformation, which produces a satisfyingly natural “transformed” voice.

read less

Abstract: In this contribution, a new system for voice conversion is described. The proposed architecture combines a PSOLA (Pitch Synchronous Overlap and Add)-derived synthesizer and a module for spectral transformation. The synthesizer based on the classical source-filter decomposition allows prosodic and spectral transformations to be performed independently. Prosodic modifications are applied on the excitation signal using the TD-PSOLA scheme; converted speech is then synthesized using the transformed spectral parameters. Two different approaches to derive spectral transformations, borrowed from the speech-recognition domain, are compared: Linear Multivariate Regression (LMR) and Dynamic Frequency Warping (DFW). Vector-quantization is carried out as a preliminary stage to render the spectral transformations dependent of the acoustical realization of sounds. A formal listening test shows that the synthesizer produces a satisfyingly natural “transformed” voice. LMR proves yet to allow a slightly better conversion than DFW. Still there is room for improvement in the spectral transformation stage.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Continuous probabilistic transform for voice conversion

[...]

Yannis Stylianou¹, Olivier Cappé², Eric Moulines²•Institutions (2)

Bell Labs¹, École Normale Supérieure²

01 Mar 1998-IEEE Transactions on Speech and Audio Processing

TL;DR: The design of a new methodology for representing the relationship between two sets of spectral envelopes and the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.

...read moreread less

Abstract: Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic+noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.

...read moreread less

1,109 citations

Cites background from "Voice transformation using PSOLA te..."

...Other recent works suggest that a possible way to improve the quality of the converted speech consists of modifying only some specific aspects of the spectral envelope, such as the location of its formants [28], [29], [48]....
[...]

Journal Article•DOI•

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

[...]

Tomoki Toda¹, Alan W. Black², Keiichi Tokuda³•Institutions (3)

Nara Institute of Science and Technology¹, Carnegie Mellon University², Nagoya Institute of Technology³

01 Nov 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.

...read moreread less

Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

...read moreread less

914 citations

Proceedings Article•DOI•

Low-delay voice conversion based on maximum likelihood estimation of spectral parameter trajectory.

[...]

Takashi Muramatsu¹, Yamato Ohtani¹, Tomoki Toda¹, Hiroshi Saruwatari¹, Kiyohiro Shikano¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

22 Sep 2008

TL;DR: The 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia as discussed by the authors, was held at the University of Queensland, Queensland, Australia.

...read moreread less

Abstract: INTERSPEECH2008: 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia.

...read moreread less

796 citations

Proceedings Article•DOI•

Spectral voice conversion for text-to-speech synthesis

[...]

Alexander Kain¹, Michael W. Macon¹•Institutions (1)

Oregon Health & Science University¹

12 May 1998

TL;DR: A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented and is found to perform more reliably for small training sets than a previous approach.

...read moreread less

Abstract: A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

...read moreread less

692 citations

Cites background from "Voice transformation using PSOLA te..."

...vector quantization with mapping codebooks [1], dynamic frequency warping [10], and neural networks [6]....
[...]

Journal Article•DOI•

Non-parametric techniques for pitch-scale and time-scale modification of speech

[...]

Eric Moulines¹, Jean Laroche¹•Institutions (1)

Télécom ParisTech¹

01 Feb 1995-Speech Communication

TL;DR: This contribution reviews frequency-domain algorithms (phase-vocoder) and time- domain algorithms (Time-Domain Pitch-Synchronous Overlap/Add and the like) in the same framework and presents more recent variations of these schemes.

...read moreread less

363 citations

Cites background from "Voice transformation using PSOLA te..."

...This point is particularly relevant in the context of speaker-characteristics modifications (Valbret et al., 1992)....
[...]
...Details are given in (Moulines and Charpentier, 1990; Valbret et al., 1992)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

An Algorithm for Vector Quantizer Design

[...]

Y. Linde¹, A. Buzo², Robert M. Gray³•Institutions (3)

Codex Corporation¹, National Autonomous University of Mexico², Stanford University³

01 Jan 1980-IEEE Transactions on Communications

TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.

...read moreread less

Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

...read moreread less

7,935 citations

Book•

Linear Prediction of Speech

[...]

John E. Markel, A. Gray

02 Dec 2011

TL;DR: Speech Analysis and Synthesis Models: Basic Physical Principles, Speech Synthesis Structures, and Considerations in Choice of Analysis.

...read moreread less

Abstract: 1. Introduction.- 1.1 Basic Physical Principles.- 1.2 Acoustical Waveform Examples.- 1.3 Speech Analysis and Synthesis Models.- 1.4 The Linear Prediction Model.- 1.5 Organization of Book.- 2. Formulations.- 2.1 Historical Perspective.- 2.2 Maximum Likelihood.- 2.3 Minimum Variance.- 2.4 Prony's Method.- 2.5 Correlation Matching.- 2.6 PARCOR (Partial Correlation).- 2.6.1 Inner Products and an Orthogonality Principle.- 2.6.2 The PARCOR Lattice Structure.- 3. Solutions and Properties.- 3.1 Introduction.- 3.2 Vector Spaces and Inner Products.- 3.2.1 Filter or Polynomial Norms.- 3.2.2 Properties of Inner Products.- 3.2.3 Orthogonality Relations.- 3.3 Solution Algorithms.- 3.3.1 Correlation Matrix.- 3.3.2 Initialization.- 3.3.3 Gram-Schmidt Orthogonalization.- 3.3.4 Levinson Recursion.- 3.3.5 Updating Am(z).- 3.3.6 A Test Example.- 3.4 Matrix Forms.- 4. Acoustic Tube Modeling.- 4.1 Introduction.- 4.2 Acoustic Tube Derivation.- 4.2.1 Single Section Derivation.- 4.2.2 Continuity Conditions.- 4.2.3 Boundary Conditions.- 4.3 Relationship between Acoustic Tube and Linear Prediction.- 4.4 An Algorithm, Examples, and Evaluation.- 4.4.1 An Algorithm.- 4.4.2 Examples.- 4.4.3 Evaluation of the Procedure.- 4.5 Estimation of Lip Impedance.- 4.5.1 Lip Impedance Derivation.- 4.6 Further Topics.- 4.6.1 Losses in the Acoustic Tube Model.- 4.6.2 Acoustic Tube Stability.- 5. Speech Synthesis Structures.- 5.1 Introduction.- 5.2 Stability.- 5.2.1 Step-up Procedure.- 5.2.2 Step-down Procedure.- 5.2.3 Polynomial Properties.- 5.2.4 A Bound on |Fm(z)|.- 5.2.5 Necessary and Sufficient Stability Conditions.- 5.2.6 Application of Results.- 5.3 Recursive Parameter Evaluation.- 5.3.1 Inner Product Properties.- 5.3.2 Equation Summary with Program.- 5.4 A General Synthesis Structure.- 5.5 Specific Speech Synthesis Structures.- 5.5.1 The Direct Form.- 5.5.2 Two-Multiplier Lattice Model.- 5.5.3 Kelly-Lochbaum Model.- 5.5.4 One-Multiplier Models.- 5.5.5 Normalized Filter Model.- 5.5.6 A Test Example.- 6. Spectral Analysis.- 6.1 Introduction.- 6.2 Spectral Properties.- 6.2.1 Zero Mean All-Pole Model.- 6.2.2 Gain Factor for Spectral Matching.- 6.2.3 Limiting Spectral Match.- 6.2.4 Non-uniform Spectral Weighting.- 6.2.5 Minimax Spectral Matching.- 6.3 A Spectral Flatness Model.- 6.3.1 A Spectral Flatness Measure.- 6.3.2 Spectral Flatness Transformations.- 6.3.3 Numerical Evaluation.- 6.3.4 Experimental Results.- 6.3.5 Driving Function Models.- 6.4 Selective Linear Prediction.- 6.4.1 Selective Linear Prediction (SLP) Algorithm.- 6.4.2 A Selective Linear Prediction Program.- 6.4.3 Computational Considerations.- 6.5 Considerations in Choice of Analysis Conditions.- 6.5.1 Choice of Method.- 6.5.2 Sampling Rates.- 6.5.3 Order of Filter.- 6.5.4 Choice of Analysis Interval.- 6.5.5 Windowing.- 6.5.6 Pre-emphasis.- 6.6 Spectral Evaluation Techniques.- 6.7 Pole Enhancement.- 7. Automatic Formant Trajectory Estimation.- 7.1 Introduction.- 7.2 Formant Trajectory Estimation Procedure.- 7.2.1 Introduction.- 7.2.2 Raw Data from A(z).- 7.2.3 Examples of Raw Data.- 7.3 Comparison of Raw Data from Linear Prediction and Cepstral Smoothing.- 7.4 Algorithm 1.- 7.5 Algorithm 2.- 7.5.1 Definition of Anchor Points.- 7.5.2 Processing of Each Voiced Segment.- 7.5.3 Final Smoothing.- 7.5.4 Results and Discussion.- 7.6 Formant Estimation Accuracy.- 7.6.1 An Example of Synthetic Speech Analysis.- 7.6.2 An Example of Real Speech Analysis.- 7.6.3 Influence of Voice Periodicity.- 8. Fundamental Frequency Estimation.- 8.1 Introduction.- 8.2 Preprocessing by Spectral Flattening.- 8.2.1 Analysis of Voiced Speech with Spectral Regularity.- 8.2.2 Analysis of Voiced Speech with Spectral Irregularities.- 8.2.3 The STREAK Algorithm.- 8.3 Correlation Techniques.- 8.3.1 Autocorrelation Analysis.- 8.3.2 Modified Autocorrelation Analysis.- 8.3.3 Filtered Error Signal Autocorrelation Analysis.- 8.3.4 Practical Considerations.- 8.3.5 The SIFT Algorithm.- 9. Computational Considerations in Analysis.- 9.1 Introduction.- 9.2 Ill-Conditioning.- 9.2.1 A Measure of Ill-Conditioning.- 9.2.2 Pre-emphasis of Speech Data.- 9.2.3 Prefiltering before Sampling.- 9.3 Implementing Linear Prediction Analysis.- 9.3.1 Autocorrelation Method.- 9.3.2 Covariance Method.- 9.3.3 Computational Comparison.- 9.4 Finite Word Length Considerations.- 9.4.1 Finite Word Length Coefficient Computation.- 9.4.2 Finite Word Length Solution of Equations.- 9.4.3 Overall Finite Word Length Implementation.- 10. Vocoders.- 10.1 Introduction.- 10.2 Techniques.- 10.2.1 Coefficient Transformations.- 10.2.2 Encoding and Decoding.- 10.2.3 Variable Frame Rate Transmission.- 10.2.4 Excitation and Synthesis Gain Matching.- 10.2.5 A Linear Prediction Synthesizer Program.- 10.3 Low Bit Rate Pitch Excited Vocoders.- 10.3.1 Maximum Likelihood and PARCOR Vocoders.- 10.3.2 Autocorrelation Method Vocoders.- 10.3.3 Covariance Method Vocoders.- 10.4 Base-Band Excited Vocoders.- 11. Further Topics.- 11.1 Speaker Identification and Verification.- 11.2 Isolated Word Recognition.- 11.3 Acoustical Detection of Laryngeal Pathology.- 11.4 Pole-Zero Estimation.- 11.5 Summary and Future Directions.- References.

...read moreread less

1,945 citations

Journal Article•DOI•

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

[...]

Eric Moulines¹, Francis Charpentier¹•Institutions (1)

Centre national d'études des télécommunications¹

01 Dec 1990-Speech Communication

TL;DR: In a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation based on pitch-synchronous overlap-add approach are reviewed.

...read moreread less

1,438 citations

Proceedings Article•DOI•

Voice conversion through vector quantization

[...]

Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, Hisao Kuwabara

11 Apr 1988

TL;DR: The authors propose a new voice conversion technique through vector quantization and spectrum mapping which makes it possible to precisely control voice individuality.

...read moreread less

Abstract: The authors propose a new voice conversion technique through vector quantization and spectrum mapping. The basic idea of this technique is to make mapping codebooks which represent the correspondence between different speakers' codebooks. The mapping codebooks for spectrum parameters, power values and pitch frequencies are separately generated using training utterances. This technique makes it possible to precisely control voice individuality. To evaluate the performance of this technique, hearing tests are carried out on two kinds of voice conversions. One is a conversion between male and female speakers, the other is a conversion between male speakers. In the male-to-female conversion experiment, all converted utterances are judged as female, and in the male-to-male conversion, 65% of them are identified as the target speaker. >

...read moreread less

554 citations

Journal Article•

Speaker recognition

[...]

Douglas O'Shaughnessy¹•Institutions (1)

Institut national de la recherche scientifique¹

01 Oct 1986-IEEE Assp Magazine

305 citations