Vocal tract normalization equals linear transformation in cepstral space

doi:10.1109/TSA.2005.848881

Home
/
Papers
/
Vocal tract normalization equals linear transformation in cepstral space

Journal Article•DOI•

Vocal tract normalization equals linear transformation in cepstral space

Michael Pitz¹, Hermann Ney¹•Institutions (1)

RWTH Aachen University¹

15 Aug 2005-IEEE Transactions on Speech and Audio Processing (IEEE)-Vol. 13, Iss: 5, pp 930-944

TL;DR: In this paper, the Jacobian determinant of the transformation matrix is computed analytically for three typical warping functions and it is shown that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

read less

Abstract: Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

An overview of voice conversion systems

[...]

Seyed Hamidreza Mohammadi¹, Alexander Kain¹•Institutions (1)

Oregon Health & Science University¹

01 Apr 2017-Speech Communication

TL;DR: An overview of real-world applications of VC systems, extensively study existing systems proposed in the literature, and discuss remaining challenges are provided.

...read moreread less

232 citations

Proceedings Article•DOI•

Text-Independent Voice Conversion Based on Unit Selection

[...]

D. Sundermann¹, Harald Höge¹, Antonio Bonafonte², Hermann Ney, Alan W. Black³, Shrikanth S. Narayanan² - Show less +2 more•Institutions (3)

Siemens¹, University of Southern California², Carnegie Mellon University³

14 May 2006

TL;DR: A new approach is presented that applies unit selection to find corresponding time frames in source and target speech to achieve the same performance as the conventional text-dependent training.

...read moreread less

Abstract: So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require text-independent training, over the last two years, training techniques that use non-parallel data were proposed. In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training.

...read moreread less

129 citations

Journal Article•DOI•

On Acoustic Emotion Recognition: Compensating for Covariate Shift

[...]

Ali Hassan¹, Robert I. Damper¹, Mahesan Niranjan¹•Institutions (1)

University of Southampton¹

01 Jul 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, the authors employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift.

...read moreread less

Abstract: Pattern recognition tasks often face the situation that training data are not fully representative of test data. This problem is well-recognized in speech recognition, where methods like cepstral mean normalization (CMN), vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) are used to compensate for channel and speaker differences. Speech emotion recognition (SER) is an important emerging field in human-computer interaction and faces the same data shift problems, a fact which has been generally overlooked in this domain. In this paper, we show that compensating for channel and speaker differences can give significant improvements in SER by modelling these differences as a covariate shift. We employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift. We test these methods on the FAU Aibo Emotion Corpus, which was used in the Interspeech 2009 Emotion Challenge. It consists of two separate parts recorded independently at different schools; hence the two parts exhibit covariate shift. Results show that the IW methods outperform combined CMN and VTLN and significantly improve on the baseline performance of the Challenge. The best of the three methods also improves significantly on the winning contribution to the Challenge.

...read moreread less

108 citations

Journal Article•DOI•

Automatic recognition of lyrics in singing

[...]

Annamaria Mesaros¹, Tuomas Virtanen¹•Institutions (1)

Tampere University of Technology¹

01 Jan 2010-Eurasip Journal on Audio, Speech, and Music Processing

TL;DR: The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer and finds global adaptation to singing to improve singing recognition performance.

...read moreread less

Abstract: The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one.

...read moreread less

97 citations

Proceedings Article•DOI•

VTLN-based cross-language voice conversion

[...]

David Sündermann¹, Hermann Ney¹, Harald Höge²•Institutions (2)

RWTH Aachen University¹, Siemens²

30 Nov 2003

TL;DR: After applying several conventional VTLN warping functions, the conventional piece-wise linear function is extended to several segments, allowing a more detailed warping of the source spectrum.

...read moreread less

Abstract: In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speaker's voice into that of a target speaker using a different language, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the conventional piece-wise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on cross-language voice conversion are performed on three corpora of two languages and both speaker genders.

...read moreread less

93 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

[...]

S. Davis, Paul Mermelstein¹•Institutions (1)

bell northern research¹

01 Aug 1980-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

...read moreread less

4,822 citations

Journal Article•DOI•

Perceptual linear predictive (PLP) analysis of speech

[...]

Hynek Hermansky¹•Institutions (1)

Panasonic¹

01 Apr 1990-Journal of the Acoustical Society of America

TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.

...read moreread less

Abstract: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.

...read moreread less

2,969 citations

Journal Article•DOI•

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

[...]

C. J. Leggetter¹, Philip C. Woodland¹•Institutions (1)

University of Cambridge¹

01 Apr 1995-Computer Speech & Language

TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

...read moreread less

2,504 citations

Journal Article•DOI•

Maximum likelihood linear transformations for HMM-based speech recognition

[...]

Mark J. F. Gales¹•Institutions (1)

University of Cambridge¹

01 Apr 1998-Computer Speech & Language

TL;DR: The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.

...read moreread less

1,755 citations

Book•DOI•

Acoustical and environmental robustness in automatic speech recognition

[...]

Alex Acero

01 May 1991

TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).

...read moreread less

Abstract: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy. Two kinds of environmental variability are introduced by the use of desk-top microphones and different training and testing conditions: additive noise and spectral tilt introduced by linear filtering. An important attribute of the novel compensation algorithms described in this thesis is that they provide joint rather than independent compensation for these two types of degradation. Acoustical compensation is applied in our algorithms as an additive correction in the cepstral domain. This allows a higher degree of integration within SPHINX, the Carnegie Mellon speech recognition system, that uses the cepstrum as its feature vector. Therefore, these algorithms can be implemented very efficiently. Processing in many of these algorithms is based on instantaneous signal-to-noise ratio (SNR), as the appropriate compensation represents a form of noise suppression at low SNRs and spectral equalization at high SNRs. The compensation vectors for additive noise and spectral transformations are estimated by minimizing the differences between speech feature vectors obtained from a "standard" training corpus of speech and feature vectors that represent the current acoustical environment. In our work this is accomplished by minimizing the distortion of vector-quantized cepstra that are produced by the feature extraction module in SPHINX. In this dissertation we describe several algorithms including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cepstral Normalization (CDCN). With CDCN, the accuracy of SPHINX when trained on speech recorded with a close-talking microphone and tested on speech recorded with a desk-top microphone is essentially the same obtained when the system is trained and tested on speech from the desk-top microphone. An algorithm for frequency normalization has also been proposed in which the parameter of the bilinear transformation that is used by the signal-processing stage to produce frequency warping is adjusted for each new speaker and acoustical environment. The optimum value of this parameter is again chosen to minimize the vector-quantization distortion between the standard environment and the current one. In preliminary studies, use of this frequency normalization produced a moderate additional decrease in the observed error rate.

...read moreread less

474 citations