scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Vocal tract normalization equals linear transformation in cepstral space

15 Aug 2005-IEEE Transactions on Speech and Audio Processing (IEEE)-Vol. 13, Iss: 5, pp 930-944
TL;DR: In this paper, the Jacobian determinant of the transformation matrix is computed analytically for three typical warping functions and it is shown that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.
Abstract: Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: An overview of real-world applications of VC systems, extensively study existing systems proposed in the literature, and discuss remaining challenges are provided.

232 citations

Proceedings ArticleDOI
14 May 2006
TL;DR: A new approach is presented that applies unit selection to find corresponding time frames in source and target speech to achieve the same performance as the conventional text-dependent training.
Abstract: So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require text-independent training, over the last two years, training techniques that use non-parallel data were proposed. In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training.

129 citations

Journal ArticleDOI
TL;DR: In this paper, the authors employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift.
Abstract: Pattern recognition tasks often face the situation that training data are not fully representative of test data. This problem is well-recognized in speech recognition, where methods like cepstral mean normalization (CMN), vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) are used to compensate for channel and speaker differences. Speech emotion recognition (SER) is an important emerging field in human-computer interaction and faces the same data shift problems, a fact which has been generally overlooked in this domain. In this paper, we show that compensating for channel and speaker differences can give significant improvements in SER by modelling these differences as a covariate shift. We employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift. We test these methods on the FAU Aibo Emotion Corpus, which was used in the Interspeech 2009 Emotion Challenge. It consists of two separate parts recorded independently at different schools; hence the two parts exhibit covariate shift. Results show that the IW methods outperform combined CMN and VTLN and significantly improve on the baseline performance of the Challenge. The best of the three methods also improves significantly on the winning contribution to the Challenge.

108 citations

Journal ArticleDOI
TL;DR: The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer and finds global adaptation to singing to improve singing recognition performance.
Abstract: The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one.

97 citations

Proceedings ArticleDOI
30 Nov 2003
TL;DR: After applying several conventional VTLN warping functions, the conventional piece-wise linear function is extended to several segments, allowing a more detailed warping of the source spectrum.
Abstract: In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speaker's voice into that of a target speaker using a different language, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the conventional piece-wise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on cross-language voice conversion are performed on three corpora of two languages and both speaker genders.

93 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Journal ArticleDOI
Hynek Hermansky1
TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
Abstract: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.

2,969 citations

Journal ArticleDOI
TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

2,504 citations

Journal ArticleDOI
TL;DR: The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.

1,755 citations

BookDOI
01 May 1991
TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).
Abstract: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy. Two kinds of environmental variability are introduced by the use of desk-top microphones and different training and testing conditions: additive noise and spectral tilt introduced by linear filtering. An important attribute of the novel compensation algorithms described in this thesis is that they provide joint rather than independent compensation for these two types of degradation. Acoustical compensation is applied in our algorithms as an additive correction in the cepstral domain. This allows a higher degree of integration within SPHINX, the Carnegie Mellon speech recognition system, that uses the cepstrum as its feature vector. Therefore, these algorithms can be implemented very efficiently. Processing in many of these algorithms is based on instantaneous signal-to-noise ratio (SNR), as the appropriate compensation represents a form of noise suppression at low SNRs and spectral equalization at high SNRs. The compensation vectors for additive noise and spectral transformations are estimated by minimizing the differences between speech feature vectors obtained from a "standard" training corpus of speech and feature vectors that represent the current acoustical environment. In our work this is accomplished by minimizing the distortion of vector-quantized cepstra that are produced by the feature extraction module in SPHINX. In this dissertation we describe several algorithms including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cepstral Normalization (CDCN). With CDCN, the accuracy of SPHINX when trained on speech recorded with a close-talking microphone and tested on speech recorded with a desk-top microphone is essentially the same obtained when the system is trained and tested on speech from the desk-top microphone. An algorithm for frequency normalization has also been proposed in which the parameter of the bilinear transformation that is used by the signal-processing stage to produce frequency warping is adjusted for each new speaker and acoustical environment. The optimum value of this parameter is again chosen to minimize the vector-quantization distortion between the standard environment and the current one. In preliminary studies, use of this frequency normalization produced a moderate additional decrease in the observed error rate.

474 citations