Vocal tract normalization equals linear transformation in cepstral space

doi:10.1109/TSA.2005.848881

Journal ArticleDOI

Vocal tract normalization equals linear transformation in cepstral space

Michael Pitz, +1 more

- 15 Aug 2005 -

IEEE Transactions on Speech and Audio Pr...

- Vol. 13, Iss: 5, pp 930-944

Chats0

TLDR

In this paper, the Jacobian determinant of the transformation matrix is computed analytically for three typical warping functions and it is shown that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

Abstract:

Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

An Analysis on the Two-phase Test Sample Sparse Representation Method and an Improved Method

Yuwu Lu

TL;DR: Wang et al. as mentioned in this paper proposed a two-phase test sample sparse representation (TPTSR) method, which uses the global search algorithm to determine the M " optimal " nearest neighbors of the test sample.

...read moreread less

Proceedings ArticleDOI

Experimental investigation on the efficacy of Affine-DTW in the quality of voice conversion

Gaku Kotani, +3 more

TL;DR: Experimental results show that Affine-DTW obtains appropriate alignments and the naturalness improvement of converted speech in subjective assessments is observed in trained models based on the alignments.

...read moreread less

Proceedings ArticleDOI

Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes

Fumiya Shiozawa, +2 more

TL;DR: By controlling the degree of invariance, this work attempts to improve accent gap prediction by testing DNN-based model-free estimation of divergence and multi-stream speech structures, which shows improvement in correlation between reference accent gaps and the predicted and quantified gaps.

...read moreread less

Proceedings ArticleDOI

Neural VTLN for Speaker Adaptation in TTS

Bastian Schnell, +1 more

TL;DR: Experimental results show that the DNN is capable of predicting phonedependent warpings on artificial data, and that such warpings improve the quality of an acoustic model on real data in subjective listening tests.

...read moreread less

Proceedings ArticleDOI

Invariant integration features combined with speaker-adaptation methods.

Florian Müller, +1 more

TL;DR: It is shown that the integration features benefit from adaptation methods and significantly outperform MFCCs in matching, as well as in mismatching training-test conditions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

S. Davis, +1 more

- 01 Aug 1980 -

IEEE Transactions on Acoustics, Speech, ...

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Journal ArticleDOI

Perceptual linear predictive (PLP) analysis of speech

Hynek Hermansky

- 01 Apr 1990 -

Journal of the Acoustical Society of Ame...

TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.

...read moreread less

Journal ArticleDOI

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

C. J. Leggetter, +1 more

- 01 Apr 1995 -

Computer Speech & Language

TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

...read moreread less

Journal ArticleDOI

Maximum likelihood linear transformations for HMM-based speech recognition

Mark J. F. Gales

- 01 Apr 1998 -

Computer Speech & Language

TL;DR: The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.

...read moreread less

BookDOI

Acoustical and environmental robustness in automatic speech recognition

Alex Acero

TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).

...read moreread less

Vocal tract normalization equals linear transformation in cepstral space

Citations

An Analysis on the Two-phase Test Sample Sparse Representation Method and an Improved Method

Experimental investigation on the efficacy of Affine-DTW in the quality of voice conversion

Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes

Neural VTLN for Speaker Adaptation in TTS

Invariant integration features combined with speaker-adaptation methods.

References

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

Perceptual linear predictive (PLP) analysis of speech

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

Maximum likelihood linear transformations for HMM-based speech recognition

Acoustical and environmental robustness in automatic speech recognition

Related Papers (5)

A frequency warping approach to speaker normalization

Maximum likelihood linear transformations for HMM-based speech recognition

A parametric approach to vocal tract length normalization

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

Speaker normalization using efficient frequency warping procedures