scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC

01 Jan 2009-Computer Speech & Language (Academic Press Ltd.)-Vol. 23, Iss: 1, pp 42-64
TL;DR: The performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation, and it is shown that the approximations involved do not lead to any performance degradation.
About: This article is published in Computer Speech & Language.The article was published on 2009-01-01. It has received 46 citations till now.
Citations
More filters
Patent
Ioannis Agiomyrgiannakis1
01 Nov 2013
TL;DR: In this paper, a text-to-speech (TTS) synthesis system may include hidden Markov model (HMM) HMM based speech modeling for both synthesizing output speech.
Abstract: A method and system is disclosed for non-parametric speech conversion. A text-to-speech (TTS) synthesis system may include hidden Markov model (HMM) HMM based speech modeling for both synthesizing output speech. A converted HMM may be initially set to a source HMM trained with a voice of a source speaker. A parametric representation of speech may be extract from speech of a target speaker to generate a set of target-speaker vectors. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each HMM state of the source HMM to a target-speaker vector. The HMM states of the converted HMM may be replaced with the matched target-speaker vectors. Transforms may be applied to further adapt the converted HMM to the voice of target speaker. The converted HMM may be used to synthesize speech with voice characteristics of the target speaker.

101 citations

Journal ArticleDOI
TL;DR: The acoustic feature and classifier method developed here have excellent potential for individual animal recognition and can be easily applied to other species.

59 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: A modification of the conventional training process for VC that allows it to perform as an AC transform is proposed, which pair source and target vectors based not on their ordering within a parallel corpus, but based on their linguistic similarity.
Abstract: Voice-conversion (VC) techniques aim to transform utterances from a source speaker to sound as if they had been produced by a target speaker. This includes not only organic properties (i.e., voice quality) but also linguistic cues (i.e., regional accents) of the target speaker. For this reason, VC is generally ill-suited for accent-conversion (AC) purposes, where the goal is to capture the voice quality of the target speaker but the regional accent of the source speaker. In this paper, we propose a modification of the conventional training process for VC that allows it to perform as an AC transform. The approach consists of pairing source and target vectors based not on their ordering within a parallel corpus, as is commonly done in VC, but based on their linguistic similarity. We validate the AC approach on a corpus containing native-accented and Spanish-accented utterances, and compare it against conventional VC through a series of perceptual listening tests. We also analyze the extent to which phonological differences between the two languages (Spanish and American English) help predict the relative performance of the two methods.

48 citations

Proceedings ArticleDOI
15 Apr 2018
TL;DR: An approach that matches frames between the two speakers based on their phonetic (rather than acoustic) similarity and improves the ratings of acoustic quality and native accent while retaining the voice identity of the non-native speaker is proposed.
Abstract: Accent conversion (AC) aims to transform non-native speech to sound as if the speaker had a native accent This can be achieved by mapping source spectra from a native speaker into the acoustic space of the non-native speaker In prior work, we proposed an AC approach that matches frames between the two speakers based on their acoustic similarity after compensating for differences in vocal tract length In this paper, we propose an approach that matches frames between the two speakers based on their phonetic (rather than acoustic) similarity Namely, we map frames from the two speakers into a phonetic posteriorgram using speaker-independent acoustic models trained on native speech We evaluate the proposed algorithm on a corpus containing multiple native and non-native speakers Compared to the previous AC algorithm, the proposed algorithm improves the ratings of acoustic quality (20% increase in mean opinion score) and native accent (69% preference) while retaining the voice identity of the non-native speaker

47 citations


Cites methods from "Frequency warping for VTLN and spea..."

  • ...Following Panchapagesan and Alwan [22], we then learn a linear transform between the MFCCs of both speakers using ridge regression: ∗ = argmin ‖ − ‖ (6)...

    [...]

  • ...[22] S. Panchapagesan and A. Alwan, "Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC," Computer Speech and Language, vol. 23, no. 1, pp. 42-64, Jan 2009....

    [...]

  • ...Following Panchapagesan and Alwan [22], we then learn a linear transform between the MFCCs of both speakers using ridge regression: ∗ = argmin ‖ − ‖ (6) where and are vectors of MFCCs from the native and L2 speakers, respectively, and ∗ is the VTLN transform....

    [...]

Patent
13 Nov 2014
TL;DR: In this paper, a method and system for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions is described.
Abstract: A method and system is disclosed for building a speech database for a text-to-speech (TTS) synthesis system from multiple speakers recorded under diverse conditions. For a plurality of utterances of a reference speaker, a set of reference-speaker vectors may be extracted, and for each of a plurality of utterances of a colloquial speaker, a respective set of colloquial-speaker vectors may be extracted. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each colloquial-speaker vector to a reference-speaker vector. The colloquial-speaker vector may be replaced with the matched reference-speaker vector. The matching-and-replacing can be carried out separately for each set of colloquial-speaker vectors. A conditioned set of speaker vectors can then be constructed by aggregating all the replaced speaker vectors. The condition set of speaker vectors can be used to train the TTS system.

41 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Journal ArticleDOI
TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

2,504 citations


"Frequency warping for VTLN and spea..." refers methods in this paper

  • ...After estimating the LT (see Section 5 below), a bias vector b and an unconstrained variance transform matrix H may be estimated according to the Maximum Likelihood Linear Regression (MLLR) technique (Leggetter and Woodland, 1995; Gales, 1996)....

    [...]

Journal ArticleDOI
TL;DR: The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.

1,755 citations


"Frequency warping for VTLN and spea..." refers methods in this paper

  • ...This objective function is identical to the one used for MLLR and CMLLR (constrained MLLR, (Gales, 1998)), except the linear transformation to be estimated is constrained by the FW parametrization....

    [...]

  • ...Also, even if the Jacobian determinant term were neglected, the accumulator based approach (Gales, 1998) for efficient optimization of the EM auxiliary function with CLTFW cannot be used with regular VTLN....

    [...]

  • ...Different CLTFW transforms can also be estimated for different classes of distributions similar to CMLLR, without much increase in computations, since it is seen from Eq....

    [...]

Journal ArticleDOI
TL;DR: The issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory is discussed, and the superiority of the minimum classification error (MCE) method over the distribution estimation method is shown by providing the results of several key speech recognition experiments.
Abstract: A critical component in the pattern matching approach to speech recognition is the training algorithm, which aims at producing typical (reference) patterns or models for accurate pattern comparison. In this paper, we discuss the issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory. We differentiate the method of classifier design by way of distribution estimation and the discriminative method of minimizing classification error rate based on the fact that in many realistic applications, such as speech recognition, the real signal distribution form is rarely known precisely. We argue that traditional methods relying on distribution estimation are suboptimal when the assumed distribution form is not the true one, and that "optimality" in distribution estimation does not automatically translate into "optimality" in classifier design. We compare the two different methods in the context of hidden Markov modeling for speech recognition. We show the superiority of the minimum classification error (MCE) method over the distribution estimation method by providing the results of several key speech recognition experiments. In general, the MCE method provides a significant reduction of recognition error rate.

728 citations