scispace - formally typeset
Search or ask a question
Author

P. T. Akhil

Bio: P. T. Akhil is an academic researcher. The author has contributed to research in topics: Warp drive & Expectation–maximization algorithm. The author has an hindex of 1, co-authored 1 publications receiving 31 citations.

Papers
More filters
Proceedings Article
01 Jan 2008
TL;DR: This paper develops a computationally efficient approach for warp factor estimation in Vocal Tract Length Normalization (VTLN) that has recognition performance that is comparable to conventional VTLN and yet is computationally more efficient.
Abstract: In this paper, we develop a computationally efficient approach for warp factor estimation in Vocal Tract Length Normalization (VTLN). Recently we have shown that warped features can be obtained by a linear transformation of the unwarped features. Using the warp matrices we show that warp factor estimation can be efficiently performed in an EM framework. This can be done by collecting Sufficient Statistics by aligning the unwarped utterances only once. The likelihood of warped features, which are necessary for warp factor estimation, are computed by appropriately modifying the sufficient statistics using the warp matrices. We show using OGI, TIDIGITS and RM task that this approach has recognition performance that is comparable to conventional VTLN and yet is computationally more efficient.

32 citations


Cited by
More filters
Proceedings Article
30 Sep 2010
TL;DR: A count-smoothing framework for incorporating prior information is extended to allow for the use of different forms of dynamic prior and improve the robustness of transform estimation on small amounts of data.
Abstract: Rapidly adapting a speech recognition system to new speakers using a small amount of adaptation data is important to improve initial user experience. In this paper, a count-smoothing framework for incorporating prior information is extended to allow for the use of different forms of dynamic prior and improve the robustness of transform estimation on small amounts of data. Prior information is obtained from existing rapid adaptation techniques like VTLN and PCMLLR. Results using VTLN as a dynamic prior for CMLLR estimation show that transforms estimated on just one utterance can yield relative gains of 15% and 46% over a baseline gender independent model on two tasks. Index Terms: automatic speech recognition, speaker adaptation, VTLN, prior knowledge

27 citations

Journal ArticleDOI
TL;DR: This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing V TLN for synthesis.
Abstract: Vocal tract length normalization (VTLN) has been successfully used in automatic speech recognition for improved performance. The same technique can be implemented in statistical parametric speech synthesis for rapid speaker adaptation during synthesis. This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing VTLN for synthesis. Jacobian normalization, high-dimensionality features and truncation of the transformation matrix are a few challenges presented with the appropriate solutions. Detailed evaluations are performed to estimate the most suitable technique for using VTLN in speech synthesis. Evaluating VTLN in the framework of speech synthesis is also not an easy task since the technique does not work equally well for all speakers. Speakers have been selected based on different objective and subjective criteria to demonstrate the difference between systems. The best method for implementing VTLN is confirmed to be use of the lower order features for estimating warping factors.

22 citations

Journal ArticleDOI
TL;DR: A method to analytically obtain a linear-transformation on the conventional Mel frequency cepstral coefficients (MFCC) features that corresponds to conventional vocal tract length normalization (VTLN)-warped MFCC features, thereby simplifying the VTLN processing.
Abstract: In this paper, we propose a method to analytically obtain a linear-transformation on the conventional Mel frequency cepstral coefficients (MFCC) features that corresponds to conventional vocal tract length normalization (VTLN)-warped MFCC features, thereby simplifying the VTLN processing. There have been many attempts to obtain such a linear-transformation, but all the previously proposed approaches either modify the signal processing (and therefore not conventional MFCC), or the linear-transformation does not correspond to conventional VTLN-warping, or the matrices being estimated and are data dependent. In short, the conventional VTLN part of an automatic speech recognition (ASR) system cannot be simply replaced with any of the previously proposed methods. Umesh proposed the idea to use band-limited interpolation for performing VTLN-warping on MFCC using plain cepstra. Motivated from this work, Panchapagesan and Alwan proposed a linear-transformation to perform VTLN-warping on conventional MFCC. However, in their approach, VTLN warping is specified in the Mel-frequency domain and is not equivalent to conventional VTLN. In this paper, we present an approach which also draws inspiration from the work of Umesh , and which we believe for the first time performs conventional VTLN as a linear-transformation on conventional MFCC using the ideas of band-limited interpolation. Deriving such a linear-transformation to perform VTLN, would allow us to use the VTLN-matrices in transform-based adaptation framework with its associated advantages and yet would require the estimation of a single parameter. Using four different tasks, we show that our proposed approach has almost identical recognition performance to conventional VTLN on both clean and noisy speech data.

16 citations

01 Jan 2010
TL;DR: The EM formulation helps to embed the feature normalization in the HMM training and enables the use of multiple (appropriate) warping factors for different state clusters of the same speaker.
Abstract: Vocal tract length normalization is an important feature normalization technique that can be used to perform speaker adaptation when very little adaptation data is available. It was shown earlier that VTLN can be applied to statistical speech synthesis and was shown to give additive improvements to CMLLR. This paper presents an EM optimization for estimating more accurate warping factors. The EM formulation helps to embed the feature normalization in the HMM training. This helps in estimating the warping factors more efficiently and enables the use of multiple (appropriate) warping factors for different state clusters of the same speaker.

15 citations

Proceedings Article
01 Jan 2009
TL;DR: This paper has shown that, in this framework of VTLN, and using the idea of regression class tree, one can obtain separate V TLN-warping for different acoustic classes, and the recognition performance of the proposed acoustic class specific warp-factor is shown.
Abstract: In this paper, we study the use of different frequency warpfactors for different acoustic classes in a computationally efficient frame-work of Vocal Tract Length Normalization (VTLN). This is motivated by the fact that all acoustic classes do not exhibit similar spectral variations as a result of physiological differences in vocal tract, and therefore, the use of a single frequency-warp for the entire utterance may not be appropriate. We have recently proposed a VTLN method that implements VTLN-warping through a linear-transformation (LT) of the conventional MFCC features and efficiently estimates the warp-factor using the same sufficient statistics as that are used in CMLLR adaptation. In this paper we have shown that, in this framework of VTLN, and using the idea of regression class tree, we can obtain separate VTLN-warping for different acoustic classes. The use of regression class tree ensures that warp-factor is estimated for each class even when there is very little data available for that class. The acoustic classes, in general, can be any collection of the Gaussian components in the acoustic model. We have built acoustic classes by using data-driven approach and by using phonetic knowledge. Using WSJ database we have shown the recognition performance of the proposed acoustic class specific warp-factor both for the data driven and the phonetic knowledge based regression class tree definitions and compare it with the case of the single warp-factor. Index Terms: VTLN, Acoustic-Class Specific Warping, Regression Class Tree, Linear Transform

13 citations