We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

/pdf/the-kaldi-speech-recognition-toolkit-3g790z933t.pdf

The Kaldi Speech Recognition Toolkit

1. Basic Concepts. 2. Nonparametric Methods. 3. Parametric Methods for Rational Spectra. 4. Parametric Methods for Line Spectra. 5. Filter Bank Methods. 6. Spatial Methods. Appendix A: Linear Algebra and Matrix Analysis Tools. Appendix B: Cramer-Rao Bound Tools. Appendix C: Model Order Selection Tools. Appendix D: Answers to Selected Exercises. Bibliography. References Grouped by Subject. Subject Index.

/pdf/spectral-analysis-of-signals-4dbupf9m4n.pdf

Spectral analysis of signals

Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.

Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment. Thus, the practical application of HMMs in modern systems involves considerable sophistication.

The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then describe the various refinements which are needed to achieve state-of-the-art performance. These refinements include feature projection, improved covariance modelling, discriminative parameter estimation, adaptation and normalisation, noise compensation and multi-pass system combination. The review concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described.

/pdf/application-of-hidden-markov-models-in-speech-recognition-4ic3ad3d5g.pdf

Application of Hidden Markov Models in Speech Recognition

Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

An overview of automatic speaker diarization systems

https://hal.archives-ouvertes.fr/hal-00499180/document

Automatic speech recognition and speech variability: A review

We show that there are many qualitatively different equations, each with few parameters, that fit the experimentally obtained Mel scale. We investigate the often made remark that there are two regions to the Mel scale, the first region (</spl sim/1000 Hz.) being linear and the upper region being logarithmic. We show that there is no evidence, based on the experimental data points, that there are two qualitatively different regions or that the lower region is linear and upper region logarithmetic. In fact F/sub M/=f/(af+b) where F/sub M/ and f are the Mel and physical frequency respectively, fits better than a line in the linear region or a logarithm in the "log" region.

Fitting the Mel scale

We present fast maximum likelihood (FML) estimation of parameters of multiple exponentially damped sinusoids. The FML algorithm was motivated by the desire to analyze data that have many closely spaced components, such as the NMR spectroscopy data of human blood plasma. The computational efficiency of FML lies in reducing the multidimensional search involved in ML estimation into multiple 1-D searches. This is achieved by using our knowledge of the shape of the compressed likelihood function (CLF) in the parameter space. The proposed FML algorithm is an iterative method that decomposes the original data into its constituent signal components and estimates the parameters of the individual components efficiently using our knowledge of the shape of the CLF. The other striking features of the proposed algorithm are that it provides procedures for initialization, has a fast converging iteration stage, and makes use of the information extracted in preliminary iterations to segment the data suitably to increase the effective signal-to-noise ratio (SNR). The computational complexity and the performance of the proposed algorithm are compared with other existing methods such as those based on linear prediction, KiSS/IQML, alternating projections (AP), and expectation-maximization (EM).

Estimation of parameters of exponentially damped sinusoids using fast maximum likelihood estimation with application to NMR spectroscopy data

In this paper, we study the scale transform of the spectral-envelope of speech utterances by different speakers. This study is motivated by the hypothesis that the formant frequencies between different speakers are approximately related by a scaling constant for a given vowel. The scale transform has the fundamental property that the magnitude of the scale-transform of a function X(f) and its scaled version /spl radic//spl alpha/X(/spl alpha/f) are same. The methods presented here are useful in reducing variations in acoustic features. We show that the F-ratio tests indicate better separability of vowels by using scale-transform based features than mel-transform based features. The data used in the comparison of the different features consist of 200 utterances of four vowels that are extracted from the TIMIT database.

Scale transform in speech analysis

Vocal tract length normalisation (VTLN) is a commonly used speaker normalisation approach. It is attractive compared to many normalisation schemes as it is typically dependent on only a single parameter, allowing the warp factors to be robustly calculated on little data. However, the scheme normally requires explicitly coding the data at multiple warp factors. Furthermore, it is only possible to approximate the Jacobian associated with the VTLN transformation. A new, simple, linear approximation to VTLN is described in this paper. This linear approximation allows the Jacobian to be exactly computed. It can also be highly efficient in terms of warp factor estimation and application of the warp factors. Both the linear and standard CUED VTLN schemes were evaluated in the 2003 BNE evaluation framework and found to yield similar performance. When used in system combination both VTLN schemes yielded slight gains over the baseline system.

Using VTLN for broadcast news transcription.

Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.

Srinivasan Umesh

Papers

Fitting the Mel scale

Estimation of parameters of exponentially damped sinusoids using fast maximum likelihood estimation with application to NMR spectroscopy data

Scale transform in speech analysis

Using VTLN for broadcast news transcription.

Improved cepstral mean and variance normalization using Bayesian framework