scispace - formally typeset
Search or ask a question
Topic

Cepstral Mean and Variance Normalization

About: Cepstral Mean and Variance Normalization is a research topic. Over the lifetime, 65 publications have been published within this topic receiving 7068 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The theoretical and experimental foundations of the RASTA method are reviewed, the relationship with human auditory perception is discussed, the original method is extended to combinations of additive noise and convolutional noise, and an application is shown to speech enhancement.
Abstract: Performance of even the best current stochastic recognizers severely degrades in an unexpected communications environment. In some cases, the environmental effect can be modeled by a set of simple transformations and, in particular, by convolution with an environmental impulse response and the addition of some environmental noise. Often, the temporal properties of these environmental effects are quite different from the temporal properties of speech. We have been experimenting with filtering approaches that attempt to exploit these differences to produce robust representations for speech recognition and enhancement and have called this class of representations relative spectra (RASTA). In this paper, we review the theoretical and experimental foundations of the method, discuss the relationship with human auditory perception, and extend the original method to combinations of additive noise and convolutional noise. We discuss the relationship between RASTA features and the nature of the recognition models that are required and the relationship of these features to delta features and to cepstral mean subtraction. Finally, we show an application of the RASTA technique to speech enhancement. >

2,002 citations

Proceedings Article
01 Jan 2000
TL;DR: A database designed to evaluate the performance of speech recognition algorithms in noisy conditions and recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.
Abstract: This paper describes a database designed to evaluate the performance of speech recognition algorithms in noisy conditions. The database may either be used for the evaluation of front-end feature extraction algorithms using a defined HMM recognition back-end or complete recognition systems. The source speech for this database is the TIdigits, consisting of connected digits task spoken by American English talkers (downsampled to 8kHz) . A selection of 8 different real-world noises have been added to the speech over a range of signal to noise ratios and special care has been taken to control the filtering of both the speech and noise. The framework was prepared as a contribution to the ETSI STQ-AURORA DSR Working Group [1]. Aurora is developing standards for Distributed Speech Recognition (DSR) where the speech analysis is done in the telecommunication terminal and the recognition at a central location in the telecom network. The framework is currently being used to evaluate alternative proposals for front-end feature extraction. The database has been made publicly available through ELRA so that other speech researchers can evaluate and compare the performance of noise robust algorithms. Recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.

1,909 citations

Journal ArticleDOI
TL;DR: In this paper, a set of functions of time obtained from acoustic analysis of a fixed, sentence-long utterance are extracted by means of LPC analysis successively throughout an utterance to form time functions, and frequency response distortions introduced by transmission systems are removed.
Abstract: This paper describes new techniques for automatic speaker verification using telephone speech. The operation of the system is based on a set of functions of time obtained from acoustic analysis of a fixed, sentence-long utterance. Cepstrum coefficients are extracted by means of LPC analysis successively throughout an utterance to form time functions, and frequency response distortions introduced by transmission systems are removed. The time functions are expanded by orthogonal polynomial representations and, after a feature selection procedure, brought into time registration with stored reference functions to calculate the overall distance. This is accomplished by a new time warping method using a dynamic programming technique. A decision is made to accept or reject an identity claim, based on the overall distance. Reference functions and decision thresholds are updated for each customer. Several sets of experimental utterances were used for the evaluation of the system, which include male and female utterances recorded over a conventional telephone connection. Male utterances processed by ADPCM and LPC coding systems were used together with unprocessed utterances. Results of the experiment indicate that verification error rate of one percent or less can be obtained even if the reference and test utterances are subjected to different transmission conditions.

1,187 citations

Journal ArticleDOI
TL;DR: The cepstrum was found to be the most effective, providing an identification accuracy of 70% for speech 50 msec in duration, which increased to more than 98% for a duration of 0.5 sec.
Abstract: Several different parametric representations of speech derived from the linear prediction model are examined for their effectiveness for automatic recognition of speakers from their voices. Twelve predictor coefficients were determined approximately once every 50 msec from speech sampled at 10 kHz. The predictor coefficients and other speech parameters derived from them, such as the impulse response function, the autocorrelation function, the area function, and the cepstrum function were used as input to an automatic speaker‐recognition system. The speech data consisted of 60 utterances, consisting of six repetitions of the same sentence spoken by 10 speakers. The identification decision was based on the distance of the test sample vector from the reference vector for different speakers in the population; the speaker corresponding to the reference vector with the smallest distance was judged to be the unknown speaker. In verification, the speaker was verified if the distance between the test sample vector and the reference vector for the claimed speaker was less than a fixed threshold. Among all the parameters investigated, the cepstrum was found to be the most effective, providing an identification accuracy of 70% for speech 50 msec in duration, which increased to more than 98% for a duration of 0.5 sec. Using the same speech data, the verification accuracy was found to be approximately 83% for a duration of 50 msec, increasing to 98% for a duration of 1 sec. In a separate study to determine the feasibility of text‐independent speaker identification, an identification accuracy of 93% was achieved for speech 2 sec in duration even though the texts of the test and reference samples were different.

984 citations

Journal ArticleDOI
Olli Viikki1, Kari Laurila1
TL;DR: A segmental feature vector normalization technique is proposed which makes an automatic speech recognition system more robust to environmental changes by normalizing the output of the signal-processing front-end to have similar segmental parameter statistics in all noise conditions.

405 citations

Network Information
Related Topics (5)
Speech processing
24.2K papers, 637K citations
82% related
Hidden Markov model
28.3K papers, 725.3K citations
79% related
Facial recognition system
38.7K papers, 883.4K citations
73% related
Feature vector
48.8K papers, 954.4K citations
73% related
Feature (machine learning)
33.9K papers, 798.7K citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20211
20206
20193
20181
20178
20163