scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Robust speaker recognition: a feature-based approach

01 Jan 1996-IEEE Signal Processing Magazine (IEEE)-Vol. 13, Iss: 5, pp 58-71
TL;DR: Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described, including the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
Abstract: The future commercialization of speaker- and speech-recognition technology is impeded by the large degradation in system performance due to environmental differences between training and testing conditions. This is known as the "mismatched condition." Studies have shown [l] that most contemporary systems achieve good recognition performance if the conditions during training are similar to those during operation (matched conditions). Frequently, mismatched conditions axe present in which the performance is dramatically degraded as compared to the ideal matched conditions. A common example of this mismatch is when training is done on clean speech and testing is performed on noise- or channel-corrupted speech. Robust speech techniques [2] attempt to maintain the performance of a speech processing system under such diverse conditions of operation. This article presents an overview of current speaker-recognition systems and the problems encountered in operation, and it focuses on the front-end feature extraction process of robust speech techniques as a method of improvement. Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described. Also described is the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
Citations
More filters
Journal ArticleDOI
01 Sep 1997
TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct decalcification.
Abstract: A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared.

1,686 citations


Cites background from "Robust speaker recognition: a featu..."

  • ...Speaker-recognition systems can be made somewhat robust against noise and channel variations [33], [49], ordinary human changes (e.g., time-of-day voice changes and minor head colds), and mimicry by humans and tape recorders [22]....

    [...]

Journal ArticleDOI
01 Oct 1980

1,565 citations

Journal ArticleDOI
TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.

1,433 citations


Cites background or methods from "Robust speaker recognition: a featu..."

  • ...The graph structure was motivated by invariance against the affine feature distortion model for cepstral features (e.g. Mak and Tsang, 2004; Mammone et al., 1996)....

    [...]

  • ...It is more desirable to design a feature extractor which is itself robust (Mammone et al., 1996), or to normalize the features before feeding them onto the modeling or matching algorithms....

    [...]

  • ...Linear prediction (LP) (Makhoul, 1975; Mammone et al., 1996) is an alternative spectrum estimation method to DFT that has good intuitive interpretation both in time domain (adjacent samples are correlated) and frequency domain (all-pole spectrum corresponding to the resonance structure)....

    [...]

Journal ArticleDOI
TL;DR: In this study the optic disc, blood vessels, and fovea were accurately detected and the identification of the normal components of the retinal image will aid the future detection of diseases in these regions.
Abstract: Aim—To recognise automatically the main components of the fundus on digital colour images. Methods—The main features of a fundus retinal image were defined as the optic disc, fovea, and blood vessels. Methods are described for their automatic recognition and location. 112 retinal images were preprocessed via adaptive, local, contrast enhancement. The optic discs were located by identifying the area with the highest variation in intensity of adjacent pixels. Blood vessels were identified by means of a multilayer perceptron neural net, for which the inputs were derived from a principal component analysis (PCA) of the image and edge detection of the first component of PCA. The foveas were identified using matching correlation together with characteristics typical of a fovea—for example, darkest area in the neighbourhood of the optic disc. The main components of the image were identified by an experienced ophthalmologist for comparison with computerised methods. Results—The sensitivity and specificity of the recognition of each retinal main component was as follows:99.1% and 99.1% for the optic disc; 83.3% and 91.0% for blood vessels; 80.4% and 99.1% for the fovea. Conclusions—In this study the optic disc, blood vessels, and fovea were accurately detected. The identification of the normal components of the retinal image will aid the future detection of diseases in these regions. In diabetic retinopathy, for example,an image could be analysed for retinopathy with reference to sight threatening complications such as disc neovascularisation, vascular changes, or foveal exudation. (Br J Ophthalmol 1999;83:902‐910)

846 citations

Journal ArticleDOI
TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.
Abstract: Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.

554 citations

References
More filters
Book
01 Jan 1993
TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

8,442 citations


"Robust speaker recognition: a featu..." refers background or methods in this paper

  • ...The first derivative of the cepstrum (also known as the delta cepstrum) is defined as [ 17 ]...

    [...]

  • ...n n=1,2, ..., L 0 otherwise ' w(n) = and bandpass lifering (BPL) [ 17 ,26] where...

    [...]

  • ...The autocorrelation method and the covariance method are two standard methods of solving for the predictor coefficients [12, 17 ]....

    [...]

  • ...the unit circle and the gain is 1, the causal LP cepstrum cg(n) of H(z) is given by [ 17 ,23,24]....

    [...]

Journal ArticleDOI
S. Boll1
TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Abstract: A stand-alone noise suppression algorithm is presented for reducing the spectral effects of acoustically added noise in speech. Effective performance of digital speech processors operating in practical environments may require suppression of noise from the digital wave-form. Spectral subtraction offers a computationally efficient, processor-independent approach to effective digital speech analysis. The method, requiring about the same computation as high-speed convolution, suppresses stationary noise from speech by subtracting the spectral noise bias calculated during nonspeech activity. Secondary procedures are then applied to attenuate the residual noise left after subtraction. Since the algorithm resynthesizes a speech waveform, it can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.

4,862 citations

Journal ArticleDOI
TL;DR: The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
Abstract: This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task. >

3,134 citations


"Robust speaker recognition: a featu..." refers methods in this paper

  • ...Methods such as probabilistic optimum filtering [6], Gaussian mixture model (GMM) [ 7 ], and hidden Markov model (HMM) adaptation [SI all fall into this category....

    [...]

Book
01 Jan 1960

3,119 citations

Book
05 Sep 1978
TL;DR: This paper presents a meta-modelling framework for digital Speech Processing for Man-Machine Communication by Voice that automates the very labor-intensive and therefore time-heavy and expensive process of encoding and decoding speech.
Abstract: 1. Introduction. 2. Fundamentals of Digital Speech Processing. 3. Digital Models for the Speech Signal. 4. Time-Domain Models for Speech Processing. 5. Digital Representation of the Speech Waveform. 6. Short-Time Fourier Analysis. 7. Homomorphic Speech Processing. 8. Linear Predictive Coding of Speech. 9. Digital Speech Processing for Man-Machine Communication by Voice.

3,103 citations