scispace - formally typeset
Search or ask a question
Author

Xiaoyu Zhang

Bio: Xiaoyu Zhang is an academic researcher from Rutgers University. The author has contributed to research in topics: Speaker recognition & Speech processing. The author has an hindex of 6, co-authored 10 publications receiving 615 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described, including the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
Abstract: The future commercialization of speaker- and speech-recognition technology is impeded by the large degradation in system performance due to environmental differences between training and testing conditions. This is known as the "mismatched condition." Studies have shown [l] that most contemporary systems achieve good recognition performance if the conditions during training are similar to those during operation (matched conditions). Frequently, mismatched conditions axe present in which the performance is dramatically degraded as compared to the ideal matched conditions. A common example of this mismatch is when training is done on clean speech and testing is performed on noise- or channel-corrupted speech. Robust speech techniques [2] attempt to maintain the performance of a speech processing system under such diverse conditions of operation. This article presents an overview of current speaker-recognition systems and the problems encountered in operation, and it focuses on the front-end feature extraction process of robust speech techniques as a method of improvement. Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described. Also described is the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.

344 citations

Patent
07 Jun 1995
TL;DR: In this article, a pattern recognition system which uses data fusion to combine data from a plurality of extracted features and classifiers is presented. But the method is limited to a single classifier.
Abstract: The present invention relates to a pattern recognition system which uses data fusion to combine data from a plurality of extracted features and a plurality of classifiers Speaker patterns can be accurately verified with the combination of discriminant based and distortion based classifiers A novel approach using a training set of a "leave one out" data can be used for training the system with a reduced data set Extracted features can be improved with a pole filtered method for reducing channel effects and an affine transformation for improving the correlation between training and testing data

76 citations

Patent
08 Jan 2002
TL;DR: In this article, a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language is presented.
Abstract: The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language. Automatic blind speech segmentation allows speech to be segmented into subword units without any linguistic knowledge of the password. Subword modeling is performed using a multiple classifiers. The system also takes advantage of such concepts as multiple classifier fusion and data resampling to successfully boost the performance. Key word/key phrase spotting is used to optimally locate the password phrase. Numerous adaptation techniques increase the flexibility of the base system, and include: channel adaptation, fusion adaptation, model adaptation and threshold adaptation.

60 citations

Patent
21 Nov 1997
TL;DR: In this article, a text-dependent automatic speaker verification voiceprint system embodies a capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language.
Abstract: The subword-based, text-dependent automatic speaker verification voiceprint system embodies a capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language. Automatic blind speech segmentation allows speech to be segmented into subword units (210) without any linguistic knowledge of the password. Subword modeling is performed using multiple classifiers (240, 250). The system also takes advantage of such concepts as multiple classifier fusion (260) and data resampling to successfully boost the performance. Key word/key phrase spotting (200) is used to optimally locate the password phrase. Numerous adaptation techniques increase the flexibility of the base system, and include: channel adaptation (180), fusion adaptation (290), model adaptation (220, 230) and threshold adaptation (295).

57 citations

PatentDOI
TL;DR: The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language.
Abstract: The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language. Automatic blind speech segmentation allows speech to be segmented into subword units without any linguistic knowledge of the password. Subword modeling is performed using a multiple classifiers. The system also takes advantage of such concepts as multiple classifier fusion and data resampling to successfully boost the performance. Key word/key phrase spotting is used to optimally locate the password phrase. Numerous adaptation techniques increase the flexibility of the base system, and include: channel adaptation, fusion adaptation, model adaptation and threshold adaptation.

41 citations


Cited by
More filters
Journal ArticleDOI
01 Sep 1997
TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct decalcification.
Abstract: A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared.

1,686 citations

Journal ArticleDOI
01 Oct 1980

1,565 citations

Journal ArticleDOI
TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.

1,433 citations

Journal ArticleDOI
TL;DR: In this study the optic disc, blood vessels, and fovea were accurately detected and the identification of the normal components of the retinal image will aid the future detection of diseases in these regions.
Abstract: Aim—To recognise automatically the main components of the fundus on digital colour images. Methods—The main features of a fundus retinal image were defined as the optic disc, fovea, and blood vessels. Methods are described for their automatic recognition and location. 112 retinal images were preprocessed via adaptive, local, contrast enhancement. The optic discs were located by identifying the area with the highest variation in intensity of adjacent pixels. Blood vessels were identified by means of a multilayer perceptron neural net, for which the inputs were derived from a principal component analysis (PCA) of the image and edge detection of the first component of PCA. The foveas were identified using matching correlation together with characteristics typical of a fovea—for example, darkest area in the neighbourhood of the optic disc. The main components of the image were identified by an experienced ophthalmologist for comparison with computerised methods. Results—The sensitivity and specificity of the recognition of each retinal main component was as follows:99.1% and 99.1% for the optic disc; 83.3% and 91.0% for blood vessels; 80.4% and 99.1% for the fovea. Conclusions—In this study the optic disc, blood vessels, and fovea were accurately detected. The identification of the normal components of the retinal image will aid the future detection of diseases in these regions. In diabetic retinopathy, for example,an image could be analysed for retinopathy with reference to sight threatening complications such as disc neovascularisation, vascular changes, or foveal exudation. (Br J Ophthalmol 1999;83:902‐910)

846 citations

Journal ArticleDOI
TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.
Abstract: Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.

554 citations