scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2007"


Journal ArticleDOI
TL;DR: It is shown how the two approaches to the problem of session variability in Gaussian mixture model (GMM)-based speaker verification, eigenchannels, and joint factor analysis can be implemented using essentially the same software at all stages except for the enrollment of target speakers.
Abstract: We compare two approaches to the problem of session variability in Gaussian mixture model (GMM)-based speaker verification, eigenchannels, and joint factor analysis, on the National Institute of Standards and Technology (NIST) 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all stages except for the enrollment of target speakers. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. We found that factor analysis was far more effective than eigenchannel modeling. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation

773 citations


Journal ArticleDOI
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.

507 citations


Journal ArticleDOI
TL;DR: The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Abstract: When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.

444 citations


01 Jan 2007
TL;DR: A comparative evaluation of the presented MFCC implementations is performed on the task of text-independent speaker verification, by means of the well-known 2001 NIST SRE (speaker recognition evaluation) one-speaker detection database.
Abstract: Making no claim of being exhaustive, a review of the most popular MFCC (Mel Frequency Cepstral Coefficients) implementations is made. These differ mainly in the particular approximation of the nonlinear pitch perception of human, the filter bank design, and the compression of the filter bank output. Then, a comparative evaluation of the presented implementations is performed on the task of text-independent speaker verification, by means of the well-known 2001 NIST SRE (speaker recognition evaluation) one-speaker detection database.

333 citations


Book
01 Jan 2007
TL;DR: In this article, Gabor et al. proposed a 3D face recognition method based on the LBP representation of the face and the texture of the textured part of the human face.
Abstract: Face Recognition.- Super-Resolved Faces for Improved Face Recognition from Surveillance Video.- Face Detection Based on Multi-Block LBP Representation.- Color Face Tensor Factorization and Slicing for Illumination-Robust Recognition.- Robust Real-Time Face Detection Using Face Certainty Map.- Poster I.- Motion Compensation for Face Recognition Based on Active Differential Imaging.- Face Recognition with Local Gabor Textons.- Speaker Verification with Adaptive Spectral Subband Centroids.- Similarity Rank Correlation for Face Recognition Under Unenrolled Pose.- Feature Correlation Filter for Face Recognition.- Face Recognition by Discriminant Analysis with Gabor Tensor Representation.- Fingerprint Enhancement Based on Discrete Cosine Transform.- Biometric Template Classification: A Case Study in Iris Textures.- Protecting Biometric Templates with Image Watermarking Techniques.- Factorial Hidden Markov Models for Gait Recognition.- A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures.- Automatic Facial Pose Determination of 3D Range Data for Face Model and Expression Identification.- SVDD-Based Illumination Compensation for Face Recognition.- Keypoint Identification and Feature-Based 3D Face Recognition.- Fusion of Near Infrared Face and Iris Biometrics.- Multi-Eigenspace Learning for Video-Based Face Recognition.- Error-Rate Based Biometrics Fusion.- Online Text-Independent Writer Identification Based on Stroke's Probability Distribution Function.- Arm Swing Identification Method with Template Update for Long Term Stability.- Walker Recognition Without Gait Cycle Estimation.- Comparison of Compression Algorithms' Impact on Iris Recognition Accuracy.- Standardization of Face Image Sample Quality.- Blinking-Based Live Face Detection Using Conditional Random Fields.- Singular Points Analysis in Fingerprints Based on Topological Structure and Orientation Field.- Robust 3D Face Recognition from Expression Categorisation.- Fingerprint Recognition Based on Combined Features.- MQI Based Face Recognition Under Uneven Illumination.- Learning Kernel Subspace Classifier.- A New Approach to Fake Finger Detection Based on Skin Elasticity Analysis.- An Algorithm for Biometric Authentication Based on the Model of Non-Stationary Random Processes.- Identity Verification by Using Handprint.- Reducing the Effect of Noise on Human Contour in Gait Recognition.- Partitioning Gait Cycles Adaptive to Fluctuating Periods and Bad Silhouettes.- Repudiation Detection in Handwritten Documents.- A New Forgery Scenario Based on Regaining Dynamics of Signature.- Curvewise DET Confidence Regions and Pointwise EER Confidence Intervals Using Radial Sweep Methodology.- Bayesian Hill-Climbing Attack and Its Application to Signature Verification.- Wolf Attack Probability: A New Security Measure in Biometric Authentication Systems.- Evaluating the Biometric Sample Quality of Handwritten Signatures.- Outdoor Face Recognition Using Enhanced Near Infrared Imaging.- Latent Identity Variables: Biometric Matching Without Explicit Identity Estimation.- Poster II.- 2^N Discretisation of BioPhasor in Cancellable Biometrics.- Probabilistic Random Projections and Speaker Verification.- On Improving Interoperability of Fingerprint Recognition Using Resolution Compensation Based on Sensor Evaluation.- Demographic Classification with Local Binary Patterns.- Distance Measures for Gabor Jets-Based Face Authentication: A Comparative Evaluation.- Fingerprint Matching with an Evolutionary Approach.- Stability Analysis of Constrained Nonlinear Phase Portrait Models of Fingerprint Orientation Images.- Effectiveness of Pen Pressure, Azimuth, and Altitude Features for Online Signature Verification.- Tracking and Recognition of Multiple Faces at Distances.- Face Matching Between Near Infrared and Visible Light Images.- User Classification for Keystroke Dynamics Authentication.- Statistical Texture Analysis-Based Approach for Fake Iris Detection Using Support Vector Machines.- A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition.- Changeable Face Representations Suitable for Human Recognition.- "3D Face": Biometric Template Protection for 3D Face Recognition.- Quantitative Evaluation of Normalization Techniques of Matching Scores in Multimodal Biometric Systems.- Keystroke Dynamics in a General Setting.- A New Approach to Signature-Based Authentication.- Biometric Fuzzy Extractors Made Practical: A Proposal Based on FingerCodes.- On the Use of Log-Likelihood Ratio Based Model-Specific Score Normalisation in Biometric Authentication.- Predicting Biometric Authentication System Performance Across Different Application Conditions: A Bootstrap Enhanced Parametric Approach.- Selection of Distinguish Points for Class Distribution Preserving Transform for Biometric Template Protection.- Minimizing Spatial Deformation Method for Online Signature Matching.- Pan-Tilt-Zoom Based Iris Image Capturing System for Unconstrained User Environments at a Distance.- Fingerprint Matching with Minutiae Quality Score.- Uniprojective Features for Gait Recognition.- Cascade MR-ASM for Locating Facial Feature Points.- Reconstructing a Whole Face Image from a Partially Damaged or Occluded Image by Multiple Matching.- Robust Hiding of Fingerprint-Biometric Data into Audio Signals.- Correlation-Based Fingerprint Matching with Orientation Field Alignment.- Vitality Detection from Fingerprint Images: A Critical Survey.- Optimum Detection of Multiplicative-Multibit Watermarking for Fingerprint Images.- Fake Finger Detection Based on Thin-Plate Spline Distortion Model.- Robust Extraction of Secret Bits from Minutiae.- Fuzzy Extractors for Minutiae-Based Fingerprint Authentication.- Coarse Iris Classification by Learned Visual Dictionary.- Nonlinear Iris Deformation Correction Based on Gaussian Model.- Shape Analysis of Stroma for Iris Recognition.- Biometric Key Binding: Fuzzy Vault Based on Iris Images.- Multi-scale Local Binary Pattern Histograms for Face Recognition.- Histogram Equalization in SVM Multimodal Person Verification.- Learning Multi-scale Block Local Binary Patterns for Face Recognition.- Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification Using the FRGC Version 2 Database.- Video-Based Face Tracking and Recognition on Updating Twin GMMs.- Poster III.- Fast Algorithm for Iris Detection.- Pyramid Based Interpolation for Face-Video Playback in Audio Visual Recognition.- Face Authentication with Salient Local Features and Static Bayesian Network.- Fake Finger Detection by Finger Color Change Analysis.- Feeling Is Believing: A Secure Template Exchange Protocol.- SVM-Based Selection of Colour Space Experts for Face Authentication.- An Efficient Iris Coding Based on Gauss-Laguerre Wavelets.- Hardening Fingerprint Fuzzy Vault Using Password.- GPU Accelerated 3D Face Registration / Recognition.- Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition.- Optimal Decision Fusion for a Face Verification System.- Robust 3D Head Tracking and Its Applications.- Multiple Faces Tracking Using Motion Prediction and IPCA in Particle Filters.- An Improved Iris Recognition System Using Feature Extraction Based on Wavelet Maxima Moment Invariants.- Color-Based Iris Verification.- Real-Time Face Detection and Recognition on LEGO Mindstorms NXT Robot.- Speaker and Digit Recognition by Audio-Visual Lip Biometrics.- Modelling Combined Handwriting and Speech Modalities.- A Palmprint Cryptosystem.- On Some Performance Indices for Biometric Identification System.- Automatic Online Signature Verification Using HMMs with User-Dependent Structure.- A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application to Face Biometrics.- SVM Speaker Verification Using Session Variability Modelling and GMM Supervectors.- 3D Model-Based Face Recognition in Video.- Robust Point-Based Feature Fingerprint Segmentation Algorithm.- Automatic Fingerprints Image Generation Using Evolutionary Algorithm.- Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers.- Improving Classification with Class-Independent Quality Measures: Q-stack in Face Verification.- Biometric Hashing Based on Genetic Selection and Its Application to On-Line Signatures.- Biometrics Based on Multispectral Skin Texture.- Application of New Qualitative Voicing Time-Frequency Features for Speaker Recognition.- Palmprint Recognition Based on Directional Features and Graph Matching.- Tongue-Print: A Novel Biometrics Pattern.- Embedded Palmprint Recognition System on Mobile Devices.- Template Co-update in Multimodal Biometric Systems.- Continual Retraining of Keystroke Dynamics Based Authenticator.

314 citations


Journal ArticleDOI
TL;DR: This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics, and is found to achieve lower error rates.
Abstract: This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the Internet. While the technologies promise an additional biometric layer of security to protect the user, the practical implementation of such systems faces many challenges. One of these is environmental noise. Due to the mobile nature of such systems, the noise sources can be highly time-varying and potentially unknown. This raises the requirement for noise robustness in the absence of information about the noise. This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics. Multicondition training is conducted using simulated noisy data with limited noise variation, providing a ldquocoarserdquo compensation for the noise, and missing-feature theory is applied to refine the compensation by ignoring noise variation outside the given training conditions, thereby reducing the training and testing mismatch. This paper is focused on several issues relating to the implementation of the new model for real-world applications. These include the generation of multicondition training data to model noisy speech, the combination of different training data to optimize the recognition performance, and the reduction of the model's complexity. The new algorithm was tested using two databases with simulated and realistic noisy speech data. The first database is a redevelopment of the TIMIT database by rerecording the data in the presence of various noise types, used to test the model for speaker identification with a focus on the varieties of noise. The second database is a handheld-device database collected in realistic noisy conditions, used to further validate the model for real-world speaker verification. The new model is compared to baseline systems and is found to achieve lower error rates.

277 citations


Journal ArticleDOI
TL;DR: The STBU speaker recognition system was a combination of three main kinds of subsystems, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE).
Abstract: This paper describes and discusses the "STBU" speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of four partners: Spescom DataVoice (Stellenbosch, South Africa), TNO (Soesterberg, The Netherlands), BUT (Brno, Czech Republic), and the University of Stellenbosch (Stellenbosch, South Africa). The STBU system was a combination of three main kinds of subsystems: 1) GMM, with short-time Mel frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) features, 2) Gaussian mixture model-support vector machine (GMM-SVM), using GMM mean supervectors as input to an SVM, and 3) maximum-likelihood linear regression-support vector machine (MLLR-SVM), using MLLR speaker adaptation coefficients derived from an English large vocabulary continuous speech recognition (LVCSR) system. All subsystems made use of supervector subspace channel compensation methods-either eigenchannel adaptation or nuisance attribute projection. We document the design and performance of all subsystems, as well as their fusion and calibration via logistic regression. Finally, we also present a cross-site fusion that was done with several additional systems from other NIST SRE-2006 participants.

271 citations


Journal ArticleDOI
TL;DR: A corpus-based approach to speaker verification in which maximum-likelihood II criteria are used to train a large-scale generative model of speaker and session variability which is called joint factor analysis is presented.
Abstract: We present a corpus-based approach to speaker verification in which maximum-likelihood II criteria are used to train a large-scale generative model of speaker and session variability which we call joint factor analysis. Enrolling a target speaker consists in calculating the posterior distribution of the hidden variables in the factor analysis model and verification tests are conducted using a new type of likelihood II ratio statistic. Using the NIST 1999 and 2000 speaker recognition evaluation data sets, we show that the effectiveness of this approach depends on the availability of a training corpus which is well matched with the evaluation set used for testing. Experiments on the NIST 1999 evaluation set using a mismatched corpus to train factor analysis models did not result in any improvement over standard methods, but we found that, even with this type of mismatch, feature warping performs extremely well in conjunction with the factor analysis model, and this enabled us to obtain very good results (equal error rates of about 6.2%)

268 citations


PatentDOI
Kazuo Sumita1
TL;DR: In this article, a speech recognition system includes a first-candidate selecting unit that selects a recognition result of a first speech from first recognition candidates based on likelihood of the first recognition candidate.
Abstract: A speech recognition apparatus includes a first-candidate selecting unit that selects a recognition result of a first speech from first recognition candidates based on likelihood of the first recognition candidates; a second-candidate selecting unit that extracts recognition candidates of a object word contained in the first speech and recognition candidates of a clue word from second recognition candidates, acquires the relevance ratio associated with the semantic relation between the extracted recognition candidates of the object word and the extracted recognition candidates of the clue word, and selects a recognition result of the second speech based on the acquired relevance ratio; a correction-portion identifying unit that identifies a portion corresponding to the object word in the first speech; and a correcting unit that corrects the word on identified portion.

249 citations


Patent
20 Mar 2007
TL;DR: In this paper, the authors present a client-server security system, which includes a client system receiving first biometric data and having a first level security authorization procedure, and a server system is provided for receiving second Biometric data.
Abstract: The present invention includes a client-server security system. The client-server security system includes a client system receiving first biometric data and having a first level security authorization procedure. In one embodiment, the first biometric data is speech data and the first level security authorization procedure includes a first speaker recognition algorithm. A server system is provided for receiving second biometric data. The server system includes a second level security authorization procedure. In one embodiment, the second biometric data is speech data and the second level security authorization procedure includes a second speaker recognition algorithm. In one embodiment, the first level security authorization procedure and the second level security authorization procedure comprise distinct biometric algorithms.

225 citations


Patent
29 Oct 2007
TL;DR: In this paper, a method of speech recognition is described for use with mobile devices, where a portion of an initial speech recognition result is presented on the mobile device including a set of general alternate recognition hypotheses associated with the portion of the speech recognition results.
Abstract: A method of speech recognition is described for use with mobile devices. A portion of an initial speech recognition result is presented on the mobile device including a set of general alternate recognition hypotheses associated with the portion of the speech recognition result. A key input representative of one or more associated letters is received from the user. The user is provided with a set of restricted alternate recognition hypotheses starting with the one or more letters associated with the key input. Then a user selection is accepted of one of the restricted alternate recognition hypotheses to represent a corrected speech recognition result.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: In this article, a method that integrates the phase information on a speaker recognition method was proposed, which reduced the speaker recognition error rate by about 44% by using phase information for speaker identification.
Abstract: In conventional speaker recognition method based on MFCC, the phase information has been ignored. In this paper, we proposed a method that integrates the phase information on a speaker recognition method. The speaker identification experiments were performed using NTT database which consists of sentences uttered at normal speed mode by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months. Each speaker uttered only 5 training utterances (about 20 seconds in total). Using the phaseinformation, the speaker recognition error rate was reduced by about 44%. Index Terms: speaker identification, MFCC, phase information, GMM, combination method

Patent
Hisayuki Nagashima1
11 Sep 2007
TL;DR: In this article, the authors present a voice recognition system that includes a first voice recognition processing unit for executing processing of assigning weights of a first ratio to a sound score and a language score calculated for the input voice and recognizing the voice using the obtained scores to determine a type of a domain representing the control object based on a result of the processing.
Abstract: A voice recognition device, a voice recognition method and a voice recognition program capable of appropriately restricting recognition objects based on voice input from a user to recognize the input voice with accuracy are provided. The voice recognition device includes a first voice recognition processing unit for executing processing of assigning weights of a first ratio to a sound score and a language score calculated for the input voice and recognizing the voice using the obtained scores to determine a type of a domain representing the control object based on a result of the processing, and a second voice recognition processing unit, using the domain of the determined type as a recognition object, for executing processing of assigning weights of a second ratio to the sound score and the language score calculated for the input voice, the weight on the sound score being greater in the second ratio than in the first ratio, and recognizing the voice using the obtained scores to determine the control content of the control object based on a result of the processing.

Book ChapterDOI
01 Feb 2007
TL;DR: This chapter introduces an evaluation measure, , that can properly evaluate the discrimination abilities of the log-likelihood-ratio scores, as well as the quality of the calibration of a speaker detector.
Abstract: In the evaluation of speaker recognition systems--an important part of speaker classification [1], the trade-off between missed speakers and false alarms has always been an important diagnostic tool. NIST has defined the task of speaker detectionwith the associated Detection Cost Function(DCF) to evaluate performance, and introduced the DET-plot [2] as a diagnostic tool. Since the first evaluation in 1996, these evaluation tools have been embraced by the research community. Although it is an excellent measure, the DCF has the limitation that it has parameters that imply a particular applicationof the speaker detection technology. In this chapter we introduce an evaluation measure that instead averagesdetection performance over application types. This metric, , was first introduced in 2004 by one of the authors [3]. Here we introduce the subject with a minimum of mathematical detail, concentrating on the various interpretations of and its practical application. We will emphasize the difference between discriminationabilities of a speaker detector (`the position/shape of the DET-curve'), and the calibrationof the detector (`how well was the threshold set'). If speaker detectors can be built to output well-calibrated log-likelihood-ratio scores, such detectors can be said to have an application-independentcalibration. The proposed metric can properly evaluate the discrimination abilities of the log-likelihood-ratio scores, as well as the quality of the calibration.

Journal ArticleDOI
TL;DR: A new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features is proposed.
Abstract: It is well-known that early integration (also called data fusion) is effective when the modalities are correlated, and late integration (also called decision or opinion fusion) is optimal when modalities are uncorrelated. In this paper, we propose a new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features. We also propose a method for high precision synchronization of the speech and lip features using CCA prior to the proposed fusion. Experimental results show that i) the proposed fusion strategy yields the best equal error rates (EER), which are used to quantify the performance of the fusion strategy for open-set speaker identification, and ii) precise synchronization prior to fusion improves the EER; hence, the best EER is obtained when the proposed synchronization scheme is employed together with the proposed fusion strategy. We note that the proposed fusion strategy outperforms others because the features used in the late integration are truly uncorrelated, since they are output of the CCA analysis.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: Experimental results on an emotional speech database demonstrate that the GMM supervector based SVM outperforms standard GMM on speech emotion recognition.
Abstract: Speech emotion recognition is a challenging yet important speech technology. In this paper, the GMM supervector based SVM is applied to this field with spectral features. A GMM is trained for each emotional utterance, and the corresponding GMM supervector is used as the input feature for SVM. Experimental results on an emotional speech database demonstrate that the GMM supervector based SVM outperforms standard GMM on speech emotion recognition.

Journal ArticleDOI
TL;DR: The group delay function is modified to overcome the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects and is called the modified group delay feature (MODGDF).
Abstract: Spectral representation of speech is complete when both the Fourier transform magnitude and phase spectra are specified. In conventional speech recognition systems, features are generally derived from the short-time magnitude spectrum. Although the importance of Fourier transform phase in speech perception has been realized, few attempts have been made to extract features from it. This is primarily because the resonances of the speech signal which manifest as transitions in the phase spectrum are completely masked by the wrapping of the phase spectrum. Hence, an alternative to processing the Fourier transform phase, for extracting speech features, is to process the group delay function which can be directly computed from the speech signal. The group delay function has been used in earlier efforts, to extract pitch and formant information from the speech signal. In all these efforts, no attempt was made to extract features from the speech signal and use them for speech recognition applications. This is primarily because the group delay function fails to capture the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects. In this paper, the group delay function is modified to overcome these effects. Cepstral features are extracted from the modified group delay function and are called the modified group delay feature (MODGDF). The MODGDF is used for three speech recognition tasks namely, speaker, language, and continuous-speech recognition. Based on the results of feature and performance evaluation, the significance of the MODGDF as a new feature for speech recognition is discussed

Proceedings ArticleDOI
27 Aug 2007
TL;DR: Comunicacio presentada a: 8th Annual Conference of the International Speech Communication Association a Antwerp (Belgium) celebrada del 27 al 31 d'agost de 2007.
Abstract: Comunicacio presentada a: 8th Annual Conference of the International Speech Communication Association a Antwerp (Belgium) celebrada del 27 al 31 d'agost de 2007.

Patent
01 Oct 2007
TL;DR: In this paper, a post-recognition processor coupled with an interface is used to compare recognized speech data generated by the speech recognition engine to contextual information retained in a memory, and transmits the modified recognition data to a parsing component.
Abstract: A system improves speech recognition includes an interface linked to a speech recognition engine. A post-recognition processor coupled to the interface compares recognized speech data generated by the speech recognition engine to contextual information retained in a memory, generates a modified recognized speech data, and transmits the modified recognized speech data to a parsing component.

Journal ArticleDOI
TL;DR: It is shown how the evaluation of DNA evidence, which is based on a probabilistic similarity-typicality metric in the form of likelihood ratios (LR), can also be generalized to continuous LR estimation, thus providing a common framework for phonetic-linguistic methods and automatic systems.
Abstract: Forensic DNA profiling is acknowledged as the model for a scientifically defensible approach in forensic identification science, as it meets the most stringent court admissibility requirements demanding transparency in scientific evaluation of evidence and testability of systems and protocols. In this paper, we propose a unified approach to forensic speaker recognition (FSR) oriented to fulfil these admissibility requirements within a framework which is transparent, testable, and understandable, both for scientists and fact-finders. We show how the evaluation of DNA evidence, which is based on a probabilistic similarity-typicality metric in the form of likelihood ratios (LR), can also be generalized to continuous LR estimation, thus providing a common framework for phonetic-linguistic methods and automatic systems. We highlight the importance of calibration, and we exemplify with LRs from diphthongal F-pattern, and LRs in NIST-SRE06 tasks. The application of the proposed approach in daily casework remains a sensitive issue, and special caution is enjoined. Our objective is to show how traditional and automatic FSR methodologies can be transparent and testable, but simultaneously remain conscious of the present limitations. We conclude with a discussion on the combined use of traditional and automatic approaches and current challenges for the admissibility of speech evidence.

Journal ArticleDOI
TL;DR: The development of a gender-independent laugh detector is described with the aim to enable automatic emotion recognition and acoustic measurements showed differences between laughter and speech in mean pitch and in the ratio of the durations of unvoiced to voiced portions, which indicate that these prosodic features are indeed useful for discrimination between laughed and speech.

Journal ArticleDOI
TL;DR: This paper deals with eigenchannel adaptation in more detail and includes its theoretical background and implementation issues, undermining a common myth that the more boxes in the scheme, the better the system.
Abstract: In this paper, several feature extraction and channel compensation techniques found in state-of-the-art speaker verification systems are analyzed and discussed. For the NIST SRE 2006 submission, cepstral mean subtraction, feature warping, RelAtive SpecTrAl (RASTA) filtering, heteroscedastic linear discriminant analysis (HLDA), feature mapping, and eigenchannel adaptation were incrementally added to minimize the system's error rate. This paper deals with eigenchannel adaptation in more detail and includes its theoretical background and implementation issues. The key part of the paper is, however, the post-evaluation analysis, undermining a common myth that ldquothe more boxes in the scheme, the better the system.rdquo All results are presented on NIST Speaker Recognition Evaluation (SRE) 2005 and 2006 data.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: This work presents their approach where the system developed is able to recognize sentences of continuous sign language independent of the speaker, and focuses on feature and model combination techniques applied in ASR, and the usage of pronunciation and language models (LM) in sign language.
Abstract: One of the most significant differences between automatic sign language recognition (ASLR) and automatic speech recognition (ASR) is due to the computer vision problems, whereas the corresponding problems in speech signal processing have been solved due to intensive research in the last 30 years. We present our approach where we start from a large vocabulary speech recognition system to profit from the insights that have been obtained in ASR research. The system developed is able to recognize sentences of continuous sign language independent of the speaker. The features used are obtained from standard video cameras without any special data acquisition devices. In particular, we focus on feature and model combination techniques applied in ASR, and the usage of pronunciation and language models (LM) in sign language. These techniques can be used for all kind of sign language recognition systems, and for many video analysis problems where the temporal context is important, e.g. for action or gesture recognition. On a publicly available benchmark database consisting of 201 sentences and 3 signers, we can achieve a 17% WER.

Journal ArticleDOI
TL;DR: The use of continuous prosodic features for speaker recognition are introduced, and it is shown how they can be modeled using joint factor analysis, using a standard Gaussian mixture model.
Abstract: In this paper, we introduce the use of continuous prosodic features for speaker recognition, and we show how they can be modeled using joint factor analysis. Similar features have been successfully used in language identification. These prosodic features are pitch and energy contours spanning a syllable-like unit. They are extracted using a basis consisting of Legendre polynomials. Since the feature vectors are continuous (rather than discrete), they can be modeled using a standard Gaussian mixture model (GMM). Furthermore, speaker and session variability effects can be modeled in the same way as in conventional joint factor analysis. We find that the best results are obtained when we use the information about the pitch, energy, and the duration of the unit all together. Testing on the core condition of NIST 2006 speaker recognition evaluation data gives an equal error rate of 16.6% and 14.6%, with prosodic features alone, for all trials and English-only trials, respectively. When the prosodic system is fused with a state-of-the-art cepstral joint factor analysis system, we obtain a relative improvement of 8% (all trials) and 12% (English only) compared to the cepstral system alone.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: The major aspects of emotion recognition are addressed in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence.
Abstract: As automatic emotion recognition based on speech matures, new challenges can be faced. We therefore address the major aspects in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence. Three different data-sets are used: the Berlin Emotional Speech Database, the Danish Emotional Speech Database, and the spontaneous AIBO Emotion Corpus. By using different feature types such as word- or turn-based statistics, manual versus forced alignment, and optimization techniques we show how to best cope with this demanding task and how noise addition or different microphone positions affect emotion recognition.

Journal ArticleDOI
TL;DR: Two techniques to separate out the speech signal of the speaker of interest from a mixture of speech signals are presented and can result in significant enhancement of individual speakers in mixed recordings, consistently achieving better performance than that obtained with hard binary masks.
Abstract: The problem of single-channel speaker separation attempts to extract a speech signal uttered by the speaker of interest from a signal containing a mixture of acoustic signals. Most algorithms that deal with this problem are based on masking, wherein unreliable frequency components from the mixed signal spectrogram are suppressed, and the reliable components are inverted to obtain the speech signal from speaker of interest. Most current techniques estimate this mask in a binary fashion, resulting in a hard mask. In this paper, we present two techniques to separate out the speech signal of the speaker of interest from a mixture of speech signals. One technique estimates all the spectral components of the desired speaker. The second technique estimates a soft mask that weights the frequency subbands of the mixed signal. In both cases, the speech signal of the speaker of interest is reconstructed from the complete spectral descriptions obtained. In their native form, these algorithms are computationally expensive. We also present fast factored approximations to the algorithms. Experiments reveal that the proposed algorithms can result in significant enhancement of individual speakers in mixed recordings, consistently achieving better performance than that obtained with hard binary masks.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper investigates the classification of different emotional states using presodic and voice quality information to exploit the usage of different phonation types within the production of emotions.
Abstract: This paper investigates the classification of different emotional states using presodic and voice quality information. We want to exploit the usage of different phonation types within the production of emotions. Therefore, as features we use prosodic features, voice quality parameters, and different combinations of both types. We study how prosodic and voice quality features overlap or complement each other in the application of emotion recognition. The classification is speaker independent and uses a reduced subset of 8 features and a Bayesian classifier.

Patent
21 Mar 2007
TL;DR: In this article, a method for efficient use of resources of a speech recognition system includes determining a recognition rate, corresponding to either recognition of instances of a word or recognition of instance of various words among a set of words, and determining an accuracy range of the recognition rate.
Abstract: A method for efficient use of resources of a speech recognition system includes determining a recognition rate, corresponding to either recognition of instances of a word or recognition of instances of various words among a set of words, and determining an accuracy range of the recognition rate. The method may further include adjusting adaptation of a model for the word or various models for the various words, based on a comparison of at least one value in the accuracy range with a recognition rate threshold. An apparatus for efficient use of resources of a speech recognition system includes a processor adapted to determine a recognition rate corresponding to either recognition of instances of a one word or recognition of instances of various words among a set of words, and an accuracy range of the recognition rate. The apparatus may further include a controller adapted to adjust adaptation of a model for the word or various models for the various words, based on a comparison of at least one value in the accuracy range with a recognition rate threshold.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: This is the first study to collectively consider the five speech modes: whispered, soft, neutral, loud and shouted, which can provide improved speech/speaker modeling information, as well as classified vocal mode knowledge to improve speech and language technology in real scenarios.
Abstract: Variation in vocal effort represents one of the most challenging problems in maintaining speech system performance for coding, speech and speaker recognition. Changes in vocal effort (or mode) result in a fundamental change in speech production which is not simply a change in volume. This is the first study to collectively consider the five speech modes: whispered, soft, neutral, loud and shouted. After corpus development, analysis is performed for i) sound intensity level, ii) duration and silence percentage, iii) frame energy distribution and iv) spectral tilt. The analysis shows vocal effort dependent traits which are used to investigate speaker recognition. Matched vocal mode conditions result in a closed-set speaker ID rate of 97.62%, with mismatch vocal conditions producing 54.02%. Finally, a speech mode classification system is developed, which has a range of classification rate from 44.5% to 98.5% confusing with adjacent vocal modes. These advancements can provide improved speech/speaker modeling information, as well as classified vocal mode knowledge to improve speech and language technology in real scenarios.

Journal ArticleDOI
TL;DR: This paper compares channel variability modeling in the usual Gaussian mixture model domain, and the proposed feature domain compensation technique, and shows that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data with a reduced computation cost.
Abstract: The variability of the channel and environment is one of the most important factors affecting the performance of text-independent speaker verification systems. The best techniques for channel compensation are model based. Most of them have been proposed for Gaussian mixture models, while in the feature domain blind channel compensation is usually performed. The aim of this work is to explore techniques that allow more accurate intersession compensation in the feature domain. Compensating the features rather than the models has the advantage that the transformed parameters can be used with models of a different nature and complexity and for different tasks. In this paper, we evaluate the effects of the compensation of the intersession variability obtained by means of the channel factors approach. In particular, we compare channel variability modeling in the usual Gaussian mixture model domain, and our proposed feature domain compensation technique. We show that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data with a reduced computation cost. We also report the results of a system, based on the intersession compensation technique in the feature space that was among the best participants in the NIST 2006 Speaker Recognition Evaluation. Moreover, we show how we obtained significant performance improvement in language recognition by estimating and compensating, in the feature domain, the distortions due to interspeaker variability within the same language.