scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2003"


Proceedings ArticleDOI
06 Jul 2003
TL;DR: The paper addresses the design of working recognition engines and results achieved with respect to the alluded alternatives and describes a speech corpus consisting of acted and spontaneous emotion samples in German and English language.
Abstract: In this contribution we introduce speech emotion recognition by use of continuous hidden Markov models. Two methods are propagated and compared throughout the paper. Within the first method a global statistics framework of an utterance is classified by Gaussian mixture models using derived features of the raw pitch and energy contour of the speech signal. A second method introduces increased temporal complexity applying continuous hidden Markov models considering several states using low-level instantaneous features instead of global statistics. The paper addresses the design of working recognition engines and results achieved with respect to the alluded alternatives. A speech corpus consisting of acted and spontaneous emotion samples in German and English language is described in detail. Both engines have been tested and trained using this equivalent speech corpus. Results in recognition of seven discrete emotions exceeded 86% recognition rate. As a basis of comparison the similar judgment of human deciders classifying the same corpus at 79.8% recognition rate was analyzed.

599 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and intra-aural intensity differences (IID).
Abstract: At a cocktail party, one can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel, supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, the notion of an "ideal" time-frequency binary mask is suggested, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. It is observed that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, pattern classification is performed in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that the model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners.

382 citations


Proceedings Article
01 Jan 2003

358 citations


01 Jan 2003
TL;DR: This paper describes an HMM-based speech synthesis system (HTS), in which speech waveform is generated from HMMs themselves, and applies it to English speech synthesis using the general speech synthesis architecture of Festival.
Abstract: This paper describes an HMM-based speech synthesis system (HTS), in which speech waveform is generated from HMMs themselves, and applies it to English speech synthesis using the general speech synthesis architecture of Festival. Similarly to other datadriven speech synthesis approaches, HTS has a compact language dependent module: a list of contextual factors. Thus, it could easily be extended to other languages, though the first version of HTS was implemented for Japanese. The resulting run-time engine of HTS has the advantage of being small: less than 1 M bytes, excluding text analysis part. Furthermore, HTS can easily change voice characteristics of synthesized speech by using a speaker adaptation technique developed for speech recognition. The relation between the HMM-based approach and other unit selection approaches is also discussed.

314 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: The SuperSID project as mentioned in this paper used prosodic dynamics, pitch and duration features, phone streams, and conversational interactions to improve the accuracy of automatic speaker recognition using a defined NIST evaluation corpus and task.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

256 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: A new feature mapping technique that maps feature vectors into a channel independent space is presented that learns mapping parameters from a set of channel-dependent models derived from a channel-independent model via MAP adaptation.
Abstract: In speaker recognition applications, channel variability is a major cause of errors. Techniques in the feature, model and score domains have been applied to mitigate channel effects. In this paper we present a new feature mapping technique that maps feature vectors into a channel independent space. The feature mapping learns mapping parameters from a set of channel-dependent models derived from a channel-independent model via MAP adaptation. The technique is developed primarily for speaker verification, but can be applied for feature normalization in speech recognition applications. Results are presented on NIST landline and cellular telephone speech corpora where it is shown that feature mapping provides significant performance improvements over baseline systems and similar performance to Hnorm and speaker-model-synthesis (SMS).

255 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: Two approaches that use the fundamental frequency and energy trajectories to capture long-term information are proposed that can achieve a 77% relative improvement over a system based on short-term pitch and energy features alone.
Abstract: Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at short-term spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a predefined set of words as the speaker templates and then, using dynamic time warping, computes the distance between the templates and the words from the test message. The results presented in this work are on Switchboard I using the NIST Extended Data evaluation design. We show that these approaches can achieve an equal error rate of 3.7%, which is a 77% relative improvement over a system based on short-term pitch and energy features alone.

212 citations


PatentDOI
TL;DR: In this paper, an apparatus and method for selective distributed speech recognition includes a dialog manager that is capable of receiving a grammar type indicator (170), which can be coupled to an external speech recognition engine (108), which may be disposed on a communication network.
Abstract: An apparatus and method for selective distributed speech recognition includes a dialog manager (104) that is capable of receiving a grammar type indicator (170). The dialog manager (104) is capable of being coupled to an external speech recognition engine (108), which may be disposed on a communication network (142). The apparatus and method further includes an audio receiver (102) coupled to the dialog manager (104) wherein the audio receiver (104) receives a speech input (110) and provides an encoded audio input (112) to the dialog manager (104). The method and apparatus also includes an embedded speech recognition engine (106) coupled to the dialog manager (104), such that the dialog manager (104) selects to distribute the encoded audio input (112) to either the embedded speech recognition engine (106) or the external speech recognition engine (108) based on the corresponding grammar type indicator (170).

187 citations


PatentDOI
TL;DR: Embodiments of the present invention include a speech recognition method that includes receiving from an external system first recognition information to recognize a first plurality of words in a first system.
Abstract: Embodiments of the present invention include a speech recognition method. In one embodiment, the method includes receiving from an external system first recognition information to recognize a first plurality of words in a first system, programming the first system with the first recognition information to recognize the first plurality of words, generating first recognition results in response to receiving at least one of the first plurality of words in the first system, receiving from the external system second recognition information to recognize a second plurality of words, wherein the second recognition information is selected based on the first recognition results, and programming the first system with the second recognition information to recognize a second plurality of words.

183 citations


Journal ArticleDOI
TL;DR: The most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer is analyzed, and a general framework for the local refinement of boundaries is proposed, and the performance of several pattern classification approaches is compared within this framework.
Abstract: This paper presents the results and conclusions of a thorough study on automatic phonetic segmentation. It starts with a review of the state of the art in this field. Then, it analyzes the most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer. For this approach, a statistical correction procedure is proposed to compensate for the systematic errors produced by context-dependent HMMs, and the use of speaker adaptation techniques is considered to increase segmentation precision. Finally, this paper explores the possibility of locally refining the boundaries obtained with the former techniques. A general framework is proposed for the local refinement of boundaries, and the performance of several pattern classification approaches (fuzzy logic, neural networks and Gaussian mixture models) is compared within this framework. The resulting phonetic segmentation scheme was able to increase the performance of a baseline HMM segmentation tool from 27.12%, 79.27%, and 97.75% of automatic boundary marks with errors smaller than 5, 20, and 50 ms, respectively, to 65.86%, 96.01%, and 99.31% in speaker-dependent mode, which is a reasonably good approximation to manual segmentation.

181 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: Using non-native data from German speakers, it is investigated how bilingual models, speaker adaptation, acoustic model interpolation and polyphone decision tree specialization methods can help to improve the recognizer performance.
Abstract: The performance of speech recognition systems is consistently poor on non-native speech. The challenge for non-native speech recognition is to maximize the recognition performance with a small amount of available non-native data. We report on acoustic modeling adaptation for the recognition of non-native speech. Using non-native data from German speakers, we investigate how bilingual models, speaker adaptation, acoustic model interpolation and polyphone decision tree specialization methods can help to improve the recognizer performance. Results obtained from the experiments demonstrate the feasibility of these methods.

Patent
03 Dec 2003
TL;DR: In this article, a fast on-line automatic speaker/environment adaptation suitable for speech/speaker recognition system, method and computer program product is presented, which consists of a computer system including a processor, a memory coupled with the processor, an input coupled with a processor for receiving acoustic signals, and an output coupled with an output for outputting recognized words or sounds.
Abstract: A fast on-line automatic speaker/environment adaptation suitable for speech/speaker recognition system, method and computer program product are presented. The system comprises a computer system including a processor, a memory coupled with the processor, an input coupled with the processor for receiving acoustic signals, and an output coupled with the processor for outputting recognized words or sounds. The system includes a model-adaptation system and a recognition system, configured to accurately and efficiently recognize on-line distorted sounds or words spoken with different accents, in the presence of randomly changing environmental conditions. The model-adaptation system quickly adapts standard acoustic training models, available on audio recognition systems, by incorporating distortion parameters representative of the changing environmental conditions or the speaker's accent. By adapting models already available to the new environment, the system does not need separate adaptation training data.

Proceedings Article
09 Dec 2003
TL;DR: A new phone- based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches is introduced and a new kernel based upon a linearization of likelihood ratio scoring is derived.
Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

Proceedings ArticleDOI
05 Nov 2003
TL;DR: The Georgia Tech Gesture Toolkit GT2k is introduced which leverages Cambridge University's speech recognition toolkit, HTK, to provide tools that support gesture recognition research and four ongoing projects that utilize the toolkit in a variety of domains are presented.
Abstract: Gesture recognition is becoming a more common interaction tool in the fields of ubiquitous and wearable computing. Designing a system to perform gesture recognition, however, can be a cumbersome task. Hidden Markov models (HMMs), a pattern recognition technique commonly used in speech recognition, can be used for recognizing certain classes of gestures. Existing HMM toolkits for speech recognition can be adapted to perform gesture recognition, but doing so requires significant knowledge of the speech recognition literature and its relation to gesture recognition. This paper introduces the Georgia Tech Gesture Toolkit GT2k which leverages Cambridge University's speech recognition toolkit, HTK, to provide tools that support gesture recognition research. GT2k provides capabilities for training models and allows for both real--time and off-line recognition. This paper presents four ongoing projects that utilize the toolkit in a variety of domains.

Proceedings ArticleDOI
15 Oct 2003
TL;DR: A simple novel technique for preparing reliable reference templates to improve the recognition rate score and produces templates called crosswords reference templates (CWRTs), which can be adapted to any DTW-based speech recognition systems to improve its performance.
Abstract: One of the main problems in dynamic time-warping (DTW) based speech recognition systems are the preparation of reliable reference templates for the set of words to be recognised. This paper presents a simple novel technique for preparing reliable reference templates to improve the recognition rate score. The developed technique produces templates called crosswords reference templates (CWRTs). It extracts the reference template from a set of examples rather than one example. This technique can be adapted to any DTW-based speech recognition systems to improve its performance. The speaker-dependent recognition rate, as tested on the English digits, is improved from 85.3%, using the traditional technique to 99%, using the developed technique.

Patent
18 Mar 2003
TL;DR: In this article, the authors present a method, program product and system for speech recognition for use with a base speech recognition process, but which does not affect scoring models in the base speech classification process.
Abstract: A method, program product and system for speech recognition for use with a base speech recognition process, but which does not affect scoring models in the base speech recognition process, the method comprising in one embodiment: obtaining an output hypothesis from a base speech recognition process that uses a first set of scoring models; obtaining a set of alternative hypotheses; scoring the set of alternative hypotheses based on a second set of different scoring models that is separate from and external to the base speech recognition process and does not affect the scoring models thereof; and selecting a hypothesis with a best score.

01 Jan 2003
TL;DR: This thesis attempts to see the feature extraction as a whole, starting from understanding the speech production process, what is known about speaker individuality, and then going to the methods adopted directly from the speech recognition task.
Abstract: Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end. In other words, classification can be at most as accurate as the features. Several feature extraction methods have been proposed, and successfully exploited in the speaker recognition task. However, almost exclusively, the methods are adopted directly from the speech recognition task. This is somewhat ironical, considering the opposite nature of the two tasks. In speech recognition, speaker variability is one of the major error sources, whereas in speaker recognition it is the information that we wish to extract. The mel-frequency cepstral coefficients (MFCC) is the most evident example of a feature set that is extensively used in speaker recognition, but originally developed for speech recognition purposes. When MFCC front-end is used in speaker recognition system, one makes an implicit assumption that the human hearing meachanism is the optimal speaker recognizer. However, this has not been confirmed, and in fact opposite results exist. Although several methods adopted from speech recognition have shown to work well in practise, they are often used as “black boxes” with fixed parameters. It is not understood what kind of information the features capture from the speech signal. Understanding the features at some level requires experience from specific areas such as speech physiology, acoustic phonetics, digital signal processing and statistical pattern recognition. According to the author’s general impression of literature, it seems more and more that currently, at the best we are guessing what is the code in the signal that carries our individuality. This thesis has two main purposes. On the one hand, we attempt to see the feature extraction as a whole, starting from understanding the speech production process, what is known about speaker individuality, and then going

Proceedings Article
01 Jan 2003
TL;DR: Two approaches for extracting speaker traits are investigated: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker, showing that voice signatures are of practical interest in real-world applications.
Abstract: Most current spoken-dialog systems only extract sequences of words from a speaker's voice. This largely ignores other useful information that can be inferred from speech such as gender, age, dialect, or emotion. These characteristics of a speaker's voice, voice signatures, whether static or dynamic, can be useful for speech mining applications or for the design of a natural spoken-dialog system. This paper explores the problem of extracting automatically and accurately voice signatures from a speaker's voice. We investigate two approaches for extracting speaker traits: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker. In the first approach, we show that standard speech/nonspeech HMM, conditioned on speaker traits and evaluated on cepstral and pitch features, achieve accuracies well above chance for all examined traits. The second approach, using support vector machines with rational kernels applied to speech recognition lattices, attains an accuracy of about 8.1 % in the task of binary classification of emotion. Our results are based on a corpus of speech data collected from a deployed customer-care application (HMIHY 0300). While still preliminary, our results are significant and show that voice signatures are of practical interest in real-world applications.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: It is found that the combination of feature warping and T-norm gives the best results on the NIST 2002 test data (for the one-speaker detection task) and approaches the state-of-the-art performance level obtained for speaker verification with land-line telephone speech.
Abstract: This paper presents some experiments with feature and score normalization for text-independent speaker verification of cellular data. The speaker verification system is based on cepstral features and Gaussian mixture models with 1024 components. The following methods, which have been proposed for feature and score normalization, are reviewed and evaluated on cellular data: cepstral mean subtraction (CMS), variance normalization, feature warping, T-norm, Z-norm and the cohort method. We found that the combination of feature warping and T-norm gives the best results on the NIST 2002 test data (for the one-speaker detection task). Compared to a baseline system using both CMS and variance normalization and achieving a 0.410 minimal decision cost function (DCF), feature warping and T-norm respectively bring 8% and 12% relative reductions, whereas the combination of both techniques yields a 22% relative reduction, reaching a DCF of 0.320. This result approaches the state-of-the-art performance level obtained for speaker verification with land-line telephone speech.

Journal ArticleDOI
TL;DR: The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline, and the SGMM-SBM shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.
Abstract: We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.

Proceedings Article
01 Jan 2003
TL;DR: It is shown how novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%—a 71% relative reduction in error over the previous state of the art.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level information can be used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-levelfeature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002): http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper, we show how these novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%—a 71% relative reduction in error over the previous state of the art.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: It was found that, for many children, recognition results were as good as for adults, however, a higher variability in phone recognition accuracy across speakers was observed for children than for adults.
Abstract: Recognition of children's speech was investigated by considering a phone recognition task. Two baseline systems were trained, one for children and one for adults, by exploiting two Italian speech databases. Under matching conditions, for training and recognition performed with data from the same population group, the phone recognition accuracy was 77.30% and 79.43% for children and adults, respectively. It was found that, for many children, recognition results were as good as for adults. However, a higher variability in phone recognition accuracy across speakers was observed for children than for adults. Vocal tract length normalization, under matched and mismatched training and testing conditions, was also investigated. For both adults and children, a performance improvement, with respect to the baseline systems, was observed.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: A set of new algorithms that perform speaker clustering in an online fashion that enables low-latency incremental speaker adaptation in online speech-to-text systems and gives a speaker tracking and indexing system the ability to label speakers with cluster ID on the fly.
Abstract: This paper describes a set of new algorithms that perform speaker clustering in an online fashion. Unlike typical clustering approaches, the proposed method does not require the presence of all the data before performing clustering. The clustering decision is made as soon as an audio segment is received. Being causal, this method enables low-latency incremental speaker adaptation in online speech-to-text systems. It also gives a speaker tracking and indexing system the ability to label speakers with cluster ID on the fly. We show that the new online speaker clustering method yields better performance compared to the traditional hierarchical speaker clustering. Evaluation metrics for speaker clustering are also discussed.

Proceedings ArticleDOI
17 Sep 2003
TL;DR: In particular, fingerprint and speech based systems serve as illustration of a mobile authentication application and a novel signal adaptive supervisor, based on the input biometric signal quality, is evaluated.
Abstract: The elements of multimodal authentication along with system models are presented These include the machine experts as well as machine supervisors In particular, fingerprint and speech based systems serve as illustration of a mobile authentication application A novel signal adaptive supervisor, based on the input biometric signal quality, is evaluated Experimental results on data collected from mobile telephones are reported; they demonstrate the benefits of the proposed scheme

Proceedings ArticleDOI
30 Nov 2003
TL;DR: After applying several conventional VTLN warping functions, the conventional piece-wise linear function is extended to several segments, allowing a more detailed warping of the source spectrum.
Abstract: In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speaker's voice into that of a target speaker using a different language, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the conventional piece-wise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on cross-language voice conversion are performed on three corpora of two languages and both speaker genders.

Proceedings Article
01 Jan 2003
TL;DR: A battery of measures of consistency and confusability, based on forced-alignment, which can be used to predict recogniser performance are presented and how these measures perform are shown to the clinicians who are the users of the system.
Abstract: We describe an unusual ASR application: recognition of command words from severely dysarthric speakers, who have poor control of their articulators. The goal is to allow these clients to control assistive technology by voice. While this is a small vocabulary, speaker-dependent, isolated-word application, the speech material is more variable than normal, and only a small amount of data is available for training. After training a CDHMM recogniser, it is necessary to predict its likely performance without using an independent test set,so that confusable words can be replaced by alternatives. We present a battery of measures of consistency and confusability, based on forced-alignment, which can be used to predict recogniser performance. We show how these measures perform, and how they are presented to the clinicians who are the users of the system.

Proceedings Article
01 Jan 2003
TL;DR: It is argued here, that including LSTF streams provides another step towards human-like speech recognition, as well as evidence for (spectro-)temporal processing in the auditory system.
Abstract: Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. 1. Getting auditory ... again? The question whether knowledge about the (human) auditory system provides valuable contributions to the design of ASR systems is as old as the field itself. The topic has been discussed extensively elsewhere (e.g. [1]). After all these years, a major argument still holds, namely the large gap in performance between normal-hearing native listeners and state-of-the art ASR systems. Consistently, humans outperform machines by at least an order of magnitude [2]. Human listeners recognize speech even in very adverse acoustical environments with strong reverberation and interfering sound sources. However, this discrepancy between human and machine performance is not restricted to robustness alone. It is observed also in undisturbed conditions and very small context independent corpora, where higher level constraints (cognitive aspects, language model) do not play a role. Arguably this hints towards an insufficient feature extraction in machine recognition systems. It is argued here, that including LSTF streams provides another step towards human-like speech recognition. 2. Evidence for (spectro-)temporal processing in the auditory system Speech is characterized by its fluctuations across time and frequency. The latter reflect the characteristics of the human vocal cords and tract and are commonly exploited in ASR by using short-term spectral representations such as cepstral coefficients. The temporal properties of speech are targeted in ASR by dynamic (delta and delta-delta) features and temporal filtering and feature extraction techniques like RASTA [3] and TRAPS [4]. Nevertheless, speech clearly exhibits combined spectro-temporal modulations. This is due to intonation, co-articulation and the succession of several phonetic elements, e.g., in a syllable. Formant transitions, for example, result in diagonal features in a spectrogram representation of speech. This kind of pattern is captured by LSTF and explicitly targeted by the Gabor feature extrac

Patent
09 Jul 2003
TL;DR: In this article, a speech data mining system for use in generating a rich transcription having utility in call center management includes a speech differentiation module differentiating between speech of interacting speakers, and a speech recognition module improving automatic recognition of speech of one speaker based on interaction with another speaker employed as a reference speaker.
Abstract: A speech data mining system for use in generating a rich transcription having utility in call center management includes a speech differentiation module differentiating between speech of interacting speakers, and a speech recognition module improving automatic recognition of speech of one speaker based on interaction with another speaker employed as a reference speaker. A transcript generation module generates a rich transcript based on recognized speech of the speakers. Focused, interactive language models improve recognition of a customer on a low quality channel using context extracted from speech of a call center operator on a high quality channel with a speech model adapted to the operator. Mined speech data includes number of interaction turns, customer frustration phrases, operator polity, interruptions, and/or contexts extracted from speech recognition results, such as topics, complaints, solutions, and resolutions. Mined speech data is useful in call center and/or product or service quality management.

Proceedings ArticleDOI
Tong Zhang1
06 Jul 2003
TL;DR: A system for automatic singer identification is developed which recognizes the singer of a song by analyzing the music signal by following the framework of common speaker identification systems.
Abstract: The singer's information is essential in organizing, browsing and retrieving music collections. In this paper, a system for automatic singer identification is developed which recognizes the singer of a song by analyzing the music signal. Meanwhile, songs which are similar in terms of singer's voice are clustered. The proposed scheme follows the framework of common speaker identification systems, but special efforts are made to distinguish the singing voice from instrumental sounds in a song. A statistical model is trained for each singer's voice with typical song(s) of the singer. Then, for a song to be identified, the starting point of singing voice is detected and a portion of the song is excerpted from that point. Audio features are extracted and matched with singers' voice models in the database. The song is assigned to the model having the best match. Promising results are obtained on a small set of samples, and accuracy rates of around 80% are achieved.

Proceedings Article
01 Jan 2003
TL;DR: There is no scientific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice, according to the two groups herein.
Abstract: Because of recent events and as members of the scientific community working in the field of speech processing, we feel compelled to publicize our views concerning the possibility of identifying or authenticating a person from his or her voice. The need for a clear and common message was indeed shown by the diversity of information that has been circulating on this matter in the media and general public over the past year. In a press release initiated by the AFCP and further elaborated in collaboration with the SpLC ISCA-SIG, the two groups herein discuss and present a summary of the current state of scientific knowledge and technological development in the field of speaker recognition, in accessible wording for nonspecialists. Our main conclusion is that, despite the existence of technological solutions to some constrained applications, at the present time, there is no scientific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice.