scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2001"


Book
01 Jan 2001
TL;DR: This chapter discusses the Discrete-Time Speech Signal Processing Framework, a model based on the FBS Method, and its applications in Speech Communication Pathway and Homomorphic Signal Processing.
Abstract: (NOTE: Each chapter begins with an introduction and concludes with a Summary, Exercises and Bibliography.) 1. Introduction. Discrete-Time Speech Signal Processing. The Speech Communication Pathway. Analysis/Synthesis Based on Speech Production and Perception. Applications. Outline of Book. 2. A Discrete-Time Signal Processing Framework. Discrete-Time Signals. Discrete-Time Systems. Discrete-Time Fourier Transform. Uncertainty Principle. z-Transform. LTI Systems in the Frequency Domain. Properties of LTI Systems. Time-Varying Systems. Discrete-Fourier Transform. Conversion of Continuous Signals and Systems to Discrete Time. 3. Production and Classification of Speech Sounds. Anatomy and Physiology of Speech Production. Spectrographic Analysis of Speech. Categorization of Speech Sounds. Prosody: The Melody of Speech. Speech Perception. 4. Acoustics of Speech Production. Physics of Sound. Uniform Tube Model. A Discrete-Time Model Based on Tube Concatenation. Vocal Fold/Vocal Tract Interaction. 5. Analysis and Synthesis of Pole-Zero Speech Models. Time-Dependent Processing. All-Pole Modeling of Deterministic Signals. Linear Prediction Analysis of Stochastic Speech Sounds. Criterion of "Goodness". Synthesis Based on All-Pole Modeling. Pole-Zero Estimation. Decomposition of the Glottal Flow Derivative. Appendix 5.A: Properties of Stochastic Processes. Random Processes. Ensemble Averages. Stationary Random Process. Time Averages. Power Density Spectrum. Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis. 6. Homomorphic Signal Processing. Concept. Homomorphic Systems for Convolution. Complex Cepstrum of Speech-Like Sequences. Spectral Root Homomorphic Filtering. Short-Time Homomorphic Analysis of Periodic Sequences. Short-Time Speech Analysis. Analysis/Synthesis Structures. Contrasting Linear Prediction and Homomorphic Filtering. 7. Short-Time Fourier Transform Analysis and Synthesis. Short-Time Analysis. Short-Time Synthesis. Short-Time Fourier Transform Magnitude. Signal Estimation from the Modified STFT or STFTM. Time-Scale Modification and Enhancement of Speech. Appendix 7.A: FBS Method with Multiplicative Modification. 8. Filter-Bank Analysis/Synthesis. Revisiting the FBS Method. Phase Vocoder. Phase Coherence in the Phase Vocoder. Constant-Q Analysis/Synthesis. Auditory Modeling. 9. Sinusoidal Analysis/Synthesis. Sinusoidal Speech Model. Estimation of Sinewave Parameters. Synthesis. Source/Filter Phase Model. Additive Deterministic-Stochastic Model. Appendix 9.A: Derivation of the Sinewave Model. Appendix 9.B: Derivation of Optimal Cubic Phase Parameters. 10. Frequency-Domain Pitch Estimation. A Correlation-Based Pitch Estimator. Pitch Estimation Based on a "Comb Filter<170. Pitch Estimation Based on a Harmonic Sinewave Model. Glottal Pulse Onset Estimation. Multi-Band Pitch and Voicing Estimation. 11. Nonlinear Measurement and Modeling Techniques. The STFT and Wavelet Transform Revisited. Bilinear Time-Frequency Distributions. Aeroacoustic Flow in the Vocal Tract. Instantaneous Teager Energy Operator. 12. Speech Coding. Statistical Models of Speech. Scaler Quantization. Vector Quantization (VQ). Frequency-Domain Coding. Model-Based Coding. LPC Residual Coding. 13. Speech Enhancement. Introduction. Preliminaries. Wiener Filtering. Model-Based Processing. Enhancement Based on Auditory Masking. Appendix 13.A: Stochastic-Theoretic parameter Estimation. 14. Speaker Recognition. Introduction. Spectral Features for Speaker Recognition. Speaker Recognition Algorithms. Non-Spectral Features in Speaker Recognition. Signal Enhancement for the Mismatched Condition. Speaker Recognition from Coded Speech. Appendix 14.A: Expectation-Maximization (EM) Estimation. Glossary.Speech Signal Processing.Units.Databases.Index.About the Author.

984 citations


Proceedings Article
01 Jan 2001
TL;DR: This paper introduces a first approach to emotion recognition using RAMSES, the UPC’s speech recognition system, based on standard speech recognition technology using hidden semi-continuous Markov models.
Abstract: This paper introduces a first approach to emotion recognition using RAMSES, the UPC’s speech recognition system The approach is based on standard speech recognition technology using hidden semi-continuous Markov models Both the selection of low level features and the design of the recognition system are addressed Results are given on speaker dependent emotion recognition using the Spanish corpus of INTERFACE Emotional Speech Synthesis Database The accuracy recognising seven different emotions—the six ones defined in MPEG-4 plus neutral style—exceeds 80% using the best combination of low level features and HMM structure This result is very similar to that obtained with the same database in subjective evaluation by human judges Dealing with the speaker’s emotion is one of the latest challenges in speech technologies Three different aspects can be easily identified: speech recognition in the presence of emotional speech, synthesis of emotional speech, and emotion recognition In this last case, the objective is to determine the emotional state of the speaker out of the speech samples Possible applications include from help to psychiatric diagnosis to intelligent toys, and is a subject of recent but rapidly growing interest [1] This paper describes the TALP researchers first approach to emotion recognition The work is inserted in the scope of the INTERFACE project [2] The objective of this European Commission sponsored project is “to define new models and implement advanced tools for audio-video analysis, synthesis and representation in order to provide essential technologies for the implementation of large-scale virtual and augmented environments The work is oriented to make man-machine interaction as natural as possible, based on everyday human communication by speech, facial expressions and body gestures” In the field of emotion recognition out of speech, the main goal of the INTERFACE project will be the construction of a real-time multi-lingual speaker independent emotion recogniser For this purpose, large speech databases with recordings from many speakers and languages are needed As these resources are not available yet, a reduced problem will be addressed first: emotion recognition in multi-speaker language dependent conditions Namely, this paper deals with the recognition of emotion for two Spanish speakers using standard hidden Markov models technology

641 citations


Patent
02 May 2001
TL;DR: In this paper, new techniques and systems may be implemented to improve error correction in speech recognition systems, which may be used in a standard desktop environment, in a mobile environment, or in any other type of environment that can receive and/or present recognized speech.
Abstract: New techniques and systems may be implemented to improve error correction in speech recognition. These new techniques and systems may be implemented to correct errors in speech recognition systems may be used in a standard desktop environment, in a mobile environment, or in any other type of environment that can receive and/or present recognized speech.

423 citations


Proceedings Article
01 Jan 2001
TL;DR: These initial experiments strongly suggest that further exploration of “familiar” speaker characteristics will likely be an extremely interesting and valuable research direction for recognition of speakers in conversational speech.
Abstract: “Familiar” speaker information is explored using non-acoustic features in NIST’s new “extended data” speaker detection task.[1] Word unigrams and bigrams, used in a traditional target/background likelihood ratio framework, are shown to give surprisingly good performance. Performance continues to improve with additional training and/or test data. Bigram performance is also found to be a function of target/model sex and age difference. These initial experiments strongly suggest that further exploration of “familiar” speaker characteristics will likely be an extremely interesting and valuable research direction for recognition of speakers in conversational speech.

285 citations


PatentDOI
TL;DR: In this article, a system and method for the control of color-based lighting through voice control or speech recognition as well as a syntax for use with such a system is presented. But this approach is limited to the use of spoken voice (in any language) without having to learn the myriad manipulation required of some complex controller interfaces.
Abstract: A system and method for the control of color-based lighting through voice control or speech recognition as well as a syntax for use with such a system. In this approach, the spoken voice (in any language) can be used to more naturally control effects without having to learn the myriad manipulation required of some complex controller interfaces. A simple control language based upon spoken words consisting of commands and values is constructed and used to provide a common base for lighting and system control.

260 citations


PatentDOI
Steven G. Woodward1
TL;DR: In this paper, a method for processing a misrecognition error in an embedded speech recognition system during a speech recognition session can include the step of speech-to-text converting audio input in the embedded SPR system based on an active language model.
Abstract: A method for processing a misrecognition error in an embedded speech recognition system during a speech recognition session can include the step of speech-to-text converting audio input in the embedded speech recognition system based on an active language model. The speech-to-text conversion can produce speech recognized text that can be presented through a user interface. A user-initiated misrecognition error notification can be detected. The audio input and a reference to the active language model can be provided to a speech recognition system training process associated with the embedded speech recognition system.

186 citations


Patent
Richard Rose1, Bojana Gajic1
12 Oct 2001
TL;DR: In this paper, a dynamic re-configurable speech recognition model is proposed for small devices such as mobile phones and personal digital assistants as well as environments such as office, home or vehicle while maintaining the accuracy of the speech recognition.
Abstract: Speech recognition models are dynamically re-configurable based on user information, background information such as background noise and transducer information such as transducer response characteristics to provide users with alternate input modes to keyboard text entry (Fig. 5). The techniques of dynamic re-configurable speech recognition provide for deployment of speech recognition on small devices such as mobile phones and personal digital assistants as well environments such as office, home or vehicle while maintaining the accuracy of the speech recognition.

179 citations


Journal ArticleDOI
TL;DR: A structural maximum a posteriori (SMAP) approach to improve the MAP estimates obtained when the amount of adaptation data is small and the recognition results obtained in unsupervised adaptation experiments showed that SMAP estimation was effective even when only one utterance from a new speaker was used for adaptation.
Abstract: Maximum a posteriori (MAP) estimation has been successfully applied to speaker adaptation in speech recognition systems using hidden Markov models. When the amount of data is sufficiently large, MAP estimation yields recognition performance as good as that obtained using maximum-likelihood (ML) estimation. This paper describes a structural maximum a posteriori (SMAP) approach to improve the MAP estimates obtained when the amount of adaptation data is small. A hierarchical structure in the model parameter space is assumed and the probability density functions for model parameters at one level are used as priors for those of the parameters at adjacent levels. Results of supervised adaptation experiments using nonnative speakers' utterances showed that SMAP estimation reduced error rates by 61% when ten utterances were used for adaptation and that it yielded the same accuracy as MAP and ML estimation when the amount of data was sufficiently large. Furthermore, the recognition results obtained in unsupervised adaptation experiments showed that SMAP estimation was effective even when only one utterance from a new speaker was used for adaptation. An effective way to combine rapid supervised adaptation and on-line unsupervised adaptation was also investigated.

172 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features, and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.
Abstract: Describes a technique for synthesizing speech with arbitrary speaker characteristics using speaker independent speech units, which we call "average voice" units. The technique is based on an HMM-based text-to-speech (TTS) system and maximum likelihood linear regression (MLLR) adaptation algorithm. In the HMM-based TTS system, speech synthesis units are modeled by multi-space probability distribution (MSD) HMMs which can model spectrum and pitch simultaneously in a unified framework. We derive an extension of the MLLR algorithm to apply it to MSD-HMMs. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.

158 citations


Book
11 Sep 2001
TL;DR: SAUSI (Semi-Automatic Speaker Identification System) as discussed by the authors ) is a semi-automated speaker identification system that uses audio-perceptual and machine-based approaches.
Abstract: 1. Introduction 2. History 3. Earwitness Lineups 4. Aural-Perceptual Approaches 5. Use of Professionals 6. Voiceprints 7. Machine Approaches 8. SAUSI (Semi-Automatic Speaker Identification System) 9. The Future

141 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.
Abstract: The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

Proceedings ArticleDOI
09 Dec 2001
TL;DR: Speech recognition experiments show that it is beneficial in this multispeaker setting to use the output of the speech activity detector for presegmenting the recognizer input, achieving word error rates within 10% of those achieved with manual turn labeling.
Abstract: As part of a project into speech recognition in meeting environments, we have collected a corpus of multichannel meeting recordings. We expected the identification of speaker activity to be straightforward given that the participants had individual microphones, but simple approaches yielded unacceptably erroneous labelings, mainly due to crosstalk between nearby speakers and wide variations in channel characteristics. Therefore, we have developed a more sophisticated approach for multichannel speech activity detection using a simple hidden Markov model (HMM). A baseline HMM speech activity detector has been extended to use mixtures of Gaussians to achieve robustness for different speakers under different conditions. Feature normalization and crosscorrelation processing are used to increase the channel independence and to detect crosstalk. The use of both energy normalization and crosscorrelation based postprocessing results in a 35% relative reduction of the frame error rate. Speech recognition experiments show that it is beneficial in this multispeaker setting to use the output of the speech activity detector for presegmenting the recognizer input, achieving word error rates within 10% of those achieved with manual turn labeling.

Proceedings ArticleDOI
07 May 2001
TL;DR: A hybrid system which appropriately combines the advantages of both the generative and discriminant model paradigms is described and experimentally evaluated on a text-independent speaker recognition task in matched and mismatched training and test conditions and proves that the combination is beneficial in terms of performance and practical in Terms of computation.
Abstract: Proposes a classification scheme that incorporates statistical models and support vector machines. A hybrid system which appropriately combines the advantages of both the generative and discriminant model paradigms is described and experimentally evaluated on a text-independent speaker recognition task in matched and mismatched training and test conditions. Our results prove that the combination is beneficial in terms of performance and practical in terms of computation. We report relative improvements of up to 25% reduction in identification error rate compared to the baseline statistical model.

01 Jan 2001
TL;DR: The application of microphone arrays to speaker recognition applications is discussed, and an experimental evaluation of a hands-free speaker verification application in noisy conditions is presented.
Abstract: This paper investigates the use of microphone arrays in handsfree speaker recognition systems. Hands-free operation is preferable in many potential speaker recognition applications, however obtaining acceptable performance with a single distant microphone is problematic in real noise conditions. A possible solution to this problem is the use of microphone arrays, which have the capacity to enhance a signal based purely on knowledge of its direction of arrival. The use of microphone arrays for improving the robustness of speech recognition systems has been studied in recent times, however little research has been conducted in the area of speaker recognition. This paper discusses the application of microphone arrays to speaker recognition applications, and presents an experimental evaluation of a hands-free speaker verification application in noisy conditions.

Proceedings ArticleDOI
07 May 2001
TL;DR: The anchor modeling algorithm is refined by pruning the number of models needed and it is shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers.
Abstract: Introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art Gaussian mixture model with universal background model (GMM-UBM) system. However, it is further shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers. Here, excessive computation may prohibit the use of the GMM-UBM recognition system. Finally, the paper presents a method for cascading anchor model and GMM-UBM detectors for speaker indexing. This approach benefits from the efficiency of anchor modeling and high accuracy of GMM-UBM recognition.

PatentDOI
TL;DR: In this article, an interactive voice response unit which provides beneficial operation by including means to handle unconstrained input such as natural speech and to allow barge-in includes a prompter, a recognizer of speech signals, a meaningful phrase detector and classifier, and a turn-taking module.
Abstract: An interactive voice response unit which provides beneficial operation by including means to handle unconstrained input such as natural speech and to allow barge-in includes a prompter, a recognizer of speech signals, a meaningful phrase detector and classifier, and a turn-taking module, all under control of a dialog manager. In the course of listening to user input while outputting a voiced message, the voice response unit processes the received signal and ascertains whether it is receiving an utterance that is intended to interrupt the prompt, or merely noise or an utterance that is not meant to be used by the arrangement. The unit is sensitive to the speed and context of the speech provided by the user and is thus able to distinguish between a situation where a speaker is merely pausing and a situation where a speaker is done speaking.

Proceedings Article
01 Jan 2001
TL;DR: A model-based compensation method is applied to cancel the effect of the additive noise in Automatic Speech Recognition systems so that the compensation procedure does not constrain real-time speech recognition systems and is compatible with emerging technologies based on distributed speech recognition.
Abstract: In this paper we apply a model-based compensation method to cancel the effect of the additive noise in Automatic Speech Recognition systems. The method is formulated in a statistical framework in order to perform the optimal compensation of the noise effect given the observed noisy speech, a model describing the statistics of the speech recorded in a clean reference environment and the estimation of the noise in the noisy recognition environment. The noise is estimated using the first frames of the sentence to be recognized and a frame-by-frame noise compensation algorithm is performed, so that the compensation procedure does not constrain real-time speech recognition systems and is compatible with emerging technologies based on distributed speech recognition. We have performed recognition experiments under noise conditions using the AURORA II database for the recognition tasks developed for this database as a standard reference. Experiments have been carried out including both, clean and multicondition training approaches. The experimental results show the improvements in the recognition performance when the proposed model-based compensation method is applied.

Proceedings ArticleDOI
I. Viikki1, Imre Kiss1, Jilei Tian1
07 May 2001
TL;DR: This work proposes an architecture for embedded multilingual speech recognition systems and investigates the technical challenges that are faced when making a transition from the speaker-dependent to speaker-independent speech recognition technology in mobile communication devices.
Abstract: We investigate the technical challenges that are faced when making a transition from the speaker-dependent to speaker-independent speech recognition technology in mobile communication devices. Due to globalization as well as the international nature of the markets and the future applications, speaker independence implies the development and use of language-independent automatic speech recognition (ASR) to avoid logistic difficulties. We propose an architecture for embedded multilingual speech recognition systems. Multilingual acoustic modeling, automatic language identification, and on-line pronunciation modeling are the key features which enable the creation of truly language- and speaker-independent ASR applications with dynamic vocabularies and sparse implementation resources. Our experimental results confirm the viability of the proposed architecture. While the use of multilingual acoustic models degrades the recognition rates only marginally, a recognition accuracy decrease of approximately 4% is observed due to sub-optimal on-line text-to-phoneme mapping and automatic language identification. This performance loss can nevertheless be compensated by applying acoustic model adaptation techniques.

Patent
26 Oct 2001
TL;DR: In this paper, a method to correct incorrect text associated with recognition errors in computer-implemented speech recognition is described, which includes the step of performing speech recognition on an utterance to produce a recognition result for the utterance.
Abstract: A method (1400, 1435) is described that corrects incorrect text associated with recognition errors in computer-implemented speech recognition. The method includes the step of performing speech recognition on an utterance to produce a recognition result (1405) for the utterance. The command includes a word and a phrase (1500). The method includes determining if a word closely corresponds to a portion of the phrase (1505). A speech recognition result is produced if the word closely corresponds to a portion of the phrase (1520, 1525).

Proceedings Article
07 Sep 2001
TL;DR: A combined system for punctuation generation and speech recognition that incorporates prosodic information with acoustic and language model information is discussed, which can improve the Fmeasure of punctuation recognition by 19% relative.
Abstract: In this paper, we discuss a combined system for punctuation generation and speech recognition. This system incorporates prosodic information with acoustic and language model information. Experiments are conducted for both the reference transcriptions and speech recogniser outputs. For the reference transcription case, prosodic information is shown to be more useful than language model information. When these information sources are combined, we can obtain an F-measure of up to 0.7830 for punctuation recognition. A few straightforward modi cations of a conventional speech recogniser allow the system to produce punctuation and speech recognition hypotheses simultaneously. The multiple hypotheses are produced by the automatic speech recogniser and are re-scored by prosodic information. When prosodic information is incorporated, the Fmeasure can be improved by 19% relative. At the same time, small reductions in word error rate are obtained.

Proceedings ArticleDOI
07 May 2001
TL;DR: The study demonstrates the complementary nature of the two components, which are derived using linear prediction analysis of short segments of speech and captured implicitly by a feedforward autoassociative neural network.
Abstract: We study the effectiveness of the features extracted from the source and system components of the speech production process for the purpose of speaker recognition. The source and system components are derived using linear prediction (LP) analysis of short segments of speech. The source component is the LP residual derived from the signal, and the system component is a set of weighted linear prediction cepstral coefficients. The features are captured implicitly by a feedforward autoassociative neural network (AANN). Two separate speaker models are derived by training two AANN models using feature vectors corresponding to source and system components. A speaker recognition system for 20 speakers is built and tested using both the models to evaluate the performance of source and system features. The study demonstrates the complementary nature of the two components.

Patent
TL;DR: In this paper, a method of improving the recognition accuracy of an in-vehicle speech recognition system is presented. But, the method of the present invention selectively adapts the system's speech engine to a speaker's voice characteristics using an N-best matching technique.
Abstract: Disclosed herein is a method of improving the recognition accuracy of an in-vehicle speech recognition system. The method of the present invention selectively adapts the system's speech engine to a speaker's voice characteristics using an N-best matching technique. In this method, the speech recognition system receives and processes a spoken utterance relating to a car command and having particular speaker-dependent speech characteristics so as to select a set of N-best voice commands matching the spoken utterance. Upon receiving a training mode input from the speaker, the system outputs the N-best command set to the speaker who selects the correct car command. The system then adapts the speech engine to recognize a spoken utterance having the received speech characteristics as the user-selected car command.

01 Jan 2001
TL;DR: LTS1 Reference LTS-CONF-2001-034 Record created on 2006-06-14, modified on 2016-08-08.
Abstract: Keywords: LTS1 Reference LTS-CONF-2001-034 Record created on 2006-06-14, modified on 2016-08-08

Patent
05 Jun 2001
TL;DR: In this paper, the authors present a client-server security system that includes a client system receiving first biometric data and having a first level security authorization procedure in one embodiment, and a server system providing second-level security authorization procedures in another embodiment.
Abstract: The present invention includes a client-server security system The client-server security system includes a client system receiving first biometric data and having a first level security authorization procedure In one embodiment, the first biometric data is speech data and the first level security authorization procedure includes a first speaker recognition algorithm A server system is provided for receiving second biometric data The server system includes a second level security authorization procedure In one embodiment, the second biometric data is speech data and the second level security authorization procedure includes a second speaker recognition algorithm In one embodiment, the first level security authorization procedure and the second level security authorization procedure comprise distinct biometric algorithms

Patent
Robert Beach1
17 Sep 2001
TL;DR: In this paper, a mobile device is arranged to receive first voice commands to be interpreted by a digital signal processor in a device having a limited vocabulary voice recognition program and to receive second voice commands which are converted to voice representative data signals to be sent by a WLAN to a remote computer for interpretation using a large vocabulary VRS program.
Abstract: A mobile device is arranged to receive first voice commands to be interpreted by a digital signal processor in said device having a limited vocabulary voice recognition program and to receive second voice commands which are converted to voice representative data signals to be sent by a WLAN to a remote computer for interpretation using a large vocabulary voice recognition program. The mobile device provides voice control of the remote computer and can also provide voice activated voice communications.

PatentDOI
TL;DR: In this paper, a control unit including a recognition result receiver, recognition result association unit having associations of results with recognition engines, and recognition engine activator able to activate the recognition engine associated with the recognition result is described.
Abstract: Described is a control unit including a recognition result receiver able to receive a recognition result, a recognition result association unit having associations of results with recognition engines, and a recognition engine activator able to activate the recognition engine associated with the recognition result. Also described is a device including a microphone, an analog to digital converter able to convert input received by the microphone, a first speech recognition engine adapted to perform a first type of recognition on an output of the analog to digital converter, and a second recognition engine adapted to perform a second type of recognition on the output.

Patent
Kenichi Fujii1, Ikeda Yuji, Takaya Ueda, Fumiaki Ito, Tomoyuki Shimizu 
10 Oct 2001
TL;DR: In this paper, a speech recognition system has been proposed to provide a plurality of usable speech recognition means to the client, which allows the client to explicitly switch and use the plurality of speech recognition mean connected to the network.
Abstract: This invention has as its object to provide a speech recognition system to which a client and a device that provides a speech recognition process are connected, which provides a plurality of usable speech recognition means to the client, and which allows the client to explicitly switch and use the plurality of speech recognition means connected to the network. To achieve this object, a speech recognition system of this invention has speech input means for inputting speech at the client, designation means for designating one of the plurality of usable speech recognition means, and processing means for making the speech recognition means designated by the designation means recognize speech input from the speech input means.

Proceedings Article
01 Jan 2001
TL;DR: Experimental results show that thefalse acceptance rate for synthetic speech was reduced drastically without significant increase of the false acceptance and rejection rates for natural speech.
Abstract: This paper describes a text-prompted speaker verification system which is robust to imposture using synthetic speech generated by an HMM-based speech synthesis system. In the verification system, text and speaker are verified separately. Text verification is based on phoneme recognition using HMM, and speaker verification is based on GMM. To discriminate synthetic speech from natural speech, an average of inter-frame difference of the log likelihood is calculated, and input speech is judged to be synthetic when this value is smaller than a decision threshold. Experimental results show that the false acceptance rate for synthetic speech was reduced drastically without significant increase of the false acceptance and rejection rates for natural speech.

01 Jan 2001
TL;DR: The results of evaluations of the recognition performance produced by multiple participating research organizations, The FBI’s initial Forensic Automatic Speaker Recognition (FASR) program, and a confidence measurement method to indicate the probabilistic certainty level of correctness of each recognition decision are described are described.
Abstract: Automatic speaker recognition technology appears to have reached a sufficient level of maturity for realistic application in the field of forensic science. However, there are key issues to be solved before the forensic community will accept its use as an investigative assistant or as evidence in actual criminal cases. To assess the state of the technology, the Federal Bureau of Investigation (FBI) built a speech corpus that included multiple levels of increasing difficulty based on text-independence, channelindependence, speaking mode, and speech duration. An evaluation of multiple automatic speaker recognition programs indicated that a large GMM model-based recognition algorithm operating with features that are robust with respect to channel variations had the best performance. In this paper we describe (1) the results of evaluations of the recognition performance produced by multiple participating research organizations, (2) The FBI’s initial Forensic Automatic Speaker Recognition (FASR) program based on these concepts, and (3) a confidence measurement method to indicate the probabilistic certainty level of correctness of each recognition decision. We will also discuss the need and justification for input speech screening and pre-processing to improve the recognition performance of the FASR as applied in a real forensic environment.

PatentDOI
TL;DR: A method of speech recognition including receiving speech signals into a front-end processor and storing at least some resources used for speech recognition in a network-attached server.
Abstract: A method of speech recognition including receiving speech signals into a front-end processor and storing at least some resources used for speech recognition in a network-attached server. The front-end processor is coupled to the network-attached server to perform the speech recognition.