scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1994"


PatentDOI
TL;DR: In this article, a distributed voice recognition system includes a digital signal processor (DSP), a nonvolatile storage medium (108), and a microprocessor (106), which is configured to extract parameters from digitized input speech samples and provide the extracted parameters to the microprocessor.
Abstract: A distributed voice recognition system includes a digital signal processor (DSP)(104), a nonvolatile storage medium (108), and a microprocessor (106). The DSP (104) is configured to extract parameters from digitized input speech samples and provide the extracted parameters to the microprocessor (106). The nonvolatile storage medium contains a database of speech templates. The microprocessor is configured to read the contents of the nonvolatile storage medium (108), compare the parameters with the contents, and select a speech template based upon the comparison. The nonvolatile storage medium may be a flash memory. The DSP (104) may be a vocoder. If the DSP (104) is a vocoder, the parameters may be diagnostic data generated by the vocoder. The distributed voice recognition system may reside on an application specific integrated circuit (ASIC).

361 citations


Journal ArticleDOI
TL;DR: This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification, and it is shown that performance differences between the basic features is small, and the major gains are due to the channel Compensation techniques.
Abstract: This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification. The goal is to keep all processing and classification steps constant and to vary only the features and compensations used to allow a controlled comparison. A general, maximum-likelihood classifier based on Gaussian mixture densities is used as the classifier, and experiments are conducted on the King speech database, a conversational, telephone-speech database. The features examined are mel-frequency and linear-frequency filterbank cepstral coefficients, linear prediction cepstral coefficients, and perceptual linear prediction (PLP) cepstral coefficients. The channel compensation techniques examined are cepstral mean removal, RASTA processing, and a quadratic trend removal technique. It is shown for this database that performance differences between the basic features is small, and the major gains are due to the channel compensation techniques. The best "across-the-divide" recognition accuracy of 92% is obtained for both high-order LPC features and band-limited filterbank features. >

336 citations


Journal ArticleDOI
TL;DR: The modified neural tree network (MNTN) is a hierarchical classifier that combines the properties of decision trees and feedforward neural networks that is found to perform better than full-search VQ classifiers for both of these applications.
Abstract: An evaluation of various classifiers for text-independent speaker recognition is presented. In addition, a new classifier is examined for this application. The new classifier is called the modified neural tree network (MNTN). The MNTN is a hierarchical classifier that combines the properties of decision trees and feedforward neural networks. The MNTN differs from the standard NTN in both the new learning rule used and the pruning criteria. The MNTN is evaluated for several speaker recognition experiments. These include closed- and open-set speaker identification and speaker verification. The database used is a subset of the TIMIT database consisting of 38 speakers from the same dialect region. The MNTN is compared with nearest neighbor classifiers, full-search, and tree-structured vector quantization (VQ) classifiers, multilayer perceptrons (MLPs), and decision trees. For closed-set speaker identification experiments, the full-search VQ classifier and MNTN demonstrate comparable performance. Both methods perform significantly better than the other classifiers for this task. The MNTN and full-search VQ classifiers are also compared for several speaker verification and open-set speaker-identification experiments. The MNTN is found to perform better than full-search VQ classifiers for both of these applications. In addition to matching or exceeding the performance of the VQ classifier for these applications, the MNTN also provides a logarithmic saving for retrieval. >

295 citations


Journal ArticleDOI
TL;DR: It is demonstrated how approaches to language identification based on acoustic modeling and language modeling are similar to algorithms used in speaker-independent continuous speech recognition.
Abstract: The Oregon Graduate Institute Multi-language Telephone Speech Corpus (OGI-TS) was designed specifically for language identification research. It currently consists of spontaneous and fixed-vocabulary utterances in 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. These utterances were produced by 90 native speakers in each language over real telephone lines. Language identification is related to speaker-independent speech recognition and speaker identification in several interesting ways. It is therefore not surprising that many of the recent developments in language identification can be related to developments in those two fields. We review some of the more important recent approaches to language identification against the background of successes in speaker and speech recognition. In particular, we demonstrate how approaches to language identification based on acoustic modeling and language modeling, respectively, are similar to algorithms used in speaker-independent continuous speech recognition. Thereafter, prosodic and duration-based information sources are studied. We then review an approach to language identification that draws heavily on speaker identification. Finally, the performance of some representative algorithms is reported. >

280 citations


Journal ArticleDOI
Yunxin Zhao1
TL;DR: Experiments of speaker adaptation on the TIMIT database using short calibration speech have shown significant performance improvement over the baseline speaker-independent continuous speech recognition, where the recognition system uses Gaussian mixture density based hidden Markov models of phone units.
Abstract: A new speaker adaptation technique is proposed for improving speaker-independent continuous speech recognition based on a decomposition of spectral variation sources. In this technique, the spectral variations are separated into two categories, one acoustic and the other phone-specific, where each variation source is modeled by a linear transformation system. The technique consists of two sequential steps: first, acoustic normalization is performed, and second, phone model parameters are adapted. Experiments of speaker adaptation on the TIMIT database using short calibration speech (5 s per speaker) have shown significant performance improvement over the baseline speaker-independent continuous speech recognition, where the recognition system uses Gaussian mixture density based hidden Markov models of phone units. For a vocabulary size of 853 and test set perplexity of 104, the recognition word accuracy has been improved from 86.9% for the baseline system to 90.5% after adaptation, corresponding to an error reduction of 27.5%. On a more difficult test set that contains an additional variation source due to recording channel mismatch, a more significant performance improvement has been obtained: for the same vocabulary and a test set perplexity of 101, the recognition word accuracy has been improved from 65.4% for the baseline to 86.0% after adaptation, corresponding to an error reduction of 59.5%. >

212 citations


Journal ArticleDOI
Jerome R. Bellegarda1, P.V. de Souza1, A. Nadas1, David Nahamoo1, Michael Picheny1, Lalit R. Bahl1 
TL;DR: Results show that the metamorphic algorithm can substantially reduce the word error rate when only a limited amount of enrolment data is available, and can also be used for tracking spectral evolution over time, thus providing a possible means for robust speaker self-adaptation.
Abstract: Large vocabulary speaker-dependent speech recognition systems adjust to the acoustic peculiarities of each new speaker based on some enrolment data provided by this speaker. As the amount of data required increases with the sophistication of the underlying acoustic models, the enrolment may get lengthy. To streamline it, it is therefore desirable to make use of previously acquired speech data. The authors describe a data augmentation strategy based on a piecewise linear mapping between the feature space of a new speaker and that of a reference speaker. This speaker-normalizing mapping is used to transform the previously acquired data of the reference speaker onto the space of the new speaker. The performance of the resulting procedure, dubbed the metamorphic algorithm, is illustrated on an isolated utterance speech recognition task with a vocabulary of 20000 words. Results show that the metamorphic algorithm can substantially reduce the word error rate when only a limited amount of enrolment data is available. Alternatively, it leads to a level of performance comparable to that obtained when a much greater amount of enrolment data is required from the new speaker. In addition, it can also be used for tracking spectral evolution over time, thus providing a possible means for robust speaker self-adaptation. >

184 citations


Patent
21 Jan 1994
Abstract: A signal processing arrangement uses a codebook of first vector quantized speech feature signals formed responsive to a large collection of speech feature signals. The codebook is altered by combining the first speech feature signals of the codebook with second speech feature signals generated responsive to later input speech patterns during normal speech processing. A speaker recognition template can be updated in this fashion to take account of change which may occur in the voice and speaking characteristics of a known speaker.

175 citations


Proceedings Article
01 Jan 1994
TL;DR: The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences, and the system achieved performance levels equivalent to untrained humans when asked to recognize the first four English digits.
Abstract: This paper presents ongoing work on a speaker independent visual speech recognition system. The work presented here builds on previous research efforts in this area and explores the potential use of simple hidden Markov models for limited vocabulary, speaker independent visual speech recognition. The task at hand is recognition of the first four English digits, a task with possible applications in car-phone dialing. The images were modeled as mixtures of independent Gaussian distributions, and the temporal dependencies were captured with standard left-to-right hidden Markov models. The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences. The system achieved performance levels equivalent to untrained humans when asked to recognize the first four English digits.

168 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: A new method of creating speaker-specific phoneme models consisting of tied-mixture HMMs and adapts the feature space of the tied- mixtures to that of the speaker through phoneme-dependent/independent iterative training is proposed.
Abstract: Speaker adaptation methods for tied-mixture-based phoneme models are investigated for text-prompted speaker recognition. For this type of speaker recognition, speaker-specific phoneme models are essential for verifying both the key text and the speaker. This paper proposes a new method of creating speaker-specific phoneme models. This uses speaker-independent (universal) phoneme models consisting of tied-mixture HMMs and adapts the feature space of the tied-mixtures to that of the speaker through phoneme-dependent/independent iterative training. Therefore, it can adapt models of phonemes that have a small amount of training data to the speaker. The proposed method was tested using 15 speakers' voices recorded over 10 months and achieved a speaker and text verification rate of 99.4% even when both the voices of different speakers and different texts uttered by the true speaker were to be rejected. >

160 citations


PatentDOI
TL;DR: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems.
Abstract: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems. An input speech data set is provided to a speech recognition training processor, whose bandwidth is higher than a telephone bandwidth. The process performs a series of alterations to the input speech data set to obtain a modified speech data set. The modified speech data set enables the speech recognition processor to perform speech recognition on voice signals from a telephone system.

159 citations


Patent
13 Oct 1994
TL;DR: In this article, a system and method of implementing purchasing and other transactions in an integrated multimedia communication system by utilizing an intelligent peripheral of an advanced intelligent telephone network which peripheral includes a voice recognition module for providing control capability based on received voice signals, and a voice verification module which includes a storage database for individualized voice authentication templates.
Abstract: Disclosed is a system and method of implementing purchasing and other transactions in an integrated multimedia communication system by utilizing an intelligent peripheral of an advanced intelligent telephone network which peripheral includes a voice recognition module for providing control capability based on received voice signals, and a voice verification module which includes a storage database for individualized voice authentication templates. The processor also includes a transaction manager module for controlling interaction with external money management devices such as automated teller machines (ATM's). The service control point of the network maintains a separate database for identifying a specific voice identification template on the basis of recognition and identification of an incoming signal as associated with a specific subscriber for a terminal and/or a telephone station. The peripheral also includes an internal data communications system carrying information between the processor, the voice verification module, the transaction monitor and the signaling communications interface.

Proceedings ArticleDOI
19 Apr 1994
TL;DR: This paper describes techniques for segmentation of conversational speech based on speaker identity using Viterbi decoding on a hidden Markov model network consisting of interconnected speaker sub-networks.
Abstract: This paper describes techniques for segmentation of conversational speech based on speaker identity. Speaker segmentation is performed using Viterbi decoding on a hidden Markov model network consisting of interconnected speaker sub-networks. Speaker sub-networks are initialized using Baum-Welch training on data labeled by speaker, and are iteratively retrained based on the previous segmentation. If data labeled by speaker is not available, agglomerative clustering is used to approximately segment the conversational speech according to speaker prior to Baum-Welch training. The distance measure for the clustering is a likelihood ratio in which speakers are modeled by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: Compares the performance which can be achieved by different hidden Markov model (HMM) based wordspotting techniques when their parameters are tuned to optimize recognition and rejection rates and it appears that the proposed version performs at least as well as the other methods.
Abstract: Compares the performance which can be achieved by different hidden Markov model (HMM) based wordspotting techniques when their parameters are tuned to optimize recognition and rejection rates. An alternative approach which does not attempt to explicitly model extraneous speech or non-speech noise is also proposed. After optimization of each of these approaches, it appears that the proposed version performs at least as well as the other methods with the advantage of simplicity and possibility to be used in hybrid models using HMMs with a multilayer perceptron (MLP). Test results are reported on a speaker independent telephone database containing 10 keywords as well as on the speaker independent ARPA resource management database in which between 10 and 250 keywords were defined. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: The ranking of phonemes is found to be largely unaffected by changes in experimental parameters such as the model size, the feature type and the speaker population.
Abstract: The aim of the study described in this paper is to provide a thorough and quantitative assessment of the relative speaker discriminating properties of phonemes. A VQ codebook based approach to speaker modeling is used in conjunction with a phonetically hand-labelled database to produce phoneme rankings based on speaker verification scores. In broad groupings the nasals and vowels are found to provide the best speaker recognition performance, followed by the fricatives, affricates and approximants, with the stops providing the worst performance of all. A comparison at the individual phoneme level produces a more detailed ranking and of particular interest is the surprisingly good performance of the unvoiced fricative /s/. The ranking of phonemes is found to be largely unaffected by changes in experimental parameters such as the model size, the feature type and the speaker population. >

Journal ArticleDOI
TL;DR: This paper compares a VQ (vector quantization)-distortion-based speaker recognition method and discrete/continuous ergodic HMM (hidden Markov model)-based ones, especially from the viewpoint of robustness against utterance variations, and shows that a continuous ergodIC HMM is as robust as a V Q-distortion method when enough data is available.
Abstract: This paper compares a VQ (vector quantization)-distortion-based speaker recognition method and discrete/continuous ergodic HMM (hidden Markov model)-based ones, especially from the viewpoint of robustness against utterance variations. The authors show that a continuous ergodic HMM is as robust as a VQ-distortion method when enough data is available and that a continuous ergodic HMM is far superior to a discrete ergodic HMM. They also show that the information on transitions between different states is ineffective for text-independent speaker recognition. Therefore, the speaker recognition rates using a continuous ergodic HMM are strongly correlated with the total number of mixtures irrespective of the number of states. >

Book
01 Nov 1994
TL;DR: In the field of speech recognition, a qualitative change in the state of the art has emerged that promises to bring speech recognition capabilities within the reach of anyone with access to a workstation.
Abstract: In the past decade, tremendous advances in the state of the art of automatic speech recognition by machine have taken place. A reduction in the word error rate by more than a factor of 5 and an increase in recognition speeds by several orders of magnitude (brought about by a combination of faster recognition search algorithms and more powerful computers), have combined to make high-accuracy, speakerindependent, continuous speech recognition for large vocabularies possible in real time, on off-the-shelf workstations, without the aid of special hardware. These advances promise to make speech recognition technology readily available to the general public. This paper focuses on the speech recognition advances made through better speech modeling techniques, chiefly through more accurate mathematical modeling of speech sounds. More and more, speech recognition technology is making its way from the laboratory to real-world applications. Recently, a qualitative change in the state of the art has emerged that promises to bring speech recognition capabilities within the reach of anyone with access to a workstation. High-accuracy, real-time, speaker-independent, continuous speech recognition for medium-sized vocabularies (a few thousand words) is now possible in software on off-the-shelf workstations. Users will be able to tailor recognition capabilities to their own applications. Such software-based, real-time solutions usher in a whole new era in the development and utility of speech recognition technology. As is often the case in technology, a paradigm shift occurs when several developments converge to make a new capability possible. In the case of continuous speech recognition, the following advances have converged to make the new technology possible: * higher-accuracy continuous speech recognition, based on better speech modeling techniques; * better recognition search strategies that reduce the time needed for high-accuracy recognition; and * increased power of audio-capable, off-the-shelf workstations. The paradigm shift is taking place in the way we view and use speech recognition. Rather than being mostly a laboratory endeavor, speech recognition is fast becoming a technology that is pervasive and will have a profound influence on the way humans communicate with machines and with each other. This paper focuses on speech modeling advances in continuous speech recognition, with an exposition of hidden Markov models (HMMs), the mathematical backbone behind these advances. While knowledge of properties of the speech signal and of speech perception have always played a role, recent improvements have relied largely on solid mathematical and The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. ?1734 solely to indicate this fact. probabilistic modeling methods, especially the use of HMMs for modeling speech sounds. These methods are capable of modeling time and spectral variability simultaneously, and the model parameters can be estimated automatically from given training speech data. The traditional processes of segmentation and labeling of speech sounds are now merged into a single probabilistic process that can optimize recognition accuracy. This paper describes the speech recognition process and provides typical recognition accuracy figures obtained in laboratory tests as a function of vocabulary, speaker dependence, grammar complexity, and the amount of speech used in training the system. As a result of modeling advances, recognition error rates have dropped several fold. Important to these improvements have been the availability of common speech corpora for training and testing purposes and the adoption of standard testing procedures. We will argue that future advances in speech recognition must continue to rely on finding better ways to incorporate our speech knowledge into advanced mathematical models, with an emphasis on methods that are robust to speaker variability, noise, and other acoustic distortions. THE SPEECH RECOGNITION PROBLEM Automatic speech recognition can be viewed as a mapping from a continuous-time signal, the speech signal, to a sequence of discrete entities-for example, phonemes (or speech sounds), words, and sentences. The major obstacle to highaccuracy recognition is the large variability in the speech signal characteristics. This variability has three main components: linguistic variability, speaker variability, and channel variability. Linguistic variability includes the effects of phonetics, phonology, syntax, semantics, and discourse on the speech signal. Speaker variability includes intraand interspeaker variability, including the effects of coarticulation-that is, the effects of neighboring sounds on the acoustic realization of a particular phoneme due to continuity and motion constraints on the human articulatory apparatus. Channel variability includes the effects of background noise and the transmission channel (e.g., microphone, telephone, reverberation). All these variabilities tend to shroud the intended message with layers of uncertainty, which must be unraveled by the recognition process. This paper will focus on modeling linguistic and speaker variabilities for the speech recognition problem. Units of Speech. To gain an appreciation of what modeling is required to perform recognition, we shall use as an example the phrase "grey whales," whose speech signal is shown at the bottom of Fig. 1 with the corresponding spectrogram (or voice print) shown immediately above. The spectrogram shows the result of a frequency analysis of the speech, with the dark bands representing resonances of the vocal tract. At the top of Fig. 1 are the two words "grey" and "whales," which are the desired output of the recognition system. The first thing to note is that the speech signal and the spectrogram show no separation

Proceedings ArticleDOI
19 Apr 1994
TL;DR: A systematic study of the relative effectiveness of different methods for seeding and training HMMs in a new language, using transfer from English to Japanese for small vocabulary speaker independent continuous speech recognition as a test case, found that cross-language adaptation produced better models than alternative approaches with relatively little effort.
Abstract: The feasibility of cross-language transfer of speech technology is of increasing concern as the demand for recognition systems in multiple languages grows. The paper presents a systematic study of the relative effectiveness of different methods for seeding and training HMMs in a new language, using transfer from English to Japanese for small vocabulary speaker independent continuous speech recognition as a test case. Effects of limited training data are also explored. The study found that cross-language adaptation produced better models than alternative approaches with relatively little effort, and that the number of speakers is more critical than the number of utterances for small training data sets. >

Proceedings ArticleDOI
31 Oct 1994
TL;DR: A continuous optical automatic speech recognizer that uses optical information from the oral-cavity shadow of a speaker that achieves a 25.3 percent recognition on sentences having a perplexity of 150 without using any syntactic, semantic, acoustic, or contextual guides is described.
Abstract: We describe a continuous optical automatic speech recognizer (OASR) that uses optical information from the oral-cavity shadow of a speaker. The system achieves a 25.3 percent recognition on sentences having a perplexity of 150 without using any syntactic, semantic, acoustic, or contextual guides. We introduce 13, mostly dynamic, oral-cavity features used for optical recognition, present phones that appear optically similar (visemes) for our speaker, and present the recognition results for our hidden Markov models (HMMs) using visemes, trisemes, and generalized trisemes. We conclude that future research is warranted for optical recognition, especially when combined with other input modalities. >


Patent
12 Apr 1994
TL;DR: In this paper, a method for segmenting audio data, comprising speech from a plurality of individual speakers according to speaker, is presented, which comprises providing individual HMMs for each individual speaker, each individual HMM including at least one state, and constructing a speaker network HMM by connecting the individual HMs in parallel.
Abstract: A method for segmenting audio data, comprising speech from a plurality of individual speakers, according to speaker is provided. The method comprises providing individual HMMs for each individual speaker, each individual HMM including at least one state, and constructing a speaker network HMM by connecting the individual HMMs in parallel. The audio data is then divided into segments by determining a most likely sequence of states through the speaker network HMM, each of the segments being associated with one of the individual HMMs. Afterward, the speaker of each of the segments is identified. The segmented data may be used to form an index into the audio data according to speaker.

Proceedings ArticleDOI
19 Apr 1994
TL;DR: A tree-structured speaker clustering algorithm that employs successive branch selection in the speaker clustered tree rather than parameter training and hence achieves fast adaptation using only a small amount of training data.
Abstract: The paper proposes a tree-structured speaker clustering algorithm and discusses its application to fast speaker adaptation. By tracing the clustering tree from top to bottom, adaptation is performed step-by-step from global to local individuality of speech. This adaptation method employs successive branch selection in the speaker clustering tree rather than parameter training and hence achieves fast adaptation using only a small amount of training data. This speaker adaptation method was applied to a hidden Markov network (HMnet) and evaluated in Japanese phoneme and phrase recognition experiments, in which it significantly outperformed speaker-independent recognition methods. In the phrase recognition experiments, the method reduced the error rate by 26.6% using three phrase utterances (approximately 2.7 seconds). >

Proceedings ArticleDOI
Stephan Euler1, J. Zinke1
19 Apr 1994
TL;DR: The authors use a Gaussian classifier for estimation of the coding condition of a test utterance and the combination of this classifier and coder specific word models yields a high overall recognition performance.
Abstract: Examines the influence of different coders in the range from 64 kbit/sec to 4.8 kbit/sec on both a speaker independent isolated word recognizer and a speaker verification system. Applying systems trained with 64 kbit/sec to e.g. the 4.8 kbit/sec data increases the error rate of the word recognizer by a factor of three. For rates below 13 kbit/sec the speaker verification is more affected than the word recognition. The performance improves significantly if word models are provided for the individual coding conditions. Therefore, the authors use a Gaussian classifier for estimation of the coding condition of a test utterance. The combination of this classifier and coder specific word models yields a high overall recognition performance. >

Patent
24 Jan 1994
TL;DR: In this article, a method of enrollment in a computerized system and the verification of the location of employees is provided. But, the verification is performed by using a set of instructions, and each employee telephones a computer which, under the control of a program and a voice recognition and speech identification device, derives a voice print of the employee.
Abstract: There is provided computerized method of enrollment in a computerized system and the verification of the location of employees. Each employee, using a predetermined set of instructions, telephones a computer which, under the control of a program and a voice recognition and speech identification device, derives a voice print of the employee. When the employee is sent to a location, the ANI and voice print, which are in the computer's database, are verified. If the correct telephone is used, and the voice print compares, the time and place of the telephone call and the employee are recorded for later use by the employer.

Journal ArticleDOI
TL;DR: The authors have found that using more relaxed decoding constraints in preparing N-best hypotheses yields better recognition results, and a new frame-level loss function is minimized to improve the separation between the correct and incorrect hypotheses.
Abstract: The authors propose an N-best candidates-based discriminative training procedure for constructing high-performance HMM speech recognizers. The algorithm has two distinct features: N-best hypotheses are used for training discriminative models; and a new frame-level loss function is minimized to improve the separation between the correct and incorrect hypotheses. The N-best candidates are decoded based on their recently proposed tree-trellis fast search algorithm. The new frame-level loss function, which is defined as a halfwave rectified log-likelihood difference between the correct and competing hypotheses, is minimized over all training tokens. The minimization is carried out by adjusting the HMM parameters along a gradient descent direction. Two speech recognition applications have been tested, including a speaker independent, small vocabulary (ten Mandarin Chinese digits), continuous speech recognition, and a speaker-trained, large vocabulary (5000 commonly used Chinese words), isolated word recognition. Significant performance improvement over the traditional maximum likelihood trained HMMs has been obtained. In the connected Chinese digit recognition experiment, the string error rate is reduced from 17.0 to 10.8% for unknown length decoding and from 8.2 to 5.2% for known length decoding. In the large vocabulary, isolated word recognition experiment, the recognition error rate is reduced from 7.2 to 3.8%. Additionally, they have found that using more relaxed decoding constraints in preparing N-best hypotheses yields better recognition results. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: The experiment results show effectiveness of using the individual information in the higher frequency band to improve the speaker recognition performance, and a new distance measure is proposed combining the lower and the higher frequencies.
Abstract: In previous studies, the speaker individual information in the higher frequency band has been almost neglected. Therefore, we investigate the speaker characteristics in the higher frequency band and the effects of using them for speaker recognition. This paper presents the results of speaker identification experiments performed in a text-dependent mode using various frequency bands. Moreover, we propose a new distance measure combining the lower and the higher frequency bands. The experiment results show effectiveness of using the individual information in the higher frequency band to improve the speaker recognition performance. >

Journal ArticleDOI
TL;DR: A method to classify unvoiced sounds using adaptive wavelets, which would help in developing a unified algorithm to classify phonemes (speech sounds), and a method to identify speakers using very short speech data (one pitch period) are described.
Abstract: Our objective is to demonstrate the applicability of adaptive wavelets for speech applications. In particular, we discuss two applications, namely, classification of unvoiced sounds and speaker identification. First, a method to classify unvoiced sounds using adaptive wavelets, which would help in developing a unified algorithm to classify phonemes (speech sounds), is described. Next, the applicability of adaptive wavelets to identify speakers using very short speech data (one pitch period) is exhibited. The described text-independent phoneme based speaker identification algorithm identifies a speaker by first modeling phonemes and then by clustering all the phonemes belonging to the same speaker into one class. For both applications, we use feed-forward neural network architecture. We demonstrate the performance of both unvoiced sounds classifier and speaker identification algorithms by using representative real speech examples.

Proceedings Article
01 Jan 1994
TL;DR: An algorithm for the construction of models that attempt to capture the variation that occurs in the pronunciations of words in spontaneous speech, which improves the performance of both the speech recognition and the speech understanding components of the BeRP system.
Abstract: One of the sources of difficulty in speech recognition and understanding is the variability due to alternate pronunciations of words. To address the issue we have investigated the use of multiple-pronunciation models (MPMs) in the decoding stage of a speaker-independent speech understanding system. In this paper we address three important issues regarding MPMs: (a) Model construction: How can MPMs be built from available data without human intervention? (b) Model embedding: How should MPM construction interact with the training of the sub-word unit models on which they are based? (c) Utility: Do they help in speech recognition? Automatic, data-driven MPM construction is accomplished using a structural HMM induction algorithm. The resulting MPMs are trained jointlywith a multi-layer perceptron functioningas a phonetic likelihood estimator. The experiments reported here demonstrate that MPMs can significantly improve speech recognition results over standard single pronunciation models.

Proceedings ArticleDOI
31 Oct 1994
TL;DR: An algorithm for modeling the shape of the mouth, and extracting meaningful dimensions for use by automatic lipreading systems is described, and the recognition system achieved 85% accuracy on a two word discrimination task.
Abstract: In this paper, we describe an algorithm for modeling the shape of the mouth, and extracting meaningful dimensions for use by automatic lipreading systems. One advantage of this technique lies in the ability to normalize the model to compensate for scale and rotation. An error function is defined which relates the model to the image, and minimization of the error yields the best fit model. This is similar to deformable templates, but we attempt to perform the minimization in closed form. Visual only recognition was performed with features extracted from the model, and the recognition system achieved 85% accuracy on a two word discrimination task. >

PatentDOI
TL;DR: In this article, an individual's speech sample, obtained for the purpose of speaker verification, is used to create a "protected" model of the speech, which is stored in a database in association with a personal identifier for that individual.
Abstract: An individual's speech sample, obtained for the purpose of speaker verification, is used to create a "protected" model of the speech. The protected model, which is stored in a database in association with a personal identifier for that individual, is arranged so that the characteristics of the individual's speech cannot be ascertained from the protected model, without access to an encryption key or other secured information stored in the system. (Fig. 1)

Journal ArticleDOI
TL;DR: This approach differs from the traditional maximum likelihood based approach in that the objective of the recognition error rate minimization is established through a specially designed loss function, and is not based on the assumptions made about the speech generation process.
Abstract: In this paper, a minimum error rate pattern recognition approach to speech recognition is studied with particular emphasis on the speech recognizer designs based on hidden Markov models (HMMs) and Viterbi decoding. This approach differs from the traditional maximum likelihood based approach in that the objective of the recognition error rate minimization is established through a specially designed loss function, and is not based on the assumptions made about the speech generation process. Various theoretical and practical issues concerning this minimum error rate pattern recognition approach in speech recognition are investigated. The formulation and the algorithmic structures of several minimum error rate training algorithms for an HMM-based speech recognizer are discussed. The tree-trellis based N-best decoding method and a robust speech recognition scheme based on the combined string models are described. This approach can be applied to large vocabulary, continuous speech recognition tasks and to speech recognizers using word or subword based speech recognition units. Various experimental results have shown that significant error rate reduction can be achieved through the proposed approach.