scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 1998"


ReportDOI
01 Dec 1998
TL;DR: A SVM -based face recognition algorithm that is compared with a principal component analysis (PCA) based algorithm on a difficult set of images from the FERET database and generated a similarity metric between faces that is learned from examples of differences between faces.
Abstract: Face recognition is a K class problem. where K is the number of known individuals; and support vector machines (SVMs) are a binary classification method. By reformulating the face recognition problem and reinterpreting the output of the SVM classifier. we developed a SVM -based face recognition algorithm. The face recognition problem is formulated as a problem in difference space. which models dissimilarities between two facial images. In difference space we formulate face recognition as a two class problem. The classes are: dissimilarities between faces of the same person. and dissimilarities between faces of different people. By modifying the interpretation of the decision surface generated by SVM. we generated a similarity metric between faces that is learned from examples of differences between faces. The SVM-based algorithm is compared with a principal component analysis (PCA) based algorithm on a difficult set of images from the FERET database. Performance was measured for both verification and identification scenarios. The identification performance for SVM is 77-78% versus 54% for PCA. For verification. the equal error rate is 7% for SVM and 13% for PCA.

412 citations


Journal ArticleDOI
Olli Viikki1, Kari Laurila1
TL;DR: A segmental feature vector normalization technique is proposed which makes an automatic speech recognition system more robust to environmental changes by normalizing the output of the signal-processing front-end to have similar segmental parameter statistics in all noise conditions.

405 citations


Journal ArticleDOI
TL;DR: An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis are presented.
Abstract: In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results have shown that frequency warping is consistently able to reduce word error rate by 20% even for very short utterances.

338 citations


Journal ArticleDOI
TL;DR: It is shown that a Markov approximation for the block error process is a good model for a broad range of parameters, and the relationship between the marginal error rate and the transition probability is largely insensitive to parameters such as block length, degree of forward error correction and modulation format.
Abstract: We investigate the behavior of block errors which arise in data transmission on fading channels. Our approach takes into account the details of the specific coding/modulation scheme and tracks the fading process symbol by symbol. It is shown that a Markov approximation for the block error process (possibly degenerating into an identically distributed (i.i.d.) process for sufficiently fast fading) is a good model for a broad range of parameters. Also, it is observed that the relationship between the marginal error rate and the transition probability is largely insensitive to parameters such as block length, degree of forward error correction and modulation format, and depends essentially on an appropriately normalized version of the Doppler frequency. This relationship can therefore be computed in the simple case of a threshold model and then used more generally as an accurate approximation. This observation leads to a unified approach for the channel modeling, and to a simplified performance analysis of upper layer protocols.

277 citations


Proceedings Article
01 Jan 1998
TL;DR: This paper presents an entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments that uses the spectral entropy to identify the speech segments accurately.
Abstract: This paper presents an entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments. Instead of using the conventional energy-based features, the spectral entropy is developed to identify the speech segments accurately. Experimental results show that this algorithm outperforms the energy-based algorithms in both detection accuracy and recognition performance under noisy environments, with an average error rate reduction of more than 16%.

221 citations


Patent
Stefan Ott1
22 May 1998
TL;DR: In this article, the authors proposed a dynamic error correction system for a bi-directional digital data transmission system, where a receiver receives the signal and decodes the information encoded thereon.
Abstract: A dynamic error correction system for a bi-directional digital data transmission system. The transmission system of the present invention includes a transmitter adapted to encode information into a signal. A receiver receives the signal and decodes the information encoded thereon. The signal is transmitted from the transmitter to the receiver via a communications channel. A signal quality/error rate detector is coupled to the receiver and is adapted to detect a signal quality and/or an error rate in the information transmitted from the transmitter. The receiver is adapted to implement at least a first and second error correction process, depending upon the detected signal quality/error rate. The first error correction process is more robust and more capable than the second error correction process. The receiver coordinates the implemented error correction process with the transmitter via a feedback channel. The receiver dynamically selects the first or second error correction process for implementation in response to the detected signal quality/error rate and coordinates the selection with the transmitter such that error correction employed by the receiver and transmitter is tailored to the condition of the communications channel.

160 citations


Journal ArticleDOI
TL;DR: This work addresses the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate, by providing a statistical analysis showing that if, for example, the expected character error rate is around 1 percent, then, with a test set of at least 10,000 statistically independent handwritten characters, that is guaranteed.
Abstract: We address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. We provide a statistical analysis showing that if, for example, the expected character error rate is around 1 percent, then, with a test set of at least 10,000 statistically independent handwritten characters (which could be obtained by taking 100 characters from each of 100 different writers), we guarantee, with 95 percent confidence, that: (1) the expected value of the character error rate is not worse than 1.25 E, where E is the empirical character error rate of the best recognizer, calculated on the test set; and (2) a difference of 0.3 E between the error rates of two recognizers is significant. We developed this framework with character recognition applications in mind, but it applies as well to speech recognition and to other pattern recognition problems.

157 citations


Journal ArticleDOI
TL;DR: The two-stage CHAM model (Computer-elicited Hyperarticulate Adaptation Model) is proposed to account for changes in users' speech during interactive error resolution.

142 citations


Proceedings ArticleDOI
Peter Beyerlein1
12 May 1998
TL;DR: Experimental results show that the accuracy of a large vocabulary continuous speech recognition system can be increased by a discriminative model combination, due to a better exploitation of the given acoustic and language models.
Abstract: Discriminative model combination is a new approach in the field of automatic speech recognition, which aims at an optimal integration of all given (acoustic and language) models into one log-linear posterior probability distribution. As opposed to the maximum entropy approach, the coefficients of the log-linear combination are optimized on training samples using discriminative methods to obtain an optimal classifier. Three methods are discussed to find coefficients which minimize the empirical word error rate on given training data: the well-known generalised probabilistic descent (GPD) based minimum error rate training leading to an iterative optimization scheme; a minimization of the mean distance between the discriminant function of the log-linear posterior probability distribution and an "ideal" discriminant function; and a minimization of a smoothed error count measure, where the smoothing function is a parabola. The latter two methods lead to closed-form solutions for the coefficients of the model combination. Experimental results show that the accuracy of a large vocabulary continuous speech recognition system can be increased by a discriminative model combination, due to a better exploitation of the given acoustic and language models.

128 citations


Proceedings ArticleDOI
01 Jan 1998
TL;DR: This work compares the performance of three ASR systems: a baseline system that uses phone-scale representations and units, an experimental system that using a syllable-oriented front-end representation and syllabic units for recognition, and a third system that combines the phone- scale and syllable -scale recognizers by merging and rescoring N-best lists.
Abstract: Including information distributed over intervals of syllabic duration (100-250 ms) may greatly improve the performance of automatic speech recognition (ASR) systems. ASR systems primarily use representations and recognition units covering phonetic durations (40-100 ms). Humans certainly use information at phonetic time scales, but results from psychoacoustics and psycholinguistics highlight the crucial role of the syllable, and syllable-length intervals, in speech perception. We compare the performance of three ASR systems: a baseline system that uses phone-scale representations and units, an experimental system that uses a syllable-oriented front-end representation and syllabic units for recognition, and a third system that combines the phone-scale and syllable-scale recognizers by merging and rescoring N-best lists. Using the combined recognition system, we observed an improvement in word error rate for telephone-bandwidth, continuous numbers from 6.8% to 5.5% on a clean test set, and from 27.8% to 19.6% on a reverberant test set, over the baseline phone-based system.

128 citations


Proceedings ArticleDOI
12 May 1998
TL;DR: An approach to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance, as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time.
Abstract: Estimates of confidence for the output of a speech recognition system can be used in many practical applications of speech recognition technology. They can be employed for detecting possible errors and can help to avoid undesirable verification turns in automatic inquiry systems. We propose to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance. The basic idea of our approach is to estimate the posterior word probabilities as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time. The word hypothesis probabilities are approximated by paths in a wordgraph and are computed using a simplified forward-backward algorithm. We present experimental results on the North American Business (NAB'94) and the German Verbmobil recognition task.

Proceedings ArticleDOI
O. Viikki1, D.K. Bye, K. Laurila
12 May 1998
TL;DR: This work proposes a more efficient implementation approach for feature vector normalization where the normalization coefficients are computed in a recursive way and achieves performance gain over 60%, the segmental method approximately 50%, and parallel model combination a 14% overall error rate reduction, respectively.
Abstract: The acoustic mismatch between testing and training conditions is known to severely degrade the performance of speech recognition systems. Segmental feature vector normalization was found to improve the noise robustness of mel-frequency cepstral coefficients (MFCC) feature vectors and to outperform other state-of-the-art noise compensation techniques in speaker-dependent recognition. The objective of feature vector normalization is to provide environment-independent parameter statistics in all noise conditions. We propose a more efficient implementation approach for feature vector normalization where the normalization coefficients are computed in a recursive way. Speaker-dependent recognition experiments show that the recursive normalization approach obtains over 60%, the segmental method approximately 50%, and parallel model combination a 14% overall error rate reduction, respectively. Moreover, in the recursive case, this performance gain is obtained with the smallest implementation costs. Also in speaker-independent connected digit recognition, over a 16% error rate reduction is obtained with the proposed feature vector normalization approach.

Proceedings Article
01 Jan 1998
TL;DR: In context-independent classification and context-dependent recognition on the TIMIT core test set using 39 classes, the system achieved error rates that are the lowest the authors have seen reported on these tasks.
Abstract: This paper addresses the problem of acoustic phonetic modeling. First, heterogeneous acoustic measurements are chosen in order to maximize the acoustic-phonetic information extracted from the speech signal in preprocessing. Second, classifier systems are presented for successfully utilizing high-dimensional acoustic measurement spaces. The techniques used for achieving these two goals can be broadly categorized as hierarchical, committeebased, or a hybrid of these two. This paper presents committeebased and hybrid approaches. In context-independent classification and context-dependent recognition on the TIMIT core test set using 39 classes, the system achieved error rates of 18.3% and 24.4%, respectively. These error rates are the lowest we have seen reported on these tasks. In addition, experiments with a telephone-based weather information word recognition task led to word error rate reductions of 10–16%.

Journal ArticleDOI
TL;DR: The presented approach produces reliable estimates of formant frequencies across a wide range of sounds and speakers and the estimated formantfrequency were used in a number of variants for recognition.
Abstract: This paper presents a new method for estimating formant frequencies. The formant model is based on a digital resonator. Each resonator represents a segment of the short-time power spectrum. The complete spectrum is modeled by a set of digital resonators connected in parallel. An algorithm based on dynamic programming produces both the model parameters and the segment boundaries that optimally match the spectrum. We used this method in experimental tests that were carried out on the TI digit string data base. The main results of the experimental tests are: (1) the presented approach produces reliable estimates of formant frequencies across a wide range of sounds and speakers; and (2) the estimated formant frequencies were used in a number of variants for recognition. The best set-up resulted in a string error rate of 4.2% on the adult corpus of the TI digit string data base.

01 Jan 1998
TL;DR: A hidden Markov model is used to extract information from broadcast news with encouraging result that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer.
Abstract: We report results using a hidden Markov model to extract information from broadcast news. IdentiFinderTM was trained on the broadcast news corpus and tested on both the 1996 HUB-4 development test data and the 1997 HUB-4 evaluation test data with respect to the named entity (NE) task: extracting • names of locations, persons, and organizations; • dates and times; • monetary amounts and percentages. Evaluation is based on automatic word alignment of the speech recognition output (the NIST algorithm) followed by the MUC6/MUC-7 scorer for NE on text, since MUC scoring assumes identical text in the system output and in the answer key. Additionally, we used the experimental MITRE scoring metric (Burger, et al., 1998). The most encouraging result is that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer. 1. MOTIVATING FACTORS One of the reasons behind this effort is to go beyond speech transcription (e.g. beyond the dictation problem) to address (at least) shallow understanding of speech. As a result of this effort, we believe that evaluating named entity (NE) extraction from speech offers a measure complementary to word error rate (wer) and represents a measure of understanding. The scores for NE from speech seem to track quality of speech recognition proportionally, i.e., NE performance degrades at worst linearly with word error rate. A second motivation is the fact that NE is the first information extraction task from text showing success, with error rates on newswire less than 10%. The named entity problem has generated much interest, as evidenced by its inclusion as an understanding task to be evaluated in both the Sixth and Seventh Message Understanding Conferences (MUC-6 and MUC-7), in the First and Second Multilingual Entity Task evaluations (MET-1 and MET-2), and as a planned track in the next broadcast news evaluation. Furthermore, at least one commercial product has emerged: NameTagTM from IsoQuest. NE is defined by a set of annotation guidelines, an evaluation metric, and example data (Chinchor, 1997). 2. THE NAMED ENTITY PROBLEM FOR SPEECH The named entity task is to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages. Though this sounds clear, enough special cases arise to require lengthy guidelines, e.g., when i s The Wall Street Journal an artifact, and when is it an organization? When is White House an organization, and when a location? Are branch offices of a bank an organization? Is a street name a location? Should yesterday and last Tuesday be labeled dates? Is mid-morning a time? For human annotator consistency, guidelines with numerous special cases have been defined for the Seventh Message Understanding Conference, MUC-7 (Chinchor, 1997). In training data, the boundaries of an expression and its type must be marked via SGML. Various GUIs support manual preparation of training data and reference answers. Though the problem is relatively easy in mixed case English prose, this is not solvable solely by recognizing capitalization in English. Though capitalization does indicate proper nouns in English, the type of the entity (person, organization, location, or none of those) must be identified. Many proper noun categories are not to be marked, e.g., nationalities, product names, and book titles. Named entity recognition is a challenge where case does not signal proper nouns, e.g., in Chinese, Japanese, German or non-text modalities (e.g., speech). Since the task was generalized to other languages in the multi-lingual entity task (MET), the task definition is no longer dependent on the use of mixed case in English. Broadcast news presents significant challenges, as illustrated in Table 1. Not having mixed case removes information useful to recognizing names in English. Automatically transcribed speech, even with no recognition errors, is harder due to the lack of punctuation, spelling numbers out as words, and upper case in SNOR (Speech Normalized Orthographic Representation) format. 3. OVERVIEW OF HMM IN IDENTIFINDERTM A full description of our HMM for named entity extraction appears in Bikel, et. al., 1997. By definition of the task, only a single label can be assigned to a word in context. Therefore, to every word, the HMM will assign either one of the desired classes (e.g., person, organization, etc.) or the label NOT-ANAME (to represent “none of the desired classes”). We organize the states into regions, one region for each desired class plus one for NOT-A-NAME. See Figure 1. The HMM will have a model of each desired class and of the other text. The implementation is not confined to the seven classes of NE; in fact, it determines the set of classes by the SGML labels in the training data. Additionally, there are two special states, the START-OF-SENTENCE and END-OF-SENTENCE states. Within each of the regions, we use a statistical bigram language model, and emit exactly one word upon entering each state. Therefore, the number of states in each of the nameclass regions is equal to the vocabulary size, V . The generation of words and name-classes proceeds in the following steps: 1. Select a name-class NC, conditioning on the previous name-class and the previous word. 2. Generate the first word inside that name-class, conditioning on the current and previous nameclasses. 3. Generate all subsequent words inside the current name-class, where each subsequent word i s conditioned on its immediate predecessor. 4. If not at the end of a sentence, go to 1. Using the Viterbi algorithm, we search the entire space of all possible name-class assignments, maximizing Pr(W, NC). This model allows each type of “name” to have its own language, with separate bigram probabilities for generating its words. This reflects our intuition that • There is generally predictive internal evidence regarding the class of a desired entity. Consider the following evidence: organization names tend to be stereotypical for airlines, utilities, law firms, insurance companies, other corporations, and government organizations. Organizations tend to select names to suggest the purpose or type of the organization. For person names, first person names are stereotypical in many cultures; in Chinese, family names are stereotypical. In Chinese and Japanese, special characters are used to transliterate foreign names. Monetary amounts typically include a unit term, e.g., Taiwan dollars, yen, German marks, etc. • Local evidence often suggests the boundaries and class of one of the desired expressions. Titles signal beginnings of person names. Closed class words, such as determiners, pronouns, and prepositions often signal a boundary. Corporate designators (Inc, Ltd., Corp., etc.) often end a corporation name. While the number of word-states within each name-class i s equal to V , this “interior” bigram language model is ergodic, Mixed Case The crash was the second of a 757 in less than two months. On Dec. 20, an American Airlines jet crashed in the mountains near Cali, Colombia, killing 160 of th 164 people on board. The cause of that crash is still under investigation. UPPER CASE THE CRASH WAS THE SECOND OF A 757 IN LESS THAN TWO MONTHS. ON DEC. 20, AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI, COLOMBIA, KILLING 160 OF TH 164 PEOPLE ON BOARD. THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION. SNOR THE CRASH WAS THE SECOND OF A SEVEN FIFTY SEVEN IN LESS THAN TWO MONTHS ON DECEMBER TWENTY AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI COLOMBIA KILLING ONE HUNDRED SIXTY OF THE ONE HUNDRED SIXTY FOUR PEOPLE ON BOARD THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION Table 1: Illustration of difficulties presented by speech recognition output (SNOR).

Journal ArticleDOI
TL;DR: A new fast kNN classification algorithm is presented for texture and pattern recognition by identifying the fat k closest vectors in the design set of a kNN classifier for each input vector by performing the partial distance search in the wavelet domain.
Abstract: A new fast kNN classification algorithm is presented for texture and pattern recognition. The algorithm identifies the fat k closest vectors in the design set of a kNN classifier for each input vector by performing the partial distance search in the wavelet domain. Simulation results show that, without increasing the classification error rate, the algorithm requires only 12.94% of the computational time of the original kNN technique.

Patent
15 Jul 1998
TL;DR: In this article, the adaptive control of a forward error correction code for transmission between a terrestrial cell/packet switch at a first terminal and a satellite/wireless network connecting to a second terminal is discussed.
Abstract: A method for the adaptive control of a forward error correction code for transmission between a terrestrial cell/packet switch at a first terminal and a satellite/wireless network connecting to a second terminal, including the steps of: calculating a byte error rate associated with communication signals received by the first terminal (500), determining a forward error correction code length based on the byte error rate (510), transmitting the forward error correction code length to the second terminal (520).

Journal ArticleDOI
TL;DR: This paper proposes a method to transform acoustic models that have been trained with a certain group of speakers for use on different speech in hidden Markov model based (HMM-based) automatic speech recognition.
Abstract: This paper proposes a method to transform acoustic models that have been trained with a certain group of speakers for use on different speech in hidden Markov model based (HMM-based) automatic speech recognition. Features are transformed on the basis of assumptions regarding the difference in vocal tract length between the groups of speakers. First, the vocal tract length (VTL) of these groups has been estimated based on the average third formant F/sub 3/. Second, the linear acoustic theory of speech production has been applied to warp the spectral characteristics of the existing models so as to match the incoming speech. The mapping is composed of subsequent nonlinear submappings. By locally linearizing it and comparing results in the output, a linear approximation for the exact mapping was obtained which is accurate as long as the warping is reasonably small. The feature vector, which is computed from a speech frame, consists of the mel scale cepstral coefficients (MFCC) along with delta and delta/sup 2/-cepstra as well as delta and delta/sup 2/ energy. The method has been tested for TI digits data base, containing adult and children speech, consisting of isolated digits and digit strings of different length. The word error rate when trained on adults and tested on children with transformed adult models is decreased by more than a factor of two compared to the nontransformed case.

Journal ArticleDOI
TL;DR: A way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system and shows that when the “correct” move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops.
Abstract: This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 “move types,” such as “acknowledge,” “query-yes/no” or “instruct.” Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the “correct” move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops.Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of m...

Proceedings ArticleDOI
Subhro Das1, D. Nix, M. Picheny
12 May 1998
TL;DR: Comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system are described.
Abstract: There are several reasons why conventional speech recognition systems modeled on adult data fail to perform satisfactorily on children's speech input. For instance, children's vocal characteristics differ significantly from those of adults. In addition, their choices of vocabulary and sentence construction modalities usually do not conform to adult patterns. We describe comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system.

Proceedings ArticleDOI
12 May 1998
TL;DR: This paper compares various category-based language models when used in conjunction with a word-based trigram by means of linear interpolation to find the largest improvement with a model using automatically determined categories.
Abstract: This paper compares various category-based language models when used in conjunction with a word-based trigram by means of linear interpolation. Categories corresponding to parts-of-speech as well as automatically clustered groupings are considered. The category-based model employs variable-length n-grams and permits each word to belong to multiple categories. Relative word error rate reductions of between 2 and 7% over the baseline are achieved in N-best rescoring experiments on the Wall Street Journal corpus. The largest improvement is obtained with a model using automatically determined categories. Perplexities continue to decrease as the number of different categories is increased, but improvements in the word error rate reach an optimum.

Patent
13 Nov 1998
TL;DR: In this article, a method and system of performing confidence measure in a speech recognition system includes receiving an utterance of input speech and creating a near miss pattern or a near-miss list of possible word entries for the utterance.
Abstract: A method and system of performing confidence measure in a speech recognition system includes receiving an utterance of input speech and creating a near-miss pattern or a near-miss list of possible word entries for the utterance. Each word entry includes an associated value of probability that the utterance corresponds to the word entry. The near-miss list of possible word entries is compared with corresponding stored near-miss confidence templates. Each word in the vocabulary (or keyword list) of near-miss confidence template, which includes a list of word entries and each word entry in each list includes an associated value. Confidence measure for a particular hypothesis word is performed based on the comparison of the values in the near-miss list of possible word entries with the values of the corresponding near-miss confidence template.

Journal ArticleDOI
TL;DR: This paper presents the multivaRiate gAussian-based cepsTral normaliZation (RATZ) family of algorithms which modify incoming cepstral features, along with the STAR (STAtistical Reestimation)family of algorithms, which modify the internal statistics of the classifier.

Patent
31 Jul 1998
TL;DR: In this paper, the authors used therapidly available speech recognition results to provide intelligent barge-in for voice-response systems and, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence to increase processing throughput.
Abstract: Speech recognition technology has attained maturity such that the most likely speech recognition result has been reached and is available before an energy based termination of speech has been made. The present invention innovatively uses therapidly available speech recognition results to provide intelligent barge-in forvoice-response systems and, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence to increase processing throughput.

Proceedings ArticleDOI
12 May 1998
TL;DR: A new feature-based method for estimating the speaking rate by detecting vowels in continuous speech, using the modified loudness and the zerocrossing rate which are both calculated in the standard preprocessing unit of the speech recognition system.
Abstract: We present a new feature-based method for estimating the speaking rate by detecting vowels in continuous speech. The features used are the modified loudness and the zerocrossing rate which are both calculated in the standard preprocessing unit of our speech recognition system. As vowels in general correspond to syllable nuclei, the feature-based vowel rate is comparable to an estimate of the lexically-based syllable rate. The vowel detector presented is tested on the spontaneously spoken German Verbmobil task and is evaluated using manually transcribed data. The lowest vowel error rate (including insertions) on the defined test set is 22.72% on average over all vowels. Additionally correlation coefficients between our estimates and reference rates are calculated. These coefficients reach up to 0.796 and therefore are comparable to those for lexically-based measures (like the phone rate) on other tasks. The accuracy is sufficient to use our measurement for speaking rate adaptation.

Patent
18 Nov 1998
TL;DR: In this article, a method and apparatus for extracting key terms from a data set, including the steps of identifying a first set of one or more word groups of words that occur more than once in the data set and removing from this first set a second set of word groups that are sub-strings of longer word groups in the first set.
Abstract: A method and apparatus is provided for extracting key terms from a data set, the method including the steps of identifying a first set of one or more word groups of one or more words that occur more than once in the data set, and removing from this first set a second set of word groups that are sub-strings of longer word groups in the first set The remaining word groups are key terms Each word group is weighted according to its frequency of occurrence within the data set The weighting of any word group may be increased by the frequency of any sub-string of words occurring in the second set and then dividing each weighting by the number of words in the word group This weighting process operates to determine the order of occurrence of the word groups Prefixes and suffixes are also removed from each word in the data set This produces a neutral form of each word so that the weighting values are prefix and suffix independent

Proceedings Article
01 Jan 1998
TL;DR: There is a substantial mismatch between real speech and the combination of the authors' acoustic models and the pronunciations in their recognition dictionary, and the use of simulation appears to be a promising tool in the efforts to understand and reduce the size of this mismatch.
Abstract: We present a study of data simulated using acoustic models trained on Switchboard data, and then recognized using various Switchboard-trained acoustic models. When we recognize real Switchboard conversations, simple development models give a word error rate (WER) of about 47 percent. If instead we simulate the speech data using word transcriptions of the conversation, obtaining the pronunciations for the words from our recognition dictionary, the WER drops by a factor of five to ten. In a third type of experiment, we use human-generated phonetic transcripts to fabricate data that more realistically represents conversational speech, and obtain WERs in the low 40’s, rates that are fairly similar to those seen in actual speech data. Taken as a whole, these and other experiments we describe in the paper suggest that there is a substantial mismatch between real speech and the combination of our acoustic models and the pronunciations in our recognition dictionary. The use of simulation appears to be a promising tool in our efforts to understand and reduce the size of this mismatch, and may prove to be a generally valuable diagnostic in speech recognition research .

20 Oct 1998
TL;DR: This paper documents the use of Broadcast News test materials in DARPA-sponsored Automatic Speech Recognition (ASR) Benchmark Tests conducted late in 1998, and results are reported on non-English language Broadcast News materials in Spanish and Mandarin.
Abstract: This paper documents the use of Broadcast News test materials in DARPA-sponsored Automatic Speech Recognition (ASR) Benchmark Tests conducted late in 1998. As in last year’s tests [1], statistical selection procedures were used in selecting test materials. Two test epochs were used, each yielding (nominally) one and one-half hours of test material. One of the test sets was drawn from the same test epoch as was used for last year’s tests, and the other was drawn from a more recent period. Results are reported for two types of systems: one (the “Hub”, or “baseline” systems) for which there were no limits on computational resources, and another (the “less than 10X realtime spoke” systems) for systems that ran in less than 10 times real-time. The lowest word error rate reported this year for the “Hub” systems was 13.5%, contrasting with last year’s lowest word error rate of 16.2%. For the “less than 10X real-time spoke” systems, the lowest reported word error rate was 16.1%. Results are also reported, for the second year, on non-English language Broadcast News materials in Spanish and Mandarin.

Proceedings Article
01 Jan 1998
TL;DR: Experimental results will indeed show that the use of an appropriate duration normalization is very important to obtain good estimates of the phone and word confidences, and that (as one could expect) confidence measures at the word level perform better than those at the phone level.
Abstract: In this paper we define and investigate a set of confidence measures based on hybrid Hidden Markov Model/Artificial Neural Network (HMM/ANN) acoustic models. All these measures are using the neural network to estimate the local phone posterior probabilities, which are then combined and normalized in different ways. Experimental results will indeed show that the use of an appropriate duration normalization is very important to obtain good estimates of the phone and word confidences. The different measures are evaluated at the phone and word levels on both an isolated word task (PHONEBOOK) and a continuous speech recognition task (BREF). It will be shown that one of those confidence measures is well suited for utterance verification, and that (as one could expect) confidence measures at the word level perform better than those at the phone level. Finally, using the resulting approach on PHONEBOOK to rescore the N-best list is shown to yield a 34% decrease in word error rate.

Journal ArticleDOI
TL;DR: It is shown that when imaged at a distance of up to about one metre, the population entropy of iris patterns is roughly 3.4 bits per square millimetre, and their complexity spans about 266 independent degrees-of-freedom, which exceed significantly the degrees of randomness and complexity available in other identifying biometric patterns.