scispace - formally typeset
Search or ask a question

Showing papers on "Perplexity published in 1997"


Proceedings ArticleDOI
21 Apr 1997
TL;DR: Two techniques based on augmenting the standard trigram model with a cache component in which the words' recurrence probabilities decay exponentially over time yield a significant reduction in perplexity when faced with a multi-domain test text.
Abstract: Presents two techniques for language model adaptation. The first is based on the use of mixtures of language models: the training text is partitioned according to topic, a language model is constructed for each component and, at recognition time, appropriate weightings are assigned to each component to model the observed style of language. The second technique is based on augmenting the standard trigram model with a cache component in which the words' recurrence probabilities decay exponentially over time. Both techniques yield a significant reduction in perplexity over the baseline trigram language model when faced with a multi-domain test text, the mixture-based model giving a 24% reduction and the cache-based model giving a 14% reduction. The two techniques attack the problem of adaptation at different scales, and as a result can be used in parallel to give a total perplexity reduction of 30%.

217 citations


Proceedings Article
01 Jun 1997
TL;DR: This work considers the use of language models whose size and accuracy are intermediate between different order n-gram models and examines smoothing procedures in which these models are interposed between different orders.
Abstract: We consider the use of language models whose size and accuracy are intermediate between different order n-gram models. Two types of models are studied in particular. Aggregate Markov models are classbased bigram models in which the mapping from words to classes is probabilistic. Mixed-order Markov models combine bigram models whose predictions are conditioned on different words. Both types of models are trained by ExpectationMaximization (EM) algorithms for maximum likelihood estimation. We examine smoothing procedures in which these models are interposed between different order n-grams. This is found to significantly reduce the perplexity of unseen word combinations.

192 citations


Proceedings Article
01 Jan 1997
TL;DR: A language model adaptation scheme that takes a piece of text, chooses the most similar topic clusters from a set of over 5000 elemental topics, and uses topic specific language models built from the topic clusters to rescore N-best lists, allowing for adaptation to unique, previously unseen combinations of subjects.
Abstract: The subject matter of any conversation or document can typically be described as some combination of elemental topics. We have developed a language model adaptation scheme that takes a piece of text, chooses the most similar topic clusters from a set of over 5000 elemental topics, and uses topic specific language models built from the topic clusters to rescore N-best lists. We are able to achieve a 15% reduction in perplexity and a small improvement in WER by using this adaptation. We also investigate the use of a topic tree, where the amount of training data for a specific topic can be judiciously increased in cases where the elemental topic cluster has too few word tokens to build a reliably smoothed and representative language model. Our system is able to fine-tune topic adaptation by interpolating models chosen from thousands of topics, allowing for adaptation to unique, previously unseen combinations of subjects.

126 citations


Proceedings Article
01 Jan 1997
TL;DR: A new method is presented to quickly adapt a given language model to local text characteristics by choosing the adaptive models as close as possible to the background estimates while constraining them to respect the locally estimated unigram probabilities.
Abstract: A new method is presented to quickly adapt a given language model to local text characteristics. The basic approach is to choose the adaptive models as close as possible to the background estimates while constraining them to respect the locally estimated unigram probabilities. Several means are investigated to speed up the calculations. We measure both perplexity and word error rate to gauge the quality of our model.

114 citations


Posted Content
Lillian Lee1
TL;DR: The clustering method, which uses the technique of deterministic annealing, represents (to the authors' knowledge) the first application of soft clustering to problems in natural language processing, and is compared to several such nearest-neighbor approaches on a word sense disambiguation task and finds that as a whole, their performance is far superior to that of standard methods.
Abstract: Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing, represents (to our knowledge) the first application of soft clustering to problems in natural language processing. We use this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia, and find that language models built from the clusters have substantial predictive power. Our algorithm also extends with no modification to other domains, such as document clustering. Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction.

99 citations


Proceedings Article
01 Jan 1997
TL;DR: A quantitative investigation into the impact of text normalization on lexica and language models for speech recognition in French found some normalizations were found to be necessary to achieve good lexical coverage, while others were more or less equivalent in this regard.
Abstract: In this paper we present a quantitative investigation into the impact of text normalization on lexica and language models for speech recognition in French. The text normalization process defines what is considered to be a word by the recognition system. Depending on this definition we can measure different lexical coverages and language model perplexities, both of which are closely related to the speech recognition accuracies obtained on read newspaper texts. Different text normalizations of up to 185M words of newspaper texts are presented along with corresponding lexical coverage and perplexity measures. Some normalizations were found to be necessary to achieve good lexical coverage, while others were more or less equivalent in this regard. The choice of normalization to create language models for use in the recognition experiments with read newspaper texts was based on these findings. Our best system configuration obtained a 11.2% word error rate in the AUPELF ‘French-speaking’ speech recognizer evaluation test held in February 1997.

57 citations


Proceedings ArticleDOI
Reinhard Kneser1, J. Peters1
21 Apr 1997
TL;DR: This paper introduces adaptation techniques such as the adaptive linear interpolation and an approximation to the minimum discriminant estimation and shows how to use the automatically derived semantic structure in order to allow a fast adaptation to some special topic or style.
Abstract: In this paper we present efficient clustering algorithms for two novel class-based approaches to adaptive language modeling. In contrast to bigram and trigram class models, the proposed classes are related to the distribution and co-occurrence of words within complete text units and are thus mostly of a semantic nature. We introduce adaptation techniques such as the adaptive linear interpolation and an approximation to the minimum discriminant estimation and show how to use the automatically derived semantic structure in order to allow a fast adaptation to some special topic or style. In experiments performed on the Wall-Street-Journal corpus, intuitively convincing semantic classes were obtained. The resulting adaptive language models were significantly better than a standard cache model. Compared to a static model a reduction in perplexity of up to 31% could be achieved.

52 citations


Journal ArticleDOI
TL;DR: A new approach to modeling phoneme-based speech units is proposed, which represents the acoustic observations of a phoneme as clusters of trajectories in a parameter space, and suggests that the stochastic trajectory model provides a more in-depth modeling of continuous speech signals.
Abstract: The paper first points out a defect in hidden Markov modeling (HMM) of continuous speech, referred as trajectory folding phenomenon. A new approach to modeling phoneme-based speech units is then proposed, which represents the acoustic observations of a phoneme as clusters of trajectories in a parameter space. The trajectories are modeled by a mixture of probability density functions of a random sequence of states. Each state is associated with a multivariate Gaussian density function, optimized at the state sequence level. Conditional trajectory duration probability is integrated in the modeling. An efficient sentence search procedure based on trajectory modeling is also formulated. Experiments with a speaker-dependent, 2010-word continuous speech recognition application with a word-pair perplexity of 50, using vocabulary-independent acoustic training, monophone models trained with 80 sentences per speaker, reported about a 1% word error rate. The new models were experimentally compared to continuous density mixture HMM (CDHMM) on the same recognition task, and gave significantly smaller word error rates. These results suggest that the stochastic trajectory model provides a more in-depth modeling of continuous speech signals.

42 citations


Proceedings ArticleDOI
14 Dec 1997
TL;DR: Alternative to perplexity for predicting language model performance, including other global features as well as a new approach that predicts, with a high correlation (0.96), performance differences associated with localized changes in language models, given a recognition system.
Abstract: Statistical n-gram language models are traditionally developed using perplexity as a measure of goodness. However, perplexity often demonstrates a poor correlation with recognition improvements, mainly because it fails to account for the acoustic confusability between words and for search errors in a recognizer. In this paper, we study alternatives to perplexity for predicting language model performance, including other global features as well as a new approach that predicts, with a high correlation (0.96), performance differences associated with localized changes in language models, given a recognition system. Experiments focus on the problem of augmenting in-domain Switchboard text with out-of-domain text from the Wall Street Journal and broadcast news that differ in both style and content from the in-domain data.

42 citations


Posted Content
TL;DR: This article proposed a statistical language model that resolves POS tagging and perplexity, which can be used as the language model of a speech recognizer, and found that by accounting for the interactions between these tasks, the performance on each task improves.
Abstract: To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problems, does POS tagging, and can be used as the language model of a speech recognizer. We find that by accounting for the interactions between these tasks that the performance on each task improves, as does POS tagging and perplexity.

32 citations


Proceedings ArticleDOI
07 Jul 1997
TL;DR: A statistical language model is put forward that resolves these problem, does POS tagging, and can be used as the language model of a speech recognizer and finds that by accounting for the interactions between these tasks that the performance on each task improves, as doesPOS tagging and perplexity.
Abstract: To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problem, does POS tagging, and can be used as the language model of a speech recognizer. We find that by accounting for the interactions between these tasks that the performance on each task improves, as does POS tagging and perplexity.

Journal ArticleDOI
TL;DR: This work studies three efficient methods for variable order stochastic language modeling in the context of the Stochastic pattern recognition problem and demonstrates that the best performance is achieved by extending one of the previous techniques using elements from the newly developed method.

ReportDOI
01 Jun 1997
TL;DR: A language model adaptation scheme that takes apiece of text, chooses the most similar topic clusters from a set of over 5000 elemental topics, and uses topic specific language models built from the topic clusters to rescore N-best lists to achieve a 15% reduction in perplexity and a small improvement in word error rate.
Abstract: : The subject matter of any conversation or document can typically be described as some combination of elemental topics. We have developed a language model adaptation scheme that takes apiece of text, chooses the most similar topic clusters from a set of over 5000 elemental topics, and uses topic specific language models built from the topic clusters to rescore N-best lists. We are able to achieve a 15% reduction in perplexity and a small improvement in word error rate by using this adaptation. We also investigate the use of a topic tree, where the amount of training data for a specific topic can be judiciously increased in cases where the elemental topic cluster has too few word tokens to build a reliably smoothed and representative language model. Our system is able to fine-tune topic adaptation by interpolating models chosen from thousands of topics, allowing for adaptation to unique, previously unseen combinations of subjects.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: A method for handling unseen events in the maximum entropy approach is described, achieved by discounting the frequencies of observed events, and the effect of this discounting operation on the convergence of the GIS algorithm is studied.
Abstract: Applies the maximum entropy approach to so-called distant bigram language modelling. In addition to the usual unigram and bigram dependencies, we use distant bigram dependencies, where the immediate predecessor word of the word position under consideration is skipped. We analyze the computational complexity of the resulting training algorithm, i.e. the generalized iterative scaling (GIS) algorithm, and study the details of its implementation. We describe a method for handling unseen events in the maximum entropy approach; this is achieved by discounting the frequencies of observed events. We study the effect of this discounting operation on the convergence of the GIS algorithm. We give experimental perplexity results for a corpus from the Wall Street Journal (WSJ) task. By using the maximum entropy approach and the distant bigram dependencies, we are able to reduce the perplexity from 205.4 for our best conventional bigram model to 169.5.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: A new technique for modelling word occurrence correlations within a word-category based language model captures the transient nature of the conditional probability effectively, and leads to reductions in perplexity of between 8 and 22%, where the largest improvements are delivered by correlations of words with themselves (self-triggers).
Abstract: A new technique for modelling word occurrence correlations within a word-category based language model is presented. Empirical observations indicate that the conditional probability of a word given its category, rather than maintaining the constant value normally assumed, exhibits an exponential decay towards a constant as a function of an appropriately defined measure of separation between the correlated words. Consequently, a functional dependence of the probability upon this separation is postulated, and methods for determining both the related word pairs as well as the function parameters are developed. Experiments using the LOB, Switchboard and Wall Street Journal corpora indicate that this formulation captures the transient nature of the conditional probability effectively, and leads to reductions in perplexity of between 8 and 22%, where the largest improvements are delivered by correlations of words with themselves (self-triggers), and the reductions increase with the size of the training corpus.

Proceedings ArticleDOI
Jerome R. Bellegarda1
14 Dec 1997
TL;DR: A new framework is proposed to integrate the various constraints, both local and global, that are present in language, resulting in several families of multi-span language models for large-vocabulary speech recognition.
Abstract: A new framework is proposed to integrate the various constraints, both local and global, that are present in language. Local constraints are captured via n-gram language modeling, while global constraints are taken into account through the use of latent semantic analysis. An integrative formulation is derived for the combination of these two paradigms, resulting in several families of multi-span language models for large-vocabulary speech recognition. Because of the inherent complementarity in the two types of constraints, the performance of the integrated language models, as measured by perplexity, compares favorably with the corresponding n-gram performance.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: A method to optimize the vocabulary for a given task using the perplexity criterion and reduced the size of the vocabulary to the half of the original word vocabulary size for the morpheme case is suggested.
Abstract: We suggest a method to optimize the vocabulary for a given task using the perplexity criterion. The optimization allows us to reduce the size of the vocabulary at the same perplexity of the original word based vocabulary or to reduce perplexity at the same vocabulary size. This new approach is an alternative to phoneme n-gram language model in the speech recognition search stage. We show the convergence of our approach on the Korean training corpus. This method may provide an optimized speech recognizer for a given task. We used phonemes, syllables, morphemes as the basic units for the optimization and reduced the size of the vocabulary to the half of the original word vocabulary size for the morpheme case.

Proceedings ArticleDOI
01 Jan 1997
TL;DR: In this article, a statistical language model was proposed to solve the POS tagging and perplexity problems in a speech recognition system, which can be used as the language model of a speech recognizer.
Abstract: To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problem, does POS tagging, and can be used as the language model of a speech recognizer. We find that by accounting for the interactions between these tasks that the performance on each task improves, as does POS tagging and perplexity.

Proceedings Article
01 Jan 1997
TL;DR: An N-best rescoring algorithm is presented that removes the effect of segmentation mismatch between training and test conditions and shows that explicit language modeling of hidden linguistic segment boundaries is improved by including turn-boundary events in the model.
Abstract: Language modeling, especially for spontaneous speech, often suffers from a mismatch of utterance segmentations between training and test conditions. In particular, training often uses linguistically-based segments, whereas testing occurs on acoustically determined segments, resulting in degraded performance. We present an N-best rescoring algorithm that removes the effect of segmentation mismatch. Furthermore, we show that explicit language modeling of hidden linguistic segment boundaries is improved by including turn-boundary events in the model. 1. THE SEGMENTATION PROBLEM IN LANGUAGE MODELING One of the problems encountered in speech recognition on continuous, spontaneous speech is the segmentation of long waveforms. Because current recognizers prefer short waveform segments for best performance and to limit computational resources, conversation-length waveforms are typically pre-segmented using simple acoustic criteria, such as locations of long pauses and turn switches. This creates several problems for language modeling: The segmentation algorithm used (including its parameters) influences the statistics embodied in the language model (LM), creating a potential mismatch between training and test set. Strictly speaking, one would have to resegment the training data, recreate the word-level transcriptions, and retrain the language model every time the segmentation process is modified. The acoustic segmentation typically yields units that are not linguistically coherent, and hence sub-optimal for language modeling. Language modeling research on spontaneous speech [10] shows that N-gram LMs based on complete utterance units give lower perplexity than those based only on acoustic segmentations. Furthermore, work reported in [12] showed that the word error rate on spontaneous speech can be reduced simply by resegmenting the speech at linguistic boundaries and using a language model based on the same segmentation. Explicit modeling of spontaneous speech phenomena such as disfluencies also requires modeling of linguistic (as opposed to acoustic) segment boundaries [15]. Similarly, sophisticated LMs modeling syntactic structure typically assume complete sentences as their input [12]. The following excerpt from the Switchboard corpus [2] illustrates the discrepancies between acoustic and linguistic segmentations. Linguistic segment boundaries are marked by , whereas acoustic boundaries are indicated by //. A subset of acoustic boundaries corresponds to turn boundaries, indicated by . B: Worried that they’re not going to get enough attention?

Posted Content
TL;DR: The authors consider the use of language models whose size and accuracy are intermediate between different order n-gram models and examine smoothing procedures in which these models are interposed between different orders of ngrams, which significantly reduce the perplexity of unseen word combinations.
Abstract: We consider the use of language models whose size and accuracy are intermediate between different order n-gram models. Two types of models are studied in particular. Aggregate Markov models are class-based bigram models in which the mapping from words to classes is probabilistic. Mixed-order Markov models combine bigram models whose predictions are conditioned on different words. Both types of models are trained by Expectation-Maximization (EM) algorithms for maximum likelihood estimation. We examine smoothing procedures in which these models are interposed between different order n-grams. This is found to significantly reduce the perplexity of unseen word combinations.

Book ChapterDOI
TL;DR: An elegant method for evaluating the discriminant power of features in the framework of an HMM-based word recognition system that employs statistical indicators, entropy and perplexity, to quantify the capability of each feature to discriminate between classes without resorting to the result of the recognition phase.
Abstract: This paper describes an elegant method for evaluating the discriminant power of features in the framework of an HMM-based word recognition system. This method employs statistical indicators, entropy and perplexity, to quantify the capability of each feature to discriminate between classes without resorting to the result of the recognition phase. The HMMs and the Viterbi algorithm are used as powerful tools to automatically deduce the probabilities required to compute the above mentioned quantities.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: An alternative scheme for perplexity estimation is developed based on a gambling approach on the next word to come in a truncated sentence and entropy bounds proposed by Shannon are used, based on the rank of the correct answer, in order to estimate a perplexity interval for non-probabilistic language models.
Abstract: Language models are usually evaluated on test texts using the perplexity derived directly from the model likelihood function. In order to use this measure in the framework of a comparative evaluation campaign, we have developed an alternative scheme for perplexity estimation. The method is derived from the Shannon (1951) game and based on a gambling approach on the next word to come in a truncated sentence. We also use entropy bounds proposed by Shannon and based on the rank of the correct answer, in order to estimate a perplexity interval for non-probabilistic language models. The relevance of the approach is assessed on an example.

01 Jan 1997
TL;DR: New methods of rejecting errors and estimating confidence for telephone speech based on phonetic word models are presented and these are shown to perform as well as the best other methods examined despite the data reduction involved.
Abstract: Automatic speech recognition (ASR) is performed imperfectly by computers. For some designated part (e.g., word or phrase) of the ASR output, rejection is deciding (yes or no) whether it is correct, and confidence is the probability (0.0 to 1.0) of it being correct. This thesis presents new methods of rejecting errors and estimating confidence for telephone speech. These are also called word or utterance verification and can be used in wordspotting or voice-response systems. Open-set or out-of-vocabulary situations are a primary focus. Language models are not considered. In vocabulary-dependent rejection all words in the target vocabulary are known in advance and a strategy can be developed for confirming each word. A word-specific artificial neural network (ANN) is shown to discriminate well, and scores from such ANNs are shown on a closed-set recognition task to reorder the N-best hypothesis list (N=3) for improved recognition performance. Segment-based duration and perceptual linear prediction (PLP) features are shown to perform well for such ANNs. The majority of the thesis concerns vocabulary- and task-independent confidence and rejection based on phonetic word models. These can be computed for words even when no training examples of those words have been seen. New techniques are developed using phoneme ranks instead of probabilities in each frame. These are shown to perform as well as the best other methods examined despite the data reduction involved. Certain new weighted averaging schemes are studied but found to give no performance benefit. Hierarchical averaging is shown to improve performance significantly: frame scores combine to make segment (phoneme state) scores, which combine to make phoneme scores, which combine to make word scores. Use of intermediate syllable scores is shown to not affect performance. Normalizing frame scores by an average of the top probabilities in each frame is shown to improve performance significantly. Perplexity of the wrong-word set is shown to be an important factor in computing the impostor probability used in the likelihood ratio. Bootstrap parameter estimation techniques are used to assess the significance of performance differences.

08 Feb 1997
TL;DR: It is shown that compression may be used as an alternative to perplexity for language model evaluation, and that the information processing techniques employed by the system may reflect what happens in the human brain.
Abstract: Statistical data compression requires a stochastic language model which must rapidly adapt to new data as it is encountered. A grammatical inference engine is introduced which satisfies this requirement; it is able to discover structure in arbitrary data using nothing more than the predictions of a simple trigram model. We show that compression may be used as an alternative to perplexity for language model evaluation, and that the information processing techniques employed by our system may reflect what happens in the human brain.

Proceedings Article
01 Jan 1997
TL;DR: A large reduction in the number of iterations necessary to build a classification tree and thus a CPU time reduction in building the model as well as a reduction in both perplexity and word error rate are obtained.
Abstract: In this paper, a new method to cluster words into classes is proposed in order to define a statistical language model. The purpose of this algorithm is to decrease the computational cost of the clustering task while not degrading speech recognition performance. The algorithm provides a bottom-up hierarchical clustering using the reciprocal neighbours method. This technique consists in merging several pairs of classes within a single iteration. Experiments on a spontaneous speech corpus are presented. Results are given both in terms of perplexity and word recognition error rate. We obtain a large reduction in the number of iterations necessary to build a classification tree and thus a CPU time reduction in building the model as well as a reduction in both perplexity and word error rate.

Journal ArticleDOI
TL;DR: In this paper, a corpus-based statistical-oriented Chinese word classification can be regarded as a fundamental step for automatic or non-automatic, monolingual natural processing systems, which can solve the problems of data sparseness and have far fewer parameters.

01 Jan 1997
TL;DR: A software agent is described which is able to take a seed (reference) corpus specified by the user, search the Internet for documents which are sufficiently similar to the seed corpus (as defined by a set of similarity metrics operating at a number of levels in the text), and augment the seed Corpus with these documents.
Abstract: A software agent is described which is able to take a seed (reference) corpus specified by the user, search the Internet for documents which are sufficiently similar to the seed corpus (as defined by a set of similarity metrics operating at a number of levels in the text), and augment the seed corpus with these documents. The size of the corpus and, hopefully, the quality of the derived language model, are thus progressively increased. The seed corpus may be quite a small collection of transcripts from the application domain, such as may be collected with minimal effort. Preliminary results are given for the perplexity of language models constructed using this approach. Potentially, our method has applications well beyond speech recognition, in corpus-based language processing in general, and document retrieval.