scispace - formally typeset
Search or ask a question

Showing papers on "Perplexity published in 2007"


Journal ArticleDOI
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.

507 citations


Journal ArticleDOI
TL;DR: The Author-Recipient-Topic model for social network analysis, which learns topic distributions based on the direction-sensitive messages sent between entities, is presented and results are given, providing evidence not only that clearly relevant topics are discovered, but that the ART model better predicts people's roles and gives lower perplexity on previously unseen messages.
Abstract: Previous work in social network analysis (SNA) has modeled the existence of links from one entity to another, but not the attributes such as language content or topics on those links. We present the Author-Recipient-Topic (ART) model for social network analysis, which learns topic distributions based on the direction-sensitive messages sent between entities. The model builds on Latent Dirichlet Allocation (LDA) and the Author-Topic (AT) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipient--steering the discovery of topics according to the relationships between people. We give results on both the Enron email corpus and a researcher's email archive, providing evidence not only that clearly relevant topics are discovered, but that the ART model better predicts people's roles and gives lower perplexity on previously unseen messages. We also present the Role-Author-Recipient-Topic (RART) model, an extension to ART that explicitly represents people's roles.

484 citations


Proceedings Article
03 Dec 2007
TL;DR: Using five real-world text corpora, it is shown that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning.
Abstract: We investigate the problem of learning a widely-used latent-variable model - the Latent Dirichlet Allocation (LDA) or "topic" model - using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

264 citations


Journal ArticleDOI
TL;DR: A new smoothing technique based on randomly grown decision trees (DTs) is developed and applied to language modeling and the resulting RF language models are superior to the best known smoothing techniques, the interpolated Kneser–Ney smoothing in reducing both the perplexity and word error rate in large vocabulary state-of-the-art speech recognition systems.

69 citations


Proceedings ArticleDOI
12 Aug 2007
TL;DR: A new probabilistic graphical model is proposed that employs non-homogeneous Poisson processes to model generation of word-counts and its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-Scales.
Abstract: Modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. In this work, we propose a new probabilistic graphical model to address this issue. The new model, which we call the Multiscale Topic Tomography Model (MTTM), employs non-homogeneous Poisson processes to model generation of word-counts. The evolution of topics is modeled through a multi-scale analysis using Haar wavelets. One of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. Our experiments on Science data using the new model uncovers some interesting patterns in topics. The new model is also comparable to LDA in predicting unseen data as demonstrated by our perplexity experiments.

62 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: An unsupervised algorithm for the discovery of words and word-like fragments from the speech signal, without using an upfront defined lexicon or acoustic phone models, based on a combination of acoustic pattern discovery, clustering, and temporal sequence learning is presented.
Abstract: We present an unsupervised algorithm for the discovery of words and word-like fragments from the speech signal, without using an upfront defined lexicon or acoustic phone models. The algorithm is based on a combination of acoustic pattern discovery, clustering, and temporal sequence learning. It exploits the acoustic similarity between multiple acoustic tokens of the same words or word-like fragments. In its current form, the algorithm is able to discover words in speech with low perplexity (connected digits). Although its performance still falls off compared to mainstream ASR approaches, the value of the algorithm is its potential to serve as a computational model in two research directions. First, the algorithm may lead to an approach for speech recognition that is fundamentally liberated from the modelling constraints in conventional ASR. Second, the proposed algorithm can be interpreted as a computational model of language acquisition that takes actual speech as input and is able to find words as ’emergent’ properties from raw input.

54 citations


Journal ArticleDOI
Imed Zitouni1
TL;DR: The results suggest that the largest gains in performance are obtained when the test set contains a large number of unseen events, and the proposed backoff hierarchical class n-gram language models outperforms backoff n- gram language models.

41 citations


Journal ArticleDOI
Angela Brew1
TL;DR: In this article, the editorial team has been working to develop a strategic plan for this journal and this has raised some interesting issues. In the journal we aim to present the best available re...
Abstract: In recent months the editorial team has been working to develop a strategic plan for this journal and this has raised some interesting issues. In the journal we aim to present the best available re...

38 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper investigates the use of several language model adaptation techniques applied to the task of machine translation from Arabic broadcast speech and finds that unsupervised and discriminative approaches slightly outperform the traditional perplexity-based optimization technique.
Abstract: This paper investigates the use of several language model adaptation techniques applied to the task of machine translation from Arabic broadcast speech. Unsupervised and discriminative approaches slightly outperform the traditional perplexity-based optimization technique. Language model adaptation, when used for n-best rescoring, improves machine translation performance by 0.3-0.4 BLEU and reduces translation edit rate (TER) by 0.2-0.5% compared to an unadapted LM.

34 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper presents People-LDA, a new graphical model that tightly couples images and captions through a modern face recognizer, and shows how topics can be refined to be more closely related to a single person rather than describing groups of people in a related area.
Abstract: Topic models have recently emerged as powerful tools for modeling topical trends in documents. Often the resulting topics are broad and generic, associating large groups of people and issues that are loosely related. In many cases, it may be desirable to influence the direction in which topic models develop. In this paper, we explore the idea of centering topics around people. In particular, given a large corpus of images featuring collections of people and associated captions, it seems natural to extract topics specifically focussed on each person. What words are most associated with George Bush? Which with Condoleezza Rice? Since people play such an important role in life, it is natural to anchor one topic to each person. In this paper, we present People-LDA, which uses the coherence efface images in news captions to guide the development of topics. In particular, we show how topics can be refined to be more closely related to a single person (like George Bush) rather than describing groups of people in a related area (like politics). To do this we introduce a new graphical model that tightly couples images and captions through a modern face recognizer. In addition to producing topics that are people specific (using images as a guiding force), the model also performs excellent soft clustering efface images, using the language model to boost performance. We present a variety of experiments comparing our method to recent developments in topic modeling and joint image-language modeling, showing that our model has lower perplexity for face identification than competing models and produces more refined topics.

33 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: Experimental results on NIST RT06s evaluation meeting data verify that HPYLM is a competitive and promising language modeling technique, which consistently performs better than interpolated Kneser-Ney and modified Kneserser-ney n-gram LMs in terms of both perplexity and word error rate.
Abstract: In this paper we investigate the application of a hierarchical Bayesian language model (LM) based on the Pitman-Yor process for automatic speech recognition (ASR) of multiparty meetings. The hierarchical Pitman-Yor language model (HPY-LM) provides a Bayesian interpretation of LM smoothing. An approximation to the HPYLM recovers the exact formulation of the interpolated Kneser-Ney smoothing method in n-gram models. This paper focuses on the application and scalability of HPYLM on a practical large vocabulary ASR system. Experimental results on NIST RT06s evaluation meeting data verify that HPYLM is a competitive and promising language modeling technique, which consistently performs better than interpolated Kneser-Ney and modified Kneser-Ney n-gram LMs in terms of both perplexity and word error rate.

Proceedings Article
01 Jun 2007
TL;DR: A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training and consistently improved machine translation quality on both speech and text based adaptation.
Abstract: We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to the target language N-gram LM via marginal adaptation. The proposed framework also enables rapid bootstrapping of LSA models for new languages based on a source LSA model from another language. On Chinese to English speech and text translation the proposed bLSA framework successfully reduced word perplexity of the English LM by over 27% for a unigram LM and up to 13.6% for a 4-gram LM. Furthermore, the proposed approach consistently improved machine translation quality on both speech and text based adaptation.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A word topical mixture model (TMM) is proposed to explore the co-occurrence relationship between words, as well as the long-span latent topical information, for language model adaptation for Mandarin broadcast news recognition.
Abstract: This paper considers dynamic language model adaptation for Mandarin broadcast news recognition. A word topical mixture model (TMM) is proposed to explore the co-occurrence relationship between words, as well as the long-span latent topical information, for language model adaptation. The search history is modeled as a composite word TMM model for predicting the decoded word. The underlying characteristics and different kinds of model structures were extensively investigated, while the performance of word TMM was analyzed and verified by comparison with the conventional probabilistic latent semantic analysis-based language model (PLSALM) and trigger-based language model (TBLM) adaptation approaches. The large vocabulary continuous speech recognition (LVCSR) experiments were conducted on the Mandarin broadcast news collected in Taiwan. Very promising results in perplexity as well as character error rate reductions were initially obtained.

Journal ArticleDOI
TL;DR: A phonetic-feature-based prediction model is presented where phones are represented by a vector of symbolic features that can be on, off, unspecified or unused, and experiments show that feature-based models benefit from prosody cues, but not text, and that phone- based models do not benefit from any of the high-level cues explored here.

Proceedings ArticleDOI
19 Oct 2007
TL;DR: This short paper outlines the design of a recommendation process that is based on an implicit social network where the relevancy and meaning of information can be negotiated not only with the recommender system but also with other users.
Abstract: In this short paper, we describe our RSS recommender system, KeepUP. Too often recommender systems are seen as black box systems, resulting in general perplexity and dissatisfaction from users who are treated as passive, isolated consumers. Recent literature observes that recommendations rarely occur within such isolation and that there may be potential within more socially-orientated approaches. With KeepUP, we outline the design of a recommendation process that is based on an implicit social network where the relevancy and meaning of information can be negotiated not only with the recommender system but also with other users. Our overall goal is to support the formation and development of online communities of interest.

Proceedings ArticleDOI
Ruhi Sarikaya1, Mohamed Afify1, Yuqing Gao1
15 Apr 2007
TL;DR: A new language modeling method is presented that takes advantage of Arabic morphology by combining morphological segments with the underlying lexical items and additional available information sources with regards to Morphological segments and lexical Items within a single joint model.
Abstract: Language modeling for inflected languages such as Arabic poses new challenges for speech recognition due to rich morphology. The rich morphology results in large increases in perplexity and out-of-vocabulary (OOV) rate. In this study, we present a new language modeling method that takes advantage of Arabic morphology by combining morphological segments with the underlying lexical items and additional available information sources with regards to morphological segments and lexical items within a single joint model. Joint representation and modeling of morphological and lexical items reduces the OOV rate and provides smooth probability estimates. Preliminary experiments detailed in this paper show satisfactory improvements over word and morpheme based trigram language models and their interpolations.

01 Jan 2007
TL;DR: In this paper, the dialogue state is defined by the set of parameters contained in the system prompt and a separate language model for each state can be constructed using the linear interpolation of all dialogue-state dependent language models and an automatic text clustering algorithm.
Abstract: Dialogue-state dependent language models in automatic inquiry systems can be employed to improve speech recognition and understanding. In this paper, the dialogue state is defined by the set of parameters contained in the system prompt. Using this knowledge, a separate language model for each state can be constructed. In order to obtain robust language models we study the linear interpolation of all dialogue-state dependent language models and an automatic text clustering algorithm. In particular, we extend the clustering algorithm so as to automatically determine the optimal number of clusters. These clusters are then be combined with linear interpolation. We present experimental results on a Dutch corpus which has been recorded in the Netherlands with a train timetable information system in the framework of the ARISE project [1]. The perplexity, the word error rate, and the attribute error rate can be reduced significantly with all of these methods.

Book ChapterDOI
28 Jun 2007
TL;DR: Fourfold cross-validation experiments on the ICSI Meeting Corpus show that exploiting prosody for language modeling can significantly reduce the perplexity, and also have marginal reductions in word error rate.
Abstract: Prosody has been actively studied as an important knowledge source for speech recognition and understanding. In this paper, we are concerned with the question of exploiting prosody for language models to aid automatic speech recognition in the context of meetings. Using an automatic syllable detection algorithm, the syllable-based prosodic features are extracted to form the prosodic representation for each word. Two modeling approaches are then investigated. One is based on a factored language model, which directly uses the prosodic representation and treats it as a 'word'. Instead of direct association, the second approach provides a richer probabilistic structure within a hierarchical Bayesian framework by introducing an intermediate latent variable to represent similar prosodic patterns shared by groups of words. Fourfold cross-validation experiments on the ICSI Meeting Corpus show that exploiting prosody for language modeling can significantly reduce the perplexity, and also have marginal reductions in word error rate.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: A topic detection approach based on a probabilistic framework is proposed to realize topic adaptation of speech recognition systems for long speech archives such as meetings, demonstrating significant reduction of perplexity and outof-vocabulary rates as well as robustness against ASR errors.
Abstract: A topic detection approach based on a probabilistic framework is proposed to realize topic adaptation of speech recognition systems for long speech archives such as meetings. Since topics in such speech are not clearly defined unlike news stories, we adopt a probabilistic representation of topics based on probabilistic latent semantic analysis (PLSA). A topical sub-space is constructed by PLSA, and speech segments are projected to the subspace, then each segment is represented by a vector which consists of topic probabilities obtained by the projection. Topic detection is performed by clustering these vectors, and topic adaptation is done by collecting relevant texts based on the similarity in this probabilistic representation. In experimental evaluations, the proposed approach demonstrated significant reduction of perplexity and outof-vocabulary rates as well as robustness against ASR errors.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: A novel language model capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple producing sequences of words with trivial output distributions is presented.
Abstract: We present a novel language model capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple. Using tokens as hidden states, our model is effectively a hidden Markov model (HMM) producing sequences of words with trivial output distributions. The transition probabilities, however, are computed using a maximum entropy model to take advantage of potentially overlapping features. We investigated different types of labels with a wide range of linguistic implications. These models outperform Kneser-Ney smoothed n-gram models both in terms of perplexity on standard datasets and in terms of word error rate for a large vocabulary speech recognition system.

Journal ArticleDOI
01 May 2007
TL;DR: The proposed modeling and estimation methods for the mixture language model (LM) led to a 21% reduction of perplexity on test sets of five doctors, which translated into improvements of captioning accuracy.
Abstract: We are developing an automatic captioning system for teleconsultation video teleconferencing (TC-VTC) in telemedicine, based on large vocabulary conversational speech recognition. In TC-VTC, doctors' speech contains a large number of infrequently used medical terms in spontaneous styles. Due to insufficiency of data, we adopted mixture language modeling, with models trained from several datasets of medical and nonmedical domains. This paper proposes novel modeling and estimation methods for the mixture language model (LM). Component LMs are trained from individual datasets, with class n-gram LMs trained from in-domain datasets and word n-gram LMs trained from out-of-domain datasets, and they are interpolated into a mixture LM. For class LMs, semantic categories are used for class definition on medical terms, names, and digits. The interpolation weights of a mixture LM are estimated by a greedy algorithm of forward weight adjustment (FWA). The proposed mixing of in-domain class LMs and out-of-domain word LMs, the semantic definitions of word classes, as well as the weight-estimation algorithm of FWA are effective on the TC-VTC task. As compared with using mixtures of word LMs with weights estimated by the conventional expectation-maximization algorithm, the proposed methods led to a 21% reduction of perplexity on test sets of five doctors, which translated into improvements of captioning accuracy

Proceedings ArticleDOI
01 Dec 2007
TL;DR: A new bigram topic model, the bigram PLSA model, is presented, and a modified training strategy that unevenly assigns latent topics to context words according to an estimation of their latent semantic complexities is proposed.
Abstract: As an important component in many speech and language processing applications, statistical language model has been widely investigated. The bigram topic model, which combines advantages of both the traditional n-gram model and the topic model, turns out to be a promising language modeling approach. However, the original bigram topic model assigns the same topic number for each context word but ignores the fact that there are different complexities to the latent semantics of context words, we present a new bigram topic model, the bigram PLSA model, and propose a modified training strategy that unevenly assigns latent topics to context words according to an estimation of their latent semantic complexities. As a consequence, a refined bigram PLSA model is reached. Experiments on HUB4 Mandarin test transcriptions reveal the superiority over existing models and further performance improvements on perplexity are achieved through the use of the refined bigram PLSA model.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper investigates several methods that combine POS-based model or integrate POS information in the ME (maximum entropy) scheme, which achieve significant reduction in perplexity and WER in a meeting transcription task.
Abstract: For language modeling of spontaneous speech, we propose a novel approach, based on the statistical machine translation framework, which transforms a document-style model to the spoken style. For better coverage and more reliable estimation, incorporation of POS (part-of-speech) information is explored in addition to lexical information. In this paper, we investigate several methods that combine POS-based model or integrate POS information in the ME (maximum entropy) scheme. They achieve significant reduction in perplexity and WER in a meeting transcription task. Moreover, the model is applied to different domains or committee meetings of different topics. As a result, even larger perplexity reduction is achieved compared with the case tested in the same domain. The result demonstrates the generality and portability of the model.

Book ChapterDOI
01 Jan 2007
TL;DR: This chapter overviews techniques for evaluating speech and speaker recognition systems, and describes principles of recognition methods, and specifies types of systems as well as their applications.
Abstract: This chapter overviews techniques for evaluating speech and speaker recognition systems. The chapter first describes principles of recognition methods, and specifies types of systems as well as their applications. The evaluation methods can be classified into subjective and objective methods, among which the chap- ter focuses on the latter methods. In order to compare/normalize performances of different speech recognition systems, test set perplexity is introduced as a measure of the difficulty of each task. Objective evaluation methods of spoken dialogue and transcription systems are respectively described. Speaker recogni- tion can be classified into speaker identification and verification, and most of the application systems fall into the speaker verification category. Since varia- tion of speech features over time is a serious problem in speaker recognition, normalization and adaptation techniques are also described. Speaker verification performance is typically measured by equal error rate, detection error trade-off (DET) curves, and a weighted cost value. The chapter concludes by summarizing various issues for future research.

Proceedings ArticleDOI
22 Apr 2007
TL;DR: By applying a two-step language model adaptation process based on notes and agenda items, this work was able to reduce perplexity by 9% and word error rate by 4% relative on a set of ten meetings recorded in-house.
Abstract: We describe the use of meeting metadata, acquired using a computerized meeting organization and note-taking system, to improve automatic transcription of meetings. By applying a two-step language model adaptation process based on notes and agenda items, we were able to reduce perplexity by 9% and word error rate by 4% relative on a set of ten meetings recorded in-house. This approach can be used to leverage other types of metadata.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: Conditional random fields (CRF) is applied to train the language model and classify documents, and significant improvement on dialect classification is achieved by using the CRF based classifier, especially on the small size documents.
Abstract: Studies have shown that dialect variation has a significant impact in speech recognition performance, and therefore it is important to be able to perform effective dialect classification to improve speech systems. Dialects differ at the acoustic, grammar, and vocabulary levels. In this study, topic-specific printed text dialect data are collected from the ten major newspapers in Australia, United Kingdom, and United States. An n-gram language model is trained for each topic in each country/dialect. The perplexity measure is applied to classify the dialect-dependent documents. In addition to the n-gram information, further features can be extracted from text structure. Conditional random fields (CRF) is such a model which can extract different levels of features and is still mathematically tractable. The CRF is applied to train the language model and classify documents. Significant improvement on dialect classification is achieved by using the CRF based classifier, especially on the small size documents (10% to 22% relative error reduction). Text classification based on variable size documents is explored and a document with several hundred words is shown to be sufficient for dialect classification. The vocabulary difference among the text documents from different countries are explored and the dialect difference is smoothly connected with the vocabulary difference. Five document topics are evaluated and performance for cross topic dialect classification is explored.

Journal IssueDOI
TL;DR: A generative text model using Dirichlet Mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya Mixtures, is proposed and it is shown that the model exhibits high performance in application to statistical language models.
Abstract: We propose a generative text model using Dirichlet Mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya Mixtures, and show that the model exhibits high performance in application to statistical language models. In this paper, we discuss some methods for estimating parameters of Dirichlet Mixtures and for estimating the expectation values of the a posteriori distribution needed for adaptation, and then compare them with two previous text models. The first conventional model is the Mixture of Unigrams, which is often used for incorporating topics into statistical language models. The second one is LDA (Latent Dirichlet Allocation), a typical generative text model. In an experiment using document probabilities and dynamic adaptation of n-gram models for newspaper articles, we show that the proposed model, in comparison with the two previous models, can achieve a lower perplexity at low mixture numbers. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(12): 76– 85, 2007; Published online in Wiley InterScience (). DOI 10.1002-scj.20629

Proceedings ArticleDOI
01 Dec 2007
TL;DR: The proposed minimum Bayes risk (MBR) based approach provides a flexible framework for unsupervised LM adaptation and generalizes to a variety of forms of recognition and translation error metrics.
Abstract: This paper investigates unsupervised test-time adaptation of language models (LM) using discriminative methods for a Mandarin broadcast speech transcription and translation task. A standard approach to adapt interpolated language models to is to optimize the component weights by minimizing the perplexity on supervision data. This is a widely made approximation for language modeling in automatic speech recognition (ASR) systems. For speech translation tasks, it is unclear whether a strong correlation still exists between perplexity and various forms of error cost functions in recognition and translation stages. The proposed minimum Bayes risk (MBR) based approach provides a flexible framework for unsupervised LM adaptation. It generalizes to a variety of forms of recognition and translation error metrics. LM adaptation is performed at the audio document level using either the character error rate (CER), or translation edit rate (TER) as the cost function. An efficient parameter estimation scheme using the extended Baum-Welch (EBW) algorithm is proposed. Experimental results on a state-of-the-art speech recognition and translation system are presented. The MBR adapted language models gave the best recognition and translation performance and reduced the TER score by up to 0.54% absolute.

Proceedings ArticleDOI
17 Sep 2007
TL;DR: A novel method is presented that adjusts the improperly assigned probabilities of unseen n-grams by taking advantage of the agglutinative characteristics of Korean language to prevent grammatically improper n-rams from achieving relatively higher probability and to assign more probability mass to proper n- grams.
Abstract: Smoothing for an n-gram language model is an algorithm that can assign a non-zero probability to an unseen n-gram. Smoothing is an essential technique for an n-gram language model due to the data sparseness problem. However, in some circumstances it assigns an improper amount of probability to unseen n-grams. In this paper, we present a novel method that adjusts the improperly assigned probabilities of unseen n-grams by taking advantage of the agglutinative characteristics of Korean language. In Korean, the grammatically proper class of a morpheme can be predicted by knowing the previous morpheme. By using this characteristic, we try to prevent grammatically improper n-grams from achieving relatively higher probability and to assign more probability mass to proper n-grams. Experimental results show that the proposed method can achieve 8.6% - 12.5% perplexity reductions for Katz backoff algorithm and 4.9% - 7.0% perplexity reductions for Kneser-Ney Smoothing.

Proceedings ArticleDOI
01 Feb 2007
TL;DR: It is shown that it is possible to use n-gram models considering histories different from those used during training, called crossing context models, which achieves an improvement in terms of word error rate on the data used for the francophone evaluation campaign ESTER.
Abstract: This study examines how to take originally advantage from distant information in statistical language models. We show that it is possible to use n-gram models considering histories different from those used during training. These models are called crossing context models. Our study deals with classical and distant n-gram models. A mixture of four models is proposed and evaluated. A bigram linear mixture achieves an improvement of 14% in terms of perplexity. Moreover the trigram mixture outperforms the standard trigram by 5.6%. These improvements have been obtained without complexifying standard n-gram models. The resulting mixture language model has been integrated into a speech recognition system. Its evaluation achieves a slight improvement in terms of word error rate on the data used for the francophone evaluation campaign ESTER [1]. Finally, the impact of the proposed crossing context language models on performance is presented according to various speakers.