We address the problem of selecting non-domain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domain-specific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data selection and two other previously proposed methods.

/pdf/intelligent-selection-of-language-model-training-data-5aiay47mzv.pdf

Intelligent Selection of Language Model Training Data

An enhanced text entry system using word-level analysis to automatically correct inaccuracies in user keystroke entries on reduced keyboards. The keyboard (105) may be a part of a touch-sensitive panel or display screen (100) or on a mechanical keyboard system. A method and system are defined which determine one or more alternative textual interpretations of each sequence of inputs detected within a designated auto-correcting keyboard region (106). The actual contact locations for the keystrokes may occur outside the boundaries of the specific keyboard key regions, where the distance from each contact location to each corresponding intended character may in general increase with the expected frequency of the intended word in the language or in a particular context. The user can easily select the intended word from among the generated interpretations.

Keyboard system with automatic correction

We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general-domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora -- 1% the size of the original -- can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding.

/pdf/domain-adaptation-via-pseudo-in-domain-data-selection-4h9n1lqsvz.pdf

Domain Adaptation via Pseudo In-Domain Data Selection

Continuous space language models

There is disclosed an enhanced text entry system which uses word-level analysis to correct inaccuracies automatically in user keystroke entries on reduced-size or virtual keyboards. A method and system are defined which determine one or more alternate textual interpretations of each sequence of inputs detected within a designated auto-correcting region. The actual interaction locations for the keystrokes may occur outside the boundaries of the specific keyboard key regions associated with the actual characters of the word interpretations proposed or offered for selection, where the distance from each interaction location to each corresponding intended character may in general increase with the expected frequency of the intended word in the language or in a particular context. Likewise, in a virtual keyboard system, the keys actuated may differ from the keys actually associated with the letters of the word interpretations. Each such sequence corresponds to a complete word, and the user can easily select the intended word from among the generated interpretations. Additionally, when the system cannot identify a sufficient number of likely word interpretation candidates of the same length as the input sequence, candidates are identified whose initial letters correspond to a likely interpretation of the input sequence.

Virtual keyboard system with automatic correction

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

/pdf/toward-a-unified-approach-to-statistical-language-modeling-4c7ugzwa1k.pdf

Toward a unified approach to statistical language modeling for Chinese

Statistical language models have been successfully applied to a lot of problems, including speech recognition, handwriting, Chinese pinyin-input etc. In recognition, statistical language model, such as trigram, is used to provide adequate information to predict the probabilities of hypothesized word sequences. The traditional method relying on distribution estimation are sub-optimal when the assumed distribution form is not the true one, and that “optimality” in distribution estimation does not automatically translate into “optimality” in classifier design. This paper proposed a discriminative training method to minimize the error rate of recognizer rather than estimate the distribution of training data. Furthermore, lexicon is also optimized to minimize the error rate of the decoder through discriminative training. Compared to the traditional LM building method, our systems gets approximately 5%-25% recognition error reduction with discriminative training on language model building.

/pdf/discriminative-training-on-language-model-7cgsxxvrbm.pdf

Discriminative training on language model.

A method for the joint optimization of language model performance and size is presented comprising developing a language model from a tuning set of information, segmenting at least a subset of a received textual corpus and calculating a perplexity value for each segment and refining the language model with one or more segments of the received corpus based, at least in part, on the calculated perplexity value for the one or more segments.

A system and method for joint optimization of language model performance and size

The paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigrams to Chinese is challenging because: (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, and segments the training data using this lexicon, all using a maximum likelihood principle, which is consistent with the trigram training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

A unified approach to statistical language modeling for Chinese

In this paper, we present an approach to lexicon optimization for Chinese language modeling. The method is an iterative procedure consisting of two phases, namely lexicon generation and lexicon pruning. In the first phase, we extract appropriate new words from a very large training corpus by statistical approaches. In the second phase, we prune the lexicon to a preset memory limitation using a perplexity minimization criterion. Experimental results show up to 6% character perplexity reduction comparing to the baseline lexicon.

/pdf/lexicon-optimization-for-chinese-language-modeling-fgxqoxsv70.pdf

Mingjing Li

Papers

Toward a unified approach to statistical language modeling for Chinese

Discriminative training on language model.

A system and method for joint optimization of language model performance and size

A unified approach to statistical language modeling for Chinese

Lexicon Optimization for Chinese Language Modeling