scispace - formally typeset
Journal ArticleDOI

n-Gram Statistics for Natural Language Understanding and Text Processing

Ching Y. Suen
- 01 Feb 1979 - 
- Vol. 1, Iss: 2, pp 164-172
Reads0
Chats0
TLDR
The positional distributions of n-grams obtained in the present study are discussed and statistical studies on word length and trends ofn-gram frequencies versus vocabulary are presented.
Abstract
n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.

read more

Citations
More filters

N-gram-based text categorization

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Journal ArticleDOI

Gauging Similarity with n-Grams: Language-Independent Categorization of Text.

Marc Damashek
- 10 Feb 1995 - 
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Journal ArticleDOI

Twenty years of document image analysis in PAMI

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Book

Survey of Text Mining: Clustering, Classification, and Retrieval

TL;DR: Survey of Text Mining II offers a broad selection in state-of-the art algorithms and software for text mining from both academic and industrial perspectives, to generate interest and insight into the state of the field.
Journal ArticleDOI

'Online recognition of Chinese characters: the state-of-the-art

TL;DR: This paper reviews the advances in online Chinese character recognition (OLCCR), with emphasis on the research works from the 1990s, in terms of pattern representation, character classification, learning/adaptation, and contextual processing.
References
More filters
Journal ArticleDOI

The viterbi algorithm

TL;DR: This paper gives a tutorial exposition of the Viterbi algorithm and of how it is implemented and analyzed, and increasing use of the algorithm in a widening variety of areas is foreseen.
Journal ArticleDOI

The String-to-String Correction Problem

TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
Journal ArticleDOI

Prediction and entropy of printed English

TL;DR: A new method of estimating the entropy and redundancy of a language is described, which exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known.