n-Gram Statistics for Natural Language Understanding and Text Processing

doi:10.1109/TPAMI.1979.4766902

Journal ArticleDOI

n-Gram Statistics for Natural Language Understanding and Text Processing

Ching Y. Suen

- 01 Feb 1979 -

IEEE Transactions on Pattern Analysis an...

- Vol. 1, Iss: 2, pp 164-172

Chats0

TLDR

The positional distributions of n-grams obtained in the present study are discussed and statistical studies on word length and trends ofn-gram frequencies versus vocabulary are presented.

Abstract:

n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.

Citations

PDF

Open Access

More filters

N-gram-based text categorization

W.B. Cavnar, +1 more

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

...read moreread less

Journal ArticleDOI

Gauging Similarity with n-Grams: Language-Independent Categorization of Text.

Marc Damashek

- 10 Feb 1995 -

Science

TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.

...read moreread less

Journal ArticleDOI

Twenty years of document image analysis in PAMI

George Nagy

- 01 Jan 2000 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.

...read moreread less

Book

Survey of Text Mining: Clustering, Classification, and Retrieval

Michael W. Berry, +1 more

TL;DR: Survey of Text Mining II offers a broad selection in state-of-the art algorithms and software for text mining from both academic and industrial perspectives, to generate interest and insight into the state of the field.

...read moreread less

Journal ArticleDOI

'Online recognition of Chinese characters: the state-of-the-art

Cheng-Lin Liu, +2 more

- 01 Jan 2004 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This paper reviews the advances in online Chinese character recognition (OLCCR), with emphasis on the research works from the 1990s, in terms of pattern representation, character classification, learning/adaptation, and contextual processing.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal Article

Binary codes capable of correcting deletions, insertions and reversals

V.I. Levenshtein

- 01 Jan 1965 -

Proceedings of the USSR Academy of Scien...

Journal Article

Binary codes capable of correcting deletions, insertions, and reversals

V.I. Levenshtein

- 01 Jan 1966 -

Soviet physics. Doklady

Journal ArticleDOI

The viterbi algorithm

Jr. G.D. Forney

TL;DR: This paper gives a tutorial exposition of the Viterbi algorithm and of how it is implemented and analyzed, and increasing use of the algorithm in a widening variety of areas is foreseen.

...read moreread less

Journal ArticleDOI

The String-to-String Correction Problem

Robert A. Wagner, +1 more

- 01 Jan 1974 -

Journal of the ACM

TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.

...read moreread less

Journal ArticleDOI

Prediction and entropy of printed English

Claude E. Shannon

- 01 Jan 1951 -

Bell System Technical Journal

TL;DR: A new method of estimating the entropy and redundancy of a language is described, which exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known.

...read moreread less

Collapse

n-Gram Statistics for Natural Language Understanding and Text Processing

Citations

N-gram-based text categorization

Gauging Similarity with n-Grams: Language-Independent Categorization of Text.

Twenty years of document image analysis in PAMI

Survey of Text Mining: Clustering, Classification, and Retrieval

'Online recognition of Chinese characters: the state-of-the-art

References

Binary codes capable of correcting deletions, insertions and reversals

Binary codes capable of correcting deletions, insertions, and reversals

The viterbi algorithm

The String-to-String Correction Problem

Prediction and entropy of printed English

Related Papers (5)

N-gram-based text categorization

Gauging Similarity with n-Grams: Language-Independent Categorization of Text.

Prediction and entropy of printed English

Introduction to Modern Information Retrieval

A vector space model for automatic indexing