Journal ArticleDOI
n-Gram Statistics for Natural Language Understanding and Text Processing
Reads0
Chats0
TLDR
The positional distributions of n-grams obtained in the present study are discussed and statistical studies on word length and trends ofn-gram frequencies versus vocabulary are presented.Abstract:
n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.read more
Citations
More filters
N-gram-based text categorization
W.B. Cavnar,John M. Trenkle +1 more
TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Journal ArticleDOI
Gauging Similarity with n-Grams: Language-Independent Categorization of Text.
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Journal ArticleDOI
Twenty years of document image analysis in PAMI
TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Book
Survey of Text Mining: Clustering, Classification, and Retrieval
TL;DR: Survey of Text Mining II offers a broad selection in state-of-the art algorithms and software for text mining from both academic and industrial perspectives, to generate interest and insight into the state of the field.
Journal ArticleDOI
'Online recognition of Chinese characters: the state-of-the-art
TL;DR: This paper reviews the advances in online Chinese character recognition (OLCCR), with emphasis on the research works from the 1990s, in terms of pattern representation, character classification, learning/adaptation, and contextual processing.
References
More filters
Journal Article
Binary codes capable of correcting deletions, insertions, and reversals
Journal ArticleDOI
The viterbi algorithm
TL;DR: This paper gives a tutorial exposition of the Viterbi algorithm and of how it is implemented and analyzed, and increasing use of the algorithm in a widening variety of areas is foreseen.
Journal ArticleDOI
The String-to-String Correction Problem
TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
Journal ArticleDOI
Prediction and entropy of printed English
TL;DR: A new method of estimating the entropy and redundancy of a language is described, which exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known.