scispace - formally typeset
Open AccessJournal ArticleDOI

Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus

Mikio Yamamoto, +1 more
- 01 Mar 2001 - 
- Vol. 27, Iss: 1, pp 1-30
Reads0
Chats0
TLDR
The authors used suffix arrays to compute term frequency (tf) and document frequency (dr) for all n-grams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun.
Abstract
Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber and Myers 1990) were first introduced to compute the frequency and location of a substring (n-gram) in a sequence (corpus) of length N. To compute frequencies over all N(N + 1)/2 substrings in a corpus, the substrings are grouped into a manageable number of equivalence classes. In this way, a prohibitive computation over substrings is reduced to a manageable computation over classes. This paper presents both the algorithms and the code that were used to compute term frequency (tf) and document frequency (dr)for all n-grams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun.The second half of the paper uses these frequencies to find "interesting" substrings. Lexicographers have been interested in n-grams with high mutual information (MI) where the joint term frequency is higher than what would be expected by chance, assuming that the parts of the n-gram combine independently. Residual inverse document frequency (RIDF) compares document frequency to another model of chance where terms with a particular term frequency are distributed randomly throughout the collection. MI tends to pick out phrases with noncompositional semantics (which often violate the independence assumption) whereas RIDF tends to highlight technical terminology, names, and good keywords for information retrieval (which tend to exhibit nonrandom distributions over documents). The combination of both MI and RIDF is better than either by itself in a Japanese word extraction task.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Foundations of Statistical Natural Language Processing

TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Proceedings ArticleDOI

Mining the peanut gallery: opinion extraction and semantic classification of product reviews

TL;DR: This work develops a method for automatically distinguishing between positive and negative reviews and draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation.
Journal ArticleDOI

Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites

TL;DR: A method and a tool aimed at the extraction of domain ontologies from Web sites, and more generally from documents shared among the members of virtual organizations, based on a new word sense disambiguation algorithm, called structural semantic interconnections is presented.
Journal ArticleDOI

Stored Word Sequences in Language Learning The Effect of Familiarity on Children's Repetition of Four-Word Combinations

TL;DR: This study tested the assumption that children store utterances as wholes by testing memory for familiar sequences of words by using a newly available, dense corpus of child-directed speech to identify frequently occurring chunks in the input and matched them to infrequent sequences.
Proceedings ArticleDOI

A Language Model Approach to Keyphrase Extraction

TL;DR: A new approach is to use pointwise KL-divergence between multiple language models for scoring both phraseness and informativeness, which can be unified into a single score to rank extracted phrases.
References
More filters
Journal ArticleDOI

Word association norms, mutual information, and lexicography

TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.
Book

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

TL;DR: In this paper, the authors introduce suffix trees and their use in sequence alignment, core string edits, alignments and dynamic programming, and extend the core problems to extend the main problems.
Journal ArticleDOI

A statistical interpretation of term specificity and its application in retrieval

TL;DR: It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.
Book

Statistical methods for speech recognition

TL;DR: The speech recognition problem hidden Markov models the acoustic model basic language modelling the Viterbi search hypothesis search on a tree and the fast match elements of information theory.
Journal ArticleDOI

Estimation of probabilities from sparse data for the language model component of a speech recognizer

TL;DR: The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.