Statistical Models for Text Segmentation

doi:10.1023/A:1007506220214

Open AccessJournal ArticleDOI

Statistical Models for Text Segmentation

Doug Beeferman, +2 more

- 01 Feb 1999 -

Machine Learning

- Vol. 34, Iss: 1, pp 177-210

TLDR

Assessment of the approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts, using a new probabilistically motivated error metric.

Abstract:

This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.

Statistical Models for Text Segmentation

Citations

The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Relevance-Based Language Models

Maximum Entropy Markov Models for Information Extraction and Segmentation

Bursty and Hierarchical Structure in Streams

Inter-coder agreement for computational linguistics

References

Classification and Regression Trees.

Bayesian Data Analysis

Classification and regression trees

Generalized Additive Models.

Chapman and Hall

Related Papers (5)

TextTiling: segmenting text into multi-paragraph subtopic passages

A critique and improvement of an evaluation metric for text segmentation

Advances in domain independent linear text segmentation

Discourse Segmentation of Multi-Party Conversation

Latent dirichlet allocation