scispace - formally typeset
Open AccessJournal ArticleDOI

Statistical Models for Text Segmentation

Doug Beeferman, +2 more
- 01 Feb 1999 - 
- Vol. 34, Iss: 1, pp 177-210
TLDR
Assessment of the approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts, using a new probabilistically motivated error metric.
Abstract
This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.

read more

Citations
More filters
Book

The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

TL;DR: Providing an in-depth examination of core text mining and link detection algorithms and operations, this text examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches.
Journal ArticleDOI

Relevance-Based Language Models

TL;DR: This work proposes a novel technique for estimating a relevance model with no training data and demonstrates that it can produce highly accurate relevance models, addressing important notions of synonymy and polysemy.
Proceedings Article

Maximum Entropy Markov Models for Information Extraction and Segmentation

TL;DR: A new Markovian sequence model is presented that allows observations to be represented as arbitrary overlapping features (such as word, capitalization, formatting, part-of-speech), and defines the conditional probability of state sequences given observation sequences.
Journal ArticleDOI

Bursty and Hierarchical Structure in Streams

TL;DR: The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content.
Journal ArticleDOI

Inter-coder agreement for computational linguistics

TL;DR: It is argued that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder.
References
More filters
Book

Bayesian Data Analysis

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Book

Classification and regression trees

Leo Breiman
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Journal ArticleDOI

Generalized Additive Models.

Journal ArticleDOI

Chapman and Hall

Anne Lohrli
- 01 Sep 1985 - 
Related Papers (5)