Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documents

doi:10.1109/JCDL.2019.00018

Home
/
Papers
/
Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documents

Proceedings Article•DOI•

Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documents

Ana Lucic¹, Robin Burke², John Shanahan¹•Institutions (2)

DePaul University¹, University of Colorado Boulder²

02 Jun 2019-pp 53-56

TL;DR: This study describes a method for paratext detection based on smoothed unsupervised clustering and shows that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts ofParatext.

read less

Abstract: Digital humanities scholars are developing new techniques of literary study using non-consumptive processing of large collections of scanned text. A crucial step in working with such collections is to separate the main text of a work from the surrounding paratext, the content of which may distort word counts, location references, sentiment scores, and other important outputs. Simple heuristic methods have been devised, but are not accurate for some texts and some methodological needs. This study describes a method for paratext detection based on smoothed unsupervised clustering. We show that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts of paratext. We also show that a more accurate detection of paratext boundaries improves the accuracy of subsequent text processing, as exemplified by a readability metric.

...read moreread less

References

PDF

Open Access

More filters

Book•

Readability revisited : the new Dale-Chall readability formula

[...]

Jeanne S. Chall, Edgar Dale

01 Jan 1995

679 citations

"Unsupervised clustering with smooth..." refers methods in this paper

...Figure 4 illustrates the results of the Dale-Chall measure [19] averaged over 10 text samples obtained with three methods of main content boundary identification....
[...]

Book Chapter•DOI•

N-Gram feature selection for authorship identification

[...]

John Houvardas¹, Efstathios Stamatatos¹•Institutions (1)

University of the Aegean¹

12 Sep 2006

TL;DR: The authors proposed a variable-length n-gram approach inspired by previous work for selecting variable length word sequences for authorship identification, using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors.

...read moreread less

Abstract: Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

...read moreread less

210 citations

Proceedings Article•DOI•

Learning to Recognize Tables in Free Text

[...]

Hwee Tou Ng¹, Chung Yong Lim¹, Jessica Li Teng Koo¹•Institutions (1)

DSO National Laboratories¹

20 Jun 1999

TL;DR: A new approach that learns to recognize tables in free text, including the boundary, rows and columns of tables, outperforms a deterministic table recognition algorithm that identifies tables based on a fixed set of conditions.

...read moreread less

Abstract: Many real-world texts contain tables. In order to process these texts correctly and extract the information contained within the tables, it is important to identify the presence and structure of tables. In this paper, we present a new approach that learns to recognize tables in free text, including the boundary, rows and columns of tables. When tested on Wall Street Journal news documents, our learning approach outperforms a deterministic table recognition algorithm that identifies table recognition algorithm that identifies tables based on a fixed set of conditions. Our learning approach is also more flexible and easily adaptable to texts in different domains with different table characteristics.

...read moreread less

86 citations

"Unsupervised clustering with smooth..." refers background in this paper

...structured text either on the Web [3][4][5] or in a digital library [6], [7], [8] for the purposes of extracting specific elements/structures from the text (e....
[...]

Proceedings Article•DOI•

Function Words in Authorship Attribution. From Black Magic to Theory

[...]

Mike Kestemont¹•Institutions (1)

University of Antwerp¹

01 Apr 2014

TL;DR: This paper will concisely survey the attractiveness of function words in stylometry and relate them to the use of character n-grams and propose to replace the term ‘function word’ by the terms ‘functor’ in stylometric, due to multiple theoretical considerations.

...read moreread less

Abstract: This position paper focuses on the use of function words in computational authorship attribution. Although recently there have been multiple successful applications of authorship attribution, the field is not particularly good at the explication of methods and theoretical issues, which might eventually compromise the acceptance of new research results in the traditional humanities community. I wish to partially help remedy this lack of explication and theory, by contributing a theoretical discussion on the use of function words in stylometry. I will concisely survey the attractiveness of function words in stylometry and relate them to the use of character n-grams. At the end of this paper, I will propose to replace the term ‘function word’ by the term ‘functor’ in stylometry, due to multiple theoretical considerations.

...read moreread less

80 citations

"Unsupervised clustering with smooth..." refers background in this paper

...1 The bag of words model takes into account all of the words in the text and does not exclude stopwords and other common words and characters as they can be important demarcator of style [12][13][14][15]....
[...]

Journal Article•DOI•

Questions of Authorship: Attribution and Beyond A Lecture Delivered on the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York

[...]

John Burrows¹•Institutions (1)

University of Newcastle¹

01 Feb 2003-Computers and The Humanities

TL;DR: The authors presented a methode d'identification de l'auteur d'un texte, based on a hierarchique des frequences des mots communs a l'ensemble of textes.

...read moreread less

Abstract: L'A. intervient a l'occasion de la reception du prix Roberto Busa 2001 qui lui est decerne pour sa contribution dans le domaine de l'informatique et des sciences humaines. Il fait un bilan de son parcours et presente une nouvelle methode d'identification de l'auteur d'un texte. Son interet pour ce domaine de recherche a debute dans les annees 1970, avec l'analyse d'un texte de Jane Austen. Il nous presente aujourd'hui une nouvelle approche basee sur une liste hierarchique des frequences des mots communs a l'ensemble des textes. Il decrit alors les procedures et les resultats de cette methode, en donne quelques developpements possibles pour l'avenir et fait le point sur l'etat de l'art dans ce domaine de la recherche assistee par ordinateur

...read moreread less

70 citations