scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documents

TL;DR: This study describes a method for paratext detection based on smoothed unsupervised clustering and shows that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts ofParatext.
Abstract: Digital humanities scholars are developing new techniques of literary study using non-consumptive processing of large collections of scanned text. A crucial step in working with such collections is to separate the main text of a work from the surrounding paratext, the content of which may distort word counts, location references, sentiment scores, and other important outputs. Simple heuristic methods have been devised, but are not accurate for some texts and some methodological needs. This study describes a method for paratext detection based on smoothed unsupervised clustering. We show that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts of paratext. We also show that a more accurate detection of paratext boundaries improves the accuracy of subsequent text processing, as exemplified by a readability metric.
References
More filters
Book
01 Jan 1995

679 citations


"Unsupervised clustering with smooth..." refers methods in this paper

  • ...Figure 4 illustrates the results of the Dale-Chall measure [19] averaged over 10 text samples obtained with three methods of main content boundary identification....

    [...]

Book ChapterDOI
12 Sep 2006
TL;DR: The authors proposed a variable-length n-gram approach inspired by previous work for selecting variable length word sequences for authorship identification, using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors.
Abstract: Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

210 citations

Proceedings ArticleDOI
20 Jun 1999
TL;DR: A new approach that learns to recognize tables in free text, including the boundary, rows and columns of tables, outperforms a deterministic table recognition algorithm that identifies tables based on a fixed set of conditions.
Abstract: Many real-world texts contain tables. In order to process these texts correctly and extract the information contained within the tables, it is important to identify the presence and structure of tables. In this paper, we present a new approach that learns to recognize tables in free text, including the boundary, rows and columns of tables. When tested on Wall Street Journal news documents, our learning approach outperforms a deterministic table recognition algorithm that identifies table recognition algorithm that identifies tables based on a fixed set of conditions. Our learning approach is also more flexible and easily adaptable to texts in different domains with different table characteristics.

86 citations


"Unsupervised clustering with smooth..." refers background in this paper

  • ...structured text either on the Web [3][4][5] or in a digital library [6], [7], [8] for the purposes of extracting specific elements/structures from the text (e....

    [...]

Proceedings ArticleDOI
01 Apr 2014
TL;DR: This paper will concisely survey the attractiveness of function words in stylometry and relate them to the use of character n-grams and propose to replace the term ‘function word’ by the terms ‘functor’ in stylometric, due to multiple theoretical considerations.
Abstract: This position paper focuses on the use of function words in computational authorship attribution. Although recently there have been multiple successful applications of authorship attribution, the field is not particularly good at the explication of methods and theoretical issues, which might eventually compromise the acceptance of new research results in the traditional humanities community. I wish to partially help remedy this lack of explication and theory, by contributing a theoretical discussion on the use of function words in stylometry. I will concisely survey the attractiveness of function words in stylometry and relate them to the use of character n-grams. At the end of this paper, I will propose to replace the term ‘function word’ by the term ‘functor’ in stylometry, due to multiple theoretical considerations.

80 citations


"Unsupervised clustering with smooth..." refers background in this paper

  • ...1 The bag of words model takes into account all of the words in the text and does not exclude stopwords and other common words and characters as they can be important demarcator of style [12][13][14][15]....

    [...]

Journal ArticleDOI
TL;DR: The authors presented a methode d'identification de l'auteur d'un texte, based on a hierarchique des frequences des mots communs a l'ensemble of textes.
Abstract: L'A. intervient a l'occasion de la reception du prix Roberto Busa 2001 qui lui est decerne pour sa contribution dans le domaine de l'informatique et des sciences humaines. Il fait un bilan de son parcours et presente une nouvelle methode d'identification de l'auteur d'un texte. Son interet pour ce domaine de recherche a debute dans les annees 1970, avec l'analyse d'un texte de Jane Austen. Il nous presente aujourd'hui une nouvelle approche basee sur une liste hierarchique des frequences des mots communs a l'ensemble des textes. Il decrit alors les procedures et les resultats de cette methode, en donne quelques developpements possibles pour l'avenir et fait le point sur l'etat de l'art dans ce domaine de la recherche assistee par ordinateur

70 citations