scispace - formally typeset
DOI

Page-Level Genre Metadata for English-Language Volumes in HathiTrust, 1700-1922

29 Dec 2014-

AboutThe article was published on 2014-12-29 and is currently open access. It has received 4 citation(s) till now. The article focuses on the topic(s): Metadata & Digital library.

...read more


Citations
More filters
Journal ArticleDOI
TL;DR: The authors studied the stylistic differences associated with literary prominence across a century and found that there is a steady tendency for new volumes of poetry to change by slightly exaggerating certain features that defined prestige in the recent past.
Abstract: A history of literary prestige needs to study both works that achieved distinction and the mass of volumes from which they were distinguished. To understand how those patterns of preference changed across a century, we gathered two samples of English-language poetry from the period 1820–1919: one drawn from volumes reviewed in prominent periodicals and one selected at random from a large digital library (in which the majority of authors are relatively obscure). The stylistic differences associated with literary prominence turn out to be quite stable: a statistical model trained to distinguish reviewed from random volumes in any quarter of this century can make predictions almost as accurate about the rest of the period. The “poetic revolutions” described by many histories are not visible in this model; instead, there is a steady tendency for new volumes of poetry to change by slightly exaggerating certain features that defined prestige in the recent past.

24 citations

01 Jan 2014

13 citations

Journal ArticleDOI
TL;DR: The potential of supervised predictive models in topic modeling is sketched by describing how Jordan Sellers and I have begun to model poetic distinction in the long 19th century—revealing an arc of gradual change much longer than received literary histories would lead us to expect.
Abstract: Debates over “Big Data” shed more heat than light in the humanities, because the term ascribes new importance to statistical methods without explaining how those methods have changed. What we badly...

7 citations


Cites background or methods from "Page-Level Genre Metadata for Engli..."

  • ...If machine learning seems inscrutable, it’s more likely because recent discussions in the humanities have focused on unsupervised algorithms, like most of those used for topic modeling (DiMaggio et al., 2013; Goldstone and Underwood, 2014; Liu, 2013)....

    [...]

  • ...One we sampled from volumes reviewed in 14 British and American magazines that were widely read by literary elites; the other we assembled by randomly sampling 53,200 volumes of poetry from HathiTrust Digital Library (using methods described in Underwood, 2014)....

    [...]

Proceedings ArticleDOI
02 Jun 2019
TL;DR: This study describes a method for paratext detection based on smoothed unsupervised clustering and shows that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts ofParatext.
Abstract: Digital humanities scholars are developing new techniques of literary study using non-consumptive processing of large collections of scanned text. A crucial step in working with such collections is to separate the main text of a work from the surrounding paratext, the content of which may distort word counts, location references, sentiment scores, and other important outputs. Simple heuristic methods have been devised, but are not accurate for some texts and some methodological needs. This study describes a method for paratext detection based on smoothed unsupervised clustering. We show that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts of paratext. We also show that a more accurate detection of paratext boundaries improves the accuracy of subsequent text processing, as exemplified by a readability metric.

Cites background from "Page-Level Genre Metadata for Engli..."

  • ...structured text either on the Web [3][4][5] or in a digital library [6], [7], [8] for the purposes of extracting specific elements/structures from the text (e....

    [...]