scispace - formally typeset
Search or ask a question

Understanding Genre in a Collection of a Million Volumes

01 Jan 2014-
About: The article was published on 2014-01-01 and is currently open access. It has received 13 citations till now.
Citations
More filters
01 Sep 2019
TL;DR: A simple but effective approach to WSD using a nearest neighbor classification on CWEs and it is shown that the pre-trained BERT model is able to place polysemic words into distinct 'sense' regions of the embedding space, while ELMo and Flair NLP do not seem to possess this ability.
Abstract: Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number of tasks, such as text classification, sequence tagging, or machine translation. Since vectors of the same word type can vary depending on the respective context, they implicitly provide a model for word sense disambiguation (WSD). We introduce a simple but effective approach to WSD using a nearest neighbor classification on CWEs. We compare the performance of different CWE models for the task and can report improvements above the current state of the art for two standard WSD benchmark datasets. We further show that the pre-trained BERT model is able to place polysemic words into distinct 'sense' regions of the embedding space, while ELMo and Flair NLP do not seem to possess this ability.

55 citations

Posted Content
TL;DR: Building upon BERT, a deep neural language model, it is demonstrated how to combine text representations with metadata and knowledge graph embeddings, which encode author information.
Abstract: In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available

47 citations

Journal ArticleDOI
23 May 2016
TL;DR: The concept of genre is as old as literary theory itself, but centuries of debate haven’t produced much consensus on the topic.
Abstract: The concept of genre is as old as literary theory itself, but centuries of debate haven’t produced much consensus on the topic. Part of the reason is that genre looks like a different thing at different points in the life of a text.

26 citations

Proceedings Article
01 Jan 2018
TL;DR: An intelligent image analysis approach to automatically detect poems in digitally archived historic newspapers by integrating computer vision and machine learning to train an artificial neural network to determine whether an image has poetic text.
Abstract: We describe an intelligent image analysis approach to automatically detect poems in digitally archived historic newspapers. Our application, Image Analysis for Archival Discovery, or Aida, integrates computer vision to capture visual cues based on visual structures of poetic works— instead of the meaning or content—and machine learning to train an artificial neural network to determine whether an image has poetic text. We have tested our application on almost 17,000 image snippets and obtained promising accuracies, precision, and recall. The application is currently being deployed at two institutions for digital library and literary research.

5 citations

DOI
01 Jan 2019
TL;DR: It is necessary to consider the role of language in the formation of identity and the role that language plays in the development of an individual's identity.
Abstract: POETRY: IDENTIFICATION, ENTITY RECOGNITION, AND RETRIEVAL

4 citations

References
More filters
Proceedings ArticleDOI
21 Aug 2011
TL;DR: A large scale data mining effort is presented that detects and blocks such adversarial advertisements for the benefit and safety of users and uses a tiered strategy combining automated and semi-automated methods to ensure reliable classification.
Abstract: In a large online advertising system, adversaries may attempt to profit from the creation of low quality or harmful advertisements. In this paper, we present a large scale data mining effort that detects and blocks such adversarial advertisements for the benefit and safety of our users. Because both false positives and false negatives have high cost, our deployed system uses a tiered strategy combining automated and semi-automated methods to ensure reliable classification. We also employ strategies to address the challenges of learning from highly skewed data at scale, allocating the effort of human experts, leveraging domain expert knowledge, and independently assessing the effectiveness of our system.

99 citations

Proceedings ArticleDOI
01 Oct 2013
TL;DR: This work describes a multilayered solution that trains hidden Markov models to segment volumes, and uses ensembles of overlapping classifers to address historical change on a collection of 469,200 volumes drawn from HathiTrust Digital Library.
Abstract: To mine large digital libraries in humanistically meaningful ways, we need to divide them by genre. This is a task that classification algorithms are well suited to assist, but they need adjustment to address the specific challenges of this domain. Digital libraries pose two problems of scale not usually found in the article datasets used to test these algorithms. 1) Because libraries span several centuries, the genres being identified may change gradually across the time axis. 2) Because volumes are much longer than articles, they tend to be internally heterogeneous, and the classification task also requires segmentation. We describe a multilayered solution that trains hidden Markov models to segment volumes, and uses ensembles of overlapping classifers to address historical change. We demonstrate this on a collection of 469,200 volumes drawn from HathiTrust Digital Library.

33 citations

10 Jul 2014
TL;DR: This poster demonstrates the first phase of developing a map that identifies (at a minimum) the specific pages the authors expect to be fiction, or poetry, or nonfiction prose, or paratext below the page level.
Abstract: Genre metadata for digital volumes is spotty; even with broad categories like “poetry” and “drama,” we’re able to deduce genre from volume-level metadata only about a third of the time. Moreover, volumes are divided internally. A volume of poetry may include plays, begin with a life of the author, and end with twenty pages of publisher’s ads, followed by a “date due” slip. If we want to distant-read public digital collections, we need to develop a map that identifies (at a minimum) the specific pages we expect to be fiction, or poetry, or nonfiction prose, or paratext. In fact, we’ll need to go further than that; we’ll want to map narrower categories like “the epistolary novel,” and divide genres below the page level. In this poster we’re only demonstrating the first phase of this process.

1 citations