scispace - formally typeset
Open AccessJournal ArticleDOI

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Martin Gerlach, +1 more
- 20 Jan 2020 - 
- Vol. 22, Iss: 1, pp 126
TLDR
The Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens, is presented, providing a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Abstract
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

read more

Citations
More filters
Journal ArticleDOI

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

TL;DR: The method of tracking dynamic changes in n-grams can be extended to any temporally evolving corpus, and example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest are presented.
Posted Content

Critical Thinking for Language Models.

TL;DR: The findings suggest that intermediary pre-training on texts that exemplify basic reasoning abilities (such as typically covered in critical thinking textbooks) might help language models to acquire a broad range of reasoning skills.
MonographDOI

Natural Language Processing for Corpus Linguistics

Jonathan Dunn
TL;DR: This Element shows how text classification and text similarity models can extend the ability to undertake corpus linguistics across very large corpora, and pairs each new methodology with a discussion of potential ethical implications.
Journal ArticleDOI

The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies

Álvaro Corral, +1 more
- 17 Feb 2020 - 
TL;DR: In this paper, the authors show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the typefrequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon.
Journal ArticleDOI

A note on the reproducibility of chaos simulation

TL;DR: A case study of reproducibility is presented in the simulation of a chaotic jerk circuit, using the software LTspice, and the methodology developed is efficient in identifying the computer with better performance, which allows applying it to other cases in the literature.
References
More filters
Book

Introduction to Information Retrieval

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Journal ArticleDOI

Estimating the reproducibility of psychological science

Alexander A. Aarts, +290 more
- 28 Aug 2015 - 
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Posted Content

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

TL;DR: The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.

Why Most Published Research Findings Are False

TL;DR: In this paper, the authors discuss the implications of these problems for the conduct and interpretation of research and suggest that claimed research findings may often be simply accurate measures of the prevailing bias.
Journal ArticleDOI

Divergence measures based on the Shannon entropy

TL;DR: A novel class of information-theoretic divergence measures based on the Shannon entropy is introduced, which do not require the condition of absolute continuity to be satisfied by the probability distributions involved and are established in terms of bounds.
Related Papers (5)