A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

doi:10.3390/E22010126

Open AccessJournal ArticleDOI

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Martin Gerlach, +1 more

- 20 Jan 2020 -

Entropy

- Vol. 22, Iss: 1, pp 126

TLDR

The Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens, is presented, providing a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Abstract:

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Thayer Alshaabi, +8 more

- 25 Jul 2020 -

arXiv: Social and Information Networks

TL;DR: The method of tracking dynamic changes in n-grams can be extended to any temporally evolving corpus, and example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest are presented.

...read moreread less

Posted Content

Critical Thinking for Language Models.

Gregor Betz

- 15 Sep 2020 -

arXiv: Computation and Language

TL;DR: The findings suggest that intermediary pre-training on texts that exemplify basic reasoning abilities (such as typically covered in critical thinking textbooks) might help language models to acquire a broad range of reasoning skills.

...read moreread less

MonographDOI

Natural Language Processing for Corpus Linguistics

Jonathan Dunn

TL;DR: This Element shows how text classification and text similarity models can extend the ability to undertake corpus linguistics across very large corpora, and pairs each new methodology with a discussion of potential ethical implications.

...read moreread less

Journal ArticleDOI

The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies

Álvaro Corral, +1 more

- 17 Feb 2020 -

Entropy

TL;DR: In this paper, the authors show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the typefrequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon.

...read moreread less

Journal ArticleDOI

A note on the reproducibility of chaos simulation

Thalita E. Nazare, +3 more

- 29 Aug 2020 -

Entropy

TL;DR: A case study of reproducibility is presented in the simulation of a chaotic jerk circuit, using the software LTspice, and the methodology developed is efficient in identifying the computer with better performance, which allows applying it to other cases in the literature.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Introduction to Information Retrieval

Christopher D. Manning, +2 more

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

Journal ArticleDOI

Estimating the reproducibility of psychological science

Alexander A. Aarts, +290 more

- 28 Aug 2015 -

Science

TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

...read moreread less

Posted Content

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, +1 more

- 09 Feb 2018 -

arXiv: Machine Learning

TL;DR: The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.

...read moreread less

Why Most Published Research Findings Are False

John P. A. Ioannidis

TL;DR: In this paper, the authors discuss the implications of these problems for the conduct and interpretation of research and suggest that claimed research findings may often be simply accurate measures of the prevailing bias.

...read moreread less

Journal ArticleDOI

Divergence measures based on the Shannon entropy

J. Lin

- 01 Jan 1991 -

IEEE Transactions on Information Theory

TL;DR: A novel class of information-theoretic divergence measures based on the Shannon entropy is introduced, which do not require the condition of absolute continuity to be satisfied by the probability distributions involved and are established in terms of bounds.

...read moreread less