A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.
TLDR
The Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens, is presented, providing a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.Abstract:
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.read more
Citations
More filters
Journal ArticleDOI
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
Thayer Alshaabi,Jane Lydia Adams,Michael Vincent Arnold,Joshua R. Minot,David Rushing Dewhurst,David Rushing Dewhurst,Andrew J. Reagan,Christopher M. Danforth,Peter Sheridan Dodds +8 more
TL;DR: The method of tracking dynamic changes in n-grams can be extended to any temporally evolving corpus, and example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest are presented.
Posted Content
Critical Thinking for Language Models.
TL;DR: The findings suggest that intermediary pre-training on texts that exemplify basic reasoning abilities (such as typically covered in critical thinking textbooks) might help language models to acquire a broad range of reasoning skills.
MonographDOI
Natural Language Processing for Corpus Linguistics
TL;DR: This Element shows how text classification and text similarity models can extend the ability to undertake corpus linguistics across very large corpora, and pairs each new methodology with a discussion of potential ethical implications.
Journal ArticleDOI
The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies
Álvaro Corral,Isabel Serra +1 more
TL;DR: In this paper, the authors show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the typefrequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon.
Journal ArticleDOI
A note on the reproducibility of chaos simulation
TL;DR: A case study of reproducibility is presented in the simulation of a chaotic jerk circuit, using the software LTspice, and the methodology developed is efficient in identifying the computer with better performance, which allows applying it to other cases in the literature.
References
More filters
Book
Introduction to Information Retrieval
TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Journal ArticleDOI
Estimating the reproducibility of psychological science
Alexander A. Aarts,Joanna E. Anderson,Christopher J. Anderson,Peter Raymond Attridge,Peter Raymond Attridge,Angela S. Attwood,Jordan Axt,Molly Babel,Štěpán Bahník,Erica Baranski,Michael Barnett-Cowan,Elizabeth Bartmess,Jennifer S. Beer,Raoul Bell,Heather Bentley,Leah Beyan,Grace Binion,Grace Binion,Denny Borsboom,Annick Bosch,Frank A. Bosco,Sara Bowman,Mark J. Brandt,Erin L Braswell,Hilmar Brohmer,Benjamin T. Brown,Kristina G. Brown,Jovita Brüning,Jovita Brüning,Ann Calhoun-Sauls,Shannon P. Callahan,Elizabeth Chagnon,Jesse Chandler,Jesse Chandler,Christopher R. Chartier,Felix Cheung,Felix Cheung,Cody D. Christopherson,Linda Cillessen,Russ Clay,Hayley M. D. Cleary,Mark D. Cloud,Michael Conn,Johanna Cohoon,Simon Columbus,Andreas Cordes,Giulio Costantini,Leslie Cramblet Alvarez,Ed Cremata,Jan Crusius,Jamie DeCoster,Michelle A. DeGaetano,Nicolás Delia Penna,Bobby Den Bezemer,Marie K. Deserno,Olivia Devitt,Laura Dewitte,David G. Dobolyi,Geneva T. Dodson,M. Brent Donnellan,Ryan Donohue,Rebecca A. Dore,Angela Rachael Dorrough,Angela Rachael Dorrough,Anna Dreber,Michelle Dugas,Elizabeth W. Dunn,Kayleigh E Easey,Sylvia Eboigbe,Casey Eggleston,Jo Embley,Sacha Epskamp,Timothy M. Errington,Vivien Estel,Frank J. Farach,Jenelle Feather,Anna Fedor,Belén Fernández-Castilla,Susann Fiedler,James G. Field,Stanka A. Fitneva,Taru Flagan,Amanda L. Forest,Eskil Forsell,Joshua D. Foster,Michael C. Frank,Rebecca S. Frazier,Heather M. Fuchs,Philip A. Gable,Jeff Galak,Elisa Maria Galliani,Anup Gampa,Sara García,Douglas Gazarian,Elizabeth Gilbert,Roger Giner-Sorolla,Andreas Glöckner,Andreas Glöckner,Lars Goellner,Jin X. Goh,Rebecca M. Goldberg,Patrick T. Goodbourn,Shauna Gordon-McKeon,Bryan Gorges,Jessie Gorges,Justin Goss,Jesse Graham,James A. Grange,Jeremy R. Gray,Chris H.J. Hartgerink,Joshua K. Hartshorne,Fred Hasselman,Timothy Hayes,Emma Heikensten,Felix Henninger,Felix Henninger,John Hodsoll,Taylor Holubar,Gea Hoogendoorn,Denise J. Humphries,Cathy On-Ying Hung,Nathali Immelman,Vanessa C. Irsik,Georg Jahn,Frank Jäkel,Marc Jekel,Magnus Johannesson,Larissa Gabrielle Johnson,David J. Johnson,Kate M. Johnson,William J. Johnston,Kai J. Jonas,Jennifer A. Joy-Gaba,Heather Barry Kappes,Kim Kelso,Mallory C. Kidwell,Seung K. Kim,Matthew W. Kirkhart,Bennett Kleinberg,Bennett Kleinberg,Goran Knežević,Franziska Maria Kolorz,Jolanda J. Kossakowski,Robert Krause,Job Krijnen,Tim Kuhlmann,Yoram K. Kunkels,Megan M. Kyc,Calvin K. Lai,Aamir Laique,Daniel Lakens,Kristin A. Lane,Bethany Lassetter,Ljiljana B. Lazarević,Etienne P. Le Bel,Key Jung Lee,Minha Lee,Kristi M. Lemm,Carmel A. Levitan,Melissa Lewis,Lin Lin,Stephanie C. Lin,Matthias Lippold,Darren Loureiro,Ilse Luteijn,Sean P. Mackinnon,Heather N. Mainard,Denise C. Marigold,Daniel P. Martin,Tylar Martinez,E. J. Masicampo,Joshua J. Matacotta,Maya B. Mathur,Michael May,Michael May,Nicole Mechin,Pranjal H. Mehta,Johannes M. Meixner,Johannes M. Meixner,Alissa Melinger,Jeremy K. Miller,Mallorie Miller,Katherine Moore,Katherine Moore,Marcus Möschl,Matt Motyl,Stephanie M. Müller,Marcus R. Munafò,Koen Ilja Neijenhuijs,Taylor Nervi,Gandalf Nicolas,Gustav Nilsonne,Gustav Nilsonne,Brian A. Nosek,Brian A. Nosek,Michèle B. Nuijten,Catherine Olsson,Catherine Olsson,Colleen Osborne,Lutz Ostkamp,Misha Pavel,Ian S. Penton-Voak,Olivia Perna,Cyril Pernet,Marco Perugini,R. Nathan Pipitone,Michael C. Pitts,Franziska Plessow,Franziska Plessow,Jason M. Prenoveau,Rima-Maria Rahal,Rima-Maria Rahal,Kate A. Ratliff,David Reinhard,Frank Renkewitz,Ashley A. Ricker,Anastasia E. Rigney,Andrew M Rivers,Mark A. Roebke,Abraham M. Rutchick,Robert S. Ryan,Onur Sahin,Anondah R. Saide,Gillian M. Sandstrom,David Santos,David Santos,Rebecca Saxe,René Schlegelmilch,René Schlegelmilch,Kathleen Schmidt,Sabine Scholz,Larissa Seibel,Dylan Selterman,Samuel Shaki,William B. Simpson,H. Colleen Sinclair,Jeanine L. M. Skorinko,Agnieszka Slowik,Joel S. Snyder,Courtney K. Soderberg,Carina Sonnleitner,Nick Spencer,Jeffrey R. Spies,Sara Steegen,Stefan Stieger,Nina Strohminger,Gavin Brent Sullivan,Thomas Talhelm,Megan Tapia,Anniek M. te Dorsthorst,Manuela Thomae,Manuela Thomae,Sarah L. Thomas,Pia Tio,Frits Traets,Steve N.H. Tsang,Francis Tuerlinckx,Paul J. Turchan,Milan Valášek,Anna E. Van't Veer,Robbie C. M. van Aert,Marcel A.L.M. van Assen,Riet van Bork,Mathijs Van De Ven,Don van den Bergh,Marije van der Hulst,Roel van Dooren,Johnny van Doorn,Daan R. van Renswoude,Hedderik van Rijn,Wolf Vanpaemel,Alejandro Vásquez Echeverría,Melissa Vazquez,Natalia Vélez,Marieke Vermue,Mark Verschoor,Michelangelo Vianello,Martin Voracek,Gina Vuu,Eric-Jan Wagenmakers,Joanneke Weerdmeester,Ashlee Welsh,Erin C. Westgate,Joeri Wissink,Michael J. Wood,Andy T. Woods,Andy T. Woods,Emily M. Wright,Sining Wu,Marcel Zeelenberg,Kellylynn Zuni +290 more
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Posted Content
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Leland McInnes,John Healy +1 more
TL;DR: The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.
Why Most Published Research Findings Are False
TL;DR: In this paper, the authors discuss the implications of these problems for the conduct and interpretation of research and suggest that claimed research findings may often be simply accurate measures of the prevailing bias.
Journal ArticleDOI
Divergence measures based on the Shannon entropy
TL;DR: A novel class of information-theoretic divergence measures based on the Shannon entropy is introduced, which do not require the condition of absolute continuity to be satisfied by the probability distributions involved and are established in terms of bounds.