Understanding Genre in a Collection of a Million Volumes

Home
/
Papers
/
Understanding Genre in a Collection of a Million Volumes

Understanding Genre in a Collection of a Million Volumes

Ted Underwood

01 Jan 2014-

About: The article was published on 2014-01-01 and is currently open access. It has received 13 citations till now.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Why Linguists Should Care about Digital Humanities (and Epidemiology)

[...]

Seth Mehl¹•Institutions (1)

University of Sheffield¹

09 Jun 2021-Journal of English Linguistics

TL;DR: A view of digital humanities and English linguistics from within both disciplines can be found in this article, where a corpus linguist based in a digital humanities center is presented, arguing that much of what we do is digital humanities, even if we do not always recognize it as such.

...read moreread less

Abstract: Where there is any perception of digital humanities (DH) within the field of English linguistics, it may be seen as a technical practice preoccupied with digitizing texts and producing digital editions. The one occurrence of digital humanities in JEngL’s archival content is a passing reference to digital editions and the role of editors—in an interview rather than a research article (Grant 2014). Within DH, there is a more vigorous conversation about what DH fundamentally is, alongside creative methodological questions about what it can be. That is because DH—from within—is largely viewed as a methodological challenge, driven by meaningful, even urgent research questions originating not only in the humanities but also in the social sciences which can be most effectively addressed via the development of new digital methods and tools. If DH is a methodological practice, it is in the sense of methods and epistemology: asking and debating how it is that we can know what we need to know, and testing the efficacy of selected digital methods in the service of specific research questions. As a corpus linguist based in a DH center, I present here a view of DH and English linguistics from within both disciplines. My discussion begins with a focus on corpus linguistics, but also includes English linguistics more generally, and linguistics as a whole. I argue that we as linguists should care about DH, not only because much of what we do is DH (even if we do not always recognize it as such), but also because collaborations between English linguistics and DH will be fruitful for all of us. Research questions in DH are wide-ranging; recent major DH projects that encompass humanities and social sciences include:

...read moreread less

1 citations

Significance Testing for the Classification of Literary Subgenres.

[...]

Lena Hettinger, Fotis Jannidis, Isabella Reger, Andreas Hotho

01 Jan 2016

Stable Random Projection: Minimal, universal dimensionality reduction for library-scale data.

[...]

Benjamin Schmidt

01 Jan 2017

TL;DR: A new method for dimensionality reduction, “stable random projection,” (hereafter “SRP”) distinctly suited for large textual corpora like those used in the digital humanities, that is computationally efficient and easily parallelizable; scales to the largest digital libraries; and creates a standard Dimensionality reduction space for all texts so that corpora and models can be easily exchanged.

...read moreread less

Abstract: This paper describes a new method for dimensionality reduction, “stable random projection,” (hereafter “SRP”) distinctly suited for large textual corpora like those used in the digital humanities. The method is computationally efficient and easily parallelizable; scales to the largest digital libraries; and creates a standard dimensionality reduction space for all texts so that corpora and models can be easily exchanged. The resulting space makes a wide variety of applications suitable to bag-of-words data, such as nearest neighbor searches, classification, and semantic querying possible with data sets an order of magnitude smaller in size than traditional feature counts. SRP is a minimal, universal dimensionality reduction with two distinctive features: 1. It makes no distinction between inand out-ofdomain vocabularies. In particular, unlike standard dimensionality reduction it creates a single space that can hold documents of any language. 2. It is trivially parallelizable, both on a local machine and through web-based architectures because it relies only on code that can be easily transferred across servers, rather than requiring large matrices or model parameters. These two features allow dimensionality reduction to be conceived of as a piece of infrastructure for digital humanities work, rather than just an ad-hoc convention used in a particular project. This method is particularly useful for provisioners and users of text data on extremely large and/or multilingual corpora. This creates a number of new applications for dimensionality reduction, both in scale and in type. SRP features could usefully be distributed by libraries as a (much smaller and easier to work with) supplement to feature counts. After a description of the method, some novel uses for dimensionality reduction on such libraries are shown using a sharable dataset of approximately 4,500,000 books projected into SRP-space from the Hathi Trust.

...read moreread less

References

PDF

Open Access

More filters

Journal Article•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel¹, Peter Prettenhofer², Ron Weiss³, Vincent Dubourg, Jake Vanderplas⁴, Alexandre Passos⁵, David Cournapeau, Matthieu Brucher⁶, Matthieu Perrot, Edouard Duchesnay - Show less +12 more•Institutions (6)

Kobe University¹, Bauhaus University, Weimar², Google³, University of Washington⁴, University of Massachusetts Amherst⁵, Total S.A.⁶

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.

...read moreread less

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

...read moreread less

47,974 citations

Book•

ggplot2: Elegant Graphics for Data Analysis

[...]

Hadley Wickham

13 Aug 2009

TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.

...read moreread less

Abstract: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable smoothers that use the powerful modelling capabilities of R, such as loess, linear models, generalised additive models and robust regression save any ggplot2 plot (or part thereof) for later modification or reuse create custom themes that capture in-house or journal style requirements, and that can easily be applied to multiple plots approach your graph from a visual perspective, thinking about how each component of the data is represented on the final plot. This book will be useful to everyone who has struggled with displaying their data in an informative and attractive way. You will need some basic knowledge of R (i.e. you should be able to get your data into R), but ggplot2 is a mini-language specifically tailored for producing graphics, and youll learn everything you need in the book. After reading this book youll be able to produce graphics customized precisely for your problems,and youll find it easy to get graphics out of your head and on to the screen or page.

...read moreread less

29,504 citations

Posted Content•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa¹, Gaël Varoquaux¹, Alexandre Gramfort¹, Vincent Michel¹, Bertrand Thirion¹, Olivier Grisel, Mathieu Blondel, Andreas Müller², Joel Nothman, Gilles Louppe², Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Edouard Duchesnay - Show less +15 more•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Liège²

02 Jan 2012-arXiv: Learning

TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.

...read moreread less

28,898 citations

Journal Article•DOI•

The WEKA data mining software: an update

[...]

Mark Hall, Eibe Frank¹, Geoffrey Holmes¹, Bernhard Pfahringer¹, Peter Reutemann¹, Ian H. Witten¹ - Show less +2 more•Institutions (1)

University of Waikato¹

16 Nov 2009-Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

19,603 citations

Journal Article•DOI•

ggplot2: Elegant Graphics for Data Analysis

[...]

Cedric E. Ginestet¹•Institutions (1)

King's College London¹

01 Jan 2011-Journal of The Royal Statistical Society Series A-statistics in Society

5,590 citations

1
2
3
4
…
5