scispace - formally typeset
Search or ask a question

Understanding Genre in a Collection of a Million Volumes

01 Jan 2014-
About: The article was published on 2014-01-01 and is currently open access. It has received 13 citations till now.
Citations
More filters
Journal ArticleDOI
TL;DR: A view of digital humanities and English linguistics from within both disciplines can be found in this article, where a corpus linguist based in a digital humanities center is presented, arguing that much of what we do is digital humanities, even if we do not always recognize it as such.
Abstract: Where there is any perception of digital humanities (DH) within the field of English linguistics, it may be seen as a technical practice preoccupied with digitizing texts and producing digital editions. The one occurrence of digital humanities in JEngL’s archival content is a passing reference to digital editions and the role of editors—in an interview rather than a research article (Grant 2014). Within DH, there is a more vigorous conversation about what DH fundamentally is, alongside creative methodological questions about what it can be. That is because DH—from within—is largely viewed as a methodological challenge, driven by meaningful, even urgent research questions originating not only in the humanities but also in the social sciences which can be most effectively addressed via the development of new digital methods and tools. If DH is a methodological practice, it is in the sense of methods and epistemology: asking and debating how it is that we can know what we need to know, and testing the efficacy of selected digital methods in the service of specific research questions. As a corpus linguist based in a DH center, I present here a view of DH and English linguistics from within both disciplines. My discussion begins with a focus on corpus linguistics, but also includes English linguistics more generally, and linguistics as a whole. I argue that we as linguists should care about DH, not only because much of what we do is DH (even if we do not always recognize it as such), but also because collaborations between English linguistics and DH will be fruitful for all of us. Research questions in DH are wide-ranging; recent major DH projects that encompass humanities and social sciences include:

1 citations

01 Jan 2017
TL;DR: A new method for dimensionality reduction, “stable random projection,” (hereafter “SRP”) distinctly suited for large textual corpora like those used in the digital humanities, that is computationally efficient and easily parallelizable; scales to the largest digital libraries; and creates a standard Dimensionality reduction space for all texts so that corpora and models can be easily exchanged.
Abstract: This paper describes a new method for dimensionality reduction, “stable random projection,” (hereafter “SRP”) distinctly suited for large textual corpora like those used in the digital humanities. The method is computationally efficient and easily parallelizable; scales to the largest digital libraries; and creates a standard dimensionality reduction space for all texts so that corpora and models can be easily exchanged. The resulting space makes a wide variety of applications suitable to bag-of-words data, such as nearest neighbor searches, classification, and semantic querying possible with data sets an order of magnitude smaller in size than traditional feature counts. SRP is a minimal, universal dimensionality reduction with two distinctive features: 1. It makes no distinction between inand out-ofdomain vocabularies. In particular, unlike standard dimensionality reduction it creates a single space that can hold documents of any language. 2. It is trivially parallelizable, both on a local machine and through web-based architectures because it relies only on code that can be easily transferred across servers, rather than requiring large matrices or model parameters. These two features allow dimensionality reduction to be conceived of as a piece of infrastructure for digital humanities work, rather than just an ad-hoc convention used in a particular project. This method is particularly useful for provisioners and users of text data on extremely large and/or multilingual corpora. This creates a number of new applications for dimensionality reduction, both in scale and in type. SRP features could usefully be distributed by libraries as a (much smaller and easier to work with) supplement to feature counts. After a description of the method, some novel uses for dimensionality reduction on such libraries are shown using a sharable dataset of approximately 4,500,000 books projected into SRP-space from the Hathi Trust.
References
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

Book
13 Aug 2009
TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Abstract: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable smoothers that use the powerful modelling capabilities of R, such as loess, linear models, generalised additive models and robust regression save any ggplot2 plot (or part thereof) for later modification or reuse create custom themes that capture in-house or journal style requirements, and that can easily be applied to multiple plots approach your graph from a visual perspective, thinking about how each component of the data is represented on the final plot. This book will be useful to everyone who has struggled with displaying their data in an informative and attractive way. You will need some basic knowledge of R (i.e. you should be able to get your data into R), but ggplot2 is a mini-language specifically tailored for producing graphics, and youll learn everything you need in the book. After reading this book youll be able to produce graphics customized precisely for your problems,and youll find it easy to get graphics out of your head and on to the screen or page.

29,504 citations

Posted Content
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations

Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations