scispace - formally typeset
Open AccessProceedings ArticleDOI

Using the HTRC data capsule model to promote reuse and evolution of experimental analysis of digital library data: a case study of topic modeling

Reads0
Chats0
TLDR
The steps necessary—and the challenges that had to be overcome—to replicate the work given in a publicly available blog using the HathiTrust Research Center’s virtual machine Data Capsule platform are detailed.
Abstract
We report on a case-study to independently reproduce the work given in a publicly available blog on how to develop a topic model sourced from a collection of texts, where both the data set and source code used are readily available. More specifically, we detail the steps necessary---and the challenges that had to be overcome---to replicate the work using the HathiTrust Research Center's virtual machine Data Capsule platform. From this we make recommendations for authors to follow, based on the lessons learned. We also show that the Data Capsule model can be put to work in a way that is of benefit to those interested in supporting computational reproducibility within their organizations.

read more

Content maybe subject to copyright    Report

Using the HTRC Data Capsule Model to Promote Reuse and
Evolution of Experimental Analysis of Digital Library Data:
A Case Study of Topic Modeling
David Bainbridge
University of Waikato
Hamilton, New Zealand
davidb@waikato.ac.nz
David M. Nichols
University of Waikato
Hamilton, New Zealand
david.nichols@waikato.ac.nz
Annika Hinze
University of Waikato
Hamilton, New Zealand
hinze@waikato.ac.nz
J. Stephen Downie
University of Illinois
Urbana-Champaign, USA
jdownie@illinois.edu
ABSTRACT
We report on a case-study to independently reproduce the work
given in a publicly available blog on how to develop a topic model
sourced from a collection of texts, where both the data set and
source code used are readily available. More specically, we detail
the steps necessary—and the challenges that had to be overcome—to
replicate the work using the HathiTrust Research Center’s virtual
machine Data Capsule platform. From this we make recommenda-
tions for authors to follow, based on the lessons learned. We also
show that the Data Capsule model can be put to work in a way
that is of benet to those interested in supporting computational
reproducibility within their organizations.
CCS CONCEPTS
General and reference Experimentation
;
Applied com-
puting Digital libraries and archives.
KEYWORDS
Experimental Reproducibility, Virtual Machine, Digital Libraries
ACM Reference Format:
David Bainbridge, David M. Nichols, Annika Hinze, and J. Stephen Downie.
2019. Using the HTRC Data Capsule Model to Promote Reuse and Evo-
lution of Experimental Analysis of Digital Library Data: A Case Study
of Topic Modeling . In Proceedings of ACM/IEEE-CS Joint Conference on
Digital Libraries (JCDL’19). ACM, New York, NY, USA. 463-4.
https:
//doi.org/10.1109/JCDL.2019.00124
1 INTRODUCTION
The scientific tenet of reproducibility has its own particular set of
of challenges to be faced in the domain of computational science
and related areas, such as digital humanities [
2
]. Virtualization
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
JCDL’19, June 2019, Urbana-Champaign, Illinois, USA
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1109/JCDL.2019.00124
techniques, such as Virtual Machines and Containerization, help by
reducing the likelihood of failure caused by installation issues or
eects from a user’s environment, but as Dumas et al. point out [
1
],
this really only establishes a baseline. The chances of reproducibility
success, they point out, are greatly enhanced through providing
a walk-through guide of the steps to run and what to expect as a
result at each stage.
In this paper we narrow attention to the more specic issue of
replication of data analysis experiments using corpora from digital
libraries and archives. Taking the HathiTrust Research Center’s
(HTRC) Data Capsule [
3
], as a baseline virtual machine, we report
on our experiences seeking to follow one such example of step-
by-step instructions, to produce a topic model based on a publicly
available texts. From this we make recommendations based on the
lessons learned, and identify particular benets that result from
conducting the experimentation inside a Data Capsule.
2 REPLICATING TOPIC MODELING WITH A
DATA CAPSULE
In Creating a Topic Browser of HathiTrust Data,
1
Goodwin (an Eng-
lish professor and practicing digital humanities scholar) provides a
“how-to article, describing the steps he went through to develop
a web-based topic browser of a set of selected novels from the
HathiTrust DL. The article is written in a follow-along style: the
starting dataset used is linked to, as is the R programming language
open source Latent Dirichlet Allocation (LDA) package that the
approach draws upon, and it also provides the additional lines of
code written to produce the topic browser. The description of the
work is agnostic as to the operating system used, although the way
les are specied suggests Goodwin used a Unix-based operating
system. These characteristics made it a suitable candidate for our
replication study using an Ubuntu-based HTRC Data Capsule.
In describing below the problems we faced, we in no way want
this to be taken as a criticism of Goodwin’s excellent work. To
the contrary, we very much appreciated the comprehensive notes
provided. As pragmatists we knew it likely that we would encounter
issues: what we were interested in, however, was what forms they
would take, and the strategies we could develop to overcome them.
1
https://jgoodwin.net/blog/creating-hathitrust-topic-browser/

JCDL’19, June 2019, Urbana-Champaign, Illinois, USA Bainbridge et al.
We categorize the ve principal diculties encountered as fol-
lows:
(1) Link-rot.
(2) Installation clarity.
(3) Incorrect le/directories specied.
(4) Version of programming language and packages used.
(5) Issues running commands more than once.
There was only one example of link-rot that we encountered,
and that was to the general web area where the dataset used came
from, resulting in a blank gray page. Fortunately, the link to the
precise dataset used in the article still worked, and so was not a
critical issue.
The principal programming language used was R, and this was
clearly stated, as was the use of Perl for some le manipulation
operations. What was not stated was that because installation of
some of the R packages themselves trigger the compilation of Java
and C compiled code, then these too needed to be installed. Further,
this compilation sequence relied on certain libraries being present.
As the Data Capsule environment provides administration rights
for the user, these unexpected requirements could be addressed in
a straightforward manner.
The article helpfully included the exact lines of code to run.
However, on more than one occasion, the name of a le or directory
used in the code-snippets was inconsistently specied: for example
sample-00-22-tsv
is used in one place for the output directory of
tab-separated value les, but 20-22-tsv is used in another. These
issues were not too hard to resolve.
The most time-consuming problem encountered concerned the
LDA topic modeling R package used,
dfrtopics
. Ultimately the
problem was traced to be a versioning issue. In the instructions
given, this package was retrieved through a github repository link
that resolved to the latest version of that code; however, updates
in the github repository meant that the code checked out was
now incompatible with how Goodwin invoked it. The date of the
blog article was cross-checked with the versioning history of the
dfrtopics
github repository, and the R statement to install the
package changed to check-out a contemporaneous version.
The nal main issue encountered—that of re-running commands—
was itself an artifact caused by the problems that were encountered
in trying to follow the full set of instructions. For example, we
encountered a situation where executing a step reported an error if
the directory it wanted to write to already existed. A single Linux
command was sucient to rectify this, but it served to highlight
the potential for such issues occurring in the other stages, but in
ways that were not explicitly reported.
With all the issues resolved, we deleted all the generated output
les and were able to take a clean pass through the instructions and
generate our own topic map. Given that Goodwin’s article links
to a live version of the topic browser he built, we could see that
the topic browser we had produced was not the same. For the sake
of transparency, we do note that the blog article does not claim
that by following the instructions given you will achieve identical
results. That said, there is sucient detail in the blog to understand
the main reasons for the dierences.
In the blog article, the number of topic clusters to produce is set to
be 100. In Goodwin’s live example he has 125 topics. Another source
of dierence could be the number of iterations used to train the
model: the blog article species this to be 200. It was not possible to
determine how many iterations had been used in the live example.
A nal factor that led to notable dierences in the topics gener-
ated was a result of the stopword list used. The actual stopword
list used by Goodwin is described but not explicitly provided. The
article mentions a stopword list contained in
dfrtopics
, which we
located and used, but it is not clear if this is the same one as used
by Goodwin.
3 LESSONS LEARNED AND CONCLUSION
Based on the lessons learned, we recommend authors give:
Careful consideration to the URLs used when publishing
links to datasets, and even consider using the Internet Archive’s
Wayback Machine to provide better link stability.
A how-to guide a “test-drive by a second researcher (akin
to proof-reading). This would also help address installation
clarity.
An explicit list of all the programming languages and pack-
ages used, along with their version numbers.
Following these recommendations, to augment the blog article by
Goodwin, we have formed our own github repository that explains
how to work through the article using an HTRC Data Capsule:
https://github.com/htrc/JGoodwin-Topic-Browser-in-a-Data-Capsule/
We conclude by highlighting two key benets of the HTRC Data
Capsule model deployment over working directly with standalone
VMs:
Readiness to run.
Even as experienced VirtualBox users it
still took us 20 minutes to set up a VM to the point where
we could follow the blog article, compared to lling out a
form at the HTRC Analytics site and clicking on a button to
create the Data Capsule, which had us at that point in under
3 minutes.
Network integrity.
There are security issues that an insti-
tute must work through if a researcher is going to operate
VM software directly on their own computer. In the infras-
tructure developed by HTRC to run Data Capsules, these
networking concerns have already been resolved.
Without doubt, the use of walk-through guides in combination
with VMs greatly aids computational reproducibility. Our particular
use of the HTRC Data Capsule illustrates the general benets of
the Data Capsule model that other institutions should consider
deploying if they are interested in supporting computational repro-
ducibility.
REFERENCES
[1]
Guillaume Dumas, Yang-Min Kim, and Jean-Baptiste Poline. 2018. Ex-
perimenting with reproducibility: a case study of robustness in bioin-
formatics. GigaScience 7, 7 (06 2018), 1–8. https://doi.org/10.1093/
gigascience/giy077 arXiv:http://oup.prod.sis.lan/gigascience/article-
pdf/7/7/giy077/25178539/giy077.pdf
[2]
Roger D. Peng. 2011. Reproducible Research in Computational Science. Sci-
ence 334, 6060 (2011), 1226–1227. https://doi.org/10.1126/science.1213847
arXiv:http://science.sciencemag.org/content/334/6060/1226.full.pdf
[3]
Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale.
2014. Cloud Computing Data Capsules for Non-consumptive use of Texts. In
Proceedings of the 5th ACM Workshop on Scientic Cloud Computing (ScienceCloud
’14). ACM, New York, NY, USA, 9–16. https://doi.org/10.1145/2608029.2608031
References
More filters
Journal ArticleDOI

Reproducible Research in Computational Science

TL;DR: This work states that reproducibility has the potential to serve as a minimum standard for judging scientific claims when full independent replication of a study is not possible.
Journal ArticleDOI

Experimenting with reproducibility: a case study of robustness in bioinformatics.

TL;DR: A case study of the difficulties in reproducing a published bioinformatics method even though code and data were available, and a reimplemented the whole method in a Python package to avoid dependency on a MATLAB license and ease the execution of the code on a high-performance computing cluster.
Proceedings ArticleDOI

Cloud computing data capsules for non-consumptiveuse of texts

TL;DR: This paper proposes a virtual machine (VM) framework and methodology for non-consumptive text analysis that prevents leakage of copyrighted content in the event that the VM is compromised.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Using the htrc data capsule model to promote reuse and evolution of experimental analysis of digital library data: a case study of topic modeling" ?

The authors report on a case-study to independently reproduce the work given in a publicly available blog on how to develop a topic model sourced from a collection of texts, where both the data set and source code used are readily available. From this the authors make recommendations for authors to follow, based on the lessons learned. The authors also show that the Data Capsule model can be put to work in a way that is of benefit to those interested in supporting computational reproducibility within their organizations.