scispace - formally typeset
Open AccessJournal ArticleDOI

Reproducible research in linguistics: A position statement on data citation and attribution in our field

Reads0
Chats0
TLDR
In this paper, a position statement on reproducible research in linguistics, including data citation and attribution, represents the collective views of some 41 colleagues, who believe that reproducibility can play a key role in increasing verification and accountability in linguistic research and is a hallmark of social science research that is currently underrepresented in our field.
Abstract
This paper is a position statement on reproducible research in linguistics, including data citation and attribution, that represents the collective views of some 41 colleagues. Reproducibility can play a key role in increasing verification and accountability in linguistic research, and is a hallmark of social science research that is currently under-represented in our field. We believe that we need to take time as a discipline to clearly articulate our expectations for how linguistic data are managed, cited, and maintained for long-Term access.

read more

Content maybe subject to copyright    Report

Andrea L. Berez-Kroeker*, Lauren Gawne, Susan Smythe Kung,
Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer,
David I. Beaver, Shobhana Chelliah, Stanley Dubinsky, Richard
P. Meier, Nick Thieberger, Keren Rice and Anthony C. Woodbury
Reproducible research in linguistics:
A position statement on data citation
and attribution in our field
https://doi.org/10.1515/ling-2017-0032
Abstract: This paper is a position statement on reproducible research in linguis-
tics, including data citation and attribution, that represents the collective views
of some 41 colleagues. Reproducibility can play a key role in increasing
*Corresponding author: Andrea L. Berez-Kroeker, Department of Linguistics, University of
HawaiʻiatMānoa, 1890 East West Road, Moore 569, Honolulu, HI 96822, USA, E-mail:
andrea.berez@hawaii.edu
Lauren Gawne, Department of Languages and Linguistics, SOAS University of London, London
WC1H 0XG, UK; La Trobe University, Melbourne, VIC 3086, Australia, E-mail:
l.gawne@latrobe.edu.au
Susan Smythe Kung, Archive of the Indigenous Languages of Latin America, University of Texas
at Austin, Austin, TX 78712, USA, E-mail: skung@austin.utexas.edu
Barbara F. Kelly, Department of Languages and Linguistics, The University of Melbourne,
Parkville, VIC 3010, Australia, E-mail: b.kelly@unimelb.edu.au
Tyler Heston, Payap University, Chiang Mai 50000, Thailand, E-mail: tylerheston@earthlink.net
Gary Holton, Department of Linguistics, University of HawaiʻiatMānoa, 1890 East West Road,
Moore 569, Honolulu, HI 96822, USA, E-mail: holton@hawaii.edu
Peter Pulsifer, National Snow and Ice Data Center, Boulder, CO 80303, USA, E-mail:
pulsifer@nsidc.org
David I. Beaver, Department of Linguistics, University of Texas at Austin, Austin, TX 78712, USA,
E-mail: dib@utexas.edu
Shobhana Chelliah, Department of Linguistics, University of North Texas, Denton, TX 76203,
USA, E-mail: Shobhana.Chelliah@unt.edu
Stanley Dubinsky, Linguistics Program, University of South Carolina, Columbia, SC 29208, USA,
E-mail: DUBINSK@mailbox.sc.edu
Richard P. Meier, Department of Linguistics, University of Texas at Austin, Austin, TX 78712,
USA, E-mail: rmeier@austin.utexas.edu
Nick Thieberger, Department of Languages and Linguistics, The University of Melbourne,
Parkville, VIC 3010, Australia, E-mail: thien@unimelb.edu.au
Keren Rice, Department of Linguistics, University of Toronto, Toronto, ON M5S, Canada,
E-mail: rice@chass.utoronto.ca
Anthony C. Woodbury, Department of Linguistics, University of Texas at Austin, Austin, TX
78712, USA, E-mail: woodbury@austin.utexas.edu
Linguistics 2018; 56(1): 118
Open Access. © 2018 Berez-Kroeker et al., published by De Gruyter. This work is licensed
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

verification and accountability in linguistic research, and is a hallmark of social
science research that is currently under-represented in our field. We believe that
we need to take time as a discipline to clearly articulate our expectations for
how linguistic data are managed, cited, and maintained for long-term access.
Keywords: reproducibility, attribution, data citation
1 Introduction
The notion of reproducible research has received considerable attention in recent
years from physical scientists, life scientists, social and behavioral scientists, and
computational scientists. In this statement we consider reproducibility as it
applies to linguistic scientists, especially with regard to facilitating a culture of
proper long-term care and citation of linguistic data sets.
This paper grows out of one effort to initiate a discipline-wide dialog around
the topic of data citation and attribution in linguistics, in which some 41 linguists
and data scientists convened for three workshops held between September 2015
and January 2017. Participants in these workshops addressed issues related to the
proper citation of linguistic data sets, and the establishment of criteria for aca-
demic credit for the collection, preservation, curation, and sharing thereof. These
workshops were supported by a grant from the National Science Foundation
(Developing standards for data citation and attribution for reproducible research
in linguistics [SMA-1447886]).
1
The 41 participants represented diverse subfields of
linguistics (syntax, semantics, phonetics, phonology, sociolinguistics, typology,
dialectology, language documentation and conservation, historical linguistics,
computational linguistics, first and second language acquisition, signed linguis-
tics, and language archiving). Other data scientists came from library and infor-
mation science, climatology, archaeology, and the polar sciences. The group
included academics from every career stage from graduate students to professors
to department chairs to provosts, and they represented institutions of higher
learning in North America, Europe, and Australia. These participants are:
Helene Andreassen
TROLLing, UiT The Arctic
University of Norway
Ruth Duerr
Ronin Institute
Keren Rice
University of Toronto
Felix Ameka
Leiden University
Colleen Fitzgerald
National Science Foundation
Loriene Roy
University of Texas at Austin
(continued)
1 https://sites.google.com/a/hawaii.edu/data-citation/
2 Andrea L. Berez-Kroeker et al.

The position described here is an outcome of these meetings, and represents
thecollectiveopinionoftheparticipants.InSection2,wediscussreproducible
research in science generally, and in linguistics i n particular. In Section 3, we
review some recent findings about current practices by authors of linguistics
publications with regard to transparency about data so urces and research
methodologies. Section 4 is our summary position statement on the importance
of linguistics data and the citation thereof; the need for mechanisms for
evaluating data work in academic hiring, tenure, and promotion processes;
(continued)
Anthony Aristar
University of Texas at Austin
Lauren Gawne
SOAS University of London and
La Trobe University
Mandana Seyfeddinipur
SOAS University of London
Helen Aristar-Dry
University of Texas at Austin
Jaime Perez Gonzalez
University of Texas at Austin
Gary F. Simons
SIL International
David Beaver
University of Texas at Austin
Ryan Henke
University of HawaiʻiatMānoa
Maho Takahashi
University of Hawaiʻiat
Mānoa
Andrea L. Berez-Kroeker
University of HawaiʻiatMānoa
Gary Holton
University of HawaiʻiatMānoa
Nick Thieberger
University of Melbourne
Hans Boas
University of Texas at Austin
Kavon Hooshiar
University of HawaiʻiatMānoa
Sarah G. Thomason
University of Michigan
David Carlson
World Climate Research
Programme
Tyler Kendall
University of Oregon
Paul Trilsbeek
The Language Archive, Max
Planck Institute for
Psycholinguistics
Brian Carpenter
American Philosophical Society
Susan Smythe Kung
University of Texas at Austin
Mark Turin,
University of British Columbia
Shobhana Chelliah
University of North Texas
Julie Ann Legate
University of Pennsylvania
Laura Welcher,
Long Now Foundation
Tanya E. Clement
University of Texas at Austin
Bradley McDonnell
University of HawaiʻiatMānoa
Nick Williams
University of Colorado
Boulder
Lauren Collister
University of Pittsburgh
Richard P. Meier
University of Texas at Austin
Margaret Winters
Wayne State University
Meagan Dailey
University of HawaiʻiatMānoa
Geoffrey S. Nathan
Wayne State University
Anthony C. Woodbury
University of Texas at Austin
Stanley Dubinsky
University of South Carolina
Peter Pulsifer
National Sea and Ice Data
Center
Reproducible research in linguistics 3

and the need to engender broad sociological shift in our field with regard to
reproducible research through education, outreach, and policy d evelopment.
Section 5 contains summary recommendations on actions that can be taken by
linguistics researchers, departments, committees, and publishers, as well as
some concluding remarks.
2 On valuing reproducibility in science
and linguistics
Reproducible research aims to provide scientific accountability by facilitating
access for other researchers to the data upon which research conclusions are
based. The term, and its value as a principle of scientific rigor, has arisen
primarily in computer science (e.g., Buckheit and Donoho 1995; de Leeuw
2001; Donoho 2010), where easy access to data and code allows other research-
ers to verify and refute putative claims. In a 2009 post on The open science
project, a blog dedicated to open source tools and research, Dan Gezelter
summarizes reproducible research thus:
If a scientist makes a claim that a skeptic can only reproduce by spending three
decades writing and debugging a complex computer program that exactly replicates
the workings of a commercial code, the origi nal claim is really only reproduc ible in
principle. [] Our view is that it is not healthy for scientific papers to be supported by
computations that cannot be re produced except by a few employees at a commercial
software developer []itmayberesearch and it may be important,butunlessenough
details of the experimental methodolog y are made available so that it can be subjected
to true reproducibility tests by skeptics, it isnt Science. (Gezelter 2009; emphasis
original)
Reproducibility in research is an evolution of replicability, a long-standing tenet
of the scientific method with which most readers are likely to already be
familiar. Replicable research methods are those that can be recreated elsewhere
by other scientists, leading to new data; sound scientific claims are those that
can be confirmed by the new data in a replicated study.
Thedifferencebetweenreproducibleresearch and replicable research is
that the latter produces new data, which can then ostensibly be analyzed for
either confirmation or disconfirmation of previous results; the former pro-
vides access to the original data for independent analysis. The benefit of
reproducibility is evident in cases where faithfully recreating the research
4 Andrea L. Berez-Kroeker et al.

conditions is impossible. For example , if a researcher conducts scientific
research studying the bacteria in human navels by surveying sixty people
at random, that study is considered replicable because another researcher
could make the same (or different) c laims based on new data coming from a
survey of sixty other randomly selected human navels (Hulcr et al. 2012). But
in many fieldwork -based life and social sciences, true replicability is not
possible to achieve. The variables contributing to a particular instance of
field observation are too hard to control in many cases for example, the
mechanisms by which frog-eating bats find prey in the wild (Ryan 2011). Even
in semi-controlled situations like studying primate tool use in captivity
(Tomasello and Call 2011) it is difficult to replicate every environmental or
non-environmental factor that may contribu te to which tool a chimpanzee
will selec t in a given situation. Th us reproducibility is a potentially useful
metric for rigor in scientific investigations that take place outside of a fully
controllable setting.
Because linguistics can be considered a social science dealing with observa-
tions of complex behavior, it is another field that would seem to lend itself to the
kind of scientific rigor that reproducibility provides; however, we are not aware
of any substantial discipline-wide discussion of how we might implement repro-
ducibility, nor of any widespread identification of a need to do so. Like the
example of the frog-eating bats, the factors contributing to the selection of one
inflected form over another in spontaneous conversation by a speaker of lan-
guage X are difficult to control for or even observe. Even in a prepared elicitation
session or a grammaticality judgment task a semi-controlled setting for lin-
guistic observation researchers cannot conceivably control for every possible
variable, such as the previous experience of the individual, that leads to an
utterance or judgment.
These natural limitations to our research methods are well accepted and
noncontroversial, but they do not relie ve us of the o bligation of scientific
accountability. The discussion of reproducibility has had serious profes-
sional consequences in other fields; consider for example the recent
controversy in social psychol ogy, in which a prominent researcher was
found to have fabricated data in 1520 years worth of publications
(Crocker and Cooper 2012). In addition, Fan g and colleagues (2013 ) survey ed
more than 2000 biomedical and life sciences journals and found that while
21.3 % of 2,047 article retractions were due to honest i nvestigator error, fully
67.4 % of retractions were due to misconduct, including fraud or suspe cted
fraud (43.4 %), duplicate publication (1 4.2 %), and plagiari sm (9.8 %) (Fang
et al. 2 013: 1). This has lead to discussions of solutions including a
Reproducible research in linguistics 5

Citations
More filters
Journal ArticleDOI

Replication in Second Language Research: Narrative and Systematic Reviews and Recommendations for the Field.

TL;DR: This paper provided a narrative review of challenges related to replication, drawing on recent developments in psychology, and proposed 16 recommendations, relating to rationales, nomenclature, design, infrastructure, and incentivization for collaboration and publication.
Journal ArticleDOI

Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics

TL;DR: The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries).
Journal ArticleDOI

A question of trust: can we build an evidence base to gain trust in systematic review automation technologies?

TL;DR: Adoption barriers are discussed with the goal of providing tool developers with guidance as to how to design and report such evaluations and for end users to assess their validity.
Journal ArticleDOI

Introducing Registered Reports at Language Learning: Promoting Transparency, Replication, and a Synthetic Ethic in the Language Sciences

TL;DR: Key concerns leading to the development of Registered Reports are outlined, its core features are described, and some of its benefits and weaknesses are discussed.
Proceedings ArticleDOI

CLDFBench: Give your cross-linguistic data a lift

TL;DR: The cldfbench as mentioned in this paper is a framework for the retro-standardization of legacy data and the curation of new datasets that drastically simplifies the creation of CLDF by providing a consistent, reproducible workflow that rigorously supports version control and long term archiving of research data and code.
References
More filters
Journal ArticleDOI

Misconduct accounts for the majority of retracted scientific publications

TL;DR: A detailed review of all 2,047 biomedical and life-science research articles indexed by PubMed as retracted on May 3, 2012 revealed that only 21.3% of retractions were attributable to error, compared with 67.4% attributable to misconduct, including fraud or suspected fraud, duplicate publication, and plagiarism.
Book ChapterDOI

WaveLab and Reproducible Research

TL;DR: Wavelab is a library of wavelet-packet analysis, cosine- Packet analysis and matching pursuit, available free of charge over the Internet.
Journal ArticleDOI

Documentary and descriptive linguistics

Nikolaus P. Himmelmann
- 01 Jan 1998 - 
TL;DR: In this article, it is proposed that documentary linguistics be conceived of as a fairly independent field of linguistic inquiry and practice that is no longer linked exclusively to the descriptive framework, and various practical and theoretical issues connected with this format are discussed.
Journal ArticleDOI

Retracted Science and the Retraction Index

TL;DR: Using a novel measure that is called the “retraction index,” it is found that the frequency of retraction varies among journals and shows a strong correlation with the journal impact factor.
Related Papers (5)

Estimating the reproducibility of psychological science

Alexander A. Aarts, +290 more
- 28 Aug 2015 - 
Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Linguistics" ?

This paper is a position statement on reproducible research in linguistics, including data citation and attribution, that represents the collective views of some 41 colleagues. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4. 0 License. The authors believe that they need to take time as a discipline to clearly articulate their expectations for how linguistic data are managed, cited, and maintained for long-term access. 

In addition, Fang and colleagues (2013) surveyed more than 2000 biomedical and life sciences journals and found that while 21.3% of 2,047 article retractions were due to honest investigator error, fully 67.4% of retractions were due to “misconduct, including fraud or suspected fraud (43.4%), duplicate publication (14.2%), and plagiarism (9.8%)” (Fang et al. 2013: 1). 

As a final recommendation, the authors encourage editors and publishers of linguistics journals and book series to develop concrete policies for both data sharing and data citation, and to develop formats for the citation of linguistic data sets. 

The authors identify two main domains in which the academic merit of creating, curating, preserving, and storing linguistic data can be valorized: research funding, and professional evaluation (i.e., hiring, tenure, and promotion). 

Marcus and Oransky (2012) suggest a number of factors that could be included in a potential transparency index, including review process, review times, manuscript acceptance rate, journal requirement for underlying data to be made available, journal costs for authors and readers, misconduct process, and retraction process. 

The group included academics from every career stage from graduate students to professors to department chairs to provosts, and they represented institutions of higher learning in North America, Europe, and Australia. 

If a scientist makes a claim that a skeptic can only reproduce by spending three decades writing and debugging a complex computer program that exactly replicates the workings of a commercial code, the original claim is really only reproducible in principle. […] 

A culture of citing and properly attributing data is sweeping the sciences, as can be witnessed through the formation of groups like the Research Data Alliance,16 FORCE11,17 the Center for Open Science,18 and others. 

Such written guidelines can play a crucial role in hiring, tenure, and promotion cases, both for internal use among colleagues in linguistics departments, programs, and research centers, and for sharing with university-level personnel committees. 

This paper grows out of one effort to initiate a discipline-wide dialog around the topic of data citation and attribution in linguistics, in which some 41 linguists and data scientists convened for three workshops held between September 2015 and January 2017. 

This has lead to discussions of solutions including aReproducible research in linguistics 5“transparency index” (Marcus and Oransky 2012) and “retraction index” for journals2 (Fang and Casadevall 2011), as well as the publication of watchdog websites,3 indices, and blogs.