What is the final recommendation for editors and publishers of linguistics journals and book series?

As a final recommendation, the authors encourage editors and publishers of linguistics journals and book series to develop concrete policies for both data sharing and data citation, and to develop formats for the citation of linguistic data sets.

What are the main domains in which the academic merit of creating, curating, preserving?

The authors identify two main domains in which the academic merit of creating, curating, preserving, and storing linguistic data can be valorized: research funding, and professional evaluation (i.e., hiring, tenure, and promotion).

What are the factors that could be included in a potential transparency index?

Marcus and Oransky (2012) suggest a number of factors that could be included in a potential transparency index, including review process, review times, manuscript acceptance rate, journal requirement for underlying data to be made available, journal costs for authors and readers, misconduct process, and retraction process.

What is the definition of a culture of citing and properly attributing data?

A culture of citing and properly attributing data is sweeping the sciences, as can be witnessed through the formation of groups like the Research Data Alliance,16 FORCE11,17 the Center for Open Science,18 and others.

What is the role of written guidelines in linguistics?

Such written guidelines can play a crucial role in hiring, tenure, and promotion cases, both for internal use among colleagues in linguistics departments, programs, and research centers, and for sharing with university-level personnel committees.

What has led to discussions of reproducibility in linguistics?

This has lead to discussions of solutions including aReproducible research in linguistics 5“transparency index” (Marcus and Oransky 2012) and “retraction index” for journals2 (Fang and Casadevall 2011), as well as the publication of watchdog websites,3 indices, and blogs.

(Open Access) Reproducible research in linguistics: A position statement on data citation and attribution in our field (2018) | Andrea L. Berez-Kroeker

Q: What are the reasons why Fang and colleagues surveyed more than 2000 journals?

In addition, Fang and colleagues (2013) surveyed more than 2000 biomedical and life sciences journals and found that while 21.3% of 2,047 article retractions were due to honest investigator error, fully 67.4% of retractions were due to “misconduct, including fraud or suspected fraud (43.4%), duplicate publication (14.2%), and plagiarism (9.8%)” (Fang et al. 2013: 1).

Andrea L. Berez-Kroeker*, Lauren Gawne, Susan Smythe Kung,

Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer,

David I. Beaver, Shobhana Chelliah, Stanley Dubinsky, Richard

P. Meier, Nick Thieberger, Keren Rice and Anthony C. Woodbury

Reproducible research in linguistics:

A position statement on data citation

and attribution in our field

https://doi.org/10.1515/ling-2017-0032

Abstract: This paper is a position statement on reproducible research in linguis-

tics, including data citation and attribution, that represents the collective views

of some 41 colleagues. Reproducibility can play a key role in increasing

*Corresponding author: Andrea L. Berez-Kroeker, Department of Linguistics, University of

HawaiʻiatMānoa, 1890 East West Road, Moore 569, Honolulu, HI 96822, USA, E-mail:

andrea.berez@hawaii.edu

Lauren Gawne, Department of Languages and Linguistics, SOAS University of London, London

WC1H 0XG, UK; La Trobe University, Melbourne, VIC 3086, Australia, E-mail:

l.gawne@latrobe.edu.au

Susan Smythe Kung, Archive of the Indigenous Languages of Latin America, University of Texas

at Austin, Austin, TX 78712, USA, E-mail: skung@austin.utexas.edu

Barbara F. Kelly, Department of Languages and Linguistics, The University of Melbourne,

Parkville, VIC 3010, Australia, E-mail: b.kelly@unimelb.edu.au

Tyler Heston, Payap University, Chiang Mai 50000, Thailand, E-mail: tylerheston@earthlink.net

Gary Holton, Department of Linguistics, University of HawaiʻiatMānoa, 1890 East West Road,

Moore 569, Honolulu, HI 96822, USA, E-mail: holton@hawaii.edu

Peter Pulsifer, National Snow and Ice Data Center, Boulder, CO 80303, USA, E-mail:

pulsifer@nsidc.org

David I. Beaver, Department of Linguistics, University of Texas at Austin, Austin, TX 78712, USA,

E-mail: dib@utexas.edu

Shobhana Chelliah, Department of Linguistics, University of North Texas, Denton, TX 76203,

USA, E-mail: Shobhana.Chelliah@unt.edu

Stanley Dubinsky, Linguistics Program, University of South Carolina, Columbia, SC 29208, USA,

E-mail: DUBINSK@mailbox.sc.edu

Richard P. Meier, Department of Linguistics, University of Texas at Austin, Austin, TX 78712,

USA, E-mail: rmeier@austin.utexas.edu

Nick Thieberger, Department of Languages and Linguistics, The University of Melbourne,

Parkville, VIC 3010, Australia, E-mail: thien@unimelb.edu.au

Keren Rice, Department of Linguistics, University of Toronto, Toronto, ON M5S, Canada,

E-mail: rice@chass.utoronto.ca

Anthony C. Woodbury, Department of Linguistics, University of Texas at Austin, Austin, TX

78712, USA, E-mail: woodbury@austin.utexas.edu

Linguistics 2018; 56(1): 1–18

under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

verification and accountability in linguistic research, and is a hallmark of social

science research that is currently under-represented in our field. We believe that

we need to take time as a discipline to clearly articulate our expectations for

how linguistic data are managed, cited, and maintained for long-term access.

Keywords: reproducibility, attribution, data citation

1 Introduction

The notion of reproducible research has received considerable attention in recent

years from physical scientists, life scientists, social and behavioral scientists, and

computational scientists. In this statement we consider reproducibility as it

applies to linguistic scientists, especially with regard to facilitating a culture of

proper long-term care and citation of linguistic data sets.

This paper grows out of one effort to initiate a discipline-wide dialog around

the topic of data citation and attribution in linguistics, in which some 41 linguists

and data scientists convened for three workshops held between September 2015

and January 2017. Participants in these workshops addressed issues related to the

proper citation of linguistic data sets, and the establishment of criteria for aca-

demic credit for the collection, preservation, curation, and sharing thereof. These

workshops were supported by a grant from the National Science Foundation

(Developing standards for data citation and attribution for reproducible research

in linguistics [SMA-1447886]).

The 41 participants represented diverse subfields of

linguistics (syntax, semantics, phonetics, phonology, sociolinguistics, typology,

dialectology, language documentation and conservation, historical linguistics,

computational linguistics, first and second language acquisition, signed linguis-

tics, and language archiving). Other data scientists came from library and infor-

mation science, climatology, archaeology, and the polar sciences. The group

included academics from every career stage from graduate students to professors

to department chairs to provosts, and they represented institutions of higher

learning in North America, Europe, and Australia. These participants are:

Helene Andreassen

TROLLing, UiT The Arctic

University of Norway

Ruth Duerr

Ronin Institute

Keren Rice

University of Toronto

Felix Ameka

Leiden University

Colleen Fitzgerald

National Science Foundation

Loriene Roy

University of Texas at Austin

(continued)

1 https://sites.google.com/a/hawaii.edu/data-citation/

2 Andrea L. Berez-Kroeker et al.

The position described here is an outcome of these meetings, and represents

thecollectiveopinionoftheparticipants.InSection2,wediscussreproducible

research in science generally, and in linguistics i n particular. In Section 3, we

review some recent findings about current practices by authors of linguistics

publications with regard to transparency about data so urces and research

methodologies. Section 4 is our summary position statement on the importance

of linguistics data and the citation thereof; the need for mechanisms for

evaluating “data work” in academic hiring, tenure, and promotion processes;

(continued)

Anthony Aristar

University of Texas at Austin

Lauren Gawne

SOAS University of London and

La Trobe University

Mandana Seyfeddinipur

SOAS University of London

Helen Aristar-Dry

University of Texas at Austin

Jaime Perez Gonzalez

University of Texas at Austin

Gary F. Simons

SIL International

David Beaver

University of Texas at Austin

Ryan Henke

University of HawaiʻiatMānoa

Maho Takahashi

University of Hawaiʻiat

Mānoa

Andrea L. Berez-Kroeker

University of HawaiʻiatMānoa

Gary Holton

University of HawaiʻiatMānoa

Nick Thieberger

University of Melbourne

Hans Boas

University of Texas at Austin

Kavon Hooshiar

University of HawaiʻiatMānoa

Sarah G. Thomason

University of Michigan

David Carlson

World Climate Research

Programme

Tyler Kendall

University of Oregon

Paul Trilsbeek

The Language Archive, Max

Planck Institute for

Psycholinguistics

Brian Carpenter

American Philosophical Society

Susan Smythe Kung

University of Texas at Austin

Mark Turin,

University of British Columbia

Shobhana Chelliah

University of North Texas

Julie Ann Legate

University of Pennsylvania

Laura Welcher,

Long Now Foundation

Tanya E. Clement

University of Texas at Austin

Bradley McDonnell

University of HawaiʻiatMānoa

Nick Williams

University of Colorado

Boulder

Lauren Collister

University of Pittsburgh

Richard P. Meier

University of Texas at Austin

Margaret Winters

Wayne State University

Meagan Dailey

University of HawaiʻiatMānoa

Geoffrey S. Nathan

Wayne State University

Anthony C. Woodbury

University of Texas at Austin

Stanley Dubinsky

University of South Carolina

Peter Pulsifer

National Sea and Ice Data

Center

Reproducible research in linguistics 3

and the need to engender broad sociological shift in our field with regard to

reproducible research through education, outreach, and policy d evelopment.

Section 5 contains summary recommendations on actions that can be taken by

linguistics researchers, departments, committees, and publishers, as well as

some concluding remarks.

2 On valuing reproducibility in science

and linguistics

Reproducible research aims to provide scientific accountability by facilitating

access for other researchers to the data upon which research conclusions are

based. The term, and its value as a principle of scientific rigor, has arisen

primarily in computer science (e.g., Buckheit and Donoho 1995; de Leeuw

2001; Donoho 2010), where easy access to data and code allows other research-

ers to verify and refute putative claims. In a 2009 post on The open science

project, a blog dedicated to open source tools and research, Dan Gezelter

summarizes reproducible research thus:

If a scientist makes a claim that a skeptic can only reproduce by spending three

decades writing and debugging a complex computer program that exactly replicates

the workings of a commercial code, the origi nal claim is really only reproduc ible in

principle. […] Our view is that it is not healthy for scientific papers to be supported by

computations that cannot be re produced except by a few employees at a commercial

software developer […]itmayberesearch and it may be important,butunlessenough

details of the experimental methodolog y are made available so that it can be subjected

to true reproducibility tests by skeptics, it isn’t Science. (Gezelter 2009; emphasis

original)

Reproducibility in research is an evolution of replicability, a long-standing tenet

of the scientific method with which most readers are likely to already be

familiar. Replicable research methods are those that can be recreated elsewhere

by other scientists, leading to new data; sound scientific claims are those that

can be confirmed by the new data in a replicated study.

Thedifferencebetweenreproducibleresearch and replicable research is

that the latter produces new data, which can then ostensibly be analyzed for

either confirmation or disconfirmation of previous results; the former pro-

vides access to the original data for independent analysis. The benefit of

reproducibility is evident in cases where faithfully recreating the research

4 Andrea L. Berez-Kroeker et al.

conditions is impossible. For example , if a researcher conducts scientific

research studying the bacteria in human navels by surveying sixty people

at random, that study is considered replicable because another researcher

could make the same (or different) c laims based on new data coming from a

survey of sixty other randomly selected human navels (Hulcr et al. 2012). But

in many fieldwork -based life and social sciences, true replicability is not

possible to achieve. The variables contributing to a particular instance of

field observation are too hard to control in many cases – for example, the

mechanisms by which frog-eating bats find prey in the wild (Ryan 2011). Even

in semi-controlled situations like studying primate tool use in captivity

(Tomasello and Call 2011) it is difficult to replicate every environmental or

non-environmental factor that may contribu te to which tool a chimpanzee

will selec t in a given situation. Th us reproducibility is a potentially useful

metric for rigor in scientific investigations that take place outside of a fully

controllable setting.

Because linguistics can be considered a social science dealing with observa-

tions of complex behavior, it is another field that would seem to lend itself to the

kind of scientific rigor that reproducibility provides; however, we are not aware

of any substantial discipline-wide discussion of how we might implement repro-

ducibility, nor of any widespread identification of a need to do so. Like the

example of the frog-eating bats, the factors contributing to the selection of one

inflected form over another in spontaneous conversation by a speaker of lan-

guage X are difficult to control for or even observe. Even in a prepared elicitation

session or a grammaticality judgment task – a semi-controlled setting for lin-

guistic observation – researchers cannot conceivably control for every possible

variable, such as the previous experience of the individual, that leads to an

utterance or judgment.

These natural limitations to our research methods are well accepted and

noncontroversial, but they do not relie ve us of the o bligation of scientific

accountability. The discussion of reproducibility has had serious profes-

sional consequences in other fields; consider for example the recent

controversy in social psychol ogy, in which a prominent researcher was

found to have fabricated data in 15–20 years’ worth of publications

(Crocker and Cooper 2012). In addition, Fan g and colleagues (2013 ) survey ed

more than 2000 biomedical and life sciences journals and found that while

21.3 % of 2,047 article retractions were due to honest i nvestigator error, fully

67.4 % of retractions were due to “misconduct, including fraud or suspe cted

fraud (43.4 %), duplicate publication (1 4.2 %), and plagiari sm (9.8 %)” (Fang

et al. 2 013: 1). This has lead to discussions of solutions including a

Reproducible research in linguistics 5

Reproducible research in linguistics: A position statement on data citation and attribution in our field

Citations

Replication in Second Language Research: Narrative and Systematic Reviews and Recommendations for the Field.

Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics

A question of trust: can we build an evidence base to gain trust in systematic review automation technologies?

Introducing Registered Reports at Language Learning: Promoting Transparency, Replication, and a Synthetic Ethic in the Language Sciences

CLDFBench: Give your cross-linguistic data a lift

References

Ethnologue : languages of the world

Misconduct accounts for the majority of retracted scientific publications

WaveLab and Reproducible Research

Documentary and descriptive linguistics

Retracted Science and the Retraction Index

Related Papers (5)

Estimating the reproducibility of psychological science

Emergent data analysis in phonetic sciences: Towards pluralism and reproducibility

Researcher degrees of freedom in phonetic research

Intentions in Communication

RESEARCH IN APPLIED LINGUISTICS: BECOMING A DISCERNING CONSUMER, Fred L. Perry, Jr.

Frequently Asked Questions (11)

Q1. What are the contributions mentioned in the paper "Linguistics" ?

Q2. What are the reasons why Fang and colleagues surveyed more than 2000 journals?

Q3. What is the final recommendation for editors and publishers of linguistics journals and book series?

Q4. What are the main domains in which the academic merit of creating, curating, preserving?

Q5. What are the factors that could be included in a potential transparency index?

Q6. Who represented the universities of higher learning in North America, Europe, and Australia?

Q7. how can a scientist make a claim that a skeptic can only reproduce?

Q8. What is the definition of a culture of citing and properly attributing data?

Q9. What is the role of written guidelines in linguistics?

Q10. How many workshops were held between September 2015 and January 2017?

Q11. What has led to discussions of reproducibility in linguistics?