Journal•ISSN: 2055-7671

Digital Scholarship in the Humanities

Oxford University Press

About: Digital Scholarship in the Humanities is an academic journal published by Oxford University Press. The journal publishes majorly in the area(s): Computer science & Digital humanities. It has an ISSN identifier of 2055-7671. Over the lifetime, 706 publications have been published receiving 4422 citations. The journal is also known as: Literary & linguistic computing.

...read moreread less

Topics: Computer science, Digital humanities, Linguistics, Artificial intelligence, Natural language processing ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

On the features of translationese

[...]

Vered Volansky¹, Noam Ordan², Shuly Wintner¹•Institutions (2)

University of Haifa¹, Saarland University²

01 Apr 2015-Digital Scholarship in the Humanities

TL;DR: It is demonstrated that some feature sets are indeed good indicators of translationese, thereby corroborating some hypotheses, whereas others perform much worse, indicating that some ‘universal’ assumptions have to be reconsidered.

...read moreread less

Abstract: Much research in translation studies indicates that translated texts are ontologically different from original non-translated ones. Translated texts, in any language, can be considered a dialect of that language, known as ‘translationese’. Several characteristics of translationese have been proposed as universal in a series of hypotheses. In this work, we test these hypotheses using a computational methodology that is based on supervised machine learning. We define several classifiers that implement various linguistically informed features, and assess the degree to which different sets of features can distinguish between translated and original texts. We demonstrate that some feature sets are indeed good indicators of translationese, thereby corroborating some hypotheses, whereas others perform much worse (sometimes at chance level), indicating that some ‘universal’ assumptions have to be reconsidered. In memoriam: Miriam Shlesinger, 1947–2012

...read moreread less

141 citations

Journal Article•DOI•

ANNIS3: A new architecture for generic corpus query and visualization

[...]

Thomas Krause¹, Amir Zeldes²•Institutions (2)

Humboldt University of Berlin¹, Georgetown University²

01 Apr 2016-Digital Scholarship in the Humanities

TL;DR: This article proposes a generic solution for specialized corpus visualizations in a Web interface using annotation-triggered style sheets, which leverage the power of modern browsers and CSS for multiple and highly customizable views of primary data.

...read moreread less

Abstract: This article is concerned with the data structures, properties of query languages, and visualization facilities required for the generic representation of richly annotated, heterogeneous linguistic corpora. We propose that above and beyond a general graph-based data model, which is becoming increasingly popular in many complex annotation formats, a well-defined concept of multiple, potentially conflicting segmentation layers must be introduced to deal with different sources and applications of corpus data flexibly. We also propose a generic solution for specialized corpus visualizations in a Web interface using annotation-triggered style sheets, which leverage the power of modern browsers and CSS for multiple and highly customizable views of primary data. We offer an implementation and evaluation of our architecture in ANNIS3, an open-source browser-based architecture for corpus search and visualization. We present three case studies to test the coverage of the system, encompassing core linguistic and digital humanities use-cases including richly annotated newspaper treebanks, multilingual diplomatic and normalized manuscript materials edited in TEI, and analysis of multimodal recordings of spoken language.

...read moreread less

127 citations

Journal Article•DOI•

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

[...]

Maciej Eder¹•Institutions (1)

Pedagogical University of Kraków¹

01 Jun 2015-Digital Scholarship in the Humanities

TL;DR: In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.

...read moreread less

Abstract: The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

...read moreread less

114 citations

Journal Article•DOI•

Understanding and explaining Delta measures for authorship attribution

[...]

Stefan Evert¹, Thomas Proisl¹, Fotis Jannidis², Isabella Reger², Steffen Pielström², Christof Schöch², Thorsten Vitt² - Show less +3 more•Institutions (2)

University of Erlangen-Nuremberg¹, University of Würzburg²

01 Dec 2017-Digital Scholarship in the Humanities

TL;DR: It is shown that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently.

...read moreread less

Abstract: This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows’s Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection, feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only. .................................................................................................................................................................................

...read moreread less

102 citations

Journal Article•DOI•

Significance testing of word frequencies in corpora

[...]

Jefrey Lijffijt¹, Terttu Nevalainen², Tanja Säily², Panagiotis Papapetrou³, Kai Puolamäki⁴, Heikki Mannila⁵ - Show less +2 more•Institutions (5)

University of Bristol¹, University of Helsinki², Stockholm University³, Finnish Institute of Occupational Health⁴, Aalto University⁵

01 Jun 2016-Digital Scholarship in the Humanities

TL;DR: The significance estimates of various statistical tests are compared in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus to conclude that significance testing can be used to find consequential differences between corpora.

...read moreread less

Abstract: Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

...read moreread less

86 citations

Collapse

Performance

Metrics

753

Papers

4,428

Citations

No. of papers from the Journal in previous years
Year	Papers
2023	53
2022	131
2021	157
2020	39
2019	104
2018	59