scispace - formally typeset
Search or ask a question

Showing papers in "Computers and The Humanities in 2003"


Journal ArticleDOI
TL;DR: C-rater is an automated scoring engine that has been developed to score responses to content-based short answer questions using predicateargument structure, pronominal reference, morphological analysis and synonyms to assign full or partial credit to a short answer question.
Abstract: C-rater is an automated scoringengine that has been developed to scoreresponses to content-based short answerquestions. It is not simply a stringmatching program – instead it uses predicateargument structure, pronominal reference,morphological analysis and synonyms to assignfull or partial credit to a short answerquestion. C-rater has been used in two studies:National Assessment for Educational Progress(NAEP) and a statewide assessment in Indiana.In both studies, c-rater agreed with humangraders about 84% of the time.

363 citations


Journal ArticleDOI
TL;DR: Vocabulary richness is of marginal value in stylistic and authorship studies because the basic assumption that it constitutes awordprint for authors is false.
Abstract: This article examines the usefulness of vocabulary richness for authorship attribution and tests the assumption that appropriate measures of vocabulary richness can capture an author's distinctive style or identity. After briefly discussing perceived and actual vocabulary richness, I show that doubling and combining texts affects some measures in computationally predictable but concep- tually surprising ways. I discuss some theoretical and empirical problems with some measures and develop simple methods to test how well vocabulary richness distinguishes texts by different authors. These methods show that vocabulary richness is ineffective for large groups of texts because of the extreme variability within and among them. I conclude that vocabulary richness is of marginal value in stylistic and authorship studies because the basic assumption that it constitutes a wordprint for authors is false.

109 citations


Journal ArticleDOI
TL;DR: A novel approach to the problem that employs a scoring scheme for computing phonetic similarity between phonetic segments on the basis of multivalued articulatory phonetic features, which performs better than comparable algorithms reported in the literature.
Abstract: The computation of the optimal phonetic alignment and the phonetic similarity between words is an important step in many applications in computational phonology, including dialecto- metry. After discussing several related algorithms, I present a novel approach to the problem that employs a scoring scheme for computing phonetic similarity between phonetic segments on the basis of multivalued articulatory phonetic features. The scheme incorporates the key concept of feature salience, which is necessary to properly balance the importance of various features. The new algorithm combines several techniques developed for sequence comparison: an extended set of edit operations, local and semiglobal modes of alignment, and the capability of retrieving a set of near-optimal alignments. On a set of 82 cognate pairs, it performs better than comparable algorithms reported in the literature.

93 citations


Journal ArticleDOI
TL;DR: A statistical analysis of the results shows, first, that language change can be measured, and second, that the rate of language change has not been uniform, and that in particular, the period 1939-;1948 had particularly slow change, while 1949-;1950 and 1959-;1968 had particularly rapid changes.
Abstract: This paper presents a numeric and information theoretic model for themeasuring of language change, without specifying the particular type ofchange. It is shown that this measurement is intuitively plausibleand that meaningful measurements canbe made from as few as 1000 characters. This measurement techniqueis extended to the task of determining the ``rate'' of language changebased on an examination of brief excerpts from the NationalGeographic Magazine and determining both their linguistic distancefrom one another as well as the number of years of temporal separation.A statistical analysis of these results shows, first, that language changecan be measured, and second, that the rate of languagechange has not been uniform, and that in particular, the period 1939-;1948had particularly slow change, while 1949-;1958 and 1959-;1968 hadparticularly rapid changes.

72 citations


Journal ArticleDOI
TL;DR: The authors presented a methode d'identification de l'auteur d'un texte, based on a hierarchique des frequences des mots communs a l'ensemble of textes.
Abstract: L'A. intervient a l'occasion de la reception du prix Roberto Busa 2001 qui lui est decerne pour sa contribution dans le domaine de l'informatique et des sciences humaines. Il fait un bilan de son parcours et presente une nouvelle methode d'identification de l'auteur d'un texte. Son interet pour ce domaine de recherche a debute dans les annees 1970, avec l'analyse d'un texte de Jane Austen. Il nous presente aujourd'hui une nouvelle approche basee sur une liste hierarchique des frequences des mots communs a l'ensemble des textes. Il decrit alors les procedures et les resultats de cette methode, en donne quelques developpements possibles pour l'avenir et fait le point sur l'etat de l'art dans ce domaine de la recherche assistee par ordinateur

70 citations


Journal ArticleDOI
TL;DR: A profile-based linguistic uniformity method designed to compare language varieties on the basis of a wide range of potentiallyheterogeneous linguistic variables, which makes it possible to compare them and investigate the implications of notabledifferences.
Abstract: In this text we present``profile-based linguistic uniformity'', a methoddesigned to compare language varieties on thebasis of a wide range of potentiallyheterogeneous linguistic variables. In manyrespects a parallel can be drawn with currentmethods in dialectometry (for an overview, see,Nerbonne and Heeringa, 2001; Heeringa, Nerbonneand Kleiweg, 2002): in both casesdissimilarities between varieties on the basisof individual variables are summarized inglobal dissimilarities, and a series oflanguage varieties are subsequently clusteredor charted using multivariate techniques suchas cluster analysis or multidimensionalscaling. This global similarity between themethods makes it possible to compare them andto investigate the implications of notabledifferences. In this text we specifically focuson, and defend one characteristic of ourmethodology, its profile-based nature.

57 citations


Journal ArticleDOI
TL;DR: Alexical distance measure is applied to assess the lexical relatedness of LAMSAS's sites, a popular focus of investigation in the past and extends dialectometric technique in suggesting means of dealing with alternate forms and multiple responses.
Abstract: The Linguistic Atlas of the Middle and South Atlantic States(LAMSAS) is admirably accessible for reanalysis (seehttp://hyde.park.uga.edu/lamsas/,Kretzschmar, 1994). The present paper applies alexical distance measure to assess the lexical relatedness of LAMSAS'ssites, a popular focus of investigation in the past(Kurath, 1949; Carver, 1989; McDavid, 1994). Several conclusions arenoteworthy: First, and least controversially, we note that LAMSAS isdialectometrically challenging at least due to the range of fieldworkers and questionnaires employed. Second, on the issue of whichareas ought to be recognized, we note that our investigations tend tosupport a three-wayNorth/South/Midlands division rather than a two-wayNorth/South division, i.e. they tend to support Kurath and McDavidrather than Carver, but this tendency is not conclusive. Third, weextend dialectometric technique in suggesting means of dealing withalternate forms and multiple responses.

55 citations


Journal ArticleDOI
TL;DR: A new digital infrastructure for discovering language resources being developed by the Open Language Archives Community is reported on, designed to facilitatedescription and discovery of all kinds of language resources, including data, tools, or advice.
Abstract: As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources. This paper reports on a new digital infrastructure for discovering language resources being developed by the Open Language Archives Community (OLAC). At the core of OLAC is its metadata format, which is designed to facilitate description and discovery of all kinds of language resources, including data, tools, or advice. The paper describes OLAC metadata, its relationship to Dublin Core metadata, and its dissemination using the metadata harvesting protocol of the Open Archives Initiative.

37 citations


Journal ArticleDOI
TL;DR: Two essay-based discourse analysis systems that identify thesis and conclusion statements from student essays written on six different essaytopics show similar results, indicating that asystem can generalize to unseen data – that is, essay responses on topics that the system has not seen in training.
Abstract: This study describes and evaluates twoessay-based discourse analysis systems thatidentify thesis and conclusion statements fromstudent essays written on six different essaytopics. Essays used to train and evaluate thesystems were annotated by two human judges,according to a discourse annotation protocol. Using a machine learning approach, a number ofdiscourse-related features were automaticallyextracted from a set of annotated trainingdata. Using these features, two discourseanalysis models were built using C5.0 withboosting: a topic-dependent and atopic-independent model. Both systemsoutperformed a positional algorithm. While thetopic-dependent system showed somewhat higherperformance, the topic-independent systemshowed similar results, indicating that asystem can generalize to unseen data – thatis, essay responses on topics that the systemhas not seen in training.

34 citations


Journal ArticleDOI
TL;DR: The aim of this paper is to find an acoustic distance measure between dialects which approximates perceptual distance measure, and applies the Levenshteinalgorithm to spectra or formant value bundles instead of transcription segments.
Abstract: Gooskens (2003) described an experiment which determined linguistic distances between 15 Norwegian dialects as perceived by Norwegian listeners. The results are compared to Levenshtein distances, calculated on the basis of transcriptions (of the words) of the same recordings as used in the perception experiment. The Levenshtein distance is equal to the sum of the weights of the insertions, deletions and substitutions needed to change one pronunciation into another. The success of the method depends on the reliability of the transcriber. The aim of this paper is to find an acoustic distance measure between dialects which approximates perceptual distance measure. We use and compare different representations of the acoustic signal: Barkfilter spectrograms, cochleagrams and formant tracks. We now apply the Levenshtein algorithm to spectra or formant value bundles instead of transcription segments. From these acoustic representations we got the best results using the formant track representation. However the transcription-based Levenshtein distances correlate still more closely. In the acoustic signal the speaker-dependent influence is kept to some extent, while a transcriber abstracts from voice quality. Using more samples per dialect word (instead of only one as in our research) should improve the accuracy of the measurements.

27 citations


Journal ArticleDOI
TL;DR: This special issue ofComputers and the Humanities presents a range of recent work on dialectology and dialectometry, which shows a perceived need for techniques which can deal with large amounts of data in a controlled means, i.e. computational techniques.
Abstract: Dialectology is the study of dialects, and dialectometry is the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography. The earliest works in dialectology showed that language variation is complex both geographically and linguistically and cannot be reduced to simple characterizations. There has thus always been a perceived need for techniques which can deal with large amounts of data in a controlled means, i.e. computational techniques. This special issue of Computers and the Humanities presents a range of recent work on this topic.

Journal ArticleDOI
TL;DR: The Theatre of Pompey became the architectural Ur-text for many of the numerous theatres built throughout the Roman Empire, and in the Renaissance left its imprint upon such seminal theatres as the Teatro Olimpico at Vicenza and theTeatro Farnese at Parma.
Abstract: In 55 BC the triumphal general Pompey the Great dedicated Rome's first permanent theatre and named it after himself This was no ordinary theatre Pompey's sumptuous and grandiose edifice probably the largest theatre ever built comprised, in addition to the Theatre itself (the stage of which was 300 feet wide), an extensive "leisure-complex" of gardens enclosed within a colonnade, and galleries displaying rare works of art It also included a curia (a meeting house for the Senate), and it was in this building that Caesar was assassinated in 44 BC A grand temple above the uppermost tiers of the auditorium, dedicated to Pompey's patron divinity, Venus Victrix, crowned the entire architecturally unified monument Although the theatre was built upon the flats of the Campus Martius, this, its highest point, was second in height only to the temple of Jupiter on the capitol According to our research, the auditorium or cavea beneath it may have accommodated some 25,000 spectators1 Pompey's gift to the Roman people was for centuries the site of many of the most important events in the cultural and political life of the city2 Nero himself performed upon its stage,3 much to the disgust of the senatorial class and the delight of the masses As late as the 6th century AD, when it was restored for the last time the theatre was still sufficiently imposing for Cassiodorus to exclaim, "one would have thought it more likely for mountains to subside, than this strong building be shaken"4 Over five centuries earlier, when Vitruvius wrote his influential treatise, De Architectura, his detailed account of how a "typical" Roman theatre should be built was based upon Pompey's recently-completed edifice; indeed, at the time he wrote, it was probably still the only stone theatre in the city of Rome5 Thus, through Vitruvius, the Theatre of Pompey became the architectural Ur-text for many of the numerous theatres built throughout the Roman Empire Subsequently, in the Renaissance, through the influence of Vitruvius, the Theatre of Pompey left its imprint upon such seminal theatres as the Teatro Olimpico at Vicenza and the Teatro Farnese at Parma This single theatre, therefore, had a unique

Journal ArticleDOI
TL;DR: Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books and a principal component analysis based on word frequencies finds that the main differences are not due to authorship.
Abstract: This paper considers the question of authorship attribution techniques whenfaced with a pastiche. We ask whether the techniques can distinguish the real thing from the fake, or can the author fool the computer? If the latter, is this because the pastiche is good, or because the technique is faulty? Using a number of mainly vocabulary-based techniques, Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books. Standard measures of lexical richness, Yule's K andOrlov's Z both distinguish Adair from Carroll, though Z also distinguishesthe two originals. A principal component analysis based on word frequenciesfinds that the main differences are not due to authorship. A discriminantanalysis based on word usage and lexical richness successfully distinguishes thepastiche from the originals. Weighted cusum tests were also unable to distinguish the two authors in a majority of cases. As a cross-validation, wemade similar comparisons with control texts: another children's story from thesame era, and other work by Carroll and Adair. The implications of thesefindings are discussed.

Journal ArticleDOI
TL;DR: The mechanics of ebookproduction at the Etext Center, the limits of the current technology, and the conversion workflow are discussed, as well as the conversionflow the authors hope to implement in the future.
Abstract: Between August 2000 and August2002, the Electronic Text Center at theUniversity of Virginia distributed over sevenmillion freely-available electronic books tousers from more than 100 different countries. Delivered in a variety of formats, including.lit and .pdb, these ebooks have providedproof-of-concept for the adaptive uses of TEIstandards beyond the World Wide Web – standardsthat the Electronic Text Center has employedsince its inception in 1992. The first half ofthis paper discusses the mechanics of ebookproduction at the Etext Center, the limits ofthe current technology, and the conversionworkflow we hope to implement in the future.The second half discusses user response to ourebook collection, classroom applications ofebook technology, and the advantages anddisadvantages that different formats offer toscholars and instructors in the humanities.

Journal ArticleDOI
TL;DR: Meta-interpretation, a method that combines individual responses to a text, reading logs, screen recordings and limitedqualitative/quantitative analysis, and criticalinterpretation is outlined, which addresses Espen Aarseth's concerns and illuminates interesting features of interactive processes in fictional environments.
Abstract: Traditional discourses upon literature have been predicated upon the ability to refer to a text that others may consult (Landow, 1994, p. 33). Texts that involve elements of feedback and non- trivial decision-making on the part of the reader (Aarseth, 1997, p. 1) therefore present a challenge to readers and critics alike. Since a persuasive case has been made against a critical method that sets out to "identify the task of interpretation as a task of territorial exploration and territorial mastery" (Aarseth, p. 87), this paper proposes the use of readers in an empirically based approach to hypertext fiction. Meta-interpretation, a method that combines individual responses to a text, reading logs, screen recordings and limited qualitative/quantitative analysis, and critical interpretation is outlined. By analysing readers' responses it is possible to suggest both the ways that textual elements may have influenced or determined readers' choices and the ways that readers' choices "configure" the text. The method thus addresses Espen Aarseth's concerns and illuminates interesting features of interactive processes in fictional environments. The paper is divided into two parts: the first part sketches out meta-interpretation through consideration of the main problems confronting the literary critic; the second part describes reading research aimed at generating data for the literary critic.

Journal ArticleDOI
TL;DR: Gene order analysis for Chaucer's Canterbury Tales supports the idea that there was no established order when the first manuscript was written, and shows relationships predicted by earliersolars, reveals new relationships, and shares features with a word variation stemma.
Abstract: Chaucer's Canterbury Tales consists of loosely-connected stories, appearing in many different orders in extant manuscripts. Differences in order result from rearrangements by scribes during copying, and may reveal relationships among manuscripts. Identifying these relationships is analogous to determining evolutionary relationships among organisms from the order of genes on a genome. We use gene order analysis to construct a stemma for the Canterbury Tales. This stemma shows relationships predicted by earlier scholars, reveals new relationships, and shares features with a word variation stemma. Our results support the idea that there was no established order when the first manuscripts were written.


Journal ArticleDOI
TL;DR: An adjusted version of an articulation-based system, developed by Almeida and Braun (1986) for findingsound distances, using the IPA system is used, which gets a division with clear similarities to traditional dialect maps when classifying dialects.
Abstract: Measuring dialect distances can be based on the comparison of words, and the comparison words should be based on the comparison of sounds. In this research we used an adjusted version of an articulation-based system, developed by Almeida and Braun (1986) for finding sound distances, using the IPA system. For comparison of two pronunciations of a word corresponding with two different varieties, we used the Levenshtein algorithm, which finds the easiest way in which one word can be changed into the other by inserting, deleting or substituting sounds. As operations weights of these three operations we used distances as found with the Almeida & Braun system. The dialect distance is now equal to the average of a range of word distances. We applied the technique to 360 Dutch dialects. The transcriptions of 125 words for each dialect are taken from the Reeks Nederlandse Dialectatlassen (Blancquaert and Pee, 1925-1982). We get a division with clear similarities to traditional dialect maps when classifying dialects. Using logarithmic sound distances improves results compared to results based on constant sound distances.

Journal ArticleDOI
TL;DR: This paper shall concentrate on the problem of assisting the automatic categorisation of small segments of aphilosophical text into a set of thematiccategories.
Abstract: There are two important strategies incomputer-assisted reading and analysis of text(CARAT) The first relates to theclassification process, and the second pertainsto the categorisation process These twooften-interrelated operations have beenregularly recognised as essential components oftext analysis However, the two operations arehighly time-consuming A possible solution tothis problem calls upon more inductive orbottom-up strategies that are numerical andstatistical in nature In our own research, wehave been exploring a few of these techniquesand their combination We now know, through ourown past research and others' work, that theclassification methods allow a good empiricalthematic exploration of a corpus Morespecifically, in this paper we shallconcentrate on the problem of assisting theautomatic categorisation of small segments of aphilosophical text into a set of thematiccategories

Journal ArticleDOI
TL;DR: Responses in personal interviews about education and career with 415 Swedish men and women (age 34) forms the basis of a speech corpus with 1.8 million words, described by means of two sets of vocabulary and a broad set of respondent characteristics.
Abstract: Responses in personalinterviews about education and career with 415Swedish men and women (age 34) forms the basisof a speech corpus with 1.8 million words. Thevocabulary is described by means of two sets ofvariables. One is based on the number of tokensand types, word length and sectioning of therunning text. The other set divides the corpusinto grammatical categories. Both sets ofvariables are related to a number of backgroundvariables such as gender, socioeconomicbackground, education, and indicators of verbalproficiency at age 13 and 32. This possibilityto study the relationship between vocabularyand a broad set of respondent characteristicsis a unique feature of this corpus.

Journal ArticleDOI
TL;DR: The use of a finite state machine (FSM) to disambiguate speech acts in a machinetranslation system and Evaluation results show that the discourse processor is able todisambiguates and improve the quality of thedialogue translation.
Abstract: A common tool for improving the performance quality of natural language processing systems is the use of contextual information for disambiguation Here I describe the use of a finite state machine (FSM) to disambiguate speech acts in a machine translation system The FSM has two layers that model, respectively, the global and local structures found in naturally-occurring conversations The FSM has been modeled on a corpus of task-oriented dialogues in a travel planning situation In the dialogues, one of the interactants is a travel agent or hotel clerk, and the other a client requesting information or services A discourse processor based on the FSM was implemented in order to process contextual information in a machine translation system Evaluation results show that the discourse processor is able to disambiguate and improve the quality of the dialogue translation Other applications include human-computer interaction and computer-assisted language learning

Journal ArticleDOI
TL;DR: How clusteranalysis can shed light on very complexvariation in a transitional dialect zone in eastern Finland is shown to show how the effects of the old parishes, borders and settlements are stillvisible in the dialects.
Abstract: The aim of this study is to show how clusteranalysis can shed light on very complexvariation in a transitional dialect zone ineastern Finland. In the course of history thisarea has been on the border between Sweden andRussia and the population has clearly been oftwo kinds: the Savo people and the Karelians.It is a well-known fact that there is variationamong these dialects, but the spread and extentof the variation has not been demonstrated previously.The idiolects of the area were studied in thelight of ten phonological and morphologicalfeatures. The material consisted of recordingsof 198 idiolects, totalling around 195 hoursand representing 19 parishes. The variation wasanalysed using hierarchical cluster analysis.While the analysis showed the extent of thevariation between idiolects and parishes, italso demonstrated how the effects of the oldparishes, borders and settlements are stillvisible in the dialects. On the parish level,the data formed clear clusters that correspondwith the main dialects in the area and itssurroundings. On the idiolect level, however,the speakers from the surrounding areas formedfairly homogenous clusters but the idiolectsfrom the Savonlinna area were spread acrossalmost all clusters.

Journal ArticleDOI
TL;DR: Another approach to constructing similar functions of the so-called "volumefunction" describing the chronological distribution of information in historical texts is given.
Abstract: In their papers, Kalashnikov et al (1986),Rachev et al (1989) and Fomenko et al (1990)introduced the so-called “volumefunction” describing the chronologicaldistribution of information in historical texts Here we give anotherapproach to constructing similar functions



Journal ArticleDOI
Øyvind Eide1
TL;DR: This paper will present a publication system in which selected material from letter collections is presented as dialogues between twopersons.
Abstract: In this paper, we will present a publication system in which selectedmaterial from letter collections is presented as dialogues between twopersons.

Journal ArticleDOI
Anne Mahoney1
TL;DR: An encoding for representing quantitativemetrical analyses in TEI SGML or XML documents, using only characters from the standard keyboard set, and a system for converting this encoding to other forms for display is described.
Abstract: This paper describes an encoding for representing quantitative metrical analyses in TEI SGML or XML documents, using only characters from the standard keyboard set, and a system for converting this encoding to other forms for display.

Journal ArticleDOI
TL;DR: GIS methodology was used for the purpose of locating the disputed site of ahistorically significant battle, which took place in 1854 when miners on an Australian goldfield staged an armed uprising against government forces.
Abstract: GIS methodology was used for thepurpose of locating the disputed site of ahistorically significant battle, which tookplace in 1854 when miners on an Australian goldfield staged an armed uprising againstgovernment forces. The route of the firstsurvey of the area (1854) and the earliestknown contour map (1856–1857) were overlaid on amodern street grid. Other features such as thevantage points of illustrators and the authorsof eyewitness accounts were also incorporated. The resulting composite map was used as the keyreference framework for comparing andcritically evaluating a large body of primaryand secondary written accounts, and forreaching a conclusion concerning the site.

Journal ArticleDOI
TL;DR: This paper examines the were-subjunctive in British rural dialects in the light of data from two sources: the Survey of English Dialects (SED) questionnaire, and the Leeds Corpus ofEnglish Dialect (LCED), consisting of transcribed recordings made at the same time as the data was gathered for the questionnaire.
Abstract: This paper examines the were-subjunctive in British rural dialects in the light of data from two sources: the Survey of English Dialects (SED) questionnaire, and the Leeds Corpus of English Dialect (LCED), consisting of transcribed recordings made at the same time as the data was gathered for the questionnaire. We begin by surveying previous work on the subjunctive in general, and the were-subjunctive in dialect grammar in particular (section 1), culminating in a discussion of the SED data on the were-subjunctive. We then move on in section 2 to pose two hypotheses: firstly that the SED does not provide a complete picture of this phenomenon and thus corpus data may be of use enriching it; secondly a "null" hypothesis that no were-subjunctive is consistently marked in the dialects in question. We then look at the methodology and data used (section 3), describing the source of our data, the LCED. We also note some potential difficulties (3.1) before moving on to discuss the choice of an area of England to examine (3.2) and of texts to analyse (3.3). In section 3.4 we describe the mark-up scheme used in the analysis of the texts, and in 3.5 the process of annotation and extraction of results form the texts. These results are presented in section 4. We consider the corpus data in relation to the questionnaire data (4.1), and to our two hypotheses (4.2 and 4.3). In our Conclusion (section 5) we summarise the implications of this study and consider some possible future routes of enquiry into the were-subjunctive in the rural dialects of England.