Showing papers in &quot;Literary and Linguistic Computing in 2002&quot;

Automatically Categorizing Written Texts by Author Gender

764 citations

Journal Article•DOI•

[...]

Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni

‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship

TL;DR: It is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy.

...read moreread less

Abstract: The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 per cent accuracy.

...read moreread less

667 citations

Journal Article•DOI•

[...]

John Burrows

Radiant Textuality: Literature after the World Wide Web

TL;DR: A new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship, which offers a simple but comparatively accurate addition to current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length.

...read moreread less

Abstract: This paper is a companion to my 'Questions of authorship: attribution and beyond', in which I sketched a new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship. The main emphasis of that paper was not on the new procedure but on the broader consequences of our increasing sophistication in making such comparisons and the increasing (although never absolute) reliability of our inferences about authorship. My present objects, accordingly, are to give a more complete account of the procedure itself; to report the outcome of an extensive set of trials; and to consider the strengths and limitations of the new procedure. The procedure offers a simple but comparatively accurate addition to our current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length. It is of even greater value as a method of reducing the field of likely candidates for texts of as little as 100 words in length. Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated.

...read moreread less

457 citations

Journal Article•DOI•

[...]

Dirk Van Hulle

Empirical Methods for Exploiting Parallel Texts

133 citations

Journal Article•DOI•

[...]

Dan Tufis

Corpus Linguistics at Work (Studies in Corpus Linguistics.)

82 citations

Journal Article•DOI•

[...]

Vincent B. Y. Ooi

Frequent Word Sequences and Statistical Stylistics

63 citations

Journal Article•DOI•

[...]

David L. Hoover

Stemming of Amharic Words for Information Retrieval

TL;DR: It is suggested that analyses based on frequent word sequences constitute improved tools for authorship and stylistic studies and produce superior results for small groups of problematic novels and critical texts extracted from the larger corpora.

...read moreread less

Abstract: This paper investigates the relative effectiveness and accuracy of multivariate analysis, specifically cluster analysis, of the frequencies of very frequent words and the frequencies of very frequent word sequences in distinguishing texts by different authors and grouping texts by a single author. Cluster analyses based on frequent words are fairly accurate for groups of texts by known authors, whether the texts are long sections of modern British and US novels or shorter sections of contemporary literary critical texts, but they are only rarely completely accurate. When frequent word sequences are used instead of frequent words or in addition to them, however, the accuracy of the analyses often improves, sometimes dramatically, especially when personal pronouns are eliminated. Analyses based on frequent sequences even provide completely correct results in some cases where analyses based on frequent words fail. They also produce superior results for small groups of problematic novels and critical texts extracted from the larger corpora. Such successes suggest that analyses based on frequent word sequences constitute improved tools for authorship and stylistic studies.

...read moreread less

58 citations

Journal Article•DOI•

[...]

Nega Alemayehu, Peter Willett

01 Apr 2002-Literary and Linguistic Computing

TL;DR: An iterative stemmer has been developed that involves the removal of both prefixes and suffixes and that also takes account of letter inconsistency and reiterative verb forms.

...read moreread less

Abstract: This paper presents a stemmer for processing document and query words to facilitate searching databases of Amharic text. An iterative stemmer has been developed that involves the removal of both prefixes and suffixes and that also takes account of letter inconsistency and reiterative verb forms. Application of the stemmer to a test file of 1221 words suggested that appropriate stems were generated for ca. 95 per cent of them, with only limited overstemming and understemming.

...read moreread less

47 citations

Journal Article•DOI•

Words And Phrases

[...]

Oliver Mason

Pause patterns in Shakespeare's verse: Canon and chronology

35 citations

Journal Article•DOI•

[...]

Macd. P. Jackson

01 Apr 2002-Literary and Linguistic Computing

TL;DR: Cette analyse reprend l'analyse faite par Ants Ora sur la variation du nombre de pauses, et leur place par rapport a la fin of the ligne, que l'on peut mettre en rapport avec l'evolution de the versification au seizieme siecle.

...read moreread less

Abstract: L'A. s'interesse aux pauses syntaxiques et aux parametres iambiques de la versification de Shakespeare. Il reprend l'analyse faite par Ants Ora en 1960 sur la variation du nombre de pauses, et leur place par rapport a la fin de la ligne, que l'on peut mettre en rapport avec l'evolution de la versification au seizieme siecle. L'A. commente cette analyse ainsi que les resultats obtenus et propose une methode statistique reutilisant les donnees de Ora afin de dater les pieces de Shakespeare ou de discuter de la paternite de certaines.

...read moreread less

30 citations

Journal Article•DOI•

Word alignment in English-Chinese parallel corpora

[...]

Scott Piao

Computer‐Assisted Teaching of Translation Methods

TL;DR: A hybrid algorithm for English–Chinese word alignment, which incorporates co‐occurrence association measures, word distribution distances, English word lemmatization, and part‐of‐speech information is presented.

...read moreread less

Abstract: Word alignment in bilingual or multilingual parallel corpora has been a challenging issue for natural language engineering. An efficient algorithm for automatically aligning word translation equivalents across different languages will be of use for a number of practical applications such as multilingual lexical construction, machine translation, etc. This paper presents a hybrid algorithm for English–Chinese word alignment, which incorporates co‐occurrence association measures, word distribution distances, English word lemmatization, and part‐of‐speech information. Eleven co‐occurrence association coefficients and eight distance measures of word distribution are explored to compare their efficiency for word alignment. The paper also describes an experiment in which the algorithm is evaluated on sentence‐aligned English–Chinese parallel corpora. In the experiment, the algorithm produced encouraging success rates on two test corpora, with the highest success rate of 89.37 per cent. It provides a practical tool for extracting word translation equivalents from English–Chinese parallel corpora.

...read moreread less

Journal Article•DOI•

[...]

Chi‐Chiang Shei, Helen Pain

The Czech National Corpus: Principles, Design, and Results

TL;DR: An intelligent tutoring system designed to help student translators learn to appreciate the distinction between literal translation and liberal translation, an important and forever debated point in the literature of translation, and some other methods of translation lying between these two extremes are introduced.

...read moreread less

Abstract: This paper introduces an intelligent tutoring system designed to help student translators learn to appreciate the distinction between literal translation and liberal translation, an important and forever debated point in the literature of translation, and some other methods of translation lying between these two extremes. We identify four prominent kinds of translation methods commonly discussed in the translation literature - word-for-word translation, literal translation, semantic translation, and communicative translation - and attempt to extract computationally expedient definitions for them from two researchers' discussions on them. We then apply these computational definitions to the preparation of our translation corpus to be used in the intelligent tutoring system. In the basic working mode the system offers a source sentence for the student to translate, compares it with the inbuilt versions, and decides on the most likely method of translation used through a translation unit matching algorithm. The student can guess where on the literal and liberal continuum their translation stands by viewing this verdict and by comparing their translation with other versions for the same sentence. In the advanced working mode, the student learns some translation techniques such as the contrastive analysis approach to teaching translation, while appreciating the working of translation methods in relation to these techniques.

...read moreread less

Journal Article•DOI•

[...]

Karel Kucera

How Accurate Were Scribes? A Mathematical Model

TL;DR: The work on the CNC project has resulted in the completion of SYN2000, a 100-million-word corpus of contemporary written Czech, the organization of the cores of spoken, diachronic, and dialectal corpora, and the finding of workable solutions to some general theoretical problems involved in the building of these corpora.

...read moreread less

Abstract: This paper describes the general principles, design, and present state of the Czech National Corpus (CNC) project. The corpus has been designed to provide a firm basis for the study of both the contemporary written Czech (a goal well attainable with the present resources) and the Czech language beyond the limits of contemporary written texts (a long-term commitment including the building of a corpus of spoken Czech and diachronic and dialectal corpora). The work on the CNC project, now in the eighth year of its official existence, has resulted in the completion of SYN2000, a 100-million-word corpus of contemporary written Czech, the organization of the cores of spoken, diachronic, and dialectal corpora, and the finding of workable solutions to some general theoretical problems involved in the building of these corpora.

...read moreread less

Journal Article•DOI•

[...]

Matthew Spencer, Christopher J. Howe

Corpus Linguistics, second edition. Edinburgh Textbooks in Empirical Linguistics

TL;DR: An indirect method is developed that derives a relationship between the number of manuscripts in the tradition and the mean number of copies separating a randomly chosen pair of manuscripts, and which can be used to estimate the probability of change.

...read moreread less

Abstract: Until printing was invented, texts were copied by hand. The probability with which changes were introduced during copying was affected by the kind of text and society. We cannot usually estimate the probability of change directly. Instead, we develop an indirect method. We derive a relationship between the number of manuscripts in the tradition and the mean number of copies separating a randomly chosen pair of manuscripts. Given the rate at which the proportion of words that are different increases with the mean number of copies separating two manuscripts, we can then estimate the probability of change. We illustrate our method with an analysis of Lydgate's medieval poem The Kings of England.

...read moreread less

Journal Article•DOI•

[...]

Fiona M. Douglas

Linguistic Computing in the Shadow of Postmodernism

Journal Article•DOI•

[...]

Thomas Merriam

Developing Conceptual Glossaries for the Latin Vulgate Bible

TL;DR: The philosophical roots of the postmodernist critiques are destructive of any attempt to discover who wrote what, and authorship attribution is now distinguished from authorship ascription with only the latter applying to literary and linguistic computing.

...read moreread less

Abstract: Determination of authorship, using methods such as those of Mosteller and Wallace, has been obliquely criticized by literary scholars for years However, the most radical critique of these methods has, under the umbrella term 'theory', emerged since the 1960s in the writings of Barthes, Foucault, and Derrida Thanks to their influence, authorship attribution is now distinguished from authorship ascription with only the latter applying to literary and linguistic computing Various criticisms are examined in detail Useful as these criticisms are, the philosophical roots of the postmodernist critiques are destructive of any attempt to discover who wrote what

...read moreread less

Journal Article•DOI•

[...]

Andrew Wilson¹•Institutions (1)

Lancaster University¹

Keyword Extraction from Ancient Greek Literary Texts

TL;DR: The background to the project is described and the steps by which the glossaries are developed within a relational database framework are outlined.

...read moreread less

Abstract: A conceptual glossary is a textual reference work that combines the features of a thesaurus and an index verborum. In it, the word occurrences within a given text are classified, disambiguated, and indexed according to their membership of a set of conceptual (i.e. semantic) fields. Since 1994, we have been working towards building a set of conceptual glossaries for the Latin Vulgate Bible. So far, we have published a conceptual glossary to the Gospel according to John and are at present completing the analysis of the Gospel according to Mark and the minor epistles. This paper describes the background to our project and outlines the steps by which the glossaries are developed within a relational database framework.

...read moreread less

Journal Article•DOI•

[...]

Jeffrey A. Rydberg-Cox

Encoding Medieval Abbreviations for Computer Analysis (from Latin–Portuguese and Portuguese Non‐literary Sources)

TL;DR: The modifications required to the traditional tf* idf keyword discovery algorithm are described so that it will extract valid keywords from literary texts written in Ancient Greek.

...read moreread less

Abstract: Automatic keyword extraction is an extremely interesting prospect for computational humanists because of its potential as a tool to aid scholarship in the humanities. Keyword discovery routines can help organize large collections of texts and perhaps even guide scholars to the discovery of important elements in their source materials. It is not clear, however, that the methods designed to extract keywords from paper abstracts or newswire texts will be effective for literary texts that are not written in English. This paper describes the modifications required to the traditional tf* idf keyword discovery algorithm so that it will extract valid keywords from literary texts written in Ancient Greek.

...read moreread less

Journal Article•DOI•

[...]

Stephen Parkinson, António H. A. Emiliano

Modelling a Morpheme-based Lexicon for Modern Greek

TL;DR: This paper proposes a solution to the problem of handling scribal abbreviations in TEI-conformant transcriptions of medieval texts, following a conservative editorial strategy.

...read moreread less

Abstract: This paper proposes a solution to the problem of handling scribal abbreviations in TEI-conformant transcriptions of medieval texts, following a conservative editorial strategy A key distinction is drawn between alphabetic abbreviations, which represent sequences of letters, and logographic abbreviations which represent whole words The TEI elements and can be used systematically to separate these two types: alphabetic abbreviations will be expanded in the main text, recording the abbreviated form (including TEI entities representing the main abbreviation marks) as an attribute of , while logographic abbreviations will be represented in their abbreviated form, with the expanded form recorded as an attribute of The proposals are illustrated from common abbreviations and short text samples from tenth-century Latin-Portuguese and thirteenth-century Old Portuguese

...read moreread less

Journal Article•DOI•

[...]

Evangelos C. Papakitsos, Maria Grigoriadou, Giorgos Philokyprou

‘Pioneers! O Pioneers!’: Lessons in Electronic Editing from Stijn Streuvels's De teleurgang van den Waterhoek

TL;DR: This paper presents a method for designing and organizing a multi-purpose morpheme-based lexical database for Modern Greek using the Entity/Relationship model, according to the linguistic theory of Generative Lexical Morphology.

...read moreread less

Abstract: This paper presents a method for designing and organizing a multi-purpose morpheme-based lexical database for Modern Greek. The authors are in favour of multi-purpose lexical databases, to avoid a repetition of effort from one application to another, and of morpheme-based lexica, to achieve flexibility, reusability, expandability, and compact representation of data for future developments. The suggested method for modelling the lexical database in the word-processing function is the Entity/Relationship model, according to the linguistic theory of Generative Lexical Morphology. In the framework of this model, which depicts rich linguistic information, we can introduce new data structures for storing the morphemes. These new data structures are matrix encoding schemes; one type, called the Cartesian Lexicon, has been designed as a part of our research. The matrix data structures combine the advantages of hash-tables and tries, which are very popular data structures in supporting machine readable dictionaries. Our system was tested on the Modern Greek language, and demonstrated a satisfactory overall performance in word-processing. These methods could also be applicable to other languages having morphological systems similar to Modern Greek.

...read moreread less

Journal Article•DOI•

[...]

Daniel Paul O'Donnell

The Hengwrt Chaucer. Digital Facsimile

Journal Article•DOI•

[...]

Herman Brinkman

A Complete and Comprehensive System for Modern Greek Language Processing Proposed as a Modern Greek Language Call Method Developer

Journal Article•DOI•

[...]

Socrates D. Baldzis, Eugenia Eumeridou, Stavros A. Kolalas

A Dictionary of the Internet

TL;DR: The system comprises a parser and generator for Modern Greek sentences as well as a computational lexicon, encoding morphological, syntactic, and semantic information for words.

...read moreread less

Abstract: In this paper, we put forward a fully developed system for the teaching of Modern Greek Language (MGL). The system comprises a parser and generator for Modern Greek sentences as well as a computational lexicon, encoding morphological, syntactic, and semantic information for words. In this paper, we present the major components of the system, highlighting their suitability for the teaching of MGL in an experimental, open, and cooperative educational environment. The proposed system can be used either in a classroom environment or by Internet correspondence for the teaching of MGL as a native or foreign language.

...read moreread less

Journal Article•DOI•

[...]

Jean Aitchison

What's All the Hype in Hypertext About? A Humanities Computing Colloquium

TL;DR: Dictionary of the internet will lead you to love reading starting from now, because dictionary is the window to open the new world and reading will give you the kindness.

...read moreread less

Abstract: We may not be able to make you love reading, but dictionary of the internet will lead you to love reading starting from now. Book is the window to open the new world. The world that you want is in the better stage and level. World will always guide you to even the prestige stage of the life. You know, this is some of how reading will give you the kindness. In this case, more books you read more knowledge you know, but it can mean also the bore is full.

...read moreread less

Journal Article•DOI•

[...]

Susan Schreibman

01 Apr 2002-Literary and Linguistic Computing

Journal Article•DOI•

Web-based Dictionaries for Languages of the South-west USA

[...]

Sonya Bird¹, Michael Hammond, Maria Amarillas¹, Melody Jeffcoat¹, Heidi Harley, Mizuki Miyashita¹, Laura A. Moll¹, Mary Ann Willie, Ofelia Zepeda - Show less +5 more•Institutions (1)

University of Arizona¹

An SDRT Approach to the Temporal Structure of Modern Greek Narrative Texts

TL;DR: This paper outlines a project to create electronic dictionaries of indigenous languages of the south-west USA and make them available over the Web for language instruction as well as for linguistic, psycholinguistic, and anthropological research.

...read moreread less

Abstract: This paper outlines a project currently under way in the Linguistics Department at the University of Arizona to create electronic dictionaries of indigenous languages of the south-west USA and make them available over the Web for language instruction as well as for linguistic, psycholinguistic, and anthropological research. Working with three languages - Tohono O'odham, Navajo, and Hiaki - we have created an XML scheme that serves as a general template for structuring and archiving language databases. We describe the process of compiling databases for different languages and converting these databases to XML, which contains all the relevant information in a manner that is easily accessible. We discuss the general programming scheme used for searching, and the interfaces used for presenting the dictionary on the Web, which include several front ends for different user groups. We end with a discussion of how to ensure that special characters are displayed properly on the Web.

...read moreread less

Journal Article•DOI•

[...]

Eleni Galiotou

Genealogical Classification of Saddharmapundarika Manuscripts Based on Many-Variable Analysis

TL;DR: An attempt to analyse the temporal structure of discourse in Modern Greek following the principles of Asher's Segmented Discourse Representation Theory and the use of linguistic knowledge for the determination of these relations.

...read moreread less

Abstract: We describe an attempt to analyse the temporal structure of discourse in Modern Greek following the principles of Asher's Segmented Discourse Representation Theory. We focus on discourse relations of a temporal and causal interest and the use of linguistic knowledge for the determination of these relations. This analysis is applied to a corpus of short newspaper articles reporting car accidents in Modern Greek and the discourse grammar is implemented using the Attribute Logic Engine.

...read moreread less

Journal Article•DOI•

[...]

Yumi Ousaka, Moriichi Yamazaki

Interpolations, Pseudographs, and the New Testament Epistles

TL;DR: For the genealogical classification of these manuscripts, principal component analysis and cluster analysis, which describe the similarities between the verses of the different manuscripts, were applied and could successfully classify these manuscripts into two large groups and several smaller groups.

...read moreread less

Abstract: Many manuscripts of the Saddharmapundarika, which are among the most important manuscripts for the study of Buddhism, have been discovered in very different localities and are classified according to their place of discovery into the following three groups: Nepalese, Kashmirian, and Central Asian manuscripts. For the genealogical classification of these manuscripts, principal component analysis and cluster analysis, which describe the similarities between the verses of the different manuscripts, were applied to the data. As a result, we could successfully classify these manuscripts into two large groups and several smaller groups: one large group consists of ten paper manuscripts from Nepal and the other comprises nine palm-leaf plus two paper manuscripts. The Kashmir and Central Asian manuscripts and a few of the Nepal manuscripts belong to the small groups.

...read moreread less

Journal Article•DOI•

[...]

George K. Barr