scispace - formally typeset
Search or ask a question

Showing papers in "Literary and Linguistic Computing in 2002"


Journal ArticleDOI

764 citations


Journal ArticleDOI
TL;DR: It is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy.
Abstract: The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 per cent accuracy.

667 citations


Journal ArticleDOI
TL;DR: A new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship, which offers a simple but comparatively accurate addition to current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length.
Abstract: This paper is a companion to my 'Questions of authorship: attribution and beyond', in which I sketched a new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship. The main emphasis of that paper was not on the new procedure but on the broader consequences of our increasing sophistication in making such comparisons and the increasing (although never absolute) reliability of our inferences about authorship. My present objects, accordingly, are to give a more complete account of the procedure itself; to report the outcome of an extensive set of trials; and to consider the strengths and limitations of the new procedure. The procedure offers a simple but comparatively accurate addition to our current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length. It is of even greater value as a method of reducing the field of likely candidates for texts of as little as 100 words in length. Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated.

457 citations





Journal ArticleDOI
TL;DR: It is suggested that analyses based on frequent word sequences constitute improved tools for authorship and stylistic studies and produce superior results for small groups of problematic novels and critical texts extracted from the larger corpora.
Abstract: This paper investigates the relative effectiveness and accuracy of multivariate analysis, specifically cluster analysis, of the frequencies of very frequent words and the frequencies of very frequent word sequences in distinguishing texts by different authors and grouping texts by a single author. Cluster analyses based on frequent words are fairly accurate for groups of texts by known authors, whether the texts are long sections of modern British and US novels or shorter sections of contemporary literary critical texts, but they are only rarely completely accurate. When frequent word sequences are used instead of frequent words or in addition to them, however, the accuracy of the analyses often improves, sometimes dramatically, especially when personal pronouns are eliminated. Analyses based on frequent sequences even provide completely correct results in some cases where analyses based on frequent words fail. They also produce superior results for small groups of problematic novels and critical texts extracted from the larger corpora. Such successes suggest that analyses based on frequent word sequences constitute improved tools for authorship and stylistic studies.

58 citations


Journal ArticleDOI
TL;DR: An iterative stemmer has been developed that involves the removal of both prefixes and suffixes and that also takes account of letter inconsistency and reiterative verb forms.
Abstract: This paper presents a stemmer for processing document and query words to facilitate searching databases of Amharic text. An iterative stemmer has been developed that involves the removal of both prefixes and suffixes and that also takes account of letter inconsistency and reiterative verb forms. Application of the stemmer to a test file of 1221 words suggested that appropriate stems were generated for ca. 95 per cent of them, with only limited overstemming and understemming.

47 citations


Journal ArticleDOI

35 citations


Journal ArticleDOI
TL;DR: Cette analyse reprend l'analyse faite par Ants Ora sur la variation du nombre de pauses, et leur place par rapport a la fin of the ligne, que l'on peut mettre en rapport avec l'evolution de the versification au seizieme siecle.
Abstract: L'A. s'interesse aux pauses syntaxiques et aux parametres iambiques de la versification de Shakespeare. Il reprend l'analyse faite par Ants Ora en 1960 sur la variation du nombre de pauses, et leur place par rapport a la fin de la ligne, que l'on peut mettre en rapport avec l'evolution de la versification au seizieme siecle. L'A. commente cette analyse ainsi que les resultats obtenus et propose une methode statistique reutilisant les donnees de Ora afin de dater les pieces de Shakespeare ou de discuter de la paternite de certaines.

30 citations


Journal ArticleDOI
TL;DR: A hybrid algorithm for English–Chinese word alignment, which incorporates co‐occurrence association measures, word distribution distances, English word lemmatization, and part‐of‐speech information is presented.
Abstract: Word alignment in bilingual or multilingual parallel corpora has been a challenging issue for natural language engineering. An efficient algorithm for automatically aligning word translation equivalents across different languages will be of use for a number of practical applications such as multilingual lexical construction, machine translation, etc. This paper presents a hybrid algorithm for English–Chinese word alignment, which incorporates co‐occurrence association measures, word distribution distances, English word lemmatization, and part‐of‐speech information. Eleven co‐occurrence association coefficients and eight distance measures of word distribution are explored to compare their efficiency for word alignment. The paper also describes an experiment in which the algorithm is evaluated on sentence‐aligned English–Chinese parallel corpora. In the experiment, the algorithm produced encouraging success rates on two test corpora, with the highest success rate of 89.37 per cent. It provides a practical tool for extracting word translation equivalents from English–Chinese parallel corpora.

Journal ArticleDOI
TL;DR: An intelligent tutoring system designed to help student translators learn to appreciate the distinction between literal translation and liberal translation, an important and forever debated point in the literature of translation, and some other methods of translation lying between these two extremes are introduced.
Abstract: This paper introduces an intelligent tutoring system designed to help student translators learn to appreciate the distinction between literal translation and liberal translation, an important and forever debated point in the literature of translation, and some other methods of translation lying between these two extremes. We identify four prominent kinds of translation methods commonly discussed in the translation literature - word-for-word translation, literal translation, semantic translation, and communicative translation - and attempt to extract computationally expedient definitions for them from two researchers' discussions on them. We then apply these computational definitions to the preparation of our translation corpus to be used in the intelligent tutoring system. In the basic working mode the system offers a source sentence for the student to translate, compares it with the inbuilt versions, and decides on the most likely method of translation used through a translation unit matching algorithm. The student can guess where on the literal and liberal continuum their translation stands by viewing this verdict and by comparing their translation with other versions for the same sentence. In the advanced working mode, the student learns some translation techniques such as the contrastive analysis approach to teaching translation, while appreciating the working of translation methods in relation to these techniques.

Journal ArticleDOI
TL;DR: The work on the CNC project has resulted in the completion of SYN2000, a 100-million-word corpus of contemporary written Czech, the organization of the cores of spoken, diachronic, and dialectal corpora, and the finding of workable solutions to some general theoretical problems involved in the building of these corpora.
Abstract: This paper describes the general principles, design, and present state of the Czech National Corpus (CNC) project. The corpus has been designed to provide a firm basis for the study of both the contemporary written Czech (a goal well attainable with the present resources) and the Czech language beyond the limits of contemporary written texts (a long-term commitment including the building of a corpus of spoken Czech and diachronic and dialectal corpora). The work on the CNC project, now in the eighth year of its official existence, has resulted in the completion of SYN2000, a 100-million-word corpus of contemporary written Czech, the organization of the cores of spoken, diachronic, and dialectal corpora, and the finding of workable solutions to some general theoretical problems involved in the building of these corpora.

Journal ArticleDOI
TL;DR: An indirect method is developed that derives a relationship between the number of manuscripts in the tradition and the mean number of copies separating a randomly chosen pair of manuscripts, and which can be used to estimate the probability of change.
Abstract: Until printing was invented, texts were copied by hand. The probability with which changes were introduced during copying was affected by the kind of text and society. We cannot usually estimate the probability of change directly. Instead, we develop an indirect method. We derive a relationship between the number of manuscripts in the tradition and the mean number of copies separating a randomly chosen pair of manuscripts. Given the rate at which the proportion of words that are different increases with the mean number of copies separating two manuscripts, we can then estimate the probability of change. We illustrate our method with an analysis of Lydgate's medieval poem The Kings of England.


Journal ArticleDOI
TL;DR: The philosophical roots of the postmodernist critiques are destructive of any attempt to discover who wrote what, and authorship attribution is now distinguished from authorship ascription with only the latter applying to literary and linguistic computing.
Abstract: Determination of authorship, using methods such as those of Mosteller and Wallace, has been obliquely criticized by literary scholars for years However, the most radical critique of these methods has, under the umbrella term 'theory', emerged since the 1960s in the writings of Barthes, Foucault, and Derrida Thanks to their influence, authorship attribution is now distinguished from authorship ascription with only the latter applying to literary and linguistic computing Various criticisms are examined in detail Useful as these criticisms are, the philosophical roots of the postmodernist critiques are destructive of any attempt to discover who wrote what

Journal ArticleDOI
TL;DR: The background to the project is described and the steps by which the glossaries are developed within a relational database framework are outlined.
Abstract: A conceptual glossary is a textual reference work that combines the features of a thesaurus and an index verborum. In it, the word occurrences within a given text are classified, disambiguated, and indexed according to their membership of a set of conceptual (i.e. semantic) fields. Since 1994, we have been working towards building a set of conceptual glossaries for the Latin Vulgate Bible. So far, we have published a conceptual glossary to the Gospel according to John and are at present completing the analysis of the Gospel according to Mark and the minor epistles. This paper describes the background to our project and outlines the steps by which the glossaries are developed within a relational database framework.

Journal ArticleDOI
TL;DR: The modifications required to the traditional tf* idf keyword discovery algorithm are described so that it will extract valid keywords from literary texts written in Ancient Greek.
Abstract: Automatic keyword extraction is an extremely interesting prospect for computational humanists because of its potential as a tool to aid scholarship in the humanities. Keyword discovery routines can help organize large collections of texts and perhaps even guide scholars to the discovery of important elements in their source materials. It is not clear, however, that the methods designed to extract keywords from paper abstracts or newswire texts will be effective for literary texts that are not written in English. This paper describes the modifications required to the traditional tf* idf keyword discovery algorithm so that it will extract valid keywords from literary texts written in Ancient Greek.

Journal ArticleDOI
TL;DR: This paper proposes a solution to the problem of handling scribal abbreviations in TEI-conformant transcriptions of medieval texts, following a conservative editorial strategy.
Abstract: This paper proposes a solution to the problem of handling scribal abbreviations in TEI-conformant transcriptions of medieval texts, following a conservative editorial strategy A key distinction is drawn between alphabetic abbreviations, which represent sequences of letters, and logographic abbreviations which represent whole words The TEI elements and can be used systematically to separate these two types: alphabetic abbreviations will be expanded in the main text, recording the abbreviated form (including TEI entities representing the main abbreviation marks) as an attribute of , while logographic abbreviations will be represented in their abbreviated form, with the expanded form recorded as an attribute of The proposals are illustrated from common abbreviations and short text samples from tenth-century Latin-Portuguese and thirteenth-century Old Portuguese

Journal ArticleDOI
TL;DR: This paper presents a method for designing and organizing a multi-purpose morpheme-based lexical database for Modern Greek using the Entity/Relationship model, according to the linguistic theory of Generative Lexical Morphology.
Abstract: This paper presents a method for designing and organizing a multi-purpose morpheme-based lexical database for Modern Greek. The authors are in favour of multi-purpose lexical databases, to avoid a repetition of effort from one application to another, and of morpheme-based lexica, to achieve flexibility, reusability, expandability, and compact representation of data for future developments. The suggested method for modelling the lexical database in the word-processing function is the Entity/Relationship model, according to the linguistic theory of Generative Lexical Morphology. In the framework of this model, which depicts rich linguistic information, we can introduce new data structures for storing the morphemes. These new data structures are matrix encoding schemes; one type, called the Cartesian Lexicon, has been designed as a part of our research. The matrix data structures combine the advantages of hash-tables and tries, which are very popular data structures in supporting machine readable dictionaries. Our system was tested on the Modern Greek language, and demonstrated a satisfactory overall performance in word-processing. These methods could also be applicable to other languages having morphological systems similar to Modern Greek.



Journal ArticleDOI
TL;DR: The system comprises a parser and generator for Modern Greek sentences as well as a computational lexicon, encoding morphological, syntactic, and semantic information for words.
Abstract: In this paper, we put forward a fully developed system for the teaching of Modern Greek Language (MGL). The system comprises a parser and generator for Modern Greek sentences as well as a computational lexicon, encoding morphological, syntactic, and semantic information for words. In this paper, we present the major components of the system, highlighting their suitability for the teaching of MGL in an experimental, open, and cooperative educational environment. The proposed system can be used either in a classroom environment or by Internet correspondence for the teaching of MGL as a native or foreign language.

Journal ArticleDOI
TL;DR: Dictionary of the internet will lead you to love reading starting from now, because dictionary is the window to open the new world and reading will give you the kindness.
Abstract: We may not be able to make you love reading, but dictionary of the internet will lead you to love reading starting from now. Book is the window to open the new world. The world that you want is in the better stage and level. World will always guide you to even the prestige stage of the life. You know, this is some of how reading will give you the kindness. In this case, more books you read more knowledge you know, but it can mean also the bore is full.


Journal ArticleDOI
TL;DR: This paper outlines a project to create electronic dictionaries of indigenous languages of the south-west USA and make them available over the Web for language instruction as well as for linguistic, psycholinguistic, and anthropological research.
Abstract: This paper outlines a project currently under way in the Linguistics Department at the University of Arizona to create electronic dictionaries of indigenous languages of the south-west USA and make them available over the Web for language instruction as well as for linguistic, psycholinguistic, and anthropological research. Working with three languages - Tohono O'odham, Navajo, and Hiaki - we have created an XML scheme that serves as a general template for structuring and archiving language databases. We describe the process of compiling databases for different languages and converting these databases to XML, which contains all the relevant information in a manner that is easily accessible. We discuss the general programming scheme used for searching, and the interfaces used for presenting the dictionary on the Web, which include several front ends for different user groups. We end with a discussion of how to ensure that special characters are displayed properly on the Web.

Journal ArticleDOI
TL;DR: An attempt to analyse the temporal structure of discourse in Modern Greek following the principles of Asher's Segmented Discourse Representation Theory and the use of linguistic knowledge for the determination of these relations.
Abstract: We describe an attempt to analyse the temporal structure of discourse in Modern Greek following the principles of Asher's Segmented Discourse Representation Theory. We focus on discourse relations of a temporal and causal interest and the use of linguistic knowledge for the determination of these relations. This analysis is applied to a corpus of short newspaper articles reporting car accidents in Modern Greek and the discourse grammar is implemented using the Attribute Logic Engine.

Journal ArticleDOI
TL;DR: For the genealogical classification of these manuscripts, principal component analysis and cluster analysis, which describe the similarities between the verses of the different manuscripts, were applied and could successfully classify these manuscripts into two large groups and several smaller groups.
Abstract: Many manuscripts of the Saddharmapundarika, which are among the most important manuscripts for the study of Buddhism, have been discovered in very different localities and are classified according to their place of discovery into the following three groups: Nepalese, Kashmirian, and Central Asian manuscripts. For the genealogical classification of these manuscripts, principal component analysis and cluster analysis, which describe the similarities between the verses of the different manuscripts, were applied to the data. As a result, we could successfully classify these manuscripts into two large groups and several smaller groups: one large group consists of ten paper manuscripts from Nepal and the other comprises nine palm-leaf plus two paper manuscripts. The Kashmir and Central Asian manuscripts and a few of the Nepal manuscripts belong to the small groups.

Journal ArticleDOI
TL;DR: Consideration of scale-related patterns and of the partnership of Paul and Silvanus in mission, leads to a possible solution to the problem of the hapaxes and throws light on the points of contact between the Paulines (including the Pastorals), 1 and 2 Peter, and Hebrews.
Abstract: Scale-related patterns are found in all thirteen Pauline epistles. To test their distinctiveness, graphs of other texts, ancient and modern, comprising more than a million words, have been scrutinized; this survey has failed to detect any similar patterns. They may therefore be related to Pauline authorship. The longer passages claimed to be interpolations are tested against these scale-related patterns and are found to be essential parts of the original texts. Further scale-related patterns are found in 1 and 2 Peter (which received wisdom claims are pseudonymous writings) and in Hebrews. Consideration of these patterns and of the partnership of Paul and Silvanus in mission, leads to a possible solution to the problem of the hapaxes and throws light on the points of contact between the Paulines (including the Pastorals), 1 and 2 Peter, and Hebrews.