scispace - formally typeset
Search or ask a question
JournalISSN: 0010-4817

Computers and The Humanities 

Springer Nature
About: Computers and The Humanities is an academic journal. The journal publishes majorly in the area(s): Computational linguistics & Natural language. It has an ISSN identifier of 0010-4817. Over the lifetime, 894 publications have been published receiving 13691 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: GATE lies at the intersection of human language computation and software engineering, and constitutes aninfrastructural system supporting research and development of languageprocessing software.
Abstract: This paper presents the design, implementation and evaluation of GATE, a General Architecture for Text Engineering.GATE lies at the intersection of human language computation and software engineering, and constitutes aninfrastructural system supporting research and development of languageprocessing software.

634 citations

Journal ArticleDOI
TL;DR: The proposed method was designed to disambiguate senses that are usually associated with different topics using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval.
Abstract: Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.

614 citations

Journal ArticleDOI
TL;DR: In this paper, an analysis is presented in which wordsenses are abstractions from clusters of corpus citations, inaccordance with current lexicographic practice, where the corpus citations are the basic objects in the ontology.
Abstract: Word sense disambiguation assumes word senses. Withinthe lexicography and linguistics literature, they areknown to bevery slippery entities. The first part of the paperlooks at problemswith existing accounts of ‘word sense’ and describesthe various kinds of ways in which a word's meaning candeviate from its coremeaning. An analysis is presented in which wordsenses areabstractions from clusters of corpus citations, inaccordance withcurrent lexicographic practice. The corpus citations,not the wordsenses, are the basic objects in the ontology. Thecorpus citationswill be clustered into senses according to thepurposes of whoever or whatever does the clustering. In theabsence of suchpurposes, word senses do not exist. Word sense disambiguation also needs a set of wordsenses todisambiguate between. In most recent work, the sethas been takenfrom a general-purpose lexical resource, with theassumption that thelexical resource describes the word senses ofEnglish/French/...,between which NLP applications will need todisambiguate. Theimplication of the first part of the paper is, bycontrast, that wordsenses exist only relative to a task. Thefinal part of the paper pursues this, exploring, bymeans of asurvey, whether and how word sense ambiguity is infact a problem forcurrent NLP applications.

419 citations

Journal ArticleDOI
TL;DR: The results suggest that the empirical trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.
Abstract: A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.

391 citations

Journal ArticleDOI
TL;DR: C-rater is an automated scoring engine that has been developed to score responses to content-based short answer questions using predicateargument structure, pronominal reference, morphological analysis and synonyms to assign full or partial credit to a short answer question.
Abstract: C-rater is an automated scoringengine that has been developed to scoreresponses to content-based short answerquestions. It is not simply a stringmatching program – instead it uses predicateargument structure, pronominal reference,morphological analysis and synonyms to assignfull or partial credit to a short answerquestion. C-rater has been used in two studies:National Assessment for Educational Progress(NAEP) and a statewide assessment in Indiana.In both studies, c-rater agreed with humangraders about 84% of the time.

363 citations

Network Information
Related Journals (5)
Computational Linguistics
1.4K papers, 154.8K citations
84% related
Information Processing and Management
3.8K papers, 151.6K citations
77% related
arXiv: Computation and Language
24.8K papers, 481.5K citations
77% related
Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
200423
200332
200225
200124
200038
199928