scispace - formally typeset
Search or ask a question

Showing papers by "Tony McEnery published in 2003"


Proceedings ArticleDOI
Scott Piao1, Paul Rayson1, Dawn Archer1, Andrew Wilson1, Tony McEnery1 
12 Jul 2003
TL;DR: The research work in which this approach tested approaching the MWE issue using a semantic field annotator using an English semantic tagger developed at Lancaster University to identify multiword units which depict single semantic concepts provides a practical solution to MWE extraction.
Abstract: Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP community and corpus linguistics. Although various statistically driven or knowledge-based approaches have been proposed and tested, efficient MWE extraction still remains an unsolved issue. In this paper, we present our research work in which we tested approaching the MWE issue using a semantic field annotator. We use an English semantic tagger (USAS) developed at Lancaster University to identify multiword units which depict single semantic concepts. The Meter Corpus (Gaizauskas et al., 2001; Clough et al., 2002) built in Sheffield was used to evaluate our approach. In our evaluation, this approach extracted a total of 4,195 MWE candidates, of which, after manual checking, 3,792 were accepted as valid MWEs, producing a precision of 90.39% and an estimated recall of 39.38%. Of the accepted MWEs, 68.22% or 2,587 are low frequency terms, occurring only once or twice in the corpus. These results show that our approach provides a practical solution to MWE extraction.

49 citations


01 Mar 2003
TL;DR: A design for a modified semantic tagger for EmodE texts, that contains an ‘intelligent’ spelling regulariser that has been designed so as to regularise spellings in their ‘correct’ context is proposed.
Abstract: As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis system (USAS) has been designed to undertake the automatic semantic analysis of present-day English (henceforth PresDE) texts. In this paper, we report on the feasibility of (re)training the USAS system to cope with English from earlier periods, specifically the Early Modern English (henceforth EmodE) period. We begin by describing how effectively the existing system tagged a training corpus prior to any modifications. The training corpus consists of newsbooks dating from December 1653 – May 1654, and totals approximately 613,000.words. We then document the various adaptations that we made to the system in an attempt to improve its efficiency, and the results we achieved when we applied the modified system to two newsbook texts, and an additional text from the Lampeter Corpus (i.e. a text that was not part of the original training corpus). To conclude, we propose a design for a modified semantic tagger for EmodE texts, that contains an ‘intelligent’ spelling regulariser, that is, a system that has been designed so as to regularise spellings in their ‘correct’ context.

37 citations


01 Jan 2003
TL;DR: This paper will focus on the corpus construction undertaken on the EMILLE Project and will outline the rationale behind data collection, and a number of issues for South Asian corpus building will be highlighted.
Abstract: The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

25 citations


Journal ArticleDOI
TL;DR: This paper examines the were-subjunctive in British rural dialects in the light of data from two sources: the Survey of English Dialects (SED) questionnaire, and the Leeds Corpus ofEnglish Dialect (LCED), consisting of transcribed recordings made at the same time as the data was gathered for the questionnaire.
Abstract: This paper examines the were-subjunctive in British rural dialects in the light of data from two sources: the Survey of English Dialects (SED) questionnaire, and the Leeds Corpus of English Dialect (LCED), consisting of transcribed recordings made at the same time as the data was gathered for the questionnaire. We begin by surveying previous work on the subjunctive in general, and the were-subjunctive in dialect grammar in particular (section 1), culminating in a discussion of the SED data on the were-subjunctive. We then move on in section 2 to pose two hypotheses: firstly that the SED does not provide a complete picture of this phenomenon and thus corpus data may be of use enriching it; secondly a "null" hypothesis that no were-subjunctive is consistently marked in the dialects in question. We then look at the methodology and data used (section 3), describing the source of our data, the LCED. We also note some potential difficulties (3.1) before moving on to discuss the choice of an area of England to examine (3.2) and of texts to analyse (3.3). In section 3.4 we describe the mark-up scheme used in the analysis of the texts, and in 3.5 the process of annotation and extraction of results form the texts. These results are presented in section 4. We consider the corpus data in relation to the questionnaire data (4.1), and to our two hypotheses (4.2 and 4.3). In our Conclusion (section 5) we summarise the implications of this study and consider some possible future routes of enquiry into the were-subjunctive in the rural dialects of England.

1 citations


01 Jan 2003
TL;DR: This paper will focus on the corpus construction undertaken on the EMILLE Project and will outline the rationale behind data collection, and a number of issues for South Asian corpus building will be highlighted.
Abstract: The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

1 citations