scispace - formally typeset
Search or ask a question

Showing papers by "Walter Daelemans published in 2000"


Proceedings ArticleDOI
31 Jul 2000
TL;DR: This work uses seven machine learning algorithms for one task: identifying base noun phrases and applies the seven learners with the best combinator, a majority vote of the top five systems, to a standard data set and improves the best published result.
Abstract: We use seven machine learning algorithms for one task: identifying base noun phrases. The results have been processed by different system combination methods and all of these outperformed the best individual result. We have applied the seven learners with the best combinator, a majority vote of the top five systems, to a standard data set and managed to improve the best published result for this data set.

50 citations


Journal ArticleDOI
TL;DR: In this paper, a memory-based classification architecture for word sense disambiguation is described and its application to the SENSEVAL evaluation task is described. But it does not address the problem of finding the closest match to stored examples of this task.
Abstract: We describe a memory-based classification architecture for word sense disambiguation and its application to the SENSEVAL evaluation task. For each ambiguous word, a semantic word expert is automatically trained using a memory-based approach. In each expert, selecting the correct sense of a word in a new context is achieved by finding the closest match to stored examples of this task. Advantages of the approach include (i) fast development time for word experts, (ii) easy and elegant automatic integration of information sources, (iii) use of all available data for training the experts, and (iv) relatively high accuracy with minimal linguistic engineering.

46 citations


Proceedings Article
01 Jan 2000
TL;DR: The authors describe lemmatisation and tagging guidelines developed for the "Spoken Dutch Corpus" and lay out the philosophy behind the high granularity tagset that was designed for the project.
Abstract: This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.

45 citations



Posted Content
TL;DR: Experiments show that COMBI-BOOTSTRAP can integrate a wide variety of existing resources, and achieves much higher accuracy than both the best single tagger and an ensemble tagger constructed out of the same small training sample.
Abstract: This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.

24 citations


Proceedings Article
01 May 2000
TL;DR: This paper used existing taggers and lexical resources for the annotation of corpora with new tagsets using a second level machine learning module that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material.
Abstract: This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.

23 citations



Proceedings Article
29 Jun 2000
TL;DR: It is shown that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.
Abstract: We apply rule induction, classifier combination and meta-learning (stacked classifiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reflecting the Flemish and Dutch pronunciations of a word on the basis of its orthographic representation (which in turn is based on the actual speech recordings). We compare several possible approaches to achieve the text-topronunciation mapping task: memory-based learning, transformation-based learning, rule induction, maximum entropy modeling, combination of classifiers in stacked learning, and stacking of meta-learners. We are interested both in optimal accuracy and in obtaining insight into the linguistic regularities involved. As far as accuracy is concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at word level) for single classifiers is boosted significantly with additional error reductions of 31% and 38% respectively using combination of classifiers, and a further 5% using combination of meta-learners, bringing overall word level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We also show that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.

12 citations



Proceedings Article
01 Jan 2000
TL;DR: This paper compares two rule induction techniques for the automatic extraction of phonemic knowledge and rules from pairs of pronunciation lexicons, and concludes that, whereas classication-based rule induction with C5.0 is more accurate, the transformation rules learned with TBEDL can be more easily interpreted.
Abstract: This paper describes the use of rule induction techniques for the automatic extraction of phonemic knowledge and rules from pairs of pronunciation lexicons. This extracted knowledge allows the adaptation of speech processing systems to regional variants of a language. As a case study, we apply the approach to Northern Dutch and Flemish (the variant of Dutch spoken in Flanders, a part of Belgium) , based on Celex and Fonilex, pronunciation lexicons for Northern Dutch and Flemish, respectively. In our study, we compare two rule induction techniques, TransformationBased Error-Driven Learning (TBEDL) (Brill, 1995) and C5.0 (Quinlan, 1993), and evaluate the extracted knowledge quantitatively (accuracy) and qualitatively (linguistic relevance of the rules). We conclude that, whereas classication-based rule induction with C5.0 is more accurate, the transformation rules learned with TBEDL can be more easily interpreted.

8 citations



Proceedings ArticleDOI
13 Sep 2000
TL;DR: This work uses a simple genetic algorithm (GA) for this problem on two typical tasks in natural language processing: morphological synthesis and unknown word tagging and finds that GA feature selection always significantly outperforms the MBLP variant without selection and that feature ordering and weighting with GA significantly outperform a situation where no weighting is used.
Abstract: We investigate the usefulness of evolutionary algorithms in three incarnations of the problem of feature relevance assignment in memory-based language processing (MBLP): feature weighting, feature ordering and feature selection. We use a simple genetic algorithm (GA) for this problem on two typical tasks in natural language processing: morphological synthesis and unknown word tagging. We find that GA feature selection always significantly outperforms the MBLP variant without selection and that feature ordering and weighting with GA significantly outperforms a situation where no weighting is used. However, GA selection does not significantly do better than simple iterative feature selection methods, and GA weighting and ordering reach only similar performance as current information-theoretic feature weighting methods.

01 Jan 2000
TL;DR: Preliminary results from an ongoing study that investigates the performance of machine learning classifiers on a diverse set of Natural Language Processing (NLP) tasks are reported.
Abstract: In this paper we report preliminary results from an ongoing study that investigates the performance of machine learning classifiers on a diverse set of Natural Language Processing (NLP) tasks. First, we compare a number of popular existing learning methods (Neural networks, Memory-based learning, Rule induction, Decision trees, Maximum Entropy, Winnow Perceptrons, Naive Bayes and Support Vector Machines), and discuss their properties vis à vis typical NLP data sets. Next, we turn to methods to optimize the parameters of single learning methods through cross-validation and evolutionary algorithms. Then we investigate how we can get the best of all single methods through combination of the tested systems in classifier ensembles. Finally we discuss new and more thorough methods of automatically constructing ensembles of classifiers based on the techniques used for parameter optimization.

Proceedings ArticleDOI
13 Sep 2000
TL;DR: This paper systematically compare two inductive learning approaches to tagging: MX-POST ( based on maximum entropy modeling) and MBT (based on memory-based learning) and results indicate that earlier observed differences in accuracy can be attributed largely to differences in information sources used, rather than to algorithm bias.
Abstract: Morphosyntactic Disambiguation (Part of Speech tagging) is a useful benchmark problem for system comparison because it is typical for a large class of Natural Language Processing (NLP) problems that can be defined as disambiguation in local context. This paper adds to the literature on the systematic and objective evaluation of different methods to automatically learn this type of disambiguation problem. We systematically compare two inductive learning approaches to tagging: MX-POST (based on maximum entropy modeling) and MBT (based on memory-based learning). We investigate the effect of different sources of information on accuracy when comparing the two approaches under the same conditions. Results indicate that earlier observed differences in accuracy can be attributed largely to differences in information sources used, rather than to algorithm bias.



Book Chapter
01 Jan 2000
TL;DR: In this article, the authors introduceelflerende systemen als een operationalisering van pre-Chomskyaanse taaltheoretische concepten, and laten zien hoe ze kunnen worden toegepast in taalbeschrijving and (computer)taalkunde.
Abstract: We introduceren zelflerende systemen als een operationalisering van pre-Chomskyaanse taaltheoretische concepten als analogie en inductie, en laten zien hoe ze kunnen worden toegepast in taalbeschrijving en (computer)taalkunde. Als casus bespreken we twee toepassingen: de automatische inductie van kennis over de beregeling van allomorfie bij Nederlandse diminutieven, en de rol van segmentele fonologische kennis bij de leerbaarheid van Nederlandse klemtoon.

Posted Content
TL;DR: This paper applied rule induction, classifier combination and meta-learning (stacked classifiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information.
Abstract: We apply rule induction, classifier combination and meta-learning (stacked classifiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reflecting the Flemish and Dutch pronunciations of a word on the basis of its orthographic representation (which in turn is based on the actual speech recordings). We compare several possible approaches to achieve the text-to-pronunciation mapping task: memory-based learning, transformation-based learning, rule induction, maximum entropy modeling, combination of classifiers in stacked learning, and stacking of meta-learners. We are interested both in optimal accuracy and in obtaining insight into the linguistic regularities involved. As far as accuracy is concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at word level) for single classifiers is boosted significantly with additional error reductions of 31% and 38% respectively using combination of classifiers, and a further 5% using combination of meta-learners, bringing overall word level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We also show that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.

Posted Content
TL;DR: This article used seven machine learning algorithms for one task: identifying base noun phrases, and the results have been processed by different system combination methods and all of these outperformed the best individual result.
Abstract: We use seven machine learning algorithms for one task: identifying base noun phrases. The results have been processed by different system combination methods and all of these outperformed the best individual result. We have applied the seven learners with the best combinator, a majority vote of the top five systems, to a standard data set and managed to improve the best published result for this data set.