scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2000"


Proceedings Article
01 Jan 2000
TL;DR: A Markovian approach is developed that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies and an extension of constraint satisfaction formalisms are extended.
Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

182 citations


Posted Content
TL;DR: This work compares two ways of modeling the problem of learning to recognize patterns and suggests that shallow parsing patterns are better learned using open/close predictors than using inside/outside predictors and thus contribute to the understanding of how to model shallow parsing tasks as learning problems.
Abstract: A SNoW based learning approach to shallow parsing tasks is presented and studied experimentally. The approach learns to identify syntactic patterns by combining simple predictors to produce a coherent inference. Two instantiations of this approach are studied and experimental results for Noun-Phrases (NP) and Subject-Verb (SV) phrases that compare favorably with the best published results are presented. In doing that, we compare two ways of modeling the problem of learning to recognize patterns and suggest that shallow parsing patterns are better learned using open/close predictors than using inside/outside predictors.

91 citations


Proceedings ArticleDOI
13 Sep 2000
TL;DR: Treating shallow parsing as part-of-speech tagging yields results comparable with other, more elaborate approaches, using the CoNLL 2000 training and testing material.
Abstract: Treating shallow parsing as part-of-speech tagging yields results comparable with other, more elaborate approaches. Using the CoNLL 2000 training and testing material, our best model had an accuracy of 94.88%, with an overall FB1 score of 91.94%. The individual FB1 scores for NPs were 92.19%, VPs 92.70% and PPs 96.69%.

47 citations


Proceedings ArticleDOI
29 Apr 2000
TL;DR: The whole approach proved to be very useful for processing of free word order languages like German, especially for the divide-and-conquer parsing strategy, which obtained an f-measure of 87.14% on unseen data.
Abstract: We present a divide-and-conquer strategy based on finite state technology for shallow parsing of real-world German texts. In a first phase only the topological structure of a sentence (i.e., verb groups, subclauses) are determined. In a second phase the phrasal grammars are applied to the contents of the different fields of the main and sub-clauses. Shallow parsing is supported by suitably configured preprocessing, including: morphological and on-line compound analysis, efficient POS-filtering, and named entity recognition. The whole approach proved to be very useful for processing of free word order languages like German. Especially for the divide-and-conquer parsing strategy we obtained an f-measure of 87.14% on unseen data.

35 citations


Proceedings ArticleDOI
19 Jun 2000
TL;DR: A case study based on part of NASA's specification of the Node Control Software of the International Space Station is described, and the authors apply to it their method of checking properties on models obtained by shallow parsing of natural language requirements.
Abstract: The authors report on their experiences of using lightweight formal methods for the partial validation of natural language (NL) requirements documents They describe a case study based on part of NASA's specification of the Node Control Software of the International Space Station, and apply to it their method of checking properties on models obtained by shallow parsing of natural language requirements These experiences support the position that it is feasible and useful to perform automated analysis of requirements expressed in natural language Indeed the authors identified a number of errors in their case study that were also independently discovered and corrected by NASA's IV and V Facility in a subsequent version of the same document The paper describes the techniques used, the errors found, and reflects on the lessons learned

33 citations


Book ChapterDOI
01 Jan 2000
TL;DR: This chapter will describe how parallel text extraction algorithms can be used for machine aided translation, focusing on two particular applications: semi-automatic construction of bilingual terminology lexicons and translation memory.
Abstract: This chapter will describe how parallel text extraction algorithms can be used for machine aided translation, focusing on two particular applications: semi-automatic construction of bilingual terminology lexicons and translation memory. Automatic word alignment and terminology extraction algorithms can be combined to substantially speed the lexicon construction process. Using a highly accurate partial alignment of term constituents, a terminologist need only recognize and correct minor errors in the recognition of term boundaries. The next generation of translation memory systems will certainly use statistical alignment algorithms and shallow parsing technology to improve coverage of current systems, by allowing for linguistic abstraction and partial sentence matching. Abstracting away from lexical units to part-of-speech, number, term, or noun phrase classes will allow these systems to mix and match components.

31 citations


Proceedings ArticleDOI
13 Sep 2000
TL;DR: This work produces tagging and chunking in a single process using an Integrated Language Model formalized as Markov Models that integrates several knowledge sources: lexical probabilities, a contextual Language Model for every chunk, and a contextual LM for the sentences.
Abstract: In this work, we present a stochastic approach to shallow parsing. Most of the current approaches to shallow parsing have a common characteristic: they take the sequence of lexical tags proposed by a POS tagger as input for the chunking process. Our system produces tagging and chunking in a single process using an Integrated Language Model (ILM) formalized as Markov Models. This model integrates several knowledge sources: lexical probabilities, a contextual Language Model (LM) for every chunk, and a contextual LM for the sentences. We have extended the ILM by adding lexical information to the contextual LMs. We have applied this approach to the CoNLL-2000 shared task improving the performance of the chunker.

15 citations


Proceedings Article
01 May 2000
TL;DR: This paper argues in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian, including a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora.
Abstract: In this paper we argue in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely statistically based approaches are inefficient basically due to great sparsity of tag distribution – 50% or less of unambiguous tags when punctuation is subtracted from the total count. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for English by [2] with a similar tagset. The current work includes a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora. In a preliminary experiment we made with automatic tagger, we obtained 99,97% accuracy in the training set and 99,03% in the test set using combined approaches: data derived from statistical tagging is well below 95% even when referred to the training set, and the same applies to syntactic tagging. As to the shallow parser and GF-assigner we shall report on a first preliminary experiment on a manually verified subset made of 10,000 words.

13 citations


Journal Article
TL;DR: A statistical algorithm is accomplished to recognize definite levels of Chinese chunks and it is proved that the algorithm gives a high accuracy for shallow parsing of real Chinese texts with robustness.
Abstract: Chunk parsing is an effective method to decrease the difficulty of language parsing.This paper proposes a formal description representing the characteristics of Chinese chunks.Based on the description,a statistical algorithm is accomplished to recognize definite levels of Chinese chunks.The experiments have proved that the algorithm gives a high accuracy for shallow parsing of real Chinese texts with robustness.

6 citations


Proceedings ArticleDOI
13 Sep 2000
TL;DR: A Markovian approach is developed that extends standard HMMs to allow the use of a rich observations structure and of general classifiers to model state-observation dependencies and an extension of constraint satisfaction formalisms are developed.
Abstract: We study the problem of identifying phrase structure. We formalize it as the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints, and develop two general approaches for it. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observations structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We also develop efficient algorithms under both models and study them experimentally in the context of shallow parsing.

6 citations


Proceedings ArticleDOI
29 Apr 2000
TL;DR: The spelling and grammar corrector for Danish is superior to other existing spelling checkers for Danish in its ability to deal with context-dependent errors.
Abstract: This paper reports on work carried out to develop a spelling and grammar corrector for Danish, addressing in particular the issue of how a form of shallow parsing is combined with error detection and correction for the treatment of context-dependent spelling errors. The syntactic grammar for Danish used by the system has been developed with the aim of dealing with the most frequent error types found in a parallel corpus of unedited and proofread texts specifically collected by the project's end users. By focussing on certain grammatical constructions and certain error types, it has been possible to exploit the linguistic 'intelligence' provided by syntactic parsing and yet keep the system robust and efficient. The system described is thus superior to other existing spelling checkers for Danish in its ability to deal with context-dependent errors.

01 Aug 2000
TL;DR: Though the system incorporates both statistical and text analysis models, the statistical model plays a major role during the automated process and a shallow parsing algorithm is used to eliminate the semantic redundancy.
Abstract: This paper introduces a Chinese summarizier called ThemePicker. Though the system incorporates both statistical and text analysis models, the statistical model plays a major role during the automated process. In addition to word segmentation and proper names identification, phrasal chunk extraction and content density calculation are based on a semantic network pre-constructed for a chosen domain. To improve the readability of the extracted sentences as auto-generated summary, a shallow parsing algorithm is used to eliminate the semantic redundancy.

01 Jan 2000
TL;DR: A system that recognizes and classifies named entities (NE) in Greek text and has been developed in the framework of the EPET II “oikONOMiA” project, which aims at the construction of a pipeline integrating NE recognition, shallow parsing, and co-reference resolution technologies.
Abstract: $EVWUDFW In this paper, we describe work in progress for the development of a Greek named entity recognizer. The system aims at information extraction applications where large scale text processing is needed. Speed of analysis, system robustness, and results accuracy have been the basic guidelines for the system’s design. Pattern matching techniques have been implemented on top of an existing automated pipeline for Greek text processing and the resulting system depends on non-recursive regular expressions in order to capture different types of named entities. For development and testing purposes, we collected a corpus of financial texts from several web sources and manually annotated part of it. Overall precision and recall are 86% and 81% respectively. ,QWURGXFWLRQ In this paper, we present a system that recognizes and classifies named entities (NE) in Greek text. The system has been developed in the framework of the EPET II “oikONOMiA” project, which aims at the construction of a pipeline integrating NE recognition, shallow parsing, and co-reference resolution technologies. The pipeline will analyze text to produce a shallow semantic representation suitable for template filling in scenario based information extraction (IE) applications . Natural Language Processing (NLP) systems performing information extraction have gained the focus of attention of both the academic and the business intelligence community. NERC is the first task in the information extraction task series. Several factors contribute to its complexity. Name-list based recognition is not adequate, since unknown names should be dealt with in addition to names appearing in the lists. Moreover, known names may be of several types; commonly used Greek names can be of type person, organization, location, or none of the above. Moreover, the name classification schema can vary significantly across domains and applications. Thus, there are two aspects in NERC: 1) recognition and classification of known names, and 2) spotting and classification of new names. It should be noted that the creation, adaptation, and maintenance of name databases comes at a significant cost; new text

Book ChapterDOI
28 Jun 2000
TL;DR: This paper focuses on the integration of NLP techniques for efficient textual database retrieval as part of the VLSHDS Project -Very Large Scale Hypermedia Delivery System, to increase the quality of textual information search (precision/recall) compared to already existing multi-lingual IR systems.
Abstract: Improvements in hardware, communication technology and database have led to the explosion of multimedia information repositories. In order to improve the quality of information retrieval compared to already existing advanced document management systems, research works have shown that it is necessary to consider vertical integration of retrieval techniques inside database service architecture. This paper focuses on the integration of NLP techniques for efficient textual database retrieval as part of the VLSHDS Project -Very Large Scale Hypermedia Delivery System. One target of this project is to increase the quality of textual information search (precision/recall) compared to already existing multi-lingual IR systems by applying morphological analysis and shallow parsing in phrase level to document and query processing. The scope of this paper is limited to Thai documents. The underlying system is The Active HYpermedia Delivery System-(AHYDS) framework providing the delivery service over internet. Based on 1100 Thai documents, as first results, our approach improved the precision and recall from 72.666% and 56.67% in the initial implementation (without applying NLP techniques) to 85.211% and 76.876% respectively.