scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2002"


Journal ArticleDOI
TL;DR: This work argues that with a systematic incremental methodology one can go beyond shallow parsing to deeper language analysis, while preserving robustness, and describes a generic system based on such a methodology and designed for building robust analyzers that tackle deeper linguistic phenomena than those traditionally handled by the now widespread shallow parsers.
Abstract: Robustness is a key issue for natural language processing in general and parsing in particular, and many approaches have been explored in the last decade for the design of robust parsing systems. Among those approaches is shallow or partial parsing, which produces minimal and incomplete syntactic structures, often in an incremental way. We argue that with a systematic incremental methodology one can go beyond shallow parsing to deeper language analysis, while preserving robustness. We describe a generic system based on such a methodology and designed for building robust analyzers that tackle deeper linguistic phenomena than those traditionally handled by the now widespread shallow parsers. The rule formalism allows the recognition of n-ary linguistic relations between words or constituents on the basis of global or local structural, topological and/or lexical conditions. It offers the advantage of accepting various types of inputs, ranging from raw to chunked or constituent-marked texts, so for instance it can be used to process existing annotated corpora, or to perform a deeper analysis on the output of an existing shallow parser. It has been successfully used to build a deep functional dependency parser, as well as for the task of co-reference resolution, in a modular way.

321 citations


Journal ArticleDOI
TL;DR: The authors presented memory-based learning approaches to shallow parsing and applied these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing.
Abstract: We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement.

137 citations


Journal Article
TL;DR: A unified technique to solve different shallow parsing tasks as a tagging problem using a Hidden Markov Model-based approach (HMM), which constructs a Specialized HMM which gives more complete contextual models.
Abstract: We present a unified technique to solve different shallow parsing tasks as a tagging problem using a Hidden Markov Model-based approach (HMM). This technique consists of the incorporation of the relevant information for each task into the models. To do this, the training corpus is transformed to take into account this information. In this way, no change is necessary for either the training or tagging process, so it allows for the use of a standard HMM approach. Taking into account this information, we construct a Specialized HMM which gives more complete contextual models. We have tested our system on chunking and clause identification tasks using different specialization criteria. The results obtained are in line with the results reported for most of the relevant state-of-the-art approaches.

102 citations


Journal ArticleDOI
TL;DR: The approach to checking properties of models obtained by shallow parsing of natural language requirements, and applied to a case study based on part of a NASA specification of the Node Control Software on the International Space Station, supports the position that it is feasible and useful to perform automated analysis of requirements expressed in natural language.
Abstract: In this paper, we report on our experiences of using lightweight formal methods for the partial validation of natural language requirements documents. We describe our approach to checking properties of models obtained by shallow parsing of natural language requirements, and apply it to a case study based on part of a NASA specification of the Node Control Software on the International Space Station. The experience reported supports our position that it is feasible and useful to perform automated analysis of requirements expressed in natural language. Indeed, we identified a number of errors in our case study that were also independently discovered and corrected by NASA's Independent Validation and Verification Facility in a subsequent version of the same document, and others that were not discovered. The paper describes the techniques we used, the errors we found and reflects on the lessons earned.

92 citations


Journal ArticleDOI
TL;DR: This paper introduced the problem of partial or shallow parsing (assigning partial syntactic structure to sentences) and explained why it is an important natural language processing (NLP) task, and future directions for machine learning of shallow parsing are suggested.
Abstract: This article introduces the problem of partial or shallow parsing (assigning partial syntactic structure to sentences) and explains why it is an important natural language processing (NLP) task. The complexity of the task makes Machine Learning an attractive option in comparison to the handcrafting of rules. On the other hand, because of the same task complexity, shallow parsing makes an excellent benchmark problem for evaluating machine learning algorithms. We sketch the origins of shallow parsing as a specific task for machine learning of language, and introduce the articles accepted for this special issue, a representative sample of current research in this area. Finally, future directions for machine learning of shallow parsing are suggested.

69 citations


Journal ArticleDOI
TL;DR: It is shown how prosody can be used together with other knowledge sources for the task of resegmentation if a first segmentation turns out to be wrong, and how a critical system evaluation can help to improve the overall performance of automatic dialogue systems.

44 citations


Journal Article
TL;DR: Three data-driven publicly available part-of-speech taggers are applied to shallow parsing of Swedish texts, and special attention is directed to the taggers' sensitivity to different types of linguistic information included in learning, as well as their sensitivity to the size and the various types of training data sets.
Abstract: Three data-driven publicly available part-of-speech taggers are applied to shallow parsing of Swedish texts. The phrase structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to in the parse tree. The encoding is based on the concatenation of the phrase tags on the path from lowest to higher nodes. Various linguistic features are used in learning; the taggers are trained on the basis of lexical information only, part-of-speech only, and a combination of both, to predict the phrase structure of the tokens with or without part-of-speech. Special attention is directed to the taggers' sensitivity to different types of linguistic information included in learning, as well as the taggers' sensitivity to the size and the various types of training data sets. The method can be easily transferred to other languages.

38 citations


Proceedings ArticleDOI
06 Jul 2002
TL;DR: It is argued that a memory-based learning algorithm might not need an explicit intermediate POS-tagging step for parsing when a sufficient amount of training material is available and word form information is used for low-frequency words.
Abstract: We describe a case study in which a memory-based learning algorithm is trained to simultaneously chunk sentences and assign grammatical function tags to these chunks. We compare the algorithm's performance on this parsing task with varying training set sizes (yielding learning curves) and different input representations. In particular we compare input consisting of words only, a variant that includes word form information for low-frequency words, gold-standard POS only, and combinations of these. The word-based shallow parser displays an apparently log-linear increase in performance, and surpasses the flatter POS-based curve at about 50,000 sentences of training data. The low-frequency variant performs even better, and the combinations is best. Comparative experiments with a real POS tagger produce lower results. We argue that we might not need an explicit intermediate POS-tagging step for parsing when a sufficient amount of training material is available and word form information is used for low-frequency words.

30 citations


01 Jan 2002
TL;DR: It is argued that a chunked syntactic representation can usefully be exploited as such for non trivial NLP applications which do not require full text understanding such as automatic lexical acquisition and information retrieval.
Abstract: This paper illustrates a technique of shallow parsing named “text chunking” whereby “parse incompleteness” is reinterpreted as “parse underspecification”. A text is chunked into structured units which can be identified with certainty on the basis of available knowledge. The chunking process stops at that level of granularity beyond which the analysis gets undecidable. We argue that a chunked syntactic representation can usefully be exploited as such for non trivial NLP applications which do not require full text understanding such as automatic lexical acquisition and information retrieval.

28 citations


Proceedings Article
01 Jan 2002
TL;DR: Among the novelties added to QUANTUM this year is a web module that finds exact answers using high-precision reformulation of the question to anticipate the expected context of.
Abstract: This year, we participated to the Question Answering task for the second time with the QUANTUM system. We entered 2 runs for the main task (one using the web, the other without) and 1 run for the list task (without the web). We essentially built on last year’s experience to enhance the system. The architecture of QUANTUM is mainly the same as last year: it uses patterns that rely on shallow parsing techniques and regular expressions to analyze the question and then select the most appropriate extraction function. This extraction function is then applied to one-paragraph long passages retrieved by Okapi to extract and score candidate answers. Among the novelties we added to QUANTUM this year is a web module that finds exact answers using high-precision reformulation of the question to anticipate the expected context of

20 citations


Journal Article
TL;DR: This work investigates the performance of four shallow parsers trained using various types of artificially noisy material and shows that they are surprisingly robust to synthetic noise, and addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution.
Abstract: Shallow parsers are usually assumed to be trained on noise-free material, drawn from the same distribution as the testing material. However, when either the training set is noisy or else drawn from a different distributions, performance may be degraded. Using the parsed Wall Street Journal, we investigate the performance of four shallow parsers (maximum entropy, memory-based learning, N-grams and ensemble learning) trained using various types of artificially noisy material. Our first set of results show that shallow parsers are surprisingly robust to synthetic noise, with performance gradually decreasing as the rate of noise increases. Further results show that no single shallow parser performs best in all noise situations. Final results show that simple, parser-specific extensions can improve noise-tolerance. Our second set of results addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution. Results using the parsed Switchboard corpus suggest that, although naturally occurring disfluencies might harm performance, differences in distribution between the training set and the testing set are more significant.

01 Jan 2002
TL;DR: This paper argues in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian, including a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora.
Abstract: In this paper we argue in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely statistically based approaches are inefficient basically due to great sparsity of tag distribution – 50% or less of unambiguous tags when punctuation is subtracted from the total count. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for English by [2] with a similar tagset. The current work includes a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora. In a preliminary experiment we made with automatic tagger, we obtained 99,97% accuracy in the training set and 99,03% in the test set using combined approaches: data derived from statistical tagging is well below 95% even when referred to the training set, and the same applies to syntactic tagging. As to the shallow parser, we report on a first preliminary experiment on a manually verified subset made of 10,000 words.

Book ChapterDOI
TL;DR: The QUANTUM system relies on computational linguistics as well as information retrieval techniques, and the TREC-X data set and tools are used to evaluate the overall system and each of its components.
Abstract: In this paper, we describe our Question Answering (QA) system called QUANTUM. The goal of QUANTUM is to find the answer to a natural language question in a large document collection. QUANTUM relies on computational linguistics as well as information retrieval techniques. The system analyzes questions using shallow parsing techniques and regular expressions, then selects the appropriate extraction function. This extraction function is then applied to one-paragraph-long passages retrieved by the Okapi information retrieval system. The extraction process involves the Alembic named entity tagger and the WordNet semantic network to identify and score candidate answers. We designed QUANTUM according to the TREC-X QA track requirements; therefore, we use the TREC-X data set and tools to evaluate the overall system and each of its components.

Journal Article
TL;DR: This paper designs and implements a method for automatic prediction of Chinese phrase boundary location based on neural network and preliminary results show that the precision is 93.24% and 92.56% respectively.
Abstract: Prediction of Chinese phrase boundary location is the base of shallow parsing or chunk parsing.It is also very important for processing real texts.With the support of our Chinese treebank including 64426 words, this paper designs and implements a method for automatic prediction of Chinese phrase boundary location based on neural network. The preliminary results show that the precision is 93.24%(close testing) and 92.56%(open testing) respectively.

Book ChapterDOI
23 Sep 2002
TL;DR: An application of grammatical inference to the task of shallow parsing by learning a deterministic probabilistic automaton that models the joint distribution of Chunk (syntactic phrase) tags and Part-of-speech tags, and using this automaton as a transducer to find the most likely chunk tag sequence using a dynamic programming algorithm.
Abstract: This paper presents an application of grammatical inference to the task of shallow parsing. We first learn a deterministic probabilistic automaton that models the joint distribution of Chunk (syntactic phrase) tags and Part-of-speech tags, and then use this automaton as a transducer to find the most likely chunk tag sequence using a dynamic programming algorithm. We discuss an efficient means of incorporating lexical information, which automatically identifies particular words that are useful using a mutual information criterion, together with an application of bagging that improve our results. Though the results are not as high as comparable techniques that use models with a fixed structure, the models we learn are very compact and efficient.

01 May 2002
TL;DR: The advantages of deploying a shallow parser based on Supertagging in an automatic dialogue system in a call center that basically leaves the initiative with the user as far as (s)he wants are outlined.
Abstract: In this paper we outline the advantages of deploying a shallow parser based on Supertagging in an automatic dialogue system in a call center that basically leaves the initiative with the user as far as (s)he wants (in the literature called user‐ initiative or adaptive in contrast to system‐initiativedialogue systems). The Supertagger relies on a Hidden Markov model and is trained with German input texts. The entire design of a Hidden Markov‐based Supertagger with trigrams builds the central issue of this paper. The evaluation of our German Supertagger lags behind the English one. Some of the reasons will be addressed later on. Nevertheless shallow parsing with the Supertags increases the accuracy compared to a basic version of KoHDaS that only relies on recurrent plausibility networks.

Posted Content
TL;DR: This work presents memory-based learning approaches to shallow parsing and applies these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing.
Abstract: We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement.

Proceedings Article
01 Jan 2002
TL;DR: This paper proposes a layered Finite State Transducer (FST) framework integrating hierarchical supra-lexical linguistic knowledge into speech recognition based on shallow parsing, and shows that with a higher order top-level n-gram model, pre-composition and optimization of the FSTs are highly restricted by the computational resources available.
Abstract: This paper proposes a layered Finite State Transducer (FST) framework integrating hierarchical supra-lexical linguistic knowledge into speech recognition based on shallow parsing. The shallow parsing grammar is derived directly from the full fledged grammar for natural language understanding, and augmented with top-level n-gram probabilities and phrase-level context-dependent probabilities, which is beyond the standard context-free grammar (CFG) formalism. Such a shallow parsing approach can help balance sufficient grammar coverage and tight structure constraints. The context-dependent probabilistic shallow parsing model is represented by layered FSTs, which can be integrated with speech recognition seamlessly to impose early phrase-level structural constraints consistent with natural language understanding. It is shown that in the JUPITER [1] weather information domain, the shallow parsing model achieves lower recognition word error rates, compared to a regular class n-gram model with the same order. However, we find that, with a higher order top-level n-gram model, pre-composition and optimization of the FSTs are highly restricted by the computational resources available. Given the potential of such models, it may be worth pursing an incremental approximation strategy [2], which includes part of the linguistic model FST in early optimization, while introducing the complete model through dynamic composition.

01 Jan 2002
TL;DR: This stylebook gives an overview of the various categories annotated in those different layers of chunks, topological fields and clauses in the presented system, and the methodology of the annotation process is mentioned.
Abstract: The presented system provides a shallow syntactic annotation for unrestricted German text. It requires POS-annotated text and annotates the layers of chunks, topological fields and clauses. This stylebook gives an overview of the various categories annotated in those different layers. The methodology of the annotation process is mentioned in those cases where it has an impact on the annotation scheme. Example sentences are taken from real language data, but were simplified where necessary.

Proceedings Article
01 May 2002
TL;DR: This paper presents an evaluation of four shallow parsers and attempts to demonstrate the interest of observing the ‘common boundaries’ produced by different parsers as good indices for the evaluation of these algorithms.
Abstract: This paper presents an evaluation of four shallow parsers The interest of each of these parsers led us to imagine a parameterized multiplexer for syntactic information based on the principle of merging the common boundaries of the outputs given by each of these programs. The question of evaluating the parsers as well as the multiplexer came in the foreground with the problem of not owning reference corpora. We attempt here to demonstrate the interest of observing the ‘common boundaries’ produced by different parsers as good indices for the evaluation of these algorithms. Such an evaluation is proposed and tested with a set of two experiences.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: A context-sensitive electronic dictionary that provides translations for any piece of text displayed on a computer screen, without requiring user interaction is introduced through a process of three phases: text acquisition from the screen, morpho-syntactic analysis of the context of the selected word, and the dictionary lookup.
Abstract: This paper introduces a context-sensitive electronic dictionary that provides translations for any piece of text displayed on a computer screen, without requiring user interaction. This is achieved through a process of three phases: text acquisition from the screen, morpho-syntactic analysis of the context of the selected word, and the dictionary lookup. As with other similar tools available, this program usually works with dictionaries adapted from one or more printed dictionaries. To implement context sensitive features, however, traditional dictionary entries need to be restructured. By splitting up entries into smaller pieces and indexing them in a special way, the program is able to display a restricted set of information that is relevant to the context. Based on the information in the dictionaries, the program is able to recognize---even discontinuous---multiword expressions on the screen.The program has three major features which we believe make it unique for the time being, and which the development focused on: linguistic flexibility (stemming, morphological analysis and shallow parsing), open architecture (three major architectural blocks, all replaceable along public documented APIs), and flexible user interface (replaceable dictionaries, direct user feedback).In this paper, we assess the functional requirements of a context-sensitive dictionary as a start; then we explain the program's three phases of operation, focusing on the implementation of the lexicons and the context-sensitive features. We conclude the paper by comparing our tool to other similar publicly available products, and summarize plans for future development.

01 Jan 2002
TL;DR: A text processing system that uses shallow parsing techniques to extract information from sentences in text documents and stores frames of information in a knowledge base that is approaching more complete text understanding in a practical way that does not require expensive processing such as full parsing of the documents.
Abstract: The system described in this paper automatically extracts and stores information from documents. We have implemented a text processing system that uses shallow parsing techniques to extract information from sentences in text documents and stores frames of information in a knowledge base. We intend to use this system in two main application areas: open domain Question & Answering (Q&A) and specific domain information extraction. Extraction from Documents The system described in this paper uses a Natural Language Processing system developed at the Center for Natural Language Processing to extract information from documents and store it in a knowledge base. In the past, applications were aimed at MUC-style information extraction that filled in templates of specific types of information. Our current goal is to produce a system that can extract generic frames of information about all entities and events in the sentences of the text and represent relationships between them. This type of system is approaching more complete text understanding in a practical way that does not require expensive processing such as full parsing of the documents. The heart of the generic extraction system is a set of rules written for a finite-state system that recognizes the patterns of text. These rules are applied in several phases including part-of-speech tagging, bracketing of noun phrases, and categorization of proper noun phrases. Later phases recognize the surface structure of phrases in each sentence and map the phrases to the case frame of the verbs, recognizing the phrases taking the roles of agent, object, point-in-time, etc., and creating a frame representing an “event”. The case roles are similar to those in case grammars (Fillmore 1968). Consider the example sentence: In addition to these most recent incidents, the Abu Sayyaf have bought Russian uranium on Basilan Island.

Journal Article
TL;DR: Este trabajo ha sido parcialmente subvencionado por los proyectos CICYT TIC 2000-0664-C02-01 y TIC2000-1599-C01-01.
Abstract: Este trabajo ha sido parcialmente subvencionado por los proyectos CICYT TIC2000-0664-C02-01 y TIC2000-1599-C01-01.

DOI
01 Jan 2002
TL;DR: An efficient FPGA-based coprocessor for natural language syntactic analysis that can deal with inputs in the form of word lattices is proposed and an interface between the hardware tool and a potential natural language software application, running on the desktop computer is offered.
Abstract: This thesis is at the crossroad between Natural Language Processing (NLP) and digital circuit design. It aims at delivering a custom hardware coprocessor for accelerating natural language parsing. The coprocessor has to parse real-life natural language and is targeted to be useful in several NLP applications that are time constrained or need to process large amounts of data. More precisely, the three goals of this thesis are: (1) to propose an efficient FPGA-based coprocessor for natural language syntactic analysis that can deal with inputs in the form of word lattices, (2) to implement the coprocessor in a hardware tool ready for integration within an ordinary desktop computer and (3) to offer an interface (i.e. software library) between the hardware tool and a potential natural language software application, running on the desktop computer. The Field Programmable Gate Array (FPGA) technology has been chosen as the core of the coprocessor implementation due to its ability to efficiently exploit all levels of parallelism available in the implemented algorithms in a cost-effective solution. In addition, the FPGA technology makes it possible to efficiently design and test such a hardware coprocessor. A final reason is that the future general-purpose processors are expected to contain reconfigurable resources. In such a context, an IP core implementing an efficient context-free parser ready to be configured within the reconfigurable resources of the general-purpose processor would be a support for any application relying on context-free parsing and running on that general-purpose processor. The context-free grammar parsing algorithms that have been implemented are the standard CYK algorithm and an enhanced version of the CYK algorithm developed at the EPFL Artificial Intelligence Laboratory. These algorithms were selected (1) due to their intrinsic properties of regular data flow and data processing that make them well suited for a hardware implementation, (2) for their property of producing partial parse trees which makes them adapted for further shallow parsing and (3) for being able to parse word lattices.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: A parser for robust and flexible interpretation of user utterances in a multi-modal system for web search in newspaper databases that integrates shallow parsing techniques with knowledge-based text retrieval to allow for robust processing and coordination of input modes.
Abstract: We describe a parser for robust and flexible interpretation of user utterances in a multi-modal system for web search in newspaper databases. Users can speak or type, and they can navigate and follow links using mouse clicks. Spoken or written queries may combine search expressions with browser commands and search space restrictions. In interpreting input queries, the system has to be fault-tolerant to account for spontanous speech phenomena as well as typing or speech recognition errors which often distort the meaning of the utterance and are difficult to detect and correct. Our parser integrates shallow parsing techniques with knowledge-based text retrieval to allow for robust processing and coordination of input modes. Parsing relies on a two-layered approach: typical meta-expressions like those concerning search, newspaper types and dates are identified and excluded from the search string to be sent to the search engine. The search terms which are left after preprocessing are then grouped according to co-occurrence statistics which have been derived from a newspaper corpus. These co-occurrence statistics concern typical noun phrases as they appear in newspaper texts.

Journal ArticleDOI
TL;DR: A unified technique to solve different shallow parsing tasks as a tagging problem using a Hidden Markov Model-based approach (HMM), consisting of the incorporation of the HMM into the parser itself.
Abstract: We present a unified technique to solve different shallow parsing tasks as a tagging problem using a Hidden Markov Model-based approach (HMM). This technique consists of the incorporation of the re...

Proceedings ArticleDOI
26 Nov 2002
TL;DR: In this article, the authors define criterions and a method to assess the natural level of MMI, based on verbal reports of users, and design a demonstration-software to automatically process the verbal reports.
Abstract: This study is on the natural MMI concept. The purpose of this work is twice. First, to define criterions and a method to assess the MMI. This method, based on verbal reports, have to measure the natural level of MMI. Second, to design a demonstration-software to assess the natural level of MMI. The demonstration-software automatically processes the verbal reports of users.