scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2001"


Posted Content
TL;DR: This article studied the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints, and developed two general approaches for an important subproblem-identifying phrase structure.
Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem-identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

204 citations


Journal ArticleDOI
TL;DR: A system called ACROMED that is part of a set of Information Extraction tools designed for processing and extracting information from abstracts in the Medline database is presented, found to be better for biomedical texts than the performance of other acronym extraction systems designed for unrestricted text.
Abstract: Acronyms are widely used in biomedical and other technical texts. Understanding their meaning constitutes an important problem in the automatic extraction and mining of information from text. Here we present a system called ACROMED that is part of a set of Information Extraction tools designed for processing and extracting information from abstracts in the Medline database. In this paper, we present the results of two strategies for finding the long forms for acronyms in biomedical texts. These strategies differ from previous automated acronym extraction methods by being tuned to the complex phrase structures of the biomedical lexicon and by incorporating shallow parsing of the text into the acronym recognition algorithm. The performance of our system was tested with several data sets obtaining a performance of 72 % recall with 97 % precision. These results are found to be better for biomedical texts than the performance of other acronym extraction systems designed for unrestricted text.

135 citations


Proceedings ArticleDOI
06 Jul 2001
TL;DR: It is concluded that directly learning to perform these tasks as shallow parsers do is advantageous over full parsers both in terms of performance and robustness to new and lower quality texts.
Abstract: Significant amount of work has been devoted recently to develop learning techniques that can be used to generate partial (shallow) analysis of natural language sentences rather than a full parse In this work we set out to evaluate whether this direction is worthwhile by comparing a learned shallow parser to one of the best learned full parsers on tasks both can perform --- identifying phrases in sentences We conclude that directly learning to perform these tasks as shallow parsers do is advantageous over full parsers both in terms of performance and robustness to new and lower quality texts

76 citations


Journal Article
TL;DR: The ProBot is interesting in its link to an underlying engine capable of implementing deeper reasoning, which is usually not present in conversational agents based on shallow parsing.
Abstract: This paper describes a conversational agent, called “ProBot”, that uses a novel structure for handling context. The ProBot is implemented as a rule-based system embedded in a Prolog interpreter. The rules consist of patterns and responses, where each pattern matches a user’s input sentence and the response is an output sentence. Both patterns and responses may have attached Prolog expressions that act as constraints in the patterns and can invoke some action when used in the response. The main contributions of this work are in the use of hierarchies of contexts to handle unexpected inputs. The ProBot is also interesting in its link to an underlying engine capable of implementing deeper reasoning, which is usually not present in conversational agents based on shallow parsing.

54 citations


01 Jan 2001
TL;DR: In this article, the authors present a system called Acromed which finds acronym-meaning pairs as part of a set of information extraction tools designed for processing and extracting data from abstracts in the Medline database.
Abstract: Acronyms are widely used in biomedical and other technical texts. Understanding their meaning constitutes an important problem in the automatic extraction and mining of information from text. Moreover, an even harder problem is sense disambiguation of acronyms; that is, where a single acronym, termed a polynym, has a multiplicity of meanings, a common occurrence in the biomedical literature. In such cases, it is necessary to identify the correct corresponding sense for the polynym, which is often not directly specified in the text. Here we present a system called Acromed which finds acronym-meaning pairs as part of a set of information extraction tools designed for processing and extracting data from abstracts in the Medline database. Our strategy for finding acronym-meaning pairs differs from previous automated acronym extraction methods by incorporating shallow parsing of the text into the acronym recognition algorithm. The performance of our system has been tested with a highly diverse set of Medline texts, giving the highest results for precision and recall, thus far in the literature. We then present Polyfind, an algorithm for disambiguating polynyms, which uses a vector space model. Our disambiguation tests produced 97.62% accuracy in one test (on acronyms) and 86.6% accuracy in another (on aliases).

38 citations


Book ChapterDOI
11 Sep 2001
TL;DR: This work presents a two-level stochastic model approach to the construction of the natural language understanding component of a dialog system in the domain of database queries, which answers queries about a railway timetable in Spanish.
Abstract: Over the last few years, stochastic models have been widely used in the natural language understanding modeling Almost all of these works are based on the definition of segments of words as basic semantic units for the stochastic semantic models In this work, we present a two-level stochastic model approach to the construction of the natural language understanding component of a dialog system in the domain of database queries This approach will treat this problem in a way similar to the stochastic approach for the detection of syntactic structures (Shallow Parsing or Chunking) in natural language sentences; however, in this case, stochastic semantic language models are based on the detection of some semantic units from the user turns of the dialog We give the results of the application of this approach to the construction of the understanding component of a dialog system, which answers queries about a railway timetable in Spanish

19 citations


Proceedings Article
01 Jan 2001
TL;DR: This work makes an extensive use of the Alembic named-entity tagger and the WordNet semantic network to extract candidate answers from one-paragraph-long passage retrieval and deals with the possibility of noanswer questions by looking for a significant score drop between the extracted candidate answers.
Abstract: We participated to the TREC-X QA main task and list task with a new system named QUANTUM, which analyzes questions with shallow parsing techniques and regular expressions. Instead of using a question classification based on entity types, we classify the questions according to generic mechanisms (which we call extraction fonctions) for the extraction of candidate answers. We take advantage of the Okapi information retrieval system for one-paragraph-long passage retrieval. We make an extensive use of the Alembic named-entity tagger and the WordNet semantic network to extract candidate answers from those passages. We deal with the possibility of noanswer questions (NIL) by looking for a significant score drop between the extracted candidate answers.

18 citations


Proceedings Article
28 Jun 2001
TL;DR: A cross-language retrieval system which integrates shallow parsing and lexical semantic databases in an interactive approach to information access that optimises the use of simple and robust Natural Language resources and techniques to facilitate crosslanguage information access.
Abstract: This paper presents a cross-language retrieval system which integrates shallow parsing and lexical semantic databases in an interactive approach to information access. At indexing time, the system extracts a list of phrases for every language in the collection. At search time, the system bridges the gap between the user's query and the relevant phrases in the collection in any language, expanding and translating individual terms and retaining the phrases that are actually relevant in the collection. The user can access information via a standard ranked list of documents or via a hierarchy of phrasal information, in which the selection of a phrase modifies the ranked list and provides access to the documents related to the phrase. This interactive setting, to our belief, optimises the use of simple and robust Natural Language resources and techniques to facilitate crosslanguage information access.

16 citations


Journal ArticleDOI
TL;DR: The structure of written Thai is highly ambiguous, which requires more sophisticated techniques than are necessary to perform comparable IE tasks in most European languages, and large amounts of domain knowledge to cope with these ambiguities.
Abstract: The development of an information extraction (IE) system for Thai documents raises a number of issues which are not important for IE in English and other European languages. We describe the characteristics of written Thai and the problem statements, and our approach to the Thai IE system. The structure of written Thai is highly ambiguous, which requires more sophisticated techniques than are necessary to perform comparable IE tasks in most European languages, and large amounts of domain knowledge to cope with these ambiguities. The basic characteristic of this system is to provide different natural language components to assess the surface structure of the documents. These components include word segmentation, specific lexical structure terms identification and part-of-speech tagger. Further analysis is to perform a shallow parsing based on the relevant regions that contain the specific trigger terms or patterns specified in the extraction templates. Finally, the information of interest is extracted from the grammar trees in corresponding to predefined concept definitions and returns the users with a list of answers responding to each concept.

15 citations


Proceedings Article
01 Jan 2001
TL;DR: This work introduces shapaqa, a shallow parsing approach to online, open-domain question answering on the WorldWideWeb that uses a memory-based shallow parser to analyze web pages retrieved using normal keyword search on a search engine.
Abstract: We introduce shapaqa, a shallow parsing approach to online, open-domain question answering on the WorldWideWeb. Given a form-based natural language question as input, the system uses a memory-based shallow parser to analyze web pages retrieved using normal keyword search on a search engine. Two versions of the system are evaluated on a test set of 200 questions. In combination with two back-off methods a mean reciprocal rank of .46 is achieved.

14 citations


Proceedings Article
01 Jan 2001
TL;DR: It is found that the concepts (themes) extracted by Oracle Text can be used to aggregate document information content to simplify statistical processing.
Abstract: Oracle's objective in TREC-10 was to study the behavior of Oracle information retrieval in previously unexplored application areas The software used was Oracle9i Text, Oracle's full-text retrieval engine integrated with the Oracle relational database management system, and the Oracle PL/SQL procedural programming language Runs were submitted in filtering and Q/A tracks For the filtering track we submitted three runs, in adaptive filtering, batch filtering and routing By comparing the TREC results, we found that the concepts (themes) extracted by Oracle Text can be used to aggregate document information content to simplify statistical processing Oracle's Q/A system integrated information retrieval (IR) and information extraction (IE) The QIA system relied on a combination of document and sentence ranking in IR, named entity tagging in IE and shallow parsing based classification of questions into pre-defined categories

Book
02 Nov 2001
TL;DR: The parsing algorithm implemented by Cico is described formally, and some experimental data on its performance is given, and a complete user manual for cico3, an implementation of the Cico algorithm, and for a number of associated tools are provided.
Abstract: Domain-based parsing is a shallow parsing technique that exploits knowledge about domain-specific properties of terms in order to determine "optimal" parse trees for natural language sentences. Cico is a simple parser using domain-based parsing. It is particularly well suited for parsing natural language sentences of technical nature (e.g., requirements documents for software systems), as in this case several simplifying assumptions hold, and has been used successfully in several experiments in the requirement engineering field. In the first part of this report, we describe formally the parsing algorithm implemented by Cico, and give some experimental data on its performance. In the second part, we provide a complete user manual for cico3, an implementation of the Cico algorithm, and for a number of associated tools. Finally, in the third part, we present some illustrative example taken from real applications.

Jonathan H. Connell1
01 Jan 2001
TL;DR: This paper proposes a specific linguistic-based format for semantic networks in which nodes correspond to “open class” words and morphological elements form the basis for atomic link labels and node tags.
Abstract: This paper proposes a specific linguistic-based format for semantic networks in which nodes correspond to “open class” words. “Closed class” words and morphological elements form the basis for atomic link labels and node tags. A simple parser has been developed to transform written text into this representation. The properties of the resulting networks are discussed and psychologically inspired limited-horizon browsing techniques are examined.

01 Jan 2001
TL;DR: A part-of-speech tagger for Czech is described that employs DIS shallow parser for Czech, manually-coded rules and inductive logic programming.
Abstract: A part-of-speech tagger for Czech is described that employs DIS shallow parser for Czech, manually-coded rules and inductive logic programming.

01 May 2001
TL;DR: In this paper well-known state-of-the-art data-driven algorithms are applied topart- of-speech tagging and shallow parsing of Swedish texts.
Abstract: In this paper well-known state-of-the-art data-driven algorithms are applied topart-of-speech tagging and shallow parsing of Swedish texts.

Proceedings Article
01 Jan 2001
TL;DR: Three data-driven algorithms are applied to shallow parsing of Swedish texts by using PoS taggers as the basis for parsing, showing that best performance can be obtained by training on the basis of PoS tags with labels marking the phrasal constituents without considering the words themselves.
Abstract: Three data-driven algorithms are applied to shallow parsing of Swedish texts by using PoS taggers as the basis for parsing. The constituent structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to. The results show that best performance can be obtained by training on the basis of PoS tags with labels marking the phrasal constituents without considering the words themselves. Transformation-based learning gives highest accuracy (94.44%) followed by the Maximum Entropy framework (mxpost) (92.47%) and the Hidden Markov model (TnT) (92.42%).

01 Jan 2001
TL;DR: L’objectif de cette communication n’est pas the description of ce programme, mais plutôt le point of vue du linguiste : comment détecter les discontinuités, c’EST-à-dire comment décider s’il y a complétion ou rupture.
Abstract: Continuous media – stories, movies, songs – all have a basic linear structure from which cognitive processes are able to retrieve some temporal organization. How much semantic computation is necessarily involved for a proper framing of events and their transition to one another ? Can this computation be approximated with the help of simple formal clues from a shallow parsing of the story stream, and how far can it go ? Our experiments with a prototype application implement a method for segmenting written stories and splicing together "referential situations" that should belong to the same time-frame. This paper does not aim to describe the implementation but rather discusses the linguistic approach to detecting discontinuity in narrative texts, based on the principles of closure and rupture in temporal consistency.

Journal Article
TL;DR: In this paper, the authors focused on the integration of NLP techniques for efficient textual database retieval as part of the VLSHDS project -Very Large Scale Hypermedia Delivery System.
Abstract: Improvements in hardware, communication technology and database have led to the explosion of multimedia information repositories. In order to improve the quality of information retrieval compared to already existing advanced document management systems, research works have shown that it is necessary to consider vertical integration of retrieval techniques inside database service architecture. This paper focuses on the integration of NLP techniques for efficient textual database retieval as part of the VLSHDS project - Very Large Scale Hypermedia Delivery System. One target of this project is to increase the quality or textual information search (precision/recall) compared to already existing multi-lingual IR systems by applying morphological analysis and shallow parsing in phrase level to document and query processing. The scope of this paper is limited to Thai documents. The underlying system is The Active HYpermedia Delivery System-(AHYDS) framework providing the delivery service over internet. Based on 1100 Thai documents, as first results, our approach improved the precision and recall from 72.666% and 56.67% in the initial implementation (without applying NLP techniques) to 85.211% and 76.876% respectively.

Book ChapterDOI
03 Sep 2001
TL;DR: The authors proposed a pyramid digest model for composite text digesting, which combines traditional text summarization and text classification in that the digest not only serves as a summary but is also able to classify text segments of any given size and answer queries relative to a context.
Abstract: We present a novel model of automated composite text digest, the Pyramidal Digest. The model integrates traditional text summarization and text classification in that the digest not only serves as a "summary" but is also able to classify text segments of any given size, and answer queries relative to a context. "Pyramidal" refers to the fact that the digest is created in at least three dimensions: scope, granularity, and scale. The Pyramidal Digest is defined recursively as a structure of extracted and abstracted features that are obtained gradually -- from specific to general, and from large to small text segment size -- through a combination of shallow parsing and machine learning algorithms. There are three noticeable threads of learning taking place: learning of characteristic relations, rhetorical relations, and lexical relations. Our model provides a principle for efficiently digesting large quantities of text: progressive learning can digest text by abstracting its significant features. This approach scales, with complexity bounded by O(n log n), where n is the size of the text. It offers a standard and systematic way of collecting as many semantic features as possible that are reachable by shallow parsing. It enables readers to query beyond keyword matches.