scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2004"


Journal ArticleDOI
TL;DR: This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms.
Abstract: Recent developments in computational terminology call for the design of multiple and complementary tools for the acquisition, the structuring and the exploitation of terminological data. This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms. First, we present a system for corpus-based acquisition of terminological relationships through discursive patterns. This system is built on previous work on automatic extraction of hyponymy links through shallow parsing. Second, we show how hypernym links between single-word terms can be extended to semantic links between multi-word terms through corpus-based extraction of semantic variants. The induced hierarchy is incomplete but provides an automatic generalization of single-word terms relations to multi-word terms that are pervasive in technical thesauri and corpora.

76 citations


Journal ArticleDOI
TL;DR: This paper focuses on the second phase, presenting the basic guidelines for syntactic annotation and the boundaries of the work being done, and justifies methodological principles and syntactic criteria to build Cast3LB: a treebank for Spanish.
Abstract: In this paper we present and justify methodological principles and syntactic criteria to build Cast3LB: a treebank for Spanish. As a previous work necessary to develop it, some automatic and semiautomatic processes have been carried out: automatic morphological analysis and disambiguation; manual validation of the tagging process, which guarantees the quality of the data; and, finally, automatic shallow parsing. The syntactic annotation consists of the labelling of constituents, including some elliptical elements, and syntactic functions. In this paper we focus on the second phase, presenting the basic guidelines for syntactic annotation and the boundaries of the work being done.

51 citations


Book ChapterDOI
15 Feb 2004
TL;DR: A robust syntactic analyser for Basque and the different modules it contains that is carried out using the Constraint Grammar (CG) formalism and the standardisation process of the parsing formats using XML is described.
Abstract: This article presents a robust syntactic analyser for Basque and the different modules it contains. Each module is structured in different analysis layers for which each layer takes the information provided by the previous layer as its input; thus creating a gradually deeper syntactic analysis in cascade. This analysis is carried out using the Constraint Grammar (CG) formalism. Moreover, the article describes the standardisation process of the parsing formats using XML.

37 citations


Patent
07 Jun 2004
TL;DR: This article proposed a method for parsing Chinese sentences by employing lexical and syntactical information to extract more prominent entities in a Chinese sentence, and the sentence is then transformed into a Triple representation by employing the Triple rules referring to elemental Chinese syntax.
Abstract: A method for processing natural language Chinese sentences can transform a Chinese sentence into a Triple representation using shallow parsing techniques. The method is concerned with parsing Chinese sentences by employing lexical and syntactical information to extract more prominent entities in a Chinese sentence, and the sentence is then transformed into a Triple representation by employing the Triple rules referring to elemental Chinese syntax—SVO (subject, verb, and object in order). The lexical and syntactical information in our method is referring a lexicon possessed of part-of-speech (POS) information and phrase-level syntax in Chinese respectively. The Triple representation consists of three elements which are agent, predicate, and patient in a sentence.

27 citations


01 Jan 2004
TL;DR: In this article, the authors present the improvements in the computational treatment of Basque, and more specifically, in the areas of morphosyntactic disambiguation and shallow parsing.
Abstract: Our goal in this article is to show the improvements in the computational treatment of Basque, and more specifically, in the areas of morphosyntactic disambiguation and shallow parsing. The improvements presented in this paper include the following: analyses of previously identified ambiguities in morphosyntax and in syntactic functions, their disambiguation, and finally, an outline of possible steps in terms ofshallow parsing based on the results provided by the disambiguation process. The work is part of the current research within the field of Natural Language Processing (NLP) in Basque, and more specifically, part of the work that is being done within the IXA group.

25 citations


Proceedings Article
01 Jan 2004
TL;DR: It is concluded that existing memory-based parsing approaches can be applied to spoken Dutch successfully, but that there is room for improvement in the tagger-chunker.
Abstract: We describe the development of a Dutch memory-based shallow parser. The availability of large treebanks for Dutch, such as the one provided by the Spoken Dutch Corpus, allows memory-based learners to be trained on examples of shallow parsing taken from the treebank, and act as a shallow parser after training. An overview is given of a modular memory-based learning approach to shallow parsing, composed of a part-of-speech tagger-chunker and two grammatical relation finders, which has originally been developed for English. This approach is applied to the syntactically annotated part of the Spoken Dutch Corpus to construct a Dutch shallow parser. From the generalisation scores of the parser we conclude that existing memory-based parsing approaches can be applied to spoken Dutch successfully, but that there is room for improvement in the tagger-chunker

10 citations


Journal Article
TL;DR: This paper described the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level.
Abstract: This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives for selecting the index terms among the syntactic dependencies detected by the parser are evaluated. Though this article focuses on Spanish, this approach is extensible to other languages by simply adapting the grammar used by the parser.

9 citations


28 Aug 2004
TL;DR: The Dependency Parser, called Maxuxta, is presented, which can serve as a representative of agglutinative languages that are also characterized by the free order of its constituents, for the linguistic processing of Basque.
Abstract: We present the Dependency Parser, called Maxuxta, for the linguistic processing of Basque, which can serve as a representative of agglutinative languages that are also characterized by the free order of its constituents. The Dependency syntactic model is applied to establish the dependency-based grammatical relations between the components within the clause. Such a deep analysis is used to improve the output of the shallow parsing where syntactic structure ambiguity is not fully and explicitly resolved. Previous to the completion of the grammar for the dependency parsing, the design of the Dependency Structure-based Scheme had to be accomplished; we concentrated on issues that must be resolved by any practical system that uses such models. This scheme was used both to the manual tagging of the corpus and to develop the parser. The manually tagged corpus has been used to evaluate the accuracy of the parser. We have evaluated the application of the grammar to corpus, measuring the linking of the verb with its dependents, with satisfactory results.

9 citations


Book ChapterDOI
30 Aug 2004
TL;DR: The application of lemmatization and shallow parsing is described as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level.
Abstract: This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives for selecting the index terms among the syntactic dependencies detected by the parser are evaluated. Though this article focuses on Spanish, this approach is extensible to other languages by simply adapting the grammar used by the parser.

8 citations


01 Jan 2004
TL;DR: It is shown how the combination of shallow and deep semantic NLP techniques can improve the effectiveness of eLearning systems which support communication in free natural language and can make them more satisfactory and pleasant for their users.
Abstract: Computer-Aided Language Learning (CALL) should play an important role in the modern training process because it provides easy accessible, adaptive and flexible ways of learning. This paper addresses the scenario of tutor-learner question answering and attempts to automate the free answers evaluation using the advantages of Natural Language Processing (NLP). Our current approach integrates shallow parsing for analysing the answers and allows the learners to enter various utterances to express themselves. However this variety does not impede the assessment of the student’s answer as we check the utterances against the automatically generated scope of the correct answers. The usage of a “set of answers” instead of one predefined correct answer enables feedback elaboration that helps learners to understand better their knowledge gaps. Briefly, in this paper we show how the combination of shallow and deep semantic NLP techniques can improve the effectiveness of eLearning systems which support communication in free natural language and can make them more satisfactory and pleasant for their users.

8 citations


Book ChapterDOI
01 Jan 2004
TL;DR: An approach based on recent results in formal and computational linguistics is proposed, which takes into consideration the morphosyntactic and syntactic structure of Polish and which avoids various known problems of previous valence dictionaries.
Abstract: This article presents the design of a syntactico-semantic dictionary for Polish, i.e., a valence dictionary enriched with certain semantic informations. Valence dictionaries, specifying the number and morphosyntactic form of arguments of verbs, are useful in many Natural Language Processing applications, including deep parsing, e.g., for the purpose of machine translation, shallow parsing, e.g., for the purpose of information extraction, and rule-based morphosyntactic disambiguation, e.g., for the purpose of corpus annotation. An approach based on recent results in formal and computational linguistics is proposed, which takes into consideration the morphosyntactic and syntactic structure of Polish and which avoids various known problems of previous valence dictionaries, some of them stemming from their impoverished theoretical framework, unable to take proper care of the syntax-semantics interface, case variations and raising predicates. An implementation of a grammar of Polish deploying the ideas presented here is currently under development.

01 Mar 2004
TL;DR: The authors first introduce the promising field of information extraction, and then describe in detail how shallow parsing techniques are used in project SOKRATES, a project to analyze German free-form battlefield reports.
Abstract: : A natural way to communicate with C2 systems would be to use natural language. Natural language components are already used in military systems, for example, CommandTalk is a spoken-language interface to the ModSAF battlefield simulator. In project SOKRATES, the authors use shallow parsing techniques for written language to analyze German free-form battlefield reports. These reports are processed by transducers. The extraction result is formalized in feature structures, semantically enriched by the semantic analysis, and the augmented result is then stored in the ATCCIS database. After storage in the database, triggers initiate a change in the position of a tactical symbol on the tactical map. Shallow parsing techniques are the basis for information extraction. In this paper, the authors first introduce the promising field of information extraction, and then describe in detail how shallow parsing techniques are used in project SOKRATES. Twenty-four briefing charts summarize the presentation. (2 figures, 15 refs.)

Proceedings Article
01 Jul 2004
TL;DR: The PolyU Treebank is based on shallow parsing in which only partial syntactical structures are annotated, and well-designed workflow and effective semiautomatic and automatic annotation checking are used to ensure annotation accuracy and consistency.
Abstract: This paper presents the construction of a manually annotated Chinese shallow Treebank, named PolyU Treebank. Different from traditional Chinese Treebank based on full parsing, the PolyU Treebank is based on shallow parsing in which only partial syntactical structures are annotated. This Treebank can be used to support shallow parser training, testing and other natural language applications. Phrase-based Grammar, proposed by Peking University, is used to guide the design and implementation of the PolyU Treebank. The design principles include good resource sharing, low structural complexity, sufficient syntactic information and large data scale. The design issues, including corpus material preparation, standard for word segmentation and POS tagging, and the guideline for phrase bracketing and annotation, are presented in this paper. Well-designed workflow and effective semiautomatic and automatic annotation checking are used to ensure annotation accuracy and consistency. Currently, the PolyU Treebank has completed the annotation of a 1-million-word corpus. The evaluation shows that the accuracy of annotation is higher than 98%.

DOI
01 Feb 2004
TL;DR: A method is proposed here for automatically acquiring large-scale NVEF knowledge without human intervention in order to identify a large, varied range of NVEf-sentences (sentences containing at least one NVEFs word-pair).
Abstract: Noun-verb event frame (NVEF) knowledge in conjunction with an NVEF word-pair identifier [Tsai et al. 2002] comprises a system that can be used to support natural language processing (NLP) and natural language understanding (NLU). In [Tsai et al. 2002a], we demonstrated that NVEF knowledge can be used effectively to solve the Chinese word-sense disambiguation (WSD) problem with 93.7% accuracy for nouns and verbs. In [Tsai et al. 2002b], we showed that NVEF knowledge can be applied to the Chinese syllable-to-word (STW) conversion problem to achieve 99.66% accuracy for the NVEF related portions of Chinese sentences. In [Tsai et al. 2002a], we defined a collection of NVEF knowledge as an NVEF word-pair (a meaningful NV word-pair) and its corresponding NVEF sense-pairs. No methods exist that can fully and automatically find collections of NVEF knowledge from Chinese sentences. We propose a method here for automatically acquiring large-scale NVEF knowledge without human intervention in order to identify a large, varied range of NVEF-sentences (sentences containing at least one NVEF word-pair). The auto-generation of NVEF knowledge (AUTO-NVEF) includes four major processes: (1) segmentation checking; (2) Initial Part-of-Speech (IPOS) sequence generation; (3) NV knowledge generation; and (4) NVEF knowledge auto-confirmation. Our experimental results show that AUTO-NVEF achieved 98.52% accuracy for news and 96.41% for specific text types, which included research reports, classical literature and modern literature. AUTO-NVEF automatically discovered over 400,000 NVEF word-pairs from the 2001 United Daily News (2001 UDN) corpus. According to our estimation, the acquired NVEF knowledge from 2001 UDN helped to identify 54% of the NVEF-sentences in the Academia Sinica Balanced Corpus (ASBC), and 60% in the 2001 UDN corpus. We plan to expand NVEF knowledge so that it is able to identify more than 75% of NVEF-sentences in ASBC. We will also apply the acquired NVEF knowledge to support other NLP and NLU researches, such as machine translation, shallow parsing, syllable and speech understanding and text indexing. The auto-generation of bilingual, especially Chinese-English, NVEF knowledge will be also addressed in our future work.


Book ChapterDOI
13 Dec 2004
TL;DR: A novel approach of identifying case role is proposed which goes beyond shallow parsing to a deeper level of language understanding, while preserving robustness, without being bogged down into a complete linguistic analysis.
Abstract: A novel approach of identifying case role is proposed. The approach makes use of an attributed string matching technique which takes full advantage of a huge number of sentence patterns in a Treebank. Based on the syntactic and semantic tags encoded in the Treebank, the approach goes beyond shallow parsing to a deeper level of language understanding, while preserving robustness, without being bogged down into a complete linguistic analysis. An evaluation of 5,000 Chinese sentences is examined in order to justify its statistical significances.

Journal Article
TL;DR: A method of skeleton parsing for domain specific Chinese text is put forward, using shallow parsing, cascade hidden Markov Model, to combine phrases and template matching through template matching from the tree of shallow parsing.
Abstract: A method of skeleton parsing for domain specific Chinese text is put forward in this paper. The method includes two key steps: shallow parsing and template matching. The template is adopted to represent the sentence skeleton. We use shallow parsing, cascade hidden Markov Model, to combine phrases. The skeleton parsing is achieved through template matching from the tree of shallow parsing. An experiment on sports news shows the performance of the proposed method achieves 98 04% precision and 81 43% recall for template matching and 96 97% precision and 84 85% recall at sentence level.

Journal Article
TL;DR: In this article, a linguistics-based system for word-to-word alignment is presented. But this system assumes some hypotheses about the structure of texts which are often infirmed.
Abstract: This paper describes an algorithm which represents one of the few linguistics-based systems for word-to-word alignment. Most systems are purely statistic and assume some hypotheses about the structure of texts which are often infirmed. Our approach combines statistic methods with positional and linguistic ones in order to can be successfully applied to any kind of bitext as far as the internal stricture of the texts is concerned. The linguistic part uses shallow parsing by regular expressions and relies on very general linguistic principles. However a component of language-specific methods can be developed for improving results. Our word-alignment system was evaluated on a Romanian-English bitext.

01 Jan 2004
TL;DR: This paper describes experiments made with Logus, a spoken understanding system based on incremental methology, and presents the first step of the parsing, a chunking based on rules of categorial grammars and pregroups.
Abstract: Spoken language understanding is a challenge for the development of Spoken Dialogue Systems. Recognition errors and speech repairs make it impossible to get complete syntactic analysis. Shallow parsing and chunking seem to be ecien t in order to start both a robust and precise analysis. This paper describes experiments made with Logus, a spoken understanding system based on incremental methology. It presents the rst step of the parsing, a chunking based on rules of categorial grammars and pregroups. These formalisms are very appropriate for this treatment and we argue they could be more widely used for applications of this type.

Book ChapterDOI
Ana-Maria Barbu1
08 Sep 2004
TL;DR: This paper describes an algorithm which represents one of the few linguistics-based systems for word-to-word alignment and combines statistic methods with positional and linguistic ones in order to can be successfully applied to any kind of bitext as far as the internal structure of the texts is concerned.
Abstract: This paper describes an algorithm which represents one of the few linguistics-based systems for word-to-word alignment Most systems are purely statistic and assume some hypotheses about the structure of texts which are often infirmed Our approach combines statistic methods with positional and linguistic ones in order to can be successfully applied to any kind of bitext as far as the internal structure of the texts is concerned The linguistic part uses shallow parsing by regular expressions and relies on very general linguistic principles However a component of language-specific methods can be developed for improving results Our word-alignment system was evaluated on a Romanian-English bitext

Patent
20 May 2004
TL;DR: In this article, a simple sentence range recognizer and a necessary component generator were used to extract information from a complex event sentence by using the sentence form information and considering a modifier clause.
Abstract: PURPOSE: A device and a method for analyzing a simple sentence structure of an event sentence for extracting information are provided to improve a performance of an information extracting system by dividing the long and complex event sentence into a simple sentence based on a declinable word, and extracting the information after analyzing the structure of the divided sentence. CONSTITUTION: A simple sentence range recognizer(10) recognizes a range of the simple sentence from the inputted complex event sentence by using the sentence form information and considering a modifier clause. A necessary component generator(20) generates a necessary component from the recognized simple sentence by using the lower category information and the object name co-occurrence information. A necessary component extender(30) obtains the finally analyzed structure of the simple sentence by recognizing/extending the generated necessary component through a complex noun dictionary and the object name information.

Proceedings ArticleDOI
28 Oct 2004
TL;DR: This work uses the local discourse structure of local discourse coherence to solve the problem of zero anaphora in Chinese and identifies the topic which is the most salient element in a sentence.
Abstract: XML Topic maps enable multiple concurrent views of sets of information objects and can be used to different applications. For example thesaurus-like interfaces to corpora navigational tools for cross-references or citation systems information filtering or delivering depending on user profiles etc. However to enrich the information of a topic map or to connect with some document's URI is very labor-intensive and time-consuming. To solve this problem we propose an approach based on natural language processing techniques to identify and extract useful information in raw Chinese text. Unlike most traditional approaches to parsing sentences based on the integration of complex linguistic information and domain knowledge we work on the output of a part-of-speech tagger and use shallow parsing instead of complex parsing to identify the topics of sentences. The key elements of the centering model of local discourse coherence are employed to extract structures of discourse segments. We use the local discourse structure to solve the problem of zero anaphora in Chinese and then identify the topic which is the most salient element in a sentence. After we obtain all the topics of a document we may assign this document into a topic node of the topic map and add the information of the document into the topic element simultaneously.

Dissertation
17 Dec 2004
TL;DR: This project takes a corpus of aviation safety reports parsed by Cass, an existing partial parser, with a particular given grammar, and looks for instances of linguistic constructs whose treatment by the parser could be improved by modifications to the grammar.
Abstract: With the growth of the World Wide Web in the nineties, alongside the increase in storage and processing capabilities of computer hardware, the problem of information overload resulted in an increased interest in finite-state techniques for Natural Language Analysis as an alternative to fragile, slower algorithms that would attempt to find complete parses for sentences based on general theories of language. As it turns out, shallow parsing, a set of robust parsing techniques based on finite state machines, provide incomplete yet very useful parses for unconstrained running text. The technique, however, will never provide 100% accuracy and requires that grammars be geared to the needs of particular data samples. In this project, we take a corpus of aviation safety reports parsed by Cass, an existing partial parser, with a particular given grammar, and look for instances of linguistic constructs whose treatment by the parser could be improved by modifications to the grammar. A few such constructs are discussed, and the grammar is edited to reflect the desired improvements. A parser accuracy measure is implemented and evaluated before and after the grammar modifications.

Book ChapterDOI
01 Jan 2004
TL;DR: It is suggested that the use of Language Technologies and — more specifically — of Information Extraction technologies provides a substantial help in Customer Opinion Monitoring, when compared to alternative approaches, including both the “traditional” methodology of employing human operators for reading documents and formalizing relevant opinions/facts to be stored and data mining techniques bases on the non—linguistic structure of the page.
Abstract: The paper addresses a crucial topic in current CRM processes, i.e. the one of constant monitoring customer opinions. We use the label “Real Time Customer Opinion Monitoring” to denote the process of retrieving, analyzing and assessing opinions, judgments, criticisms about products and brands, from newsgroups, message boards, consumer associations sites and other public sources on the Internet. We suggest that the use of Language Technologies and — more specifically — of Information Extraction technologies provides a substantial help in Customer Opinion Monitoring, when compared to alternative approaches, including both the “traditional” methodology of employing human operators for reading documents and formalizing relevant opinions/facts to be stored, data mining techniques bases on the non—linguistic structure of the page (web mining) or on statistical rather then linguistic analysis of the text (text mining in its standard meaning). In the light of these considerations, a novel application (ArgoServer) is presented, where different technologies cooperate with the core linguistic information extraction engine in order to achieve the result of constantly updating a database of product or brand-related customer opinions automatically gathered from newsgroups. The paper will emphasize how far the currently implemented shallow parsing techniques can go in understanding the contents of customers and users’ messages, thus extracting database records from relevant textual segments. It will also stress the limits inherently associated to the use of pure shallow techniques for the comprehension of language, and show how a new emerging linguistic technology to be developed in the context of the European project Deep Thought could in principle overcome such limits.

Book ChapterDOI
15 Feb 2004
TL;DR: A statistical translation model incorporating linguistic knowledge of syntactic and phrasal information for better translations is presented and it is shown that the structural relationship helps construct a better translation model for structurally different languages like Korean and English.
Abstract: As a part of work on alignment of the English and Korean parallel corpus, this paper presents a statistical translation model incorporating linguistic knowledge of syntactic and phrasal information for better translations. For this, we propose three models: First, we incorporate syntactic information such as part of speech into the word-based lexical alignment. Based on this model, we propose the second model which finds phrasal correspondence in the parallel corpus. Phrasal mapping through chunk-based shallow parsing enables to settle mismatch of meaningful units in the two languages. Lastly, we develop a two-level alignment model by combining these two models in order to construct both the word and phrase-based translation model. Model parameters are automatically estimated from a set of bilingual sentence pairs by applying the EM algorithm. Experiments show that the structural relationship helps construct a better translation model for structurally different languages like Korean and English.

01 Jan 2004
TL;DR: A model is explored that does not attempt to be a global characterisation of the sequence as in the case of PCFG/HMM but yet it does not assume independence among the generative processes of the subsequent elements in the sequence.
Abstract: [Intro] Probabilistic Models for sequence data can be mainly divided into two categories: I. Fairly sophisticated models that are aimed at finding an all encompassing characterisation of the whole sequence by the means of an interdependent generative process (such as PCFGs and HMMs); and II. Relatively simple models that make an independence assumption regarding the generation process of each of the elements of the sequence (such as Unigram, Naive Bayes, Clustering, PLSA and LDA).In this paper we explore the interval between these two extremes with a model that does not attempt to be a global characterisation of the sequence as in the case of PCFG/HMM but yet it does not assume independence among the generative processes of the subsequent elements in the sequence.