scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2006"


Journal ArticleDOI
TL;DR: A novel computer-aided procedure for generating multiple-choice test items from electronic documents that makes use of language resources such as corpora and ontologies, and saves both time and production costs.
Abstract: This paper describes a novel computer-aided procedure for generating multiple-choice test items from electronic documents. In addition to employing various Natural Language Processing techniques, including shallow parsing, automatic term extraction, sentence transformation and computing of semantic distance, the system makes use of language resources such as corpora and ontologies. It identifies important concepts in the text and generates questions about these concepts as well as multiple-choice distractors, offering the user the option to post-edit the test items by means of a user-friendly interface. In assisting test developers to produce items in a fast and expedient manner without compromising quality, the tool saves both time and production costs.

216 citations


Book ChapterDOI
13 May 2006
TL;DR: The MT engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine are described, and it is described in more detail the pilot Portuguese and Spanish linguistic data.
Abstract: This paper describes the current status of development of an open-source shallow-transfer machine translation (MT) system for the [European] Portuguese $\leftrightarrow$ Spanish language pair, developed using the OpenTrad Apertium MT toolbox (www.apertium.org). Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state-based chunking for structural transfer, and is based on a simple rationale: to produce fast, reasonably intelligible and easily correctable translations between related languages, it suffices to use a MT strategy which uses shallow parsing techniques to refine word-for-word MT. This paper briefly describes the MT engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine, and then goes on to describe in more detail the pilot Portuguese$\leftrightarrow$Spanish linguistic data.

83 citations


Journal ArticleDOI
TL;DR: A new approach is proposed, which is hybrid with both shallow parsing and pattern matching, to extract relations between proteins from scientific papers of biomedical themes, and has achieved an average F-score of 80% on individual verbs, and 66% on all verbs.

35 citations


01 Jan 2006
TL;DR: Improvements are possible by utilizing supertagging, lightweight dependency analysis, a link grammar parser and a maximum-entropy based chunk parser to investigate methods that add syntactically motivated features to a statistical machine translation system in a reranking framework.
Abstract: We investigate methods that add syntactically motivated features to a statistical machine translation system in a reranking framework The goal is to analyze whether shallow parsing techniques help in identifying ungrammatical hypotheses We show that improvements are possible by utilizing supertagging, lightweight dependency analysis, a link grammar parser and a maximum-entropy based chunk parser Adding features to n-best lists and discriminatively training the system on a development set increases the BLEU score up to 07% on the test set

24 citations


Proceedings Article
01 Jan 2006
TL;DR: The utility of corpora-independent lexicons derived from machine readable dictionaries are demonstrated, and substantial error reductions are shown for the tasks of part-of-speech tagging and shallow parsing.
Abstract: Many natural language processing tasks make use of a lexicon – typically the words collected from some annotated training data along with their associated properties We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 157% error reduction for shallow parsing using the CoNLL 2000 data Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain)

17 citations


Book ChapterDOI
17 Dec 2006
TL;DR: A novel selection method for tri-training learning in which newly labeled sentences are selected by comparing the agreements of three classifiers if the other two classifiers agree on the labels while itself disagrees.
Abstract: This paper presents a practical tri-training method for Chinese chunking using a small amount of labeled training data and a much larger pool of unlabeled data. We propose a novel selection method for tri-training learning in which newly labeled sentences are selected by comparing the agreements of three classifiers. In detail, in each iteration, a new sample is selected for a classifier if the other two classifiers agree on the labels while itself disagrees. We compare the proposed tri-training learning approach with co-training learning approach on Upenn Chinese Treebank V4.0(CTB4). The experimental results show that the proposed approach can improve the performance significantly.

16 citations


Proceedings Article
01 Jul 2006
TL;DR: A system that automatically constructs ontologies by extracting knowledge from dictionary definition sentences using Robust Minimal Recursion Semantics (RMRS) is outlined and how this system was designed to handle multiple lexicons and languages is discussed.
Abstract: In this paper, we outline the development of a system that automatically constructs ontologies by extracting knowledge from dictionary definition sentences using Robust Minimal Recursion Semantics (RMRS). Combining deep and shallow parsing resource through the common formalism of RMRS allows us to extract ontological relations in greater quantity and quality than possible with any of the methods independently. Using this method, we construct ontologies from two different Japanese lexicons and one English lexicon. We then link them to existing, handcrafted ontologies, aligning them at the word-sense level. This alignment provides a representative evaluation of the quality of the relations being extracted. We present the results of this ontology construction and discuss how our system was designed to handle multiple lexicons and languages.

14 citations


Journal Article
TL;DR: The primary aim of the design of the UCSG parsing architecture is developing a judicious combination of linguistic and statistical methods to develop wide coverage robust shallow parsing systems, without the need for large scale manually parsed training corpora.
Abstract: Recently, there is an increasing interest in integrating rule based methods with statistical techniques for developing robust, wide coverage, high performance parsing systems. In this paper 1 , we describe an architecture, called UCSG shallow parser architecture, which combines linguistic constraints expressed in the form of finite state grammars with statistical rating using HMMs built from a POS-tagged corpus and an A* search for global optimization for determining the best shallow parse for a given sentence. The primary aim of the design of the UCSG parsing architecture is developing a judicious combination of linguistic and statistical methods to develop wide coverage robust shallow parsing systems, without the need for large scale manually parsed training corpora. The UCSG architecture uses a grammar to specify all valid structures and a statistical component to rate and rank the possible alternatives, so as to produce the best parse first without compromising on the ability to produce all possible parses. The architecture supports bootstrapping with an aim to reduce the need for parsed training corpora. The complete system has been implemented in Per1 under Linux. In this paper we first describe the UCSG shallow parsing architecture and then focus on the evaluation of the UCSG finite state grammar for the chunking task for English. Recall of 91.16% and 93.73% have been obtained on the Susanne parsed corpus and CoNLL 2000 chunking task test data set respectively. Extensive experimentation is under way to evaluate the other modules.

13 citations


Book ChapterDOI
19 Feb 2006
TL;DR: The UCSG shallow parser as mentioned in this paper uses a grammar to specify all valid structures and a statistical component to rate and rank the possible alternatives, so as to produce the best parse first without compromising on the ability to produce all possible parses.
Abstract: Recently, there is an increasing interest in integrating rule based methods with statistical techniques for developing robust, wide coverage, high performance parsing systems. In this paper, we describe an architecture, called UCSG shallow parser architecture, which combines linguistic constraints expressed in the form of finite state grammars with statistical rating using HMMs built from a POS-tagged corpus and an A* search for global optimization for determining the best shallow parse for a given sentence. The primary aim of the design of the UCSG parsing architecture is developing a judicious combination of linguistic and statistical methods to develop wide coverage robust shallow parsing systems, without the need for large scale manually parsed training corpora. The UCSG architecture uses a grammar to specify all valid structures and a statistical component to rate and rank the possible alternatives, so as to produce the best parse first without compromising on the ability to produce all possible parses. The architecture supports bootstrapping with an aim to reduce the need for parsed training corpora. The complete system has been implemented in Perl under Linux. In this paper we first describe the UCSG shallow parsing architecture and then focus on the evaluation of the UCSG finite state grammar for the chunking task for English. Recall of 91.16% and 93.73% have been obtained on the Susanne parsed corpus and CoNLL 2000 chunking task test data set respectively. Extensive experimentation is under way to evaluate the other modules.

12 citations


Book ChapterDOI
16 Aug 2006
TL;DR: This paper presents a method of Chinese POS tagging and shallow parsing based on conditional random fields (CRF), as new discriminative sequential models, which may incorporate many rich features and well avoid the label bias problem.
Abstract: Part-of-speech (POS) tagging and shallow parsing are sequence modeling problems. While HMM and other generative models are not the most appropriate for the task of labeling sequential data. Compared with HMM, Maximum Entropy Markov models (MEMM) and other discriminative finite-state models can easily fused more features, however they suffer from the label bias problem. This paper presents a method of Chinese POS tagging and shallow parsing based on conditional random fields (CRF), as new discriminative sequential models, which may incorporate many rich features and well avoid the label bias problem. Moreover, we propose the information feedback from syntactical analysis to lexical analysis, since natural language should be a multi-knowledge interaction in nature. Experiments show that CRF approach achieves 0.70% F-score improvement in POS tagging and 0.67% improvement in shallow parsing. And we also confirm the effectiveness of information feedback to some complicated multi-class words.

7 citations


Journal ArticleDOI
TL;DR: The experimental results show that pos information is greatly helpful to improve the performance of Chinese shallow parsing and integrates Chinese linguistics information of the chunk into HMM model.

Journal Article
TL;DR: This paper describes how the syntactic analyzer of written Estonian to the spoken language was adapted, and the introduced changes are described and the achieved results are analyzed.
Abstract: In this paper we describe how we have adapted the syntactic analyzer of written Estonian to the spoken language. The Constraint Grammar shallow syntactic parser (Müürisep et al. 2003) was used for the automatic syntactic analysis of the corpus of Estonian spoken language (Hennoste et al. 2000). To adapt the parser, the clause boundary detection rules as well as some syntactic constraints had to be changed. Two new syntactic tags were also introduced. In the paper the introduced changes are described and the achieved results are analyzed. The parser determined the syntactic label unambiguosly for 90% of the words in the text in average, using the manually morphologically disambiguated text as an input. The error rate was less than 3%.

Proceedings ArticleDOI
01 Aug 2006
TL;DR: Conditional random fields (CRF) is presented as a new kind of discriminative sequential model, it can incorporate many rich features, and well avoid the label bias problem that is the limitation of maximum entropy Markov models (MEMM) and other discrim inative finite-state models.
Abstract: This paper presents an excel sequence tagging approach based on the combined machine learning methods. Firstly, conditional random fields (CRF) is presented as a new kind of discriminative sequential model, it can incorporate many rich features, and well avoid the label bias problem that is the limitation of maximum entropy Markov models (MEMM) and other discriminative finite-state models. Secondly, support vector machine is improved to adapt the sequential tagging task. Finally, these improved models and other existing models are combined together, which have achieved the state-of-the-art performance. Experimental results show that CRF approach achieves 0.70% improvement in POS tagging and 0.67% improvement in shallow parsing. Moreover, our combination method achieves F-measure 93.73% and 93.69% in above two tasks respectively, which is better than any sub-model.

01 Jan 2006
TL;DR: This document reports the experiments conducted at The Robert Gordon University (RGU) where Statistical Language Models combined with shallow parsing techniques for the opinion retrieval problem.
Abstract: Blogs are highly rich in opinion making their automatic processing appealing to marketing companies, the media, costumer centres, etc. TREC ran a Blog track in 2006 with two tasks: opinion retrieval and an open task. This document reports the experiments conducted at The Robert Gordon University (RGU) where we used Statistical Language Models combined with shallow parsing techniques for the opinion retrieval problem.

Proceedings Article
01 Jan 2006
TL;DR: A probability model is proposed to score the confidence of protein-protein interactions based on both text mining results and gene expression profiles, and experimental results are presented to show the feasibility of this framework.
Abstract: Protein-protein interactions referring to the associations of protein molecules are crucial for many biological functions. Since most knowledge about them still hides in biological publications, there is an increasing focus on mining information from the vast amount of biological literature such as MedLine. Many approaches, such as pattern matching, shallow parsing and deep parsing, have been proposed to automatically extract protein-protein interaction information from text sources, with however limited success. Moreover, to the best of our knowledge, none of the existing approaches have performed automatic validation on the mining results. In this paper, we describe a novel framework in which text mining results are automatically validated using the knowledge mined from gene expression profiles. A probability model is proposed to score the confidence of protein-protein interactions based on both text mining results and gene expression profiles. Experimental results are presented to show the feasibility of this framework.

01 Jan 2006
TL;DR: The system built by the Documents and Linguistic Technology (DLT) Group at University of Limerick for participation in the French-English Question Answering Task of the Cross Language Evaluation Forum (CLEF) resulted in improved performance.
Abstract: The basic architecture of our factoid system is standard in nature and comprises query type identification, query analysis and translation, retrieval query formulation, document retrieval, text file parsing, named entity recognition and answer entity selection. Factoid classification into 69 query types is carried out using keywords. Associated with each type is a set of one or more Named Entities. Xelda is used to tag the French query for partof-speech and then shallow parsing is carried out over these in order to recognise thirteen different kinds of significant phrase. These were determined after a study of the constructions used in French queries together with their English counterparts. Our observations were that (1) Proper names usually only start with a capital letter with subsequent words un-capitalised, unlike English; (2) Adjective-Noun combinations either capitalised or not can have the status of compounds in French and hence need special treatment; (3) Certain noun-preposition-noun phrases are also of significance. The phrases are then translated into English by the engine WorldLingo and using the Grand Dictionnaire Terminologique, the results being combined. Each phrase has a weight assigned to it by the parser. A Boolean retrieval query is formulated consisting of an AND of all phrases in increasing order of weight. The corpus is indexed by sentence using Lucene. The Boolean query is submitted to the engine and if unsuccessful is re-submitted with the first (least significant) term removed. The process continues until the search succeeds. The documents (i.e. sentences) are retrieved and the NEs corresponding to the identified query type are marked. Significant terms from the query are also marked. Each NE is scored based on its distance from query terms and their individual weights. The answer returned is the highest-scoring NE. Temporarily Restricted Factoids are treated in the same way as Factoids. Definition questions are classified in three ways: organisation, person or unknown. This year Factoids had to be recognised automatically by an extension of the classifier. An IR query is formulated using the main term in the original question plus a disjunction of phrases depending on the identified type. All matching sentences are returned complete. Results this year were as follows: 32/150 (21%) of Factoids were R, 14/150 (9%) were X, 4/40 (10%) of Definitions were R and 2 List results were R (P@N = 0.2). Our ranking in Factoids relative to all thirteen runs was Fourth. However, scoring all systems over R&X together and including Definitions, our ranking would be Second Equal because we had more X scores than any other system. Last year our score on Factoids was 26/150 (17%) but the difference is probably the easier queries this year. 1 On Sabbatical from University of Limerick.

Journal Article
TL;DR: The authors proposed a two-stage annotation method for identification of case roles in Chinese sentences, which makes use of a feature-enhanced string matching technique which takes full advantage of a huge number of sentence patterns in a Treebank.
Abstract: A two-stage annotation method for identification of case roles in Chinese sentences is proposed. The approach makes use of a feature-enhanced string matching technique which takes full advantage of a huge number of sentence patterns in a Treebank. The first stage of the approach is a coarse-grained syntactic parsing which is complementary to a semantic dissimilarities analysis in its latter stage. The approach goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis. The ideas described have been implemented and an evaluation of 5,000 Chinese sentences is examined in order to justify its significances.

01 Jan 2006
TL;DR: A shallow syntactic annotation scheme for Icelandic text that comprises a set of grammatical descriptors and their application guidelines and a grammar de nition corpus, annotated using the annotation scheme.
Abstract: We describe a shallow syntactic annotation scheme for Icelandic text. The scheme comprises a set of grammatical descriptors and their application guidelines. The descriptors consist of brackets and labels which indicate constituent structure and functional relations. Additionally, we describe a grammar de nition corpus, annotated using the annotation scheme. The annotation scheme has been developed as a part of a shallow parsing project.

Book ChapterDOI
17 Dec 2006
TL;DR: The contribution of the LIPN to the NLQ2NEXI task (part of the Natural Language Processing (NLP) track) of the Initiative for Evaluation of XML Retrieval (INEX 2006) discusses the use of shallow parsing methods to analyse natural language queries.
Abstract: This article presents the contribution of the LIPN : Laboratoire d’Informatique de Paris Nord (France) to the NLQ2NEXI (Natural Language Queries to NEXI) task (part of the Natural Language Processing (NLP) track) of the Initiative for Evaluation of XML Retrieval (INEX 2006) It discusses the use of shallow parsing methods to analyse natural language queries

Proceedings Article
01 Dec 2006
TL;DR: A hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge and a set of statistic-based association measures (AMs) as filters is presented.
Abstract: This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (AMs) as filters. There are two main purposes for the design of this hybrid algorithm: (1) to maintain a reasonable recall while improving the precision, and (2) to investigate the proposed association measures on Chinese noun phrase collocations. The performance is compared with a pure statistical model and a pure rule-based method on a 60MB PoS tagged corpus. The experiment results show that the proposed hybrid method has a higher precision of 92.65% and recall of 47% based on 29 randomly selected noun headwords compared with the precision of 78.87% and recall of 27.19% of a statistics based extraction system. The F-score improvement is 55.7%.

Book ChapterDOI
29 Oct 2006
TL;DR: A semi-automatic method of extracting and representing the various ontological relations of Korean numeral classifiers is proposed Shallow parsing and word-sense disambiguation were used to extract semantic relations from natural language texts and from wordnets.
Abstract: Many studies have focused on the facts that numeral classifiers give decisive clues to the semantic categorizing of nouns However, few studies have analyzed the ontological relationships of classifiers or the construction of classifier ontology In this paper, a semi-automatic method of extracting and representing the various ontological relations of Korean numeral classifiers is proposed Shallow parsing and word-sense disambiguation were used to extract semantic relations from natural language texts and from wordnets.

Proceedings Article
01 Dec 2006
TL;DR: The purpose of this paper is to characterize a chunk boundary parsing algorithm, using a statistical method combining adjustment rules, which serves as a supplement to traditional statistics-based parsing methods.
Abstract: Natural language processing (NLP) is a very hot research domain. One important branch of it is sentence analysis, including Chinese sentence analysis. However, currently, no mature deep analysis theories and techniques are available. An alternative way is to perform shallow parsing on sentences which is very popular in the domain. The chunk identification is a fundamental task for shallow parsing. The purpose of this paper is to characterize a chunk boundary parsing algorithm, using a statistical method combining adjustment rules, which serves as a supplement to traditional statistics-based parsing methods. The experimental results show that the model works well on the small dataset. It will contribute to the sequent processes like chunk tagging and chunk collocation extraction under other topics etc.

Book ChapterDOI
23 Oct 2006
TL;DR: In the belief that punctuation can aid in the process of sentence structure analysis, this work focuses on a prior assignment of values to commas in Spanish texts, with very encouraging results.
Abstract: In the belief that punctuation can aid in the process of sentence structure analysis, our work focuses on a prior assignment of values to commas in Spanish texts. Supervised machine learning techniques are applied for learning commas classifiers, taking as input attributes positional information and part of speech tags. One of these comma classifiers and a rule-based analyzer are combined in order to recognize and label text structures. The prior assignment of values to commas allowed the simplification of recognition rules, with very encouraging results.

Book ChapterDOI
19 Feb 2006
TL;DR: A two-stage annotation method for identification of case roles in Chinese sentences is proposed which goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis.
Abstract: A two-stage annotation method for identification of case roles in Chinese sentences is proposed. The approach makes use of a feature-enhanced string matching technique which takes full advantage of a huge number of sentence patterns in a Treebank. The first stage of the approach is a coarse-grained syntactic parsing which is complementary to a semantic dissimilarities analysis in its latter stage. The approach goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis. The ideas described have been implemented and an evaluation of 5,000 Chinese sentences is examined in order to justify its significances.