Showing papers on "Shallow parsing published in 2009"

PDF

Open Access

Journal Article•DOI•

Multilingual collocation extraction with a syntactic parser

[...]

Violeta Seretan¹, Eric Wehrli¹•Institutions (1)

01 Mar 2009

TL;DR: Parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision, MWE precision, and grammatical precision, which bears a high importance in the perspective of the subsequent integration of extraction results in other NLP applications.

...read moreread less

Abstract: An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP applications.

...read moreread less

45 citations

Proceedings Article•DOI•

Chinese Semantic Role Labeling with Shallow Parsing

[...]

Weiwei Sun¹, Zhifang Sui¹, Meng Wang¹, Xin Wang¹•Institutions (1)

Peking University¹

06 Aug 2009

TL;DR: This paper evaluates SRL methods that take partial parses as inputs and implements SRL systems which cast SRL as the classification of syntactic chunks with IOB2 representation for semantic roles (i.e. semantic chunks).

...read moreread less

Abstract: Most existing systems for Chinese Semantic Role Labeling (SRL) make use of full syntactic parses In this paper, we evaluate SRL methods that take partial parses as inputs We first extend the study on Chinese shallow parsing presented in (Chen et al, 2006) by raising a set of additional features On the basis of our shallow parser, we implement SRL systems which cast SRL as the classification of syntactic chunks with IOB2 representation for semantic roles (ie semantic chunks) Two labeling strategies are presented: 1) directly tagging semantic chunks in one-stage, and 2) identifying argument boundaries as a chunking task and labeling their semantic types as a classification task Lor both methods, we present encouraging results, achieving significant improvements over the best reported SRL performance in the literature Additionally, we put forward a rule-based algorithm to automatically acquire Chinese verb formation, which is empirically shown to enhance SRL

...read moreread less

35 citations

Journal Article•DOI•

Feature generation and representations for protein-protein interaction classification

[...]

Man Lan¹, Chew Lim Tan², Jian Su¹•Institutions (2)

Institute for Infocomm Research Singapore¹, National University of Singapore²

01 Oct 2009-Journal of Biomedical Informatics

TL;DR: The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLPoutput improved the performance of text classification, in comparison with the best performance achieved in the BioCreAtIvE II IAS.

...read moreread less

30 citations

Proceedings Article•DOI•

Measuring sentence similarity from different aspects

[...]

Lin Li¹, Xia Hu¹, Biyun Hu¹, Jun Wang¹, Yiming Zhou¹ - Show less +1 more•Institutions (1)

Beihang University¹

12 Jul 2009

TL;DR: Experiments show that the proposed method makes the sentence similarity comparison more exactly and give out a more reasonable result, which is similar to the people's comprehension to the meanings of the sentences.

...read moreread less

Abstract: The paper proposes to determine sentence similarities from different aspects. Based on the information people get from a sentence, Objects-Specified Similarity, Objects-Property Similarity, Objects-Behavior Similarity and Overall Similarity are defined to determine sentence similarities from four aspects. Experiments show that the proposed method makes the sentence similarity comparison more exactly and give out a more reasonable result, which is similar to the people's comprehension to the meanings of the sentences.

...read moreread less

29 citations

Book Chapter•DOI•

Spejd: A Shallow Processing and Morphological Disambiguation Tool

[...]

Aleksander Buczyński¹, Adam Przepiórkowski¹•Institutions (1)

Polish Academy of Sciences¹

25 Aug 2009

TL;DR: This article presents a formalism and a beta version of a new tool for simultaneous morphosyntactic disambiguation and shallow parsing, which facilitates the task of the shallow parsing of Morphosyntactically ambiguous or erroneouslydisambiguated input.

...read moreread less

Abstract: This article presents a formalism and a beta version of a new tool for simultaneous morphosyntactic disambiguation and shallow parsing. Unlike in the case of other shallow parsing formalisms, the rules of the grammar allow for explicit morphosyntactic disambiguation statements, independently of structure-building statements, which facilitates the task of the shallow parsing of morphosyntactically ambiguous or erroneously disambiguated input.

...read moreread less

15 citations

Proceedings Article•

Compilation of Specialized Comparable Corpora in French and Japanese.

[...]

Lorraine Goeuriot, Béatrice Daille, Emmanuel Morin

06 Aug 2009

TL;DR: In this paper, the authors present a specialized comparable corpora compilation tool for which quality would be close to a manually compiled corpus, based on three levels: domain, topic and type of discourse.

...read moreread less

Abstract: We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM \texttt{UIMA} system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.

...read moreread less

12 citations

Proceedings Article•

Prepositional Phrase Attachment in Shallow Parsing

[...]

Vincent Van Asch¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

01 Sep 2009

TL;DR: A method to evaluate the PP attachment task in a more natural situation is provided, making it possible to compare the approach to full statistical parsing approaches, and the domain adaptation properties of both approaches are investigated.

...read moreread less

Abstract: In this paper we extend a shallow parser [6] with prepositional phrase attachment. Although the PP attachment task is a well-studied task in a discriminative learning context, it is mostly addressed in the context of artificial situations like the quadruple classification task [18] in which only two possible attachment sites, each time a noun or a verb, are possible. In this paper we provide a method to evaluate the task in a more natural situation, making it possible to compare the approach to full statistical parsing approaches. First, we show how to extract anchor-pp pairs from parse trees in the GENIA and WSJ treebanks. Next, we discuss the extension of the shallow parser with a PP-attacher. We compare the PP attachment module with a statistical full parsing approach [4] and analyze the results. More specifically, we investigate the domain adaptation properties of both approaches (in this case domain shifts between journalistic and medical language).

...read moreread less

11 citations

Proceedings Article•DOI•

Compilation of Specialized Comparable Corpora in French and Japanese

[...]

Lorraine Goeuriot¹, Emmanuel Morin¹, Béatrice Daille¹•Institutions (1)

University of Nantes¹

06 Aug 2009

TL;DR: A specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus, is presented, based on three levels: domain, topic and type of discourse.

...read moreread less

Abstract: We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.

...read moreread less

11 citations

Book Chapter•DOI•

Question answering on English and Romanian languages

[...]

Adrian Iftene¹, Diana Trandabăţ¹, Alex Moruz¹, Ionuţ Pistol¹, Maria Husarciuc¹, Dan Cristea¹ - Show less +2 more•Institutions (1)

Alexandru Ioan Cuza University¹

30 Sep 2009

TL;DR: UAIC's QA systems participating in the Ro-Ro and En-En tasks adhered to the classical QA architecture, with an emphasis on simplicity and real time answers.

...read moreread less

Abstract: 2009 marked UAIC1's fourth consecutive participation at the [email protected] competition, with continually improving results. This paper describes UAIC's QA systems participating in the Ro-Ro and En-En tasks. Both systems adhered to the classical QA architecture, with an emphasis on simplicity and real time answers: only shallow parsing was used for question processing, the indexes used by the retrieval module were at coarse-grained paragraph and document levels, and the answer extraction component used simple patternbased rules and lexical similarity metrics for candidate answer ranking. The results obtained for this year's participation were greatly improved from those of our team's previous participations, with an accuracy of 54% on the EN-EN task and 47% on the RO-RO task.

...read moreread less

10 citations

Proceedings Article•DOI•

Shallow Parsing for Hindi - An extensive analysis of sequential learning algorithms using a large annotated corpus

[...]

Himanshu Gahlot¹, Awaghad Ashish Krishnarao¹, Dharmender Singh Kushwaha¹•Institutions (1)

Motilal Nehru National Institute of Technology Allahabad¹

06 Mar 2009

TL;DR: The results show that CRFs outperform SVMs and Maxent in terms of accuracy and will give future researchers an insight into how to shape their research keeping in mind the comparative performance of major algorithms on datasets of various sizes and in various conditions.

...read moreread less

Abstract: In this paper, we provide the first comprehensive comparison of methods for part-of-speech tagging and chunking for Hindi. We present an analysis of the application of three major learning algorithms (viz. Maximum Entropy Models [2] [9], Conditional Random Fields [12] and Support Vector Machines [8]) to part-of-speech tagging and chunking for Hindi Language using datasets of different sizes. The use of language independent features make this analysis more general and capable of concluding important results for similar South and South East Asian Languages. The results show that CRFs outperform SVMs and Maxent in terms of accuracy. We are able to achieve an accuracy of 92.26% for part-of-speech tagging and 93.57% for chunking using Conditional Random Fields algorithm. The corpus we have used had 138177 annotated instances for training. We report results for three learning algorithms by varying various conditions (clustering, BIEO notation vs. BIES notation, multiclass methods for SVMs etc.) and present an extensive analysis of the whole process. These results will give future researchers an insight into how to shape their research keeping in mind the comparative performance of major algorithms on datasets of various sizes and in various conditions.

...read moreread less

8 citations

Cross-lingual Adaptation as a Baseline: Adapting Maximum Entropy Models to Bulgarian

[...]

Georgi Georgiev¹, Preslav Nakov², Petya Osenova³, Kiril Simov³•Institutions (3)

Ontotext¹, National University of Singapore², Bulgarian Academy of Sciences³

17 Sep 2009

TL;DR: Five basic natural language processing components were originally developed for English within OpenNLP, an open source maximum entropy based machine learning toolkit, and were retrained based on manually annotated training data from the BulTreeBank.

...read moreread less

Abstract: We describe our efforts in adapting five basic natural language processing components to Bulgar-ian: sentence splitter, tokenizer, part-of-speech tagger, chunker, and syntactic parser. The components were originally developed for English within OpenNLP, an open source maximum entropy based machine learning toolkit, and were retrained based on manually annotated training data from the BulTreeBank. The evaluation results show an F1 score of 92.54% for the sentence splitter, 98.49% for the tokenizer, 94.43% for the part-of-speech tagger, 84.60% for the chunker, and 77.56% for the syntactic parser, which should be interpreted as baseline for Bulgarian.

...read moreread less

Proceedings Article•DOI•

[...]

Lin Li¹, Yiming Zhou¹, Boqiu Yuan¹, Jun Wang¹, Xia Hu¹ - Show less +1 more•Institutions (1)

Beihang University¹

14 Aug 2009

TL;DR: The paper proposes a novel method to determine sentence similarities based on a semantic vector method that has a high performance in F-measure and Recall.

...read moreread less

Abstract: The paper proposes a novel method to determine sentence similarities. First two compared sentences are parsed by shallow-parsing and all noun phrases, verb phrases and preposition phrases of each sentence are extracted. Then the similarity between each kind of phrases is calculated based on a semantic vector method. The overall sentence similarity is defined as a combination of semantic similarities of the three kinds of phrases. Experiments show that the proposed method has a high performance in F-measure (81.6%) and Recall (97.4%).

...read moreread less

Proceedings Article•DOI•

Highly scalable Text Mining - parallel tagging application

[...]

Firat Tekiner¹, Yoshimasa Tsuruoka², Jun'ichi Tsujii², Sophia Ananiadou²•Institutions (2)

University of Central Lancashire¹, University of Manchester²

01 Sep 2009

TL;DR: A Parallel version of genia tagger application has been implemented and performance has been compared on a number of different architectures and the focus has been particularly on scalability of the application.

...read moreread less

Abstract: There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle exponential growth in text data. Problem sizes are increasing by the day by addition of new text docments. The task of labelling sequence data such as part-of-speech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. Genia is a POS tagger which is specifically tuned for biomedical text. Genia is built with maximum entropy modelling and state of the art tagging algorithm. A Parallel version of genia tagger application has been implemented and performance has been compared on a number of different architectures. The focus has been particularly on scalability of the application. Scaling of 512 processors has been achieved and a method to scale to 10000 processors is proposed for massively parallel Text Mining applications. The parallel implementation of genia tagger is done using MPI for achieving portable code.

...read moreread less

Proceedings Article•DOI•

Extracting Protein-Protein Interaction from Biomedical Text Using Additional Shallow Parsing Information

[...]

Huanhuan Yu¹, Longhua Qian¹, Guodong Zhou¹, Qiaoming Zhu¹•Institutions (1)

Soochow University (Suzhou)¹

30 Oct 2009

TL;DR: Besides common lexical features, various overlap features and base phrase chunking information are used to improve the performance of the feature-based protein-protein interaction extraction from biomedical literature using Support Vector Machines.

...read moreread less

Abstract: This paper explores protein-protein interaction extraction from biomedical literature using Support Vector Machines (SVM). Besides common lexical features, various overlap features and base phrase chunking information are used to improve the performance. Evaluation on the AIMed corpus shows that our feature-based method achieves very encouraging performances of 68.6 and 51.0 in F-measure with 10-fold pairwise cross-validation and 10-fold document-wise cross-validation respectively, which are comparable with other state-of-the-art feature-based methods. Keywords-Protein-Protein Interaction; SVM; Shallow Parsing Information

...read moreread less

Journal Article•DOI•

Assigning roles to protein mentions: The case of transcription factors

[...]

H. J. Yang¹, John A. Keane¹, Casey M. Bergman¹, Goran Nenadic¹•Institutions (1)

University of Manchester¹

01 Oct 2009-Journal of Biomedical Informatics

TL;DR: This work is one of the first attempts to apply text-mining techniques to the task of assigning semantic roles to protein mentions, and suggests that the phrase-based CRF model benefits from the flexibility to use correlated domain-specific features that describe the dependencies between TFs and other entities.

...read moreread less

Book Chapter•DOI•

Towards the Automatic Acquisition of a Valence Dictionary for Polish

[...]

Adam Przepiórkowski¹•Institutions (1)

Polish Academy of Sciences¹

02 Oct 2009

TL;DR: It is shown that the valence dictionary obtained with the use of shallow parsing attains higher quality when it is measured on the basis of a corpus of valence frames, while the dictionary produced with the help of deep parsing seems superior when the results are compared to existing valence dictionaries.

...read moreread less

Abstract: This article presents the evaluation of a valence dictionary for Polish produced with the help of shallow parsing techniques and compares those results to earlier results involving deep parsing. We show that the valence dictionary obtained with the use of shallow parsing attains higher quality when it is measured on the basis of a corpus of valence frames, while the dictionary produced with the help of deep parsing seems superior when the results are compared to existing valence dictionaries.

...read moreread less

Journal Article•DOI•

Shallow semantic labeling using two-phase feature-enhanced string matching

[...]

Samuel W. K. Chan¹•Institutions (1)

The Chinese University of Hong Kong¹

01 Aug 2009-Expert Systems With Applications

TL;DR: A two-phase annotation method for semantic labeling in natural language processing which goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis.

...read moreread less

Abstract: A two-phase annotation method for semantic labeling in natural language processing is proposed. The dynamic programming approach stresses on a non-exact string matching which takes full advantage of the underlying grammatical structure of the parse trees in a Treebank. The first phase of the labeling is a coarse-grained syntactic parsing which is complementary to a semantic dissimilarities analysis in its latter phase. The approach goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis. The paper presents experimental results for recognizing more than 50 different semantic labels in 10,000 sentences. Results show that the approach improves the labeling, even though with incomplete information. Detailed evaluations are discussed in order to justify its significances.

...read moreread less

Proceedings Article•DOI•

A Logic of Semantic Representations for Shallow Parsing

[...]

Alexander Koller¹, Alex Lascarides²•Institutions (2)

Saarland University¹, University of Edinburgh²

30 Mar 2009

TL;DR: This work provides a model theory for a semantic formalism that is designed for this, namely Robust Minimal Recursion Semantics (rmrs), and shows that rmrs supports a notion of entailment that allows for comparing the semantic output of different parses of varying depth.

...read moreread less

Abstract: One way to construct semantic representations in a robust manner is to enhance shallow language processors with semantic components. Here, we provide a model theory for a semantic formalism that is designed for this, namely Robust Minimal Recursion Semantics (rmrs). We show that rmrs supports a notion of entailment that allows it to form the basis for comparing the semantic output of different parses of varying depth.

...read moreread less

Book Chapter•DOI•

Empirical Paraphrasing of Modern Greek Text in Two Phases: An Application to Steganography

[...]

Katia Lida Kermanidis¹, Emmanouil Magkos¹•Institutions (1)

Ionian University¹

17 Feb 2009

TL;DR: This paper describes the application of paraphrasing to steganography, using Modern Greek text as the cover medium, and describes the syntactic transformations, which require minimal linguistic resources and are easily portable to other inflectional languages.

...read moreread less

Abstract: This paper describes the application of paraphrasing to steganography, using Modern Greek text as the cover medium. Paraphrases are learned in two phases: a set of shallow empirical rules are applied to every input sentence, leading to an initial pool of paraphrases. The pool is then filtered through supervised learning techniques. The syntactic transformations are shallow and require minimal linguistic resources, allowing the methodology to be easily portable to other inflectional languages. A secret key shared between two communicating parties helps them agree on one chosen paraphrase, the presence of which (or not) represents a binary bit of hidden information. The ability to simultaneously apply more than one rules, and each rule more than one times, to an input sentence increases the paraphrase pool size, ensuring thereby steganographic security.

...read moreread less

Journal Article•

Information Extraction Method of Technical Solution from Mechanical Product Patent

[...]

Xie Shuangxi¹•Institutions (1)

Zhejiang University¹

01 Jan 2009-Journal of Mechanical Engineering

TL;DR: Results show that the method can automatically extract the patent information of structure technical solution and assist deep application of patent in the conceptual design and meet the requirement of conceptual design knowledge.

...read moreread less

Abstract: Patent has become an important knowledge resource for conceptual design on account of its innovation and practicability Information extraction of structure technical solution from product patent is a basic work Aiming at mechanical product patent,conceptual model of technical solution for patent information extraction,which meets the requirement of conceptual design knowledge,is described The task of information extraction is composed of two parts:technical components extraction and technical relations extraction Moreover,construction of knowledge base for information extraction is studied Using non-deterministic finite state automata the technical components are extracted Based on frame semantics,patent verb semantic frame library is built for technical relations extraction Further,the process of information extraction of technical solution based on natural language understanding is put forward Key techniques of shallow parsing and semantic parsing are also studied The deployment of USA patent is illuminated Results show that the method can automatically extract the patent information of structure technical solution and assist deep application of patent in the conceptual design

...read moreread less

Proceedings Article•DOI•

A computer-assisted dictionary-making system for Chinese English learner's dictionary

[...]

Wenxin Xiong¹, Guohua Chen¹•Institutions (1)

Beijing Foreign Studies University¹

06 Nov 2009

TL;DR: The system streamlines and optimizes the processes of compiling, revising, editing, proofreading and type-setting, and distinguishes itself from other similar systems in that it pays more attentions to learners with much information derived from bilingual and learner's corpora by statistical techniques and shallow parsing.

...read moreread less

Abstract: The paper reports our computer-assisted dictionary-making system for English-Chinese learner's dictionary. The system aims to enhance language quality and format consistency in dictionary compilation, which is realized by a number of linguistic analyzing modules and editorial assistant tools. The system embeds a) a concordancer of equivalent words in English-Chinese bilingual corpora based on probability coefficients, b) a collocation extraction tool over grammatical relations by shallow syntactic parsing, and c) a colligation finder based on part-of-speech tagged corpora. All these lead to better language quality of learner's dictionaries. In addition, a desktop publishing tool and database human-machine interface are implemented to ensure format consistency in the entries produced by different lexicographers, thereby contributing to the quality of the dictionaries. The system streamlines and optimizes the processes of compiling, revising, editing, proofreading and type-setting. It distinguishes itself from other similar systems in that it pays more attentions to learners with much information derived from bilingual and learner's corpora by statistical techniques and shallow parsing.

...read moreread less

Reconnaissance du type de discours dans des corpus comparables spécialisés

[...]

Lorraine Goeuriot¹, Béatrice Daille¹•Institutions (1)

University of Nantes¹

01 Jan 2009

TL;DR: This paper automates the automatic detection of the type of discourse in French and Japanese documents, which needs a wide linguistic analysis, and creates a robust and linguistically motivated typology based on structural, modal and lexical levels.

...read moreread less

Abstract: Our goal is to automate the compilation of smart specialized comparable cor pora. The comparability is based on three levels: domain, topic and type of discou rse. Domain and topic can be filtered with the keywords used through web search. We pres ent in this paper the automatic detection of the type of discourse in French and Japanese docu ments, which needs a wide linguistic analysis. A contrastive analysis of the documents leads us to s pecify which information is relevant to distinguish them. Referring to classical studies on info rmation re- trieval, we create a robust and linguistically motivated typology based on thre e analysis levels: structural, modal and lexical. This typology is used to learn classification mo dels using shallow parsing. We obtain good results, that demonstrates the efficiency of this ty pology.

...read moreread less

Book Chapter•DOI•

Shallow Parsing of Transcribed Speech of Estonian and Disfluency Detection

[...]

Kaili Müürisep¹, Helen Nigol¹•Institutions (1)

University of Tartu¹

25 Aug 2009

TL;DR: This paper introduces the strategy for adapting a rule based parser of written language to transcribed speech and gives a detailed analysis of the types of errors made by the parser while analyzing the corpus of disfluencies.

...read moreread less

Abstract: This paper introduces our strategy for adapting a rule based parser of written language to transcribed speech. Special attention has been paid to disfluencies (repairs, repetitions and false starts). A Constraint Grammar based parser was used for shallow syntactic analysis of spoken Estonian. The modification of grammar and additional methods improved the recall from 97.5% to 97.6% and precision from 91.6% to 91.8%. Also, the paper gives a detailed analysis of the types of errors made by the parser while analyzing the corpus of disfluencies.

...read moreread less

Book Chapter•DOI•

Parsing with Agreement

[...]

Adam Radziszewski¹•Institutions (1)

Wrocław University of Technology¹

25 Aug 2009

TL;DR: This work presents an alternate approach to shallow parsing of noun phrases for Slavic languages which follows the original Abney's principles and shows that continuous phrase chunking as well as shallow constituency parsing display evident drawbacks when faced with freer word order languages.

...read moreread less

Abstract: Shallow parsing has been proposed as a means of arriving at practically useful structures while avoiding the difficulties of full syntactic analysis. According to Abney's principles, it is preferred to leave an ambiguity pending than to make a likely wrong decision. We show that continuous phrase chunking as well as shallow constituency parsing display evident drawbacks when faced with freer word order languages. Those drawbacks may lead to unnecessary data loss as a result of decisions forced by the formalism and therefore diminish practical value of shallow parsers for Slavic languages. We present an alternate approach to shallow parsing of noun phrases for Slavic languages which follows the original Abney's principles. The proposed approach to parsing is decomposed into several stages, some of which allow for marking discontinuous phrases.

...read moreread less

Journal Article•

Study on Chinese named entity recognition based on cascaded conditional random fields

[...]

Shi Shui-cai¹, Trs Information•Institutions (1)

Beijing Information Science & Technology University¹

01 Jan 2009-Computer Engineering and Applications

TL;DR: This paper presents a new algorithm of named entity recognition based on cascaded conditional random fields, and experimentally evaluates the algorithm on large-scale corpus.

...read moreread less

Abstract: Named entity recognition is one of the fundamental problems in many natural language processing applications,such as information extraction,information retrieval,machine translation,shallow parsing and question answering systemThis paper mainly researches the recognition of the complex location and complex organization in Chinese named entityThis paper presents a new algorithm of named entity recognition based on cascaded conditional random fieldsWe experimentally evaluate the algorithm on large-scale corpusIn open test,the recall,precision and F-measure achieves of 2 recognitions are 9195%,8999% ,9050% and 9007%,8872%,8939%

...read moreread less

Journal Article•

A Distributed Strategy for CRFs Based Chinese Text Chunking

[...]

Yu Jing¹•Institutions (1)

Dalian University of Technology¹

01 Jan 2009-Journal of Chinese information processing

TL;DR: This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields and Error-driven technique and a method is described to deal with the conflicting chunking according to the F-measure values.

...read moreread less

Abstract: This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields(CRFs) and Error-driven technique.First eleven types of Chinese chunks are divided into different groups to build CRFs model respectively.Then,the error-driven technique is applied over CRFs chunking results for further modification.Finally,a method is described to deal with the conflicting chunking according to the F-measure values.The experimental results show that this approach is effective,outperforming the single CRFs-based approach,distributed method and other hybrid approaches in the open test by achieving reaches 94.90%,91.00%,and 92.91% in recall,precision,and F-measure respectively.

...read moreread less

Proceedings Article•DOI•

Building deep dependency structure from partial parses

[...]

Heshaam Faili¹•Institutions (1)

University of Tehran¹

08 Dec 2009

TL;DR: This paper uses an idea of the shallow parsing based on a statistical approach in TAG formalism, named supertagging, which enhanced the standard POS tags in order to employ the syntactical information about the sentence.

...read moreread less

Abstract: Increasing the domain of locality by using tree-adjoining-grammars (TAG) encourages some researchers to use it as a modeling formalism in their language application. But parsing with a rich grammar like TAG faces two main obstacles: low parsing speed and a lot of ambiguous syntactical parses. We uses an idea of the shallow parsing based on a statistical approach in TAG formalism, named supertagging, which enhanced the standard POS tags in order to employ the syntactical information about the sentence. In this paper, an error-driven method in order to approaching a full parse from the partial parses based on TAG formalism is presented. These partial parses are basically resulted from supertagger which is followed by a simple heuristic based light parser named light weight dependency analyzer (LDA). Like other error driven methods, the process of generation the deep parses can be divided into two different phases: error detection and error correction, which in each phase, different completion heuristics applied on the partial parses. The experiments on Penn Treebank show considerable improvements in the parsing time and disambiguation process.

...read moreread less

Proceedings Article•DOI•

A Multi-agent Model for English Text Chunking

[...]

Ying-Hong Liang, Jin-xiang Li, Jun Cao, De-pang Wang

25 Jul 2009

TL;DR: A multi-agent text chunking model is proposed that uses individual sensitive features of each phrase to identify different phrases and is effective because F score of English chunking using this Multi-agent model achieves to 95.70%, which is higher than the best result that has been reported.

...read moreread less

Abstract: Traditional English text chunking approach is to identify phrases using only one model and same features. It is shown that one model could not consider each phrase’s characteristics, and same features are not suitable to all phrases. In this paper, a multi-agent text chunking model is proposed. This model uses individual sensitive features of each phrase to identify different phrases. Through testing on the public training and test corpus, this multi-agent model is effective because F score of English chunking using this Multi-agent model achieves to 95.70%, which is higher than the best result that has been reported.

...read moreread less

Proceedings Article•DOI•

Evolutionary Shallow Parsing

[...]

J. Atkinson¹, J. Matamala¹•Institutions (1)

University of Concepción¹

30 Nov 2009

TL;DR: A new approach to natural-language chunking using an evolutionary model that uses previously captured training information to guide the evolution of the model and a multi-objective optimization strategy is used to produce the best solutions based on the internal and the external quality of chunking.

...read moreread less

Abstract: In this work, a new approach to natural-language chunking using an evolutionary model is proposed. This uses previously captured training information to guide the evolution of the model. In addition, a multi-objective optimization strategy is used to produce the best solutions based on the internal and the external quality of chunking. Experiments and the main results obtained using the model and state-of-the-art approaches are discussed.

...read moreread less

Shallow Parsing forHindi -Anextensive analysis ofsequential learning algorithms using alarge annotated corpus

[...]

Himanshu Gahlot, D. S. Kushwaha Motilal

01 Jan 2009

TL;DR: This paper provides the first comprehensive comparisonactivities and results will give future researchers aninsight into how to modelparameters, and shows that CRFs outperform SVMs andMaxentinterms ofI Development ofxa stochastic taggers rirearge amount accuracy.

...read moreread less

Abstract: twotasksareconsidered asimportant preprocessing Inthis paper, weprovide thefirst comprehensive comparisonactivities. Thishelps indoing deepparsing oftext andalso ofmethods forpart-of-speech tagging andchunking forHindi. indeveloping Information extraction systems, semantic Wepresent ananalysis oftheapplication ofthree major learningprocessing etc. algorithms (viz. MaximumEntropy Models[2][9], Conditional Random Fields [12] andSupport Vector Machines [8]) topart-of- Part-of-Speech (POS)tagging fornatural language texts speech tagging andchunking forHindi Language using datasetsaredeveloped using linguistic rules, stochastic models and ofdifferent sizes. Theuseoflanguage independent features make acombination ofboth(hybrid taggers). Amongstochastic this analysis moregeneral andcapable ofconcluding important models, Hidden MarkovModels(HMM)arequite popular. results forsimilar SouthandSouth EastAsian Languages. The ' results showthat CRFsoutperform SVMsandMaxentinterms ofI Development ofxa stochastic taggers rirearge amount accuracy. Weareable toachieve anaccuracy of92.26% forpart- ofannotated text. Stochastic taggers withmorethan950 of-speech tagging and93.57% forchunking using Conditionalword-level accuracy havebeendeveloped forEnglish, Random Fields algorithm. Thecorpus wehave used had138177 Germanandother European Languages, forwhichlarge annotated instances fortraining. We report results forthree labeled data isavailable. Theproblem isquite difficult for learning algorithms byvarying various conditions (clustering, Indian Languages duetolackofsuchlarge annotated BIEOnotation vs.BIESnotation, multiclass methods forSVMs corpus. Simple HMM modelsdonotworkwellwhen etc.) andpresent anextensive analysis ofthewholeprocess.smallamountoflabeled dataareusedtoestimate the These results will give future researchers aninsight into howto modelparameters. After POStagging, thenextstepis shapetheirresearch keeping in mindthecomparativechunking, whichdivides sentences intonon recursive performance ofmajor algorithms ondatasets ofvarious sizes and inseparable phrases. Itcanserveasthefirst stepforfull invarious conditions. parsing. Thetaskofidentifying chunkboundaries and chunklabels ismodeled inthesamewayasofidentifying

...read moreread less