scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2009"


Journal ArticleDOI
01 Mar 2009
TL;DR: Parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision, MWE precision, and grammatical precision, which bears a high importance in the perspective of the subsequent integration of extraction results in other NLP applications.
Abstract: An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP applications.

45 citations


Proceedings ArticleDOI
06 Aug 2009
TL;DR: This paper evaluates SRL methods that take partial parses as inputs and implements SRL systems which cast SRL as the classification of syntactic chunks with IOB2 representation for semantic roles (i.e. semantic chunks).
Abstract: Most existing systems for Chinese Semantic Role Labeling (SRL) make use of full syntactic parses In this paper, we evaluate SRL methods that take partial parses as inputs We first extend the study on Chinese shallow parsing presented in (Chen et al, 2006) by raising a set of additional features On the basis of our shallow parser, we implement SRL systems which cast SRL as the classification of syntactic chunks with IOB2 representation for semantic roles (ie semantic chunks) Two labeling strategies are presented: 1) directly tagging semantic chunks in one-stage, and 2) identifying argument boundaries as a chunking task and labeling their semantic types as a classification task Lor both methods, we present encouraging results, achieving significant improvements over the best reported SRL performance in the literature Additionally, we put forward a rule-based algorithm to automatically acquire Chinese verb formation, which is empirically shown to enhance SRL

35 citations


Journal ArticleDOI
TL;DR: The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLPoutput improved the performance of text classification, in comparison with the best performance achieved in the BioCreAtIvE II IAS.

30 citations


Proceedings ArticleDOI
Lin Li1, Xia Hu1, Biyun Hu1, Jun Wang1, Yiming Zhou1 
12 Jul 2009
TL;DR: Experiments show that the proposed method makes the sentence similarity comparison more exactly and give out a more reasonable result, which is similar to the people's comprehension to the meanings of the sentences.
Abstract: The paper proposes to determine sentence similarities from different aspects. Based on the information people get from a sentence, Objects-Specified Similarity, Objects-Property Similarity, Objects-Behavior Similarity and Overall Similarity are defined to determine sentence similarities from four aspects. Experiments show that the proposed method makes the sentence similarity comparison more exactly and give out a more reasonable result, which is similar to the people's comprehension to the meanings of the sentences.

29 citations


Book ChapterDOI
25 Aug 2009
TL;DR: This article presents a formalism and a beta version of a new tool for simultaneous morphosyntactic disambiguation and shallow parsing, which facilitates the task of the shallow parsing of Morphosyntactically ambiguous or erroneouslydisambiguated input.
Abstract: This article presents a formalism and a beta version of a new tool for simultaneous morphosyntactic disambiguation and shallow parsing. Unlike in the case of other shallow parsing formalisms, the rules of the grammar allow for explicit morphosyntactic disambiguation statements, independently of structure-building statements, which facilitates the task of the shallow parsing of morphosyntactically ambiguous or erroneously disambiguated input.

15 citations


Proceedings Article
06 Aug 2009
TL;DR: In this paper, the authors present a specialized comparable corpora compilation tool for which quality would be close to a manually compiled corpus, based on three levels: domain, topic and type of discourse.
Abstract: We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM \texttt{UIMA} system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.

12 citations


Proceedings Article
01 Sep 2009
TL;DR: A method to evaluate the PP attachment task in a more natural situation is provided, making it possible to compare the approach to full statistical parsing approaches, and the domain adaptation properties of both approaches are investigated.
Abstract: In this paper we extend a shallow parser [6] with prepositional phrase attachment. Although the PP attachment task is a well-studied task in a discriminative learning context, it is mostly addressed in the context of artificial situations like the quadruple classification task [18] in which only two possible attachment sites, each time a noun or a verb, are possible. In this paper we provide a method to evaluate the task in a more natural situation, making it possible to compare the approach to full statistical parsing approaches. First, we show how to extract anchor-pp pairs from parse trees in the GENIA and WSJ treebanks. Next, we discuss the extension of the shallow parser with a PP-attacher. We compare the PP attachment module with a statistical full parsing approach [4] and analyze the results. More specifically, we investigate the domain adaptation properties of both approaches (in this case domain shifts between journalistic and medical language).

11 citations


Proceedings ArticleDOI
06 Aug 2009
TL;DR: A specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus, is presented, based on three levels: domain, topic and type of discourse.
Abstract: We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.

11 citations


Book ChapterDOI
30 Sep 2009
TL;DR: UAIC's QA systems participating in the Ro-Ro and En-En tasks adhered to the classical QA architecture, with an emphasis on simplicity and real time answers.
Abstract: 2009 marked UAIC1's fourth consecutive participation at the [email protected] competition, with continually improving results. This paper describes UAIC's QA systems participating in the Ro-Ro and En-En tasks. Both systems adhered to the classical QA architecture, with an emphasis on simplicity and real time answers: only shallow parsing was used for question processing, the indexes used by the retrieval module were at coarse-grained paragraph and document levels, and the answer extraction component used simple patternbased rules and lexical similarity metrics for candidate answer ranking. The results obtained for this year's participation were greatly improved from those of our team's previous participations, with an accuracy of 54% on the EN-EN task and 47% on the RO-RO task.

10 citations


Proceedings ArticleDOI
06 Mar 2009
TL;DR: The results show that CRFs outperform SVMs and Maxent in terms of accuracy and will give future researchers an insight into how to shape their research keeping in mind the comparative performance of major algorithms on datasets of various sizes and in various conditions.
Abstract: In this paper, we provide the first comprehensive comparison of methods for part-of-speech tagging and chunking for Hindi. We present an analysis of the application of three major learning algorithms (viz. Maximum Entropy Models [2] [9], Conditional Random Fields [12] and Support Vector Machines [8]) to part-of-speech tagging and chunking for Hindi Language using datasets of different sizes. The use of language independent features make this analysis more general and capable of concluding important results for similar South and South East Asian Languages. The results show that CRFs outperform SVMs and Maxent in terms of accuracy. We are able to achieve an accuracy of 92.26% for part-of-speech tagging and 93.57% for chunking using Conditional Random Fields algorithm. The corpus we have used had 138177 annotated instances for training. We report results for three learning algorithms by varying various conditions (clustering, BIEO notation vs. BIES notation, multiclass methods for SVMs etc.) and present an extensive analysis of the whole process. These results will give future researchers an insight into how to shape their research keeping in mind the comparative performance of major algorithms on datasets of various sizes and in various conditions.

8 citations


17 Sep 2009
TL;DR: Five basic natural language processing components were originally developed for English within OpenNLP, an open source maximum entropy based machine learning toolkit, and were retrained based on manually annotated training data from the BulTreeBank.
Abstract: We describe our efforts in adapting five basic natural language processing components to Bulgar-ian: sentence splitter, tokenizer, part-of-speech tagger, chunker, and syntactic parser. The components were originally developed for English within OpenNLP, an open source maximum entropy based machine learning toolkit, and were retrained based on manually annotated training data from the BulTreeBank. The evaluation results show an F1 score of 92.54% for the sentence splitter, 98.49% for the tokenizer, 94.43% for the part-of-speech tagger, 84.60% for the chunker, and 77.56% for the syntactic parser, which should be interpreted as baseline for Bulgarian.

Proceedings ArticleDOI
Lin Li1, Yiming Zhou1, Boqiu Yuan1, Jun Wang1, Xia Hu1 
14 Aug 2009
TL;DR: The paper proposes a novel method to determine sentence similarities based on a semantic vector method that has a high performance in F-measure and Recall.
Abstract: The paper proposes a novel method to determine sentence similarities. First two compared sentences are parsed by shallow-parsing and all noun phrases, verb phrases and preposition phrases of each sentence are extracted. Then the similarity between each kind of phrases is calculated based on a semantic vector method. The overall sentence similarity is defined as a combination of semantic similarities of the three kinds of phrases. Experiments show that the proposed method has a high performance in F-measure (81.6%) and Recall (97.4%).

Proceedings ArticleDOI
01 Sep 2009
TL;DR: A Parallel version of genia tagger application has been implemented and performance has been compared on a number of different architectures and the focus has been particularly on scalability of the application.
Abstract: There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle exponential growth in text data. Problem sizes are increasing by the day by addition of new text docments. The task of labelling sequence data such as part-of-speech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. Genia is a POS tagger which is specifically tuned for biomedical text. Genia is built with maximum entropy modelling and state of the art tagging algorithm. A Parallel version of genia tagger application has been implemented and performance has been compared on a number of different architectures. The focus has been particularly on scalability of the application. Scaling of 512 processors has been achieved and a method to scale to 10000 processors is proposed for massively parallel Text Mining applications. The parallel implementation of genia tagger is done using MPI for achieving portable code.

Proceedings ArticleDOI
30 Oct 2009
TL;DR: Besides common lexical features, various overlap features and base phrase chunking information are used to improve the performance of the feature-based protein-protein interaction extraction from biomedical literature using Support Vector Machines.
Abstract: This paper explores protein-protein interaction extraction from biomedical literature using Support Vector Machines (SVM). Besides common lexical features, various overlap features and base phrase chunking information are used to improve the performance. Evaluation on the AIMed corpus shows that our feature-based method achieves very encouraging performances of 68.6 and 51.0 in F-measure with 10-fold pairwise cross-validation and 10-fold document-wise cross-validation respectively, which are comparable with other state-of-the-art feature-based methods. Keywords-Protein-Protein Interaction; SVM; Shallow Parsing Information

Journal ArticleDOI
TL;DR: This work is one of the first attempts to apply text-mining techniques to the task of assigning semantic roles to protein mentions, and suggests that the phrase-based CRF model benefits from the flexibility to use correlated domain-specific features that describe the dependencies between TFs and other entities.

Book ChapterDOI
02 Oct 2009
TL;DR: It is shown that the valence dictionary obtained with the use of shallow parsing attains higher quality when it is measured on the basis of a corpus of valence frames, while the dictionary produced with the help of deep parsing seems superior when the results are compared to existing valence dictionaries.
Abstract: This article presents the evaluation of a valence dictionary for Polish produced with the help of shallow parsing techniques and compares those results to earlier results involving deep parsing. We show that the valence dictionary obtained with the use of shallow parsing attains higher quality when it is measured on the basis of a corpus of valence frames, while the dictionary produced with the help of deep parsing seems superior when the results are compared to existing valence dictionaries.

Journal ArticleDOI
TL;DR: A two-phase annotation method for semantic labeling in natural language processing which goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis.
Abstract: A two-phase annotation method for semantic labeling in natural language processing is proposed. The dynamic programming approach stresses on a non-exact string matching which takes full advantage of the underlying grammatical structure of the parse trees in a Treebank. The first phase of the labeling is a coarse-grained syntactic parsing which is complementary to a semantic dissimilarities analysis in its latter phase. The approach goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis. The paper presents experimental results for recognizing more than 50 different semantic labels in 10,000 sentences. Results show that the approach improves the labeling, even though with incomplete information. Detailed evaluations are discussed in order to justify its significances.

Proceedings ArticleDOI
30 Mar 2009
TL;DR: This work provides a model theory for a semantic formalism that is designed for this, namely Robust Minimal Recursion Semantics (rmrs), and shows that rmrs supports a notion of entailment that allows for comparing the semantic output of different parses of varying depth.
Abstract: One way to construct semantic representations in a robust manner is to enhance shallow language processors with semantic components. Here, we provide a model theory for a semantic formalism that is designed for this, namely Robust Minimal Recursion Semantics (rmrs). We show that rmrs supports a notion of entailment that allows it to form the basis for comparing the semantic output of different parses of varying depth.

Book ChapterDOI
17 Feb 2009
TL;DR: This paper describes the application of paraphrasing to steganography, using Modern Greek text as the cover medium, and describes the syntactic transformations, which require minimal linguistic resources and are easily portable to other inflectional languages.
Abstract: This paper describes the application of paraphrasing to steganography, using Modern Greek text as the cover medium. Paraphrases are learned in two phases: a set of shallow empirical rules are applied to every input sentence, leading to an initial pool of paraphrases. The pool is then filtered through supervised learning techniques. The syntactic transformations are shallow and require minimal linguistic resources, allowing the methodology to be easily portable to other inflectional languages. A secret key shared between two communicating parties helps them agree on one chosen paraphrase, the presence of which (or not) represents a binary bit of hidden information. The ability to simultaneously apply more than one rules, and each rule more than one times, to an input sentence increases the paraphrase pool size, ensuring thereby steganographic security.

Journal Article
Xie Shuangxi1
TL;DR: Results show that the method can automatically extract the patent information of structure technical solution and assist deep application of patent in the conceptual design and meet the requirement of conceptual design knowledge.
Abstract: Patent has become an important knowledge resource for conceptual design on account of its innovation and practicability Information extraction of structure technical solution from product patent is a basic work Aiming at mechanical product patent,conceptual model of technical solution for patent information extraction,which meets the requirement of conceptual design knowledge,is described The task of information extraction is composed of two parts:technical components extraction and technical relations extraction Moreover,construction of knowledge base for information extraction is studied Using non-deterministic finite state automata the technical components are extracted Based on frame semantics,patent verb semantic frame library is built for technical relations extraction Further,the process of information extraction of technical solution based on natural language understanding is put forward Key techniques of shallow parsing and semantic parsing are also studied The deployment of USA patent is illuminated Results show that the method can automatically extract the patent information of structure technical solution and assist deep application of patent in the conceptual design

Proceedings ArticleDOI
06 Nov 2009
TL;DR: The system streamlines and optimizes the processes of compiling, revising, editing, proofreading and type-setting, and distinguishes itself from other similar systems in that it pays more attentions to learners with much information derived from bilingual and learner's corpora by statistical techniques and shallow parsing.
Abstract: The paper reports our computer-assisted dictionary-making system for English-Chinese learner's dictionary. The system aims to enhance language quality and format consistency in dictionary compilation, which is realized by a number of linguistic analyzing modules and editorial assistant tools. The system embeds a) a concordancer of equivalent words in English-Chinese bilingual corpora based on probability coefficients, b) a collocation extraction tool over grammatical relations by shallow syntactic parsing, and c) a colligation finder based on part-of-speech tagged corpora. All these lead to better language quality of learner's dictionaries. In addition, a desktop publishing tool and database human-machine interface are implemented to ensure format consistency in the entries produced by different lexicographers, thereby contributing to the quality of the dictionaries. The system streamlines and optimizes the processes of compiling, revising, editing, proofreading and type-setting. It distinguishes itself from other similar systems in that it pays more attentions to learners with much information derived from bilingual and learner's corpora by statistical techniques and shallow parsing.

01 Jan 2009
TL;DR: This paper automates the automatic detection of the type of discourse in French and Japanese documents, which needs a wide linguistic analysis, and creates a robust and linguistically motivated typology based on structural, modal and lexical levels.
Abstract: Our goal is to automate the compilation of smart specialized comparable cor pora. The comparability is based on three levels: domain, topic and type of discou rse. Domain and topic can be filtered with the keywords used through web search. We pres ent in this paper the automatic detection of the type of discourse in French and Japanese docu ments, which needs a wide linguistic analysis. A contrastive analysis of the documents leads us to s pecify which information is relevant to distinguish them. Referring to classical studies on info rmation re- trieval, we create a robust and linguistically motivated typology based on thre e analysis levels: structural, modal and lexical. This typology is used to learn classification mo dels using shallow parsing. We obtain good results, that demonstrates the efficiency of this ty pology.

Book ChapterDOI
25 Aug 2009
TL;DR: This paper introduces the strategy for adapting a rule based parser of written language to transcribed speech and gives a detailed analysis of the types of errors made by the parser while analyzing the corpus of disfluencies.
Abstract: This paper introduces our strategy for adapting a rule based parser of written language to transcribed speech. Special attention has been paid to disfluencies (repairs, repetitions and false starts). A Constraint Grammar based parser was used for shallow syntactic analysis of spoken Estonian. The modification of grammar and additional methods improved the recall from 97.5% to 97.6% and precision from 91.6% to 91.8%. Also, the paper gives a detailed analysis of the types of errors made by the parser while analyzing the corpus of disfluencies.

Book ChapterDOI
25 Aug 2009
TL;DR: This work presents an alternate approach to shallow parsing of noun phrases for Slavic languages which follows the original Abney's principles and shows that continuous phrase chunking as well as shallow constituency parsing display evident drawbacks when faced with freer word order languages.
Abstract: Shallow parsing has been proposed as a means of arriving at practically useful structures while avoiding the difficulties of full syntactic analysis. According to Abney's principles, it is preferred to leave an ambiguity pending than to make a likely wrong decision. We show that continuous phrase chunking as well as shallow constituency parsing display evident drawbacks when faced with freer word order languages. Those drawbacks may lead to unnecessary data loss as a result of decisions forced by the formalism and therefore diminish practical value of shallow parsers for Slavic languages. We present an alternate approach to shallow parsing of noun phrases for Slavic languages which follows the original Abney's principles. The proposed approach to parsing is decomposed into several stages, some of which allow for marking discontinuous phrases.

Journal Article
TL;DR: This paper presents a new algorithm of named entity recognition based on cascaded conditional random fields, and experimentally evaluates the algorithm on large-scale corpus.
Abstract: Named entity recognition is one of the fundamental problems in many natural language processing applications,such as information extraction,information retrieval,machine translation,shallow parsing and question answering systemThis paper mainly researches the recognition of the complex location and complex organization in Chinese named entityThis paper presents a new algorithm of named entity recognition based on cascaded conditional random fieldsWe experimentally evaluate the algorithm on large-scale corpusIn open test,the recall,precision and F-measure achieves of 2 recognitions are 9195%,8999% ,9050% and 9007%,8872%,8939%

Journal Article
TL;DR: This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields and Error-driven technique and a method is described to deal with the conflicting chunking according to the F-measure values.
Abstract: This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields(CRFs) and Error-driven technique.First eleven types of Chinese chunks are divided into different groups to build CRFs model respectively.Then,the error-driven technique is applied over CRFs chunking results for further modification.Finally,a method is described to deal with the conflicting chunking according to the F-measure values.The experimental results show that this approach is effective,outperforming the single CRFs-based approach,distributed method and other hybrid approaches in the open test by achieving reaches 94.90%,91.00%,and 92.91% in recall,precision,and F-measure respectively.

Proceedings ArticleDOI
08 Dec 2009
TL;DR: This paper uses an idea of the shallow parsing based on a statistical approach in TAG formalism, named supertagging, which enhanced the standard POS tags in order to employ the syntactical information about the sentence.
Abstract: Increasing the domain of locality by using tree-adjoining-grammars (TAG) encourages some researchers to use it as a modeling formalism in their language application. But parsing with a rich grammar like TAG faces two main obstacles: low parsing speed and a lot of ambiguous syntactical parses. We uses an idea of the shallow parsing based on a statistical approach in TAG formalism, named supertagging, which enhanced the standard POS tags in order to employ the syntactical information about the sentence. In this paper, an error-driven method in order to approaching a full parse from the partial parses based on TAG formalism is presented. These partial parses are basically resulted from supertagger which is followed by a simple heuristic based light parser named light weight dependency analyzer (LDA). Like other error driven methods, the process of generation the deep parses can be divided into two different phases: error detection and error correction, which in each phase, different completion heuristics applied on the partial parses. The experiments on Penn Treebank show considerable improvements in the parsing time and disambiguation process.

Proceedings ArticleDOI
25 Jul 2009
TL;DR: A multi-agent text chunking model is proposed that uses individual sensitive features of each phrase to identify different phrases and is effective because F score of English chunking using this Multi-agent model achieves to 95.70%, which is higher than the best result that has been reported.
Abstract: Traditional English text chunking approach is to identify phrases using only one model and same features. It is shown that one model could not consider each phrase’s characteristics, and same features are not suitable to all phrases. In this paper, a multi-agent text chunking model is proposed. This model uses individual sensitive features of each phrase to identify different phrases. Through testing on the public training and test corpus, this multi-agent model is effective because F score of English chunking using this Multi-agent model achieves to 95.70%, which is higher than the best result that has been reported.

Proceedings ArticleDOI
30 Nov 2009
TL;DR: A new approach to natural-language chunking using an evolutionary model that uses previously captured training information to guide the evolution of the model and a multi-objective optimization strategy is used to produce the best solutions based on the internal and the external quality of chunking.
Abstract: In this work, a new approach to natural-language chunking using an evolutionary model is proposed. This uses previously captured training information to guide the evolution of the model. In addition, a multi-objective optimization strategy is used to produce the best solutions based on the internal and the external quality of chunking. Experiments and the main results obtained using the model and state-of-the-art approaches are discussed.

01 Jan 2009
TL;DR: This paper provides the first comprehensive comparisonactivities and results will give future researchers aninsight into how to modelparameters, and shows that CRFs outperform SVMs andMaxentinterms ofI Development ofxa stochastic taggers rirearge amount accuracy.
Abstract: twotasksareconsidered asimportant preprocessing Inthis paper, weprovide thefirst comprehensive comparisonactivities. Thishelps indoing deepparsing oftext andalso ofmethods forpart-of-speech tagging andchunking forHindi. indeveloping Information extraction systems, semantic Wepresent ananalysis oftheapplication ofthree major learningprocessing etc. algorithms (viz. MaximumEntropy Models[2][9], Conditional Random Fields [12] andSupport Vector Machines [8]) topart-of- Part-of-Speech (POS)tagging fornatural language texts speech tagging andchunking forHindi Language using datasetsaredeveloped using linguistic rules, stochastic models and ofdifferent sizes. Theuseoflanguage independent features make acombination ofboth(hybrid taggers). Amongstochastic this analysis moregeneral andcapable ofconcluding important models, Hidden MarkovModels(HMM)arequite popular. results forsimilar SouthandSouth EastAsian Languages. The ' results showthat CRFsoutperform SVMsandMaxentinterms ofI Development ofxa stochastic taggers rirearge amount accuracy. Weareable toachieve anaccuracy of92.26% forpart- ofannotated text. Stochastic taggers withmorethan950 of-speech tagging and93.57% forchunking using Conditionalword-level accuracy havebeendeveloped forEnglish, Random Fields algorithm. Thecorpus wehave used had138177 Germanandother European Languages, forwhichlarge annotated instances fortraining. We report results forthree labeled data isavailable. Theproblem isquite difficult for learning algorithms byvarying various conditions (clustering, Indian Languages duetolackofsuchlarge annotated BIEOnotation vs.BIESnotation, multiclass methods forSVMs corpus. Simple HMM modelsdonotworkwellwhen etc.) andpresent anextensive analysis ofthewholeprocess.smallamountoflabeled dataareusedtoestimate the These results will give future researchers aninsight into howto modelparameters. After POStagging, thenextstepis shapetheirresearch keeping in mindthecomparativechunking, whichdivides sentences intonon recursive performance ofmajor algorithms ondatasets ofvarious sizes and inseparable phrases. Itcanserveasthefirst stepforfull invarious conditions. parsing. Thetaskofidentifying chunkboundaries and chunklabels ismodeled inthesamewayasofidentifying