scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2008"


Journal ArticleDOI
TL;DR: This work presents a stopping criterion for active learning based on the way instances are selected during uncertainty-based sampling and verifies its applicability in a variety of settings.

143 citations


Journal ArticleDOI
22 Feb 2008
TL;DR: A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper and named Entity Recognition systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus.
Abstract: The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.

73 citations


Proceedings ArticleDOI
18 Aug 2008
TL;DR: This paper proposes the Best Label Path (BLP) inference algorithm, which is able to produce the most probable label sequence on latent conditional models, and outperforms two existing inference algorithms.
Abstract: Shallow parsing is one of many NLP tasks that can be reduced to a sequence labeling problem. In this paper we show that the latent-dynamics (i.e., hidden substructure of shallow phrases) constitutes a problem in shallow parsing, and we show that modeling this intermediate structure is useful. By analyzing the automatically learned hidden states, we show how the latent conditional model explicitly learn latent-dynamics. We propose in this paper the Best Label Path (BLP) inference algorithm, which is able to produce the most probable label sequence on latent conditional models. It outperforms two existing inference algorithms. With the BLP inference, the LDCRF model significantly outperforms CRF models on word features, and achieves comparable performance of the most successful shallow parsers on the CoNLL data when further using part-of-speech features.

63 citations


01 Jan 2008
TL;DR: The aim of this paper is to present the construction of a hybrid, three-stage named entity recognizer for Tamil that performs an in-place tagging task for a given Tamil document in three phases namely shallow parsing, shallow semantic parsing and statistical processing.
Abstract: The aim of this paper is to present the construction of a hybrid, three-stage named entity recognizer for Tamil. Named entity recognition performs an in-place tagging task for a given Tamil document in three phases namely shallow parsing, shallow semantic parsing and statistical processing. The E-M algorithm (HMM) is used in the statistical processing phase, with initial probabilities obtained from the shallow parsing phase, and a modification to the E-M algorithm deals with inputs from the shallow semantic parsing phase. This study is concentrated on entity names (personal names, location names and organization names), temporal expressions (dates and times) and number expressions. Both NER tags and POS tags are used as the hidden variables in the E-M algorithm. The average Fvalues obtained from the system 72.72% for the various entity types.

20 citations


Journal Article
TL;DR: Experimental results show that this approach can analyze a wide range of questions with high accuracy and produce reasonable textual responses and advantages of a novel Natural Language Interface comprising of shallow parsing based algorithms in conjunction with some intelligent techniques to train the system.
Abstract: This paper deals with a natural language interface, which accepts natural language questions as inputs and generates textual responses. In natural language processing, key-word matching based paradigm generate answers, however these answers frequently affected by certain language dependant phenomena such as semantic symmetry and ambiguous modification. Available techniques, described in the literature, deal with these problems using in depth parsing. In this paper, we will present rules to tackle linguistic phenomena using shallow parsing and discuss advantages of a novel Natural Language Interface comprising of shallow parsing based algorithms in conjunction with some intelligent techniques to train the system. Experimental results show that this approach can analyze a wide range of questions with high accuracy and produce reasonable textual responses.

15 citations


Journal ArticleDOI
TL;DR: This article proposes the use of syntactic dependencies as complex index terms in an attempt to solve the problems deriving from both syntactic and morpho-syntactic variation and, in this way, to obtain more precise index terms.
Abstract: The performance of information retrieval systems is limited by the linguistic variation present in natural language texts. Word-level natural language processing techniques have been shown to be useful in reducing this variation. In this article, we summarize our work on the extension of these techniques for dealing with phrase-level variation in European languages, taking Spanish as a case in point. We propose the use of syntactic dependencies as complex index terms in an attempt to solve the problems deriving from both syntactic and morpho-syntactic variation and, in this way, to obtain more precise index terms. Such dependencies are obtained through a shallow parser based on cascades of finite-state transducers in order to reduce as far as possible the overhead due to this parsing process. The use of different sources of syntactic information, queries or documents, has been also studied, as has the restriction of the dependencies applied to those obtained from noun phrases. Our approaches have been tested using the CLEF corpus, obtaining consistent improvements with regard to classical word-level non-linguistic techniques. Results show, on the one hand, that syntactic information extracted from documents is more useful than that from queries. On the other hand, it has been demonstrated that by restricting dependencies to those corresponding to noun phrases, important reductions of storage and management costs can be achieved, albeit at the expense of a slight reduction in performance.

13 citations


Journal Article
TL;DR: Proposed shallow parsing based algorithms reduce the amount of syntactic processing required to deal with problems caused by semantic symmetry and ambiguous modification and improve the precision of Natural Language Interface.
Abstract: Performance of Natural Language Interface often deteriorates due to linguistic phenomena of Semantic Symmetry and Ambiguous Modification (Katz and Lin, 2003). In this paper we present algorithms to handle problems caused by semantic symmetry and ambiguous modification. Use of these algorithms has improved the precision of Natural Language Interface. Proposed shallow parsing based algorithms reduce the amount of syntactic processing required to deal with problems caused by semantic symmetry and ambiguous modification. These algorithms need only POS (Part of Speech) information that is generated by shallow parsing of corpus text. Results are compared with the results of basic Natural Language Interface without such algorithm. Dealing with linguistic phenomena using shallow parsing is a novel approach as we overcome the usual brittleness ass ociated with in depth parsing. We also present computational results that produced comparative charts based on answers extracted for a same query posed to these two systems.

13 citations


Journal Article
TL;DR: This paper developed a rule based shallow parser to chunk Persian sentences and developed a knowledge-based system to assign 16 selected thematic roles to the chunks to extract semantic roles from Persian sentences.
Abstract: Extracting thematic (semantic) roles is one of the major steps in representing text meaning. It refers to finding the semantic relations between a predicate and syntactic constituents in a sentence. In this paper we present a rule-based approach to extract semantic roles from Persian sentences. The system exploits a twophase architecture to (1) identify the arguments and (2) label them for each predicate. For the first phase we developed a rule based shallow parser to chunk Persian sentences and for the second phase we developed a knowledge-based system to assign 16 selected thematic roles to the chunks. The experimental results of testing each phase are shown at the end of the paper. Keywords—Natural Language Processing, Semantic Role Labeling, Shallow parsing, Thematic Roles.

13 citations


Proceedings Article
01 May 2008
TL;DR: Spejd (abbreviated to ) is based on a fully uniform formalism both for constituency partial parsing and for morphosyntactic disambiguation, which is more flexible than either the usual shallow parsing formalisms or the usual unification-based formalisms.
Abstract: The paper presents Spejd, an Open Source Shallow Parsing and Disambiguation Engine. Spejd (abbreviated to ) is based on a fully uniform formalism both for constituency partial parsing and for morphosyntactic disambiguation — the same grammar rule may contain structure-building operations, as well as morphosyntactic correction and disambiguation operations. The formalism and the engine are more flexible than either the usual shallow parsing formalisms, which assume disambiguated input, or the usual unification-based formalisms, which couple disambiguation (via unification) with structure building. Current applications of Spejd include rule-based disambiguation, detection of multiword expressions, valence acquisition, and sentiment analysis. The functionality can be further extended by adding external lexical resources. While the examples are based on the set of rules prepared for the parsing of the IPI PAN Corpus of Polish,  is fully language-independent and we hope it will also be useful in the processing of other languages.

10 citations


01 Jan 2008
TL;DR: At the core of the system is a language model based on lemma bigrams and part-of-speech tags as well as an entropy computation over sentences to retrieve the best-compressed sentences.
Abstract: Sentence compression is a necessary component to the generation of abstracts. Previous studies focused mainly on the syntactic tree representation of the sentence. Our approach is a statistic approach, which does not use syntactic trees, which can be inaccurate in sentence analysis. At the core of our system is a language model based on lemma bigrams and part-of-speech tags (only a shallow parsing is performed) as well as an entropy computation over sentences to retrieve the best-compressed sentences. We also introduce the perceptron which is used to classify the compressed and non-compressed sentences and to indicate whether or not a sentence should be compressed.

8 citations


Posted Content
TL;DR: The design and implementation of the Prolog interface to the Unstructured Information Management Architecture (UIMA) and some of its applications in natural language processing are described.
Abstract: In this paper we describe the design and implementation of the Prolog interface to the Unstructured Information Management Architecture (UIMA) and some of its applications in natural language processing. The UIMA Prolog interface translates unstructured data and the UIMA Common Analysis Structure (CAS) into a Prolog knowledge base, over which, the developers write rules and use resolution theorem proving to search and generate new annotations over the unstructured data. These rules can explore all the previous UIMA annotations (such as, the syntactic structure, parsing statistics) and external Prolog knowledge bases (such as, Prolog WordNet and Extended WordNet) to implement a variety of tasks for the natural language analysis. We also describe applications of this logic programming interface in question analysis (such as, focus detection, answer-type and other constraints detection), shallow parsing (such as, relations in the syntactic structure), and answer selection.

01 Jan 2008
TL;DR: Results of a comparison of pure “Bag of Words” approach against a mixed method, extended by detecting opinion patterns using shallow-parsing techniques are presented.
Abstract: Automated sentiment polarity prediction from text is a challenging problem addressed in this paper. We present results of a comparison of pure “Bag of Words” approach against a mixed method, extended by detecting opinion patterns using shallow-parsing techniques. We utilize two resources for the analysis: Spejd shallow parsing engine and Zetema dictionary of sentiment in Polish. The performance of both approaches has been evaluated on online product review database.

Proceedings ArticleDOI
18 Jun 2008
TL;DR: The result shows that although the method did not apply any syntactic rules, the BPS algorithm, which combined the MM with SM algorithm, exerted the strong point of the MM andSM algorithm, obtained a favorable performance.
Abstract: Shallow parsing is a very important task in natural language processing or text mining, and the partial syntactical information can help to solve many other natural language processing tasks. In this paper, we split the task of shallow parsing into two subtasks: (1) Seeking all the break points to divide a part-of-speech (POS) sequence into some groups; (2) Tagging a phrase type for each POS group. In the first, we present the break point seeking (BPS) algorithm,which is combination of scoring model (SM) and maximum matching method (MM), to solve the first subtask. Then,we used the Bayes classifier to tag the phrase structure type for each POS group. The result shows that although our method did not apply any syntactic rules, the BPS algorithm, which combined the MM with SM algorithm, exerted the strong point of the MM and SM algorithm, obtained a favorable performance.

Proceedings ArticleDOI
20 Jun 2008
TL;DR: This paper proposed a new method to detect and resolve zero pronouns in Chinese text with integrated automatic main verbs identification, verbal logic valence and machine learning approach and demonstrated this zero pronouns identifying and resolving method works effectively.
Abstract: This paper proposed a new method to detect and resolve zero pronouns in Chinese text with integrated automatic main verbs identification, verbal logic valence and machine learning approach. Zero pronoun recognition was treated as the problem of finding missing verbs logic arguments. First, based on automatic main verbs identification, syntax hierarchies were analysed. Second, combining the syntax hierarchy and verbal logic valence theory, zero pronouns were identified. And then using a machine learning approach, zero pronouns were resolved. Experimental results on 150 news articles indicated that the precision and recall of zero pronoun detection is 72.9% and 92.7% respectively, and the accuracy of antecedent estimation is 64.3%. . These results demonstrated this zero pronouns identifying and resolving method works effectively.

Proceedings Article
01 Jan 2008
TL;DR: A shallow parsing formalism aiming at machine translation between closely related languages by allowing to write grammar rules helping to (partially) disambiguate chunks in input sentences.
Abstract: This paper describes a shallow parsing formalism aiming at machine translation between closely related languages. The formalism allows to write grammar rules helping to (partially) disambiguate chunks in input sentences. The chunks are then translatred into the target language without any deep syntactic or semantic processing. A stochastic ranker then selects the best translation according to the target language model. The results obtained for Czech and Slovak are presented.

Journal IssueDOI
TL;DR: In the proposed approach, shallow parsing such as part of-speech tagging and noun phrase chunking are used to parse both questions and Automated Speech Recognition (ASR) transcripts, and a sliding-window algorithm is proposed to identify the start and ending boundaries of returned segments.
Abstract: Recently, lecture videos have been widely used in e-learning systems. Envisioning intelligent e-learning systems, this article addresses the challenge of information seeking in lecture videos by retrieving relevant video segments based on user queries, through dynamic segmentation of lecture speech text. In the proposed approach, shallow parsing such as part of-speech tagging and noun phrase chunking are used to parse both questions and Automated Speech Recognition (ASR) transcripts. A sliding-window algorithm is proposed to identify the start and ending boundaries of returned segments. Phonetic and partial matching is utilized to correct the errors from automated speech recognition and noun phrase chunking. Furthermore, extra knowledge such as lecture slides is used to facilitate the ASR transcript error correction. The approach also makes use of proximity to approximate the deep parsing and structure match between question and sentences in ASR transcripts. The experimental results showed that both phonetic and partial matching improved the segmentation performance, slides-based ASR transcript correction improves information coverage, and proximity is also effective in improving the overall performance. © 2008 Wiley Periodicals, Inc.

Proceedings ArticleDOI
26 Nov 2008
TL;DR: This paper defines the representation of Chinese chunk and entity relation and obtains an optimized CRFs model that can realize label to chunk and entities relation so as to complete chunk parsing and relation extracting.
Abstract: Conditional random fields (CRFs) model is the valid probabilistic model to segment and label sequence data. Comparing with other statistical models, such as HMM, MEHMM, CRFs process the data sequence in terms of the context of data. Chunk analysis is a shallow parsing method to simplify natural language processing. And entity relation extraction is used in establishing relationship between entities. Because full syntax parsing is complexity in Chinese text understanding chunk analysis and relation extraction is important. This paper models these problems to Chinese text. By transforming them into label solution we can use CRFs to realize the chunk analysis and entities relation extraction. In the paper we define the representation of Chinese chunk and entity relation. The features window of the label word is discussed. By training we obtain an optimized CRFs model. It can realize label to chunk and entity relation so as to complete chunk parsing and relation extracting.

Proceedings ArticleDOI
23 Jul 2008
TL;DR: The new method presented yields a good efficiency and effectiveness without conducting a complex and deep syntactic analysis of Chinese sentences, which can be applied to an EBMT system for CSSM for a better performance in Chinese-to-English translation.
Abstract: Example-based machine translation (EBMT) is an important branch of machine translation. Sentence similarity measure is certainly one of the most significant problems addressed in EBMT. For EBMT from Chinese to English, the performance of similarity measure of Chinese sentences greatly affects the final translation result of an input Chinese sentence. In this paper, we present an approach to Chinese sentence similarity measure (CSSM) together with word sequence and sentence structure information. The new method in our experiments yields a good efficiency and effectiveness without conducting a complex and deep syntactic analysis of Chinese sentences, which can be applied to an EBMT system for CSSM for a better performance in Chinese-to-English translation.

Proceedings Article
13 Jul 2008
TL;DR: A learned classifier is presented that can accurately identify reduced passive voice constructions in shallow parsing environments and directly impact thematic role recognition and NLP applications that depend on it.
Abstract: Our research is motivated by the observation that NLP systems frequently mislabel passive voice verb phrases as being in the active voice when there is no auxiliary verb (e.g., "The man arrested had a long record"). These errors directly impact thematic role recognition and NLP applications that depend on it. We present a learned classifier that can accurately identify reduced passive voice constructions in shallow parsing environments.

Proceedings Article
01 Jan 2008
TL;DR: An architecture for building wide coverage shallow parsers by using a judicious combination of linguistic and statistical techniques without need for large amount of parsed training corpus without compromising on the ability to produce all possible parses in principle is proposed.
Abstract: In this paper, we propose an architecture, called UCSG Shallow Parsing Architecture, for building wide coverage shallow parsers by using a judicious combination of linguistic and statistical techniques without need for large amount of parsed training corpus to start with. We only need a large POS tagged corpus. A parsed corpus can be developed using the architecture with minimal manual effort, and such a corpus can be used for evaluation as also for performance improvement. The UCSG architecture is designed to be extended into a full parsing system but the current work is limited to chunking and obtaining appropriate chunk sequences for a given sentence. In the UCSG architecture, a Finite State Grammar is designed to accept all possible chunks, referred to as word groups here. A separate statistical component, encoded in HMMs (Hidden Markov Model), has been used to rate and rank the word groups so produced. Note that we are not pruning, we are only rating and ranking the word groups already obtained. Then we use a Best First Search strategy to produce parse outputs in best first order, without compromising on the ability to produce all possible parses in principle. We propose a bootstrapping strategy for improving HMM parameters and hence the performance of the parser as a whole. A wide coverage shallow parser has been implemented for English starting from the British National Corpus, a nearly 100 Million word POS tagged corpus. Note that the corpus is not a parsed corpus. Also, there are tagging errors, multiple tags assigned in many cases, and some words have not been tagged. A dictionary of 138,000 words with frequency counts for each word in each tag has been built. Extensive experiments have been carried out to evaluate the performance of the various modules. We work with large data sets and performance obtained is encouraging. A manually checked parsed corpus of 4000 sentences has also been developed and used to improve the parsing performance further. The entire system has been implemented in Perl under Linux.

Posted Content
TL;DR: This study aims to evaluate the parts-of-speech (POS) tagging accuracy and attempts to explore whether a comparable performance is obtained when a generic POS tagger, MontyTagger, was used in place of MedPost, a tagger trained in biomedical text.
Abstract: A recent study reported development of Muscorian, a generic text processing tool for extracting protein-protein interactions from text that achieved comparable performance to biomedical-specific text processing tools. This result was unexpected since potential errors from a series of text analysis processes is likely to adversely affect the outcome of the entire process. Most biomedical entity relationship extraction tools have used biomedical-specific parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect subsequent semantic analysis of the text, such as shallow parsing. This study aims to evaluate the parts-of-speech (POS) tagging accuracy and attempts to explore whether a comparable performance is obtained when a generic POS tagger, MontyTagger, was used in place of MedPost, a tagger trained in biomedical text. Our results demonstrated that MontyTagger, Muscorian's POS tagger, has a POS tagging accuracy of 83.1% when tested on biomedical text. Replacing MontyTagger with MedPost did not result in a significant improvement in entity relationship extraction from text; precision of 55.6% from MontyTagger versus 56.8% from MedPost on directional relationships and 86.1% from MontyTagger compared to 81.8% from MedPost on nondirectional relationships. This is unexpected as the potential for poor POS tagging by MontyTagger is likely to affect the outcome of the information extraction. An analysis of POS tagging errors demonstrated that 78.5% of tagging errors are being compensated by shallow parsing. Thus, despite 83.1% tagging accuracy, MontyTagger has a functional tagging accuracy of 94.6%.

Proceedings ArticleDOI
Qiang Zhou1, Hang Yu1
01 Oct 2008
TL;DR: A new relation tagging scheme to represent different intra-chunk relations is designed and several experiments of feature engineering are made to select a best baseline statistical model to improve parsing performance.
Abstract: Multiword chunking is designed as a shallow parsing technique to recognize external constituent and internal relation tags of a chunk in sentence. In this paper, we propose a new solution to deal with this problem. We design a new relation tagging scheme to represent different intra-chunk relations and make several experiments of feature engineering to select a best baseline statistical model. We also apply outside knowledge from a large-scale lexical relationship knowledge base to improve parsing performance. By integrating all above techniques, we develop a new Chinese MWC parser. Experimental results show its parsing performance can greatly exceed the rule-based parser trained and tested in the same data set.

Journal Article
TL;DR: The practical goal of this work is to enrich the information of the shallow parser with linguistic information for analyzing sequences containing an N that instantiates a kind of quantification of the other nominal constituent, by means of some different syntactical structures.
Abstract: This paper reports on work in progress to improve shallow parsing for Basque. The practical goal of our work is to enrich the information of the shallow parser with linguistic information for analyzing sequences containing an N that instantiates a kind of quantification of the other nominal constituent, by means of some different syntactical structures.

01 Jan 2008
TL;DR: The authors report on work in progress to improve shallow parsing for Basque, by enriching the information of the shallow parser with linguistic information for analyzing sequences containing an N that instantiates a kind of quantification of the other nominal constituent, by means of some different syntactical structures.
Abstract: This paper reports on work in progress to improve shallow parsing for Basque. The practical goal of our work is to enrich the information of the shallow parser with linguistic information for analyzing sequences containing an N that instantiates a kind of quantification of the other nominal constituent, by means of some different syntactical structures.