scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2013"


Book ChapterDOI
01 Jan 2013
TL;DR: This chapter explores an alternative to event extraction based on BBN SERIFTM, and BBN OnTopicTM, two state-of-the-art statistical natural language processing engines, and empirically compares their effectiveness on five dimensions.
Abstract: Automated analysis of news reports is a significant empowering technology for predictive models of political instability. To date, the standard approach to this analytic task has been embodied in systems such as KEDS/TABARI [1], which use manually-generated rules and shallow parsing techniques to identify events and their participants in text. In this chapter we explore an alternative to event extraction based on BBN SERIFTM, and BBN OnTopicTM, two state-of-the-art statistical natural language processing engines. We empirically compare this new approach to existing event extraction techniques on five dimensions: (1) Accuracy: when an event is reported by the system, how often is it correct? (2) Coverage: how many events are correctly reported by the system? (3) Filtering of historical events: how well are historical events (e.g. 9/11) correctly filtered out of the current event data stream? (4) Topic-based event filtering: how well do systems filter out red herrings based on document topic, such as sports documents mentioning “clashes” between two countries on the playing field? (5) Domain shift: how well do event extraction models perform on data originating from diverse sources? In all dimensions we show significant improvement to the state-of-the-art by applying statistical natural language processing techniques. It is our hope that these results will lead to greater acceptance of automated coding by creators and consumers of social science models that depend on event data and provide a new way to improve the accuracy of those predictive models.

50 citations


Book ChapterDOI
01 Jan 2013
TL;DR: This paper presents an integrated feature extraction framework for Natural Language Processing that removes wasteful redundancy and helps in rapid prototyping.
Abstract: Feature extraction from text corpora is an important step in Natural Language Processing (NLP), especially for Machine Learning (ML) techniques. Various NLP tasks have many common steps, e.g. low level act of reading a corpus and obtaining text windows from it. Some high-level processing steps might also be shared, e.g. testing for morpho-syntactic constraints between words. An integrated feature extraction framework removes wasteful redundancy and helps in rapid prototyping.

20 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: A new shallow parsing mechanism is implemented which is driven by handcrafted rules and shows some clear avenues for further improvement in the error analysis.
Abstract: In this paper, we investigate the recognition of threats in Dutch tweets. As tweets often display irregular grammatical form and deviant orthography, analysis by standard means is problematic. Therefore, we have implemented a new shallow parsing mechanism which is driven by handcrafted rules. Experimental results are encouraging, with an F-measure of about 40% on a random sample of Dutch tweets. Moreover, the error analysis shows some clear avenues for further improvement.

11 citations


Patent
11 Nov 2013
TL;DR: In this article, a simple grammatical error and an error in sentence structure are detected by generating a string of parts of speech using n-grams for a composed input sentence and parsing the generated string of speech on the basis of a rule (shallow parsing) defined according to a connective relationship between adjacent parts-of-speech, and a corrected draft is proposed for the detected errors to increase accuracy of sentence evaluation.
Abstract: An automatic sentence evaluating device using a shallow parser. A simple grammatical error and an error in sentence structure are detected by generating a string of parts of speech using n-gram for a composed input sentence and parsing the generated string of parts of speech on the basis of a rule (shallow parsing) defined according to a connective relationship between adjacent parts of speech, and a corrected draft is proposed for the detected errors to thereby increase accuracy of sentence evaluation, and an error detection apparatus and a method for the same.

9 citations


Proceedings Article
01 Sep 2013
TL;DR: This paper selects an intersection set of Wall Street Journal documents that is included both in the Penn Discourse Tree Bank (PDTB) and in the Multi-Perspective Question Answering (MPQA) corpus in order to explore the usefulness of discourselevel structure to facilitate the extraction of fine-grained opinion expressions.
Abstract: Opinion analysis deals with public opinions and trends, but subjective language is highly ambiguous. In this paper, we follow a simple data-driven technique to learn fine-grained opinions. We select an intersection set of Wall Street Journal documents that is included both in the Penn Discourse Tree Bank (PDTB) and in the Multi-Perspective Question Answering (MPQA) corpus. This is done in order to explore the usefulness of discourselevel structure to facilitate the extraction of fine-grained opinion expressions. Here we perform shallow parsing of MPQA expressions with connective based discourse structure, and then also with Named Entities (NE) and some syntax features using conditional random fields; the latter feature set is basically a collection of NEs and a bundle of features that is proved to be useful in a shallow discourse parsing task. We found that both of the feature-sets are useful to improve our baseline at different levels of this fine-grained opinion expression mining task.

6 citations


Proceedings Article
07 Dec 2013
TL;DR: In a pattern-based matching operation, the transducer described here consists of POS-tags using regular expressions that take advantage of the characteristics of German grammar to find linguistically relevant phrases with a good precision.
Abstract: Non-finite state parsers provide fine-grained information. However, they are computationally demanding. Therefore, it is interesting to see how far a shallow parsing approach is able to go. In a pattern-based matching operation, the transducer described here consists of POS-tags using regular expressions that take advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb. The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency. This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher. Possible applications include simulation of text comprehension on the syntactical level, creation of selective benchmarks and failure analysis.

5 citations


01 Jan 2013
TL;DR: A machine learning approach to automatically extract concepts and the conceptual relations towards creation of Conceptual Graphs (CGs) from patent documents using shallow parser and NER and machine learning technique.
Abstract: This paper presents a machine learning approach to automatically extract concepts and the conceptual relations towards creation of Conceptual Graphs (CGs) from patent documents using shallow parser and NER. The main challenge in the creation of conceptual graphs from the natural language texts is the automatic identification of concepts and conceptual relations. The texts analyzed in this work are patent documents, focused mainly on the claim‟s section (Claim) of the documents. The task of automatically identifying the concept and conceptual relation becomes difficult due to the complexities in the writing style of these documents as they are technical as well as legal. The analysis we have done shows that the general in-depth parsers available in the open domain fail to parse the „claims section‟ sentences in patent documents. The failure of in-depth parsers led us, to develop a methodology to extract CGs using other resources. Thus in the present work we came up with a methodology which uses shallow parsing, NER and machine learning technique for extracting concepts and conceptual relationships from sentences in the claim/novelty section of patent documents. The results obtained from our experiments are encouraging and are discussed in detail in this paper. We have obtained a precision of 73.2 % and a recall of 68.3%.

4 citations


01 Jan 2013
TL;DR: Recent improvements to the system as well as other enhancements made with an aim to help Ainu language researchers are described, including enhanced the POS tagger with analysis of morphological information.
Abstract: This paper describes our research on computer processing of Ainu language with the use of various NLP techniques. Ainu is an endangered language close to extinction. At present linguists and anthropologists make a great effort to preserve the language by analyzing and understanding it. However, most of the work in this matter is done manually, which makes it an uphill task. Previously we have presented POST-AL, a part-of-speech tagger for Ainu language. This paper describes recent improvements to the system as well as other enhancements made with an aim to help Ainu language researchers. In particular, we have enhanced the POS tagger with analysis of morphological information. We have also added a translation support tool for Ainu language translators and made a first step toward deeper syntactical analysis of Ainu language by creating a simple shallow parser.

3 citations


Journal Article
TL;DR: This paper surveys the rich researches on chunking in several aspects: the definition and classification of chunks, the chunk identification, the chunks annotation and evaluation, and the internal relationship in chunks.
Abstract: Chunking,as a typical shallow parsing,serves for many language information processing system for their demands on syntactic information,as well as a bridge between the lexical analysis,syntactic parsing and semantic parsing.This paper surveys the rich researches on chunking in several aspects: the definition and classification of chunks,the chunks identification,the chunks annotation and evaluation,and the internal relationship in chunks.Finally,this paper draws conclusions and discusses the future work.

3 citations



Journal ArticleDOI
TL;DR: Different strategies to improve a superchunker based on Conditional Random Fields by combining it with a finite-state symbolic super-chunker driven by lexical and grammatical resources are presented.
Abstract: In this paper, we focus on chunking including contiguous multiword expression recognition, namely super-chunking. In particular, we present different strategies to improve a superchunker based on Conditional Random Fields by combining it with a finite-state symbolic super-chunker driven by lexical and grammatical resources. We display a substantial gain of 7.6 points in terms of overall accuracy.

Dissertation
01 Jan 2013
TL;DR: This dissertation presents a grammatically motivated, sentiment classification framework to handle these distinctive features of the Urdu language, and uses the sentiment-annotated lexicon based approach.
Abstract: The rise of social networking sites and blogs has simulated a bull market in personal opinion; consumer recommendations, product reviews, ratings, and other types of online expressions. For computational linguistic researchers, this fast-growing heap of information has opened an exciting research frontier, referred as, the Sentiment Analysis (SA).For English, this area is under consideration from last decade.But, other major languages, like Urdu, are totally overlooked by the research community.Urdu is a morphologically rich and recourse poor language.The distinctive features, like, complex morphology, flexible grammar rules, context sensitive orthography and free word order, make the Urdu language processing a challenging problem domain. For the same reasons, sentiment analysis approaches and techniques developed for other well-explored languages are not workable for Urdu text.This dissertation presents a grammatically motivated, sentiment classification framework to handle these distinctive features of the Urdu language.The main research contributions are; to highlight the linguistic (orthography, grammar and morphology, etc.) as well as technical (parsing algorithm, lexicon, corpus, etc.) aspects of this multidimensional research problem, to explore Urdu morphological operations, grammar and orthographic rules, to redefine these operations and rules with respect to the requirements of sentiment analysis framework. The orthographical, morphological, grammatical and finally the conceptual details of the language are our target concerns. Additionally, our approach can help in the sentiment analysis of other languages, like Arabic, Persian, Hindi, Punjabi etc.The proposed framework emphasizes on the identification of the SentiUnits, rather than, the subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for which an opinion is made.The system extracts SentiUnits and the target expressions through the shallow parsing based chunking.The dependency parsing algorithm creates associations between these extracted expressions. The framework uses the sentiment-annotated lexicon based approach. Each entry of the lexicon is marked with its orientation (positive or negative) and the intensity (force of orientation) score.The experimentation based evaluation of the system with a sentiment-annotated lexicon of Urdu words and two corpuses of reviews as test-beds, shows encouraging achievement in terms of accuracy, precision, recall and f-measure.

Book ChapterDOI
24 Mar 2013
TL;DR: This work has investigated how the results of a pattern-based unsupervised grammar induction system improve as data on new kind of phrases are added, leading to a significant improvement in performance.
Abstract: There is a growing interest in unsupervised grammar induction, which does not require syntactic annotations, but provides less accurate results than the supervised approach. Aiming at improving the accuracy of the unsupervised approach, we have resorted to additional information, which can be obtained more easily. Shallow parsing or chunking identifies the sentence constituents (noun phrases, verb phrases, etc.), but without specifying their internal structure. There exist highly accurate systems to perform this task, and thus this information is available even for languages for which large syntactically annotated corpora are lacking. In this work we have investigated how the results of a pattern-based unsupervised grammar induction system improve as data on new kind of phrases are added, leading to a significant improvement in performance. We have analyzed the results for three different languages. We have also shown that the system is able to significantly improve the results of the unsupervised system using the chunks provided by automatic chunkers.

05 Jun 2013
TL;DR: The improvements presented in this paper include the following: analyses of previously identified ambiguities in morphosyntax and in syntactic functions, their disambiguation, and finally, an outline of possible steps in terms of shallow parsing based on the results provided by the disambigsuation process.
Abstract: Our goal in this article is to show the improvements in the computational treatment of Basque, and more specifically, in the areas of morphosyntactic disambiguation and shallow parsing The improvements presented in this paper include the following: analyses of previously identified ambiguities in morphosyntax and in syntactic functions, their disambiguation, and finally, an outline of possible steps in terms ofshallow parsing based on the results provided by the disambiguation process The work is part of the current research within the field of Natural Language Processing (NLP) in Basque, and more specifically, part of the work that is being done within the IXA group

Proceedings ArticleDOI
23 Jul 2013
TL;DR: An abbreviation definition identification algorithm is proposed, which employs a variety of rules and incorporates shallow parsing of the text to identify the most probable abbre acronym definition from general texts.
Abstract: The study of abbreviation identifications mostly is limited to the biomedical literature. The wide use of abbreviations in general texts, including web data and newswire data, requires us to process and extract the abbreviation definition. In this paper, we propose an abbreviation definition identification algorithm, which employs a variety of rules and incorporates shallow parsing of the text to identify the most probable abbreviation definition from general texts. The performance of our system was tested with data set provided by 2012 NIST1 TAC-KBP2, obtaining a performance of 94.2% recall and 95.5% precision.

Journal ArticleDOI
TL;DR: A text mining approach for multiclass biomedical relations based on predicate argument structure (PAS) and shallow parsing and the implementation of BRES, a text mining system, is implemented based on the proposed approach.
Abstract: With an overwhelming amount of published biomedical research, the underlying biomedical knowledge is expanding at an exponential rate. This expansion makes it very difficult to find interested genetics knowledge. And therefore, there is an urgent need for developing text mining approaches to discover new knowledge from publications. This paper presents a text mining approach for multiclass biomedical relations based on predicate argument structure (PAS) and shallow parsing. The approach can mine explicit biomedical relations with semantic enrichment, and visualize relations with semantic network. It first identifies noun phrases based on shallow parsing, and then filters arguments from noun phrases via biomedical ontology dictionary. We have implemented BRES, a text mining system, based on our proposed approach. Our results obtained 67.7% F-measure, 62.5% precision and 73.8% recall for the test dataset. This also shows our proposed approach is promising for developing biomedical text mining technology. Highlights: • Mining multiclass biomedical relations; • Representing biomedical relations with semantic enrichment; • Visualizing relations by semantic network; • Extracting direct and indirect biomedical relations.

Book ChapterDOI
Qiong Wu1
10 May 2013
TL;DR: Wang et al. as mentioned in this paper focused on Chinese non-canonical VN collocations from the NLP perspective and made a classification of the Chinese noncanonical collocations, and then talked about their semantic features, and argued that, machine recognition of Chinese collocations should not only consider the semantic roles of the objects, but also the verbs.
Abstract: This paper focuses on Chinese non-canonical VN collocations from the NLP perspective It first makes a classification of the Chinese non-canonical VN collocations, and then talks about their semantic features This paper argues that, machine recognition of Chinese non-canonical collocations should not only consider the semantic roles of the objects, but also the verbs Idioms and chunks should be put into the lexicon directly A flow chart for the machine recognition is offered at the end of this paper

Book ChapterDOI
01 Jan 2013
TL;DR: A new model for shallow parsing of Chinese is presented, which adopts Church theory and carries on Chinese phrases recognition based on HMM; improves the precision rate of sentences separation by improving the observance probabilities of HMM model and making use of the context information of the Chinese sentences.
Abstract: Complete parsing is difficult to meet the need of precision and recall rate in Chinese. To address this problem, a new model for shallow parsing of Chinese is presented in this paper. We adopt Church theory and carry on Chinese phrases recognition based on HMM; improve the precision rate of sentences separation by improving the observance probabilities of HMM model and making use of the context information of the Chinese sentences. At the same time, by studying the rules of Chinese sentence, we extract some rules useful for ambiguity elimination. The experimental result indicates that the model based on HMM has high precision and recall rate.

Journal ArticleDOI
TL;DR: In this essay, some applied technology of shallow parsing is introduced and a new method of it is experimented.
Abstract: Shallow parsing is a new strategy of language processing in the domain of natural language processing recently years It is not focus on the obtaining of the full parsing tree but requiring of the recognition of some simple composition of some structure It separated parsing into two subtasks: one is the recognition and analysis of chunks the other is the analysis of relationships among chunks In this essay, some applied technology of shallow parsing is introduced and a new method of it is experimented