scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2014"


Journal ArticleDOI
TL;DR: A grammatically motivated, sentiment classification model, applied on a morphologically rich language: Urdu, achieves the state of the art performance in the sentiment analysis of the Urdu text.
Abstract: This paper presents, a grammatically motivated, sentiment classification model, applied on a morphologically rich language: Urdu. The morphological complexity and flexibility in grammatical rules of this language require an improved or altogether different approach. We emphasize on the identification of the SentiUnits, rather than, the subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for which an opinion is made. The system extracts SentiUnits and the target expressions through the shallow parsing based chunking. The dependency parsing algorithm creates associations between these extracted expressions. For our system, we develop sentiment-annotated lexicon of Urdu words. Each entry of the lexicon is marked with its orientation (positive or negative) and the intensity (force of orientation) score. For the evaluation of the system, two corpora of reviews, from the domains of movies and electronic appliances are collected. The results of the experimentation show that, we achieve the state of the art performance in the sentiment analysis of the Urdu text.

34 citations


Journal ArticleDOI
22 Jul 2014-PeerJ
TL;DR: The results show how the presented method for machine-aided skim reading outperforms tools like PubMed regarding focused browsing and informativeness of the browsing context.
Abstract: Background. Unlike full reading, ‘skim-reading’ involves the process of looking quickly over information in an attempt to cover more material whilst still being able to retain a superficial view of the underlying content. Within this work, we specifically emulate this natural human activity by providing a dynamic graph-based view of entities automatically extracted from text. For the extraction, we use shallow parsing, co-occurrence analysis and semantic similarity computation techniques. Our main motivation is to assist biomedical researchers and clinicians in coping with increasingly large amounts of potentially relevant articles that are being published ongoingly in life sciences. Methods. To construct the high-level network overview of articles, we extract weighted binary statements from the text. We consider two types of these statements, co-occurrence and similarity, both organised in the same distributional representation (i.e., in a vector-space model). For the co-occurrence weights, we use point-wise mutual information that indicates the degree of non-random association between two co-occurring entities. For computing the similarity statement weights, we use cosine distance based on the relevant co-occurrence vectors. These statements are used to build fuzzy indices of terms, statements and provenance article identifiers, which support fuzzy querying and subsequent result ranking. These indexing and querying processes are then used to construct a graph-based interface for searching and browsing entity networks extracted from articles, as well as articles relevant to the networks being browsed. Last but not least, we describe a methodology for automated experimental evaluation of the presented approach. The method uses formal comparison of the graphs generated by our tool to relevant gold standards based on manually curated PubMed, TREC challenge and MeSH data. Results. We provide a web-based prototype (called ‘SKIMMR’) that generates a network of inter-related entities from a set of documents which a user may explore through our interface. When a particular area of the entity network looks interesting to a user, the tool displays the documents that are the most relevant to those entities of interest currently shown in the network. We present this as a methodology for browsing a collection of research articles. To illustrate the practical applicability of SKIMMR, we present examples of its use in the domains of Spinal Muscular Atrophy and Parkinson’s Disease. Finally, we report on the results of experimental evaluation using the two domains and one additional dataset based on the TREC challenge. The results show how the presented method for machine-aided skim reading outperforms tools like PubMed regarding focused browsing and informativeness of the browsing context.

10 citations


Proceedings ArticleDOI
Sung Jeon Song1, Go Eun Heo1, Ha Jin Kim1, Hyo Jung Jung1, Yong Hwan Kim1, Min Song1 
07 Nov 2014
TL;DR: This paper proposes a hybrid approach to extracting relations based on a rule-based approach feature set, and uses different classification algorithms such as SVM, Naïve Bayes, and Decision Tree classifiers for relation classification.
Abstract: Relation extraction is an important task in biomedical areas such as protein-protein interaction, gene-disease interactions, and drug-disease interactions. In recent years, it has been widely researched to automatically extract biomedical relations in a vest amount of biomedical text data. In this paper, we propose a hybrid approach to extracting relations based on a rule-based approach feature set. We then use different classification algorithms such as SVM, Naive Bayes, and Decision Tree classifiers for relation classification. The rationale for adopting shallow parsing and other NLP techniques to extract relations is two-folds: simplicity and robustness. We select seven features with the rule-based shallow parsing technique and evaluate the performance with four different PPI public corpora. Our experimental results show the stable performance in F-measure even with the relatively fewer features.

9 citations


Journal ArticleDOI
TL;DR: A novel approach for unsupervised shallow parsing model trained on the unannotated Chinese text of parallel Chinese-English corpus, with exploitation of graph-based label propagation for bilingual knowledge transfer, along with an application of using the projected labels as features in un supervised model.
Abstract: This paper presents a novel approach for unsupervised shallow parsing model trained on the unannotated Chinese text of parallel Chinese-English corpus. In this approach, no information of the Chinese side is applied. The exploitation of graph-based label propagation for bilingual knowledge transfer, along with an application of using the projected labels as features in unsupervised model, contributes to a better performance. The experimental comparisons with the state-of-the-art algorithms show that the proposed approach is able to achieve impressive higher accuracy in terms of F-score.

8 citations


Book ChapterDOI
01 Dec 2014
TL;DR: This work proposes a deep Natural Language Understanding approach to create complete and precise formal models of requirements specifications, providing feedback to the user, allowing the refinement of specifications into a precise and unambiguous form.
Abstract: Many attempts have been made to apply Natural Language Processing to requirements specifications. However, typical approaches rely on shallow parsing to identify object-oriented elements of the specifications (e.g. classes, attributes, and methods). As a result, the models produced are often incomplete, imprecise, and require manual revision and validation. In contrast, we propose a deep Natural Language Understanding approach to create complete and precise formal models of requirements specifications. We combine three main elements to achieve this: (1) acquisition of lexicon from a user-supplied glossary requiring little specialised prior knowledge; (2) flexible syntactic analysis based purely on word-order; and (3) Knowledge-based Configuration unifies several semantic analysis tasks and allows the handling of ambiguities and errors. Moreover, we provide feedback to the user, allowing the refinement of specifications into a precise and unambiguous form. We demonstrate the benefits of our approach on an example from the PROMISE requirements corpus.

3 citations


Proceedings ArticleDOI
01 Jun 2014
TL;DR: Hedge parsing is introduced as an approach to recovering constituents of length up to some maximum span L, which improves efficiency by bounding constituent size, and allows for efficient segmentation strategies prior to parsing.
Abstract: Finite-state chunking and tagging methods are very fast for annotating nonhierarchical syntactic information, and are often applied in applications that do not require full syntactic analyses. Scenarios such as incremental machine translation may benefit from some degree of hierarchical syntactic analysis without requiring fully connected parses. We introduce hedge parsing as an approach to recovering constituents of length up to some maximum span L. This approach improves efficiency by bounding constituent size, and allows for efficient segmentation strategies prior to parsing. Unlike shallow parsing methods, hedge parsing yields internal hierarchical structure of phrases within its span bound. We present the approach and some initial experiments on different inference strategies.

3 citations


Journal ArticleDOI
TL;DR: A new phrase chunking algorithm is proposed that accepts Myanmar tagged sentence as input and generates chunks as output and good accuracy of Precision, Recall and F-measure were obtained with new developed algorithm.
Abstract: Chunking is the subdivision of sentences into non recursive regular syntactical groups: verbal chunks, nominal chunks, adjective chunks, adverbial chunks and propositional chunks etc. The chunker can operate as a preprocessor for Natural Language Processing systems. This study aims to proposed new phrase chunking algorithm for Myanmar natural language processing. The developed new algorithm accepts Myanmar tagged sentence as input and generates chunks as output. Input Myanmar sentence is split into chunks by using chunk markers such as postpositions, particles and conjunction and define the type of chunks as noun chunk, verb chunk, adjective chunk, adverb chunk and conjunction chunk. The algorithm was evaluated with POS tagged Myanmar sentences based on three measures parameters. According to the results, good accuracy of Precision, Recall and F-measure were obtained with new developed algorithm.

2 citations


Book ChapterDOI
17 Sep 2014
TL;DR: This work describes an information extraction methodology which uses shallow parsing and uses predefined frame templates and vocabulary stored within a domain ontology with elements related to frame templates.
Abstract: This work describes an information extraction methodology which uses shallow parsing. We present detailed information on the extraction process, data structures used within that process as well as the evaluation of the described method. The extraction is fully automatic. Instead of machine learning it uses predefined frame templates and vocabulary stored within a domain ontology with elements related to frame templates. The architecture of the information extractor is modular and the main extraction module is capable of processing various languages when lexicalization for these languages is provided.

2 citations


Proceedings ArticleDOI
Kamal Sarkar1
19 Dec 2014
TL;DR: This paper presents a new approach for automatically extracting key phrases from a Bengali document that uses lexical information and case markers for candidate key phrase identification and choosing the best items from the set of the candidates using a ranking method that combines the statistical features and the linguistic features for ranking the candidates.
Abstract: This paper presents a new approach for automatically extracting key phrases from a Bengali document. Our proposed approach presented in this paper has two important steps: (1) a shallow parsing based candidate key phrase identification that uses lexical information and case markers for candidate key phrase identification and (2) choosing the best items from the set of the candidates using a ranking method that combines the statistical features and the linguistic features for ranking the candidates. The feature set includes term frequency, position of the phrase's first occurrence, named entity information and lexical information. The proposed system has been tested on a collection of Bengali news documents. The experimental results show that it performs better than the existing approaches to which it is compared.

1 citations


Proceedings ArticleDOI
11 Apr 2014
TL;DR: The article presents preliminary results of a project which aims at developing a new algorithm for shallow parsing of a natural language, which allows efficient processing of word lattices and an efficient application of RHS operations directly on a word lattice, which turns out not to be trivial due to the high expressive power of the Spejd formalism.
Abstract: The article presents preliminary results of a project which aims at developing a new algorithm for shallow parsing of a natural language. The main feature is that the algorithm allows efficient processing of word lattices. The main application of this feature will be a new rule-based approach to automatic speech recognition: with scoring sentence candidates for a given utterance with regard to their accordance with grammar of a particular natural language. The algorithm is being implemented on a basis of a shallow parsing and morphosyntactic disambiguation system Spejd. The Spejd formalism is designed to process a single (linear) sentence at once. The naive baseline approach to the task of scoring each of sentence candidates in a word lattice would be to process each of the candidates sequentially, after extracting from the lattice. Unfortunately, the time complexity of the naive approach is exponential in the size of the lattice. Hence, the basic approach is not applicable to real-life data. The rule pattern matching algorithm which is presented in the article uses Finite-State techniques to process the whole word lattice at once in polynomial time and space. It combines optimizations of processing of linear sentences used in the existing Spejd implementation with new ideas, specific to searching for regular patterns in word lattice. The improvements follow the idea of a Non-deterministic Finite-State Automaton simulation without backtracking presented by Ville Laurikari. This idea was extended to handle not only a bunch of NFA threads at once, but also to simultaneously process multiple paths passing through a particular word (a lattice node). The article also discusses an efficient application of RHS operations directly on a word lattice, which turns out not to be trivial due to the high expressive power of the Spejd formalism. The article is supplemented by a preliminary speed evaluation of the new implementation.

Dissertation
15 Apr 2014
TL;DR: This thesis contributes to the task with the language-style and domain adaptation techniques for machine translation of spoken conversations using off-the-shelf systems like Google Translate, SMT systems trained on both out-of-domain and in-domain parallel data and demonstrates that the techniques are beneficial for both close and distant language pairs.
Abstract: English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issue. However, this requires significant amount of effort. Recently, Statistical Machine Translation (SMT) techniques and parallel corpora were used to transfer annotations from a linguistic resource rich languages to a resource-poor languages for a variety of Natural Language Processing (NLP) tasks, including Part-of-Speech tagging, Noun Phrase chunking, dependency parsing, textual entailment, etc. This cross-language NLP paradigm relies on the solution of the following sub-problems: - Data-driven NLP techniques are very sensitive to the differences in training and testing conditions. Different domains, such as financial news-wire and biomedical publications, have different distributions of NLP task-specific properties; thus, the domain adaptation of the source language tools -- either the development of models with good cross-domain performance or tuned to the target domain -- is critical. - Another difference in training and testing conditions arises with cross-genre applications such as written text (monologues) and spontaneous dialog data. Properties of written text such as punctuation and the notion of sentence are not present in spoken conversation transcriptions. Thus, style-adaptation techniques to cover a wider range of genres is critical as well. - The basis of cross-language porting is parallel corpora. Unfortunately, parallel corpora are scarce. Thus, generation or retrieval of parallel corpora between the languages of interest is important. Additionally, these parallel corpora most often are not in the domains of interest; consequently, the cross-language porting should be augmented with SMT domain adaptation techniques. - The language distance play an important role within the paradigm, since for close family language pairs (e.g. Romance languages Italian and Spanish) the range of linguistic phenomena to consider is significantly less compared to the distant family language pairs (e.g. Italian and Turkish). The developed cross-language techniques should be applicable to both conditions. In this thesis we address these sub-problems on complex Natural Language Processing tasks of Discourse Parsing and Spoken Language Understanding. Both tasks are cast as token-level shallow parsing. Penn Discourse Treebank (PDTB) style discourse parsing is applied cross-domain and we contribute feature-level domain adaptation techniques for the task. Additionally, we explore PDTB-style discourse parsing on dialog data in Italian are report on challenges. The problems of parallel corpora creation, language style adaptation, SMT domain-adaptation and language distance are addressed on the task of cross-language porting of Spoken Language Understanding. This thesis contributes to the task with the language-style and domain adaptation techniques for machine translation of spoken conversations using off-the-shelf systems like Google Translate, SMT systems trained on both out-of-domain and in-domain parallel data. We demonstrate that the techniques are beneficial for both close and distant language pairs. We propose the methodologies for the creation of parallel spoken conversation corpora via professional translation services that considers speech phenomena such as disfluencies. Additionally, we explore the semantic annotation transfer using automatic SMT methods and crowdsourcing. For the later, we propose the computational methodology to obtain acceptable quality corpus without the target language references and the low worker agreement.

Proceedings ArticleDOI
13 Jul 2014
TL;DR: It is shown how to train a HM-SVM model to achieve good performance on the data set of CoNLL2000 share task, and the model yields an F-score of 95.51% which is better than any system result of ConLL 2000 share task.
Abstract: Shallow parsing system, providing natural part syntactic information statement, to meet a lot of language information processing requirements, has received much attention recent years. Hidden Markov Support Vector Machines (HM-SVMs) for sequence labeling offer advantages over both generative models like HMMs and classifying models like SVMs which give labeling result for each positionseparately. We show how to train a HM-SVM model to achieve good performance on the data set of CoNLL2000 share task. The HM-SVMs yields an F-score of 95.51% which is better than any system result of ConLL2000 share task.