scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2016"


Posted Content
TL;DR: In this paper, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed, and a shallow parser has been developed.
Abstract: In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community with the goal of enabling better text analysis of Hindi English CSMT. The pipeline is accessible at 1 .

59 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: The problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed, and a language identifier, a normalizer, a part-of-speech tagger and a shallow parser are developed.

39 citations


Proceedings ArticleDOI
01 Aug 2016
TL;DR: This work provides an in-depth analysis of effect of Sandhi in developing a robust shallow parser pipeline with experimental results emphasizing on how sensitive the individual components of shallow parser are, towards the accuracy of a sandhi splitter.
Abstract: This paper evaluates the challenges involved in shallow parsing of Dravidian languages which are highly agglutinative and morphologically rich. Text processing tasks in these languages are not trivial because multiple words concatenate to form a single string with morpho-phonemic changes at the point of concatenation. This phenomenon known as Sandhi, in turn complicates the individual word identification. Shallow parsing is the task of identification of correlated group of words given a raw sentence. The current work is an attempt to study the effect of Sandhi in building shallow parsers for Dravidian languages by evaluating its effect on Malayalam, one of the main languages from Dravidian family. We provide an in-depth analysis of effect ofSandhi in developing a robust shallow parser pipeline with experimental results emphasizing on how sensitive the individual components of shallow parser are, towards the accuracy of a sandhi splitter. Our work can serve as a guiding light for building robust text processing systems in Dravidian languages.

9 citations


Book ChapterDOI
01 Jan 2016
TL;DR: This paper proposes a noun phrase chunker system for Turkish texts that uses a weighted constraint dependency parser to represent the relationship between sentence components and to determine noun phrases.
Abstract: Noun phrase chunking is a sub-category of shallow parsing that can be used for many natural language processing tasks. In this paper, we propose a noun phrase chunker system for Turkish texts. We use a weighted constraint dependency parser to represent the relationship between sentence components and to determine noun phrases. The dependency parser uses a set of hand-crafted rules which can combine morphological and semantic information for constraints. The rules are suitable for handling complex noun phrase structures because of their flexibility. The developed dependency parser can be easily used for shallow parsing of all phrase types by changing the employed rule set. The lack of reliable human tagged datasets is a significant problem for natural language studies about Turkish. Therefore, we constructed a noun phrase dataset for Turkish. According to our evaluation results, our noun phrase chunker gives promising results on this dataset.

6 citations


Proceedings ArticleDOI
29 Apr 2016
TL;DR: In the proposed work, the applicability of BLEU metric and of its modified versions for English to Hindi Machine Translation(s) particularly for Agriculture Domain is checked and a synonym replacement module is incorporated in the algorithm.
Abstract: The task of the Evaluation of Machine Translation (MT) is very difficult and challenging too. The difficulty comes from the fact that translation is not a science but it is more an art; most of the sentences can be translated in many acceptable forms. Consequently, there is no such fix standard against which a particular translation can be evaluated. If it has been possible to make an independent algorithm that would be able to evaluate a specific Machine Translation, then the belief is that this evaluation algorithm will be a better algorithm than the translating algorithm itself. Initially, MT evaluation used to be done by human beings which was a time-consuming task and highly subjective too. Also evaluation results may vary from one human evaluator to another for the same sentence pair. Therefore we need automatic evaluation systems, which are quick and objective. Different methods for Automatic Evaluation of Machine Translation have been projected in recent years, out of which many of them have been accepted willingly by the MT community. In the proposed work, we have checked the applicability of BLEU metric and of its modified versions for English to Hindi Machine Translation(s) particularly for Agriculture Domain. Further, we have incorporated some additional features like synonym replacement and shallow parsing modules and after that we have calculated the final score by using BLEU and M-BLEU metrics. The sentences which have been tested are taken from Agriculture Domain. The BLEU metric does not consider the synonym problem and it considers synonym as different words thereby lowering down the final calculated score while comparing the human translations with the machine translations. To overcome this drawback of BLEU, we have incorporated a synonym replacement module in our algorithm. For this, first of all the word is replaced by its synonym present in any of the reference human translations and then it is compared with the reference human translation.

3 citations


Book ChapterDOI
01 Jan 2016
TL;DR: A method to extract the comparative sentences from the text documents using a rule-based shallow parser is proposed, which will help researchers to extract knowledge from these contents.
Abstract: The contents generated by the users on the Web play a vital role for researchers to extract knowledge from these contents. Users write their views by making comparison between two or more than two features in a product domain. Extracting these reviews from the Web helps in improving the business from competitors. In this paper, a method to extract the comparative sentences from the text documents using a rule-based shallow parser is proposed. A shallow parser holds a nonoverlapping area of text and allows extracting the part of the text based on the given rule or grammar. In order to identify and classify comparatives from text documents various rules were generated. The proposed technique is divided into two tasks: first, obtain the rules to identify the comparative sentences from various text documents, and second, classify the text documents into different categories of comparatives.

3 citations


Proceedings Article
01 May 2016
TL;DR: The paper contains a description of OPFI: Opinion Finder for the Polish Language, a freely available tool for opinion target extraction that is not dependent on any particular method of sentiment identification and provides a built-in sentiment dictionary as a convenient option.
Abstract: The paper contains a description of OPFI: Opinion Finder for the Polish Language, a freely available tool for opinion target extraction. The goal of the tool is opinion finding: a task of identifying tuples composed of sentiment (positive or negative) and its target (about what or whom is the sentiment expressed). OPFI is not dependent on any particular method of sentiment identification and provides a built-in sentiment dictionary as a convenient option. Technically, it contains implementations of three different modes of opinion tuple generation: one hybrid based on dependency parsing and CRF, the second based on shallow parsing and the third on deep learning, namely GRU neural network. The paper also contains a description of related language resources: two annotated treebanks and one set of tweets.

2 citations


Proceedings ArticleDOI
01 Jul 2016
TL;DR: This paper presents a flexible computational representation for Cognitive Construction Grammars (CCxG) which is based on the argument structure representation of CCxG and provides the visualization of annotated results by the Box Diagram.
Abstract: Construction Grammar (CxG) with strong explanatory power for language phenomena and language learning is still a stranger for most of natural language processing (NLP) tasks. The main reasons include challenges brought by the opening definition of construction, the lacking of a large scale construction knowledge base, the lacking of annotation tools and construction-annotated corpus which are big obstacles for using CxG in NLP. In this paper, we firstly present a flexible computational representation for Cognitive Construction Grammars (CCxG) which is based on the argument structure representation of CCxG. A CCxG definition and annotation system are then implemented. Through shallow parsing, this system provides the visualization of annotated results by the Box Diagram. By emphasizing the computable aspect rather than the cognitive and psychological aspects of the CCxG, we purposely provide NLP researchers and engineers an easily usable tool platform for building applicable construction knowledge base and large scale of training and testing corpus for CCxG parser. It is also useful platform for linguists to investigate and analyze new emerging language phenomena.

1 citations


Posted Content
TL;DR: This paper proposes a SVM and template based approach to Tibetan person knowledge extraction, and designs a hierarchical SVM classifier to realize the entity knowledge extraction.
Abstract: Person knowledge extraction is the foundation of the Tibetan knowledge graph construction, which provides support for Tibetan question answering system, information retrieval, information extraction and other researches, and promotes national unity and social stability. This paper proposes a SVM and template based approach to Tibetan person knowledge extraction. Through constructing the training corpus, we build the templates based the shallow parsing analysis of Tibetan syntactic, semantic features and verbs. Using the training corpus, we design a hierarchical SVM classifier to realize the entity knowledge extraction. Finally, experimental results prove the method has greater improvement in Tibetan person knowledge extraction.

Journal ArticleDOI
Dage Särg1
TL;DR: Extensive use of discourse particles and direct addresses, short sentence length, and small percentage of attributes among the Syntactic functions used in text appeared to be the most distinctive features of netspeak, as well as the large amount of elliptical sentences from which, in addition to other syntactic functions, a predicate can be left out.
Abstract: Artikkel kirjeldab eesti keele kitsenduste grammatika kohandamist internetikeelele. Selleks parsiti 19 809 sone suurune jututubade korpus eesti kirjakeele jaoks valjatootatud reeglistikuga. Korpuse margenduse kasitsi kontrollimisel leitud vigade pohjal tehti reeglistikku muudatusi neljas etapis: osalausepiiride tuvastamine, uhendverbide tuvastamine, pindsuntaktiline analuus ning soltuvussuntaktiline analuus. Too kaigus leiti, et internetikeele suntaksi olulisemateks erijoonteks on laialdane partiklite ja utete kasutus, vaiksem taiendite osakaal, lausete luhidus ja valjajatteliste lausete sage esinemine. Reeglistiku kohandamise tulemusel paranesid nii pind- kui soltuvussuntaktilise analuusi naitajad. Koige enam vigu tekkis subjektide, predikatiivide ja adverbiaalide funktsioonide margendamisel. Soltuvussuntaktilisel analuusil esines enim vigu adverbiaalide soltuvusmargendites. Syntactic analysis of Estonian netspeak using Constraint Grammar The paper provides an overview of an attempt to adapt the Estonian Constraint Grammar rule set for netspeak. The rule set has been developed by Kaili Muurisep and Tiina Puolakainen for shallow and dependency parsing of Estonian literary language, and it has previously been adapted for shallow parsing of spoken Estonian by Kaili Muurisep and Heli Uibo. First, in order to adapt the rules, a chatroom corpus was parsed with the existing rule set. The corpus was manually revised and based on the errors that were found, changes were made to the rule set. The changes regarded detection of clause boundaries and particle verbs, as well as assignment of syntactic tags and dependency relations. Extensive use of discourse particles and direct addresses, short sentence length, and small percentage of attributes among the syntactic functions used in text appeared to be the most distinctive features of netspeak, as well as the large amount of elliptical sentences from which, in addition to other syntactic functions, a predicate can be left out. As a result of adapting the rule set, the results of both shallow and dependency parsing improved. The most error-prone syntactic functions were subjects, predicatives, and adverbials. In dependency parsing, the largest number of errors was made in determining the governors of adverbials.

Book ChapterDOI
03 Apr 2016
TL;DR: Conditional Random Fields (CRFs), a machine learning technique is used for automatic identification of social events and mining of social networks from literary texts in Tamil using shallow parsing for document processing.
Abstract: We describe our work on automatic identification of social events and mining of social networks from literary texts in Tamil. Tamil belongs to Dravidian language family and is a morphologically rich language. This is a resource poor language; sophisticated resources for document processing such as parsers, phrase structure tree tagger are not available. In our work we have used shallow parsing for document processing. Conditional Random Fields (CRFs), a machine learning technique is used for automatic identification of social events. We have obtained an F-measure of 62% on social event identification. Social networks are mined by forming triads of the actors in the social events. The social networks are evaluated using graph comparison technique. The system generated social networks is compared with the gold network. We have obtained a very encouraging similarity score of 0.75.