scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2020"


Book ChapterDOI
01 Jan 2020
TL;DR: The chapter presents software for extracting predicate-argument constructions that characterizing the composition of the structural elements of the inventions from cyber-physics domain and the relationships between them.
Abstract: The chapter presents software for extracting predicate-argument constructions that characterizing the composition of the structural elements of the inventions from cyber-physics domain and the relationships between them. The extracted structures reconstruct the component structure of the invention in the form of a net. Such data is further converted into a domain ontology and used in the field of information support of automated invention. A new method for extracting structured data from patents has been proposed taking into account the specificity of the text of patents and is based on the shallow parsing and segmentation of sentences. The ontology scheme includes the structural elements of technical objects as the concepts and the relationship between them, as well as supporting information on the invention. The results suggest that the proposed approach is promising. A further direction of research is seen by the authors in improving the existing method for extracting data and expanding ontology.

11 citations


Journal ArticleDOI
TL;DR: This paper presents a statistical POS tagger for Somali language using different machine learning approaches (i.e., HMM and CRF) and neural network model and explores the use word embeddings for Somali POS tagging.
Abstract: POS tagging serves as a preliminary task for many NLP applications. It refers to the process of classifying words into their parts of speech (also known as words classes or lexical categories). Somali is a member of the Cushitic languages with limited number of NLP tools for use. An accurate and reliable POS tagger is essential for many NLP tasks like shallow parsing, dependency parsing, sentiment analysis, and named entity recognition. In this paper, we present a statistical POS tagger for Somali language using different machine learning approaches (i.e., HMM and CRF) and neural network model. Our Somali POS tagger outperforms the state-of-the-art POS tagger by 87.51% on a tenfold cross-validation. The key contribution of this paper are (1) building a generic POS tagger, (2) comparing the performances with the existing state of the art techniques, and (3) exploring the use word embeddings for Somali POS tagging.

9 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: This work proposes CNN based models that incorporate this semantic information of Chinese characters and use them for NER and shows an improvement over the baseline BERT-BiLSTM-CRF model.
Abstract: Most Named Entity Recognition (NER) systems use additional features like part-of-speech (POS) tags, shallow parsing, gazetteers, etc. Adding these external features to NER systems have been shown to have a positive impact. However, creating gazetteers or taggers can take a lot of time and may require extensive data cleaning. In this work instead of using these traditional features we use lexicographic features of Chinese characters. Chinese characters are composed of graphical components called radicals and these components often have some semantic indicators. We propose CNN based models that incorporate this semantic information and use them for NER. Our models show an improvement over the baseline BERT-BiLSTM-CRF model. We present one of the first studies on Chinese OntoNotes v5.0 and show an improvement of + .64 F1 score over the baseline. We present a state-of-the-art (SOTA) F1 score of 71.81 on the Weibo dataset, show a competitive improvement of + 0.72 over baseline on the ResumeNER dataset, and a SOTA F1 score of 96.49 on the MSRA dataset.

7 citations


Book ChapterDOI
24 Jun 2020
TL;DR: A model for the generation of literary sentences in Spanish is proposed, which is based on statistical algorithms, shallow parsing and the automatic detection of personality features of characters of well known literary texts.
Abstract: The area of Computational Creativity has received much attention in recent years. In this paper, within this framework, we propose a model for the generation of literary sentences in Spanish, which is based on statistical algorithms, shallow parsing and the automatic detection of personality features of characters of well known literary texts. We present encouraging results of the analysis of sentences generated by our methods obtained with human inspection.

5 citations



Posted Content
01 Jan 2020
TL;DR: Proponemos tres modelos de generación textual basados principalmente en algoritmos estadísticos y análisis sintáctico superficial superficial en el área of the Creatividad Computacional (CC).
Abstract: In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on statistical algorithms and shallow parsing analysis. We also present some rather encouraging preliminary results.

2 citations


Posted Content
TL;DR: Considering the paucity of resources in code mixed languages, the CRF model and HMM model is proposed for word level language identification and the best performing system is CRF-based with an f1-score of 0.91.
Abstract: In a multilingual or sociolingual configuration Intra-sentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays. In the world, most of the people know more than one language. CM usage is especially apparent in social media platforms. Moreover, ICS is particularly significant in the context of technology, health, and law where conveying the upcoming developments are difficult in one's native language. In applications like dialog systems, machine translation, semantic parsing, shallow parsing, etc. CM and Code Switching pose serious challenges. To do any further advancement in code-mixed data, the necessary step is Language Identification. In this paper, we present a study of various models - Nave Bayes Classifier, Random Forest Classifier, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for Language Identification in English - Telugu Code Mixed Data. Considering the paucity of resources in code mixed languages, we proposed the CRF model and HMM model for word level language identification. Our best performing system is CRF-based with an f1-score of 0.91.

1 citations


Book ChapterDOI
01 Jan 2020
TL;DR: It is found that it costs more time for training and tagging with the machine learning method with more features and more fine-grained tagging schemes on all the corpora, Nevertheless, the tagging time is less affected by them.
Abstract: Text chunking, also known as shallow parsing, is an important task in natural language processing, and very useful for other tasks. By means of discriminate machine learning methods and extensive experiments, this paper investigates the impacts of different tagging schemes and feature types on chunking efficiency and effectiveness on corpora with different chunk specifications and languages. We find out that it costs more time for training and tagging with the machine learning method with more features and more fine-grained tagging schemes on all the corpora. Nevertheless, the tagging time is less affected by them. It is also revealed from our investigation that the method with more features and more fine-grained tagging schemes has better performance, but the chunk specification of corpus may have impacts on the choice.


Book ChapterDOI
01 Jan 2020
TL;DR: A knowledge-poor machine learning technique that employs heuristics procedures on shallow morphological features for finding equivalence class for Hindi anaphora and the paper discusses the ambiguities in Hindi spoken dialogues and the role of case markers (CMs) to resolve anaphors.
Abstract: Natural language processing needs cognitive study of human brain, mind, and intelligence to come up with new intricacies and theories and convert them to build computational strategies and machine learning based applications. One such phenomenon of natural language processing is anaphora resolution (AR) that assures the presence of pro-forms or pronouns in the context and calls an antecedently expression that recover the same meaning and can be an noun phrase (NP) or non-NP. Therefore AR is the task of resolving a pronoun that has been introduced after its referent and depict same implication. A reader or speaker rejects the formal linguistic concept to understand the utterance and link the pro-forms appeared in dialects to the actual referent as per the cognition. The productivity and performance of many natural languages processing applications such as tutoring system, essay summarization system, question answering system, and machine translation had been affected by the unresolved hidden facts in pro-forms, which must be sort out. AR is a nontrivial task as the human brain is smart enough to comprehend different writing and speaking styles, references, coreferences, idiomatic terms, etc., but computer lacks real-world knowledge, and hence analyzing anaphora and its referent is problematic. The machine learning techniques had made possible to understand and handle such exigent task like AR. The paper presents a knowledge-poor machine learning technique that employs heuristics procedures on shallow morphological features for finding equivalence class for Hindi anaphora. The authors believe that it is a novel procedure as no dictionaries and named entity recognizer have been implemented, and investigations have been carried out solely utilizing shallow parsing where there is no information available about the animated entity. The paper describes the probabilistic method that integrates the filtration rules for mention detection and the algorithms for identifying actual antecedents of intrasentential and intersentential anaphora. The filtration rules check the potency of mentions for being candidate for antecedent and remove them if irrelevant. The resolution algorithms for pronominal and nonpronominal anaphors in Hindi texts gives approximate solution as based on heuristics. The antecedents for mentions may be multiple, distributive or phrase in nature and could span across the utterance in the discourse. Hence, prior to resolution such mentions are remarked and categorized. The paper discusses the ambiguities in Hindi spoken dialogues and the role of case markers (CMs) to resolve anaphors. The limited attributes of noun and pronouns are strongly considered as constraint features that have been incorporated in the algorithms, which are CMs, number, and distance between the anaphor and the candidate NP. The approach had defined its own module to categorize inanimate NP without using any semantic knowledge. Further the authors have discussed the contribution of these features in overall result. The final evaluation of the system has been conducted using standard metrics and also comparison of the results with other resolution approaches using the same datasets has been made. Since the approach is not domain based, it is feasible to easily apply to new domains that contain dialogues. The effectiveness of the results has been promoted by generating the mapping table for each utterance that satisfactorily depicts human cognition.

Book ChapterDOI
26 Feb 2020
TL;DR: The article presents a method for extracting predicate-argument constructions characterizing the composition of the structural elements of the inventions and the relationships between them and the results suggest that the proposed method is promising.
Abstract: The article presents a method for extracting predicate-argument constructions characterizing the composition of the structural elements of the inventions and the relationships between them. The extracted structures are converted into a domain ontology and used in prior art patent search and information support of automated invention. The analysis of existing natural language processing (NLP) tools in relation to the processing of Russian-language patents has been carried out. A new method for extracting structured data from patents has been proposed taking into account the specificity of the text of patents and is based on the shallow parsing and segmentation of sentences. The value of the F1 metric for a rigorous estimate of data extraction is 63% and for a lax estimate is 79%. The results obtained suggest that the proposed method is promising.

Posted Content
TL;DR: This work addresses the automatic generation of literary sentences in Spanish by proposing three models of text generation based mainly on statistical algorithms and shallow parsing analysis.
Abstract: In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on statistical algorithms and shallow parsing analysis. We also present some rather encouraging preliminary results.