scispace - formally typeset
Search or ask a question

Showing papers on "Shallow parsing published in 2021"


Journal ArticleDOI
TL;DR: The linguistic work on annotation guidelines development, manual corpus annotation, and preparing the neural models used for chunking - the first one for the Polish language, and the evaluation of these models are described.

1 citations


Book ChapterDOI
20 Sep 2021
TL;DR: This article demonstrates an opposite approach: ontology-based entailing of words in combination with simple shallow parsing rules, which allows us to increase UAS metrics from 0.82 for SpaCy to 0.834 for the authors' approach.
Abstract: The common approach to the analysis of natural texts implies that semantic analysis should following the stage of parsing. However, medical texts are known as very complicated and written in a very specific language. Traditional parsers are demonstrating relatively small productivity here. In this article, we are demonstrating an opposite approach: ontology-based entailing of words in combination with simple shallow parsing rules. It allows us to increase UAS metrics from 0.82 for SpaCy to 0.834 for our approach.

Proceedings ArticleDOI
12 Aug 2021
TL;DR: In this paper, the authors present an online API to access a number of Natural Language Processing services developed at KTH, including tokenization, part-of-speech tagging, shallow parsing, compound word analysis, word inflection, lemmatization, spelling error detection and correction, grammar checking, and more.
Abstract: We present an online API to access a number of Natural Language Processing services developed at KTH. The services work on Swedish text. They include tokenization, part-of-speech tagging, shallow parsing, compound word analysis, word inflection, lemmatization, spelling error detection and correction, grammar checking, and more. The services can be accessed in several ways, including a RESTful interface, direct socket communication, and premade Web forms. The services are open to anyone. The source code is also freely available making it possible to set up another server or run the tools locally. We have also evaluated the performance of several of the services and compared them to other available systems. Both the precision and the recall for the Granska grammar checker are higher than for both Microsoft Word and Google Docs. The evaluation also shows that the recall is greatly improved when combining all the grammar checking services in the API, compared to any one method, and combining services is made easy by the API.

Posted Content
TL;DR: This article showed that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks and applied pause duration to enrich contextual embeddings to improve shallow parsing of entities.
Abstract: Entity tags in human-machine dialog are integral to natural language understanding (NLU) tasks in conversational assistants. However, current systems struggle to accurately parse spoken queries with the typical use of text input alone, and often fail to understand the user intent. Previous work in linguistics has identified a cross-language tendency for longer speech pauses surrounding nouns as compared to verbs. We demonstrate that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks. Analysis of pauses in French and English utterances from a commercial voice assistant shows the statistically significant difference in pause duration around multi-token entity span boundaries compared to within entity spans. Additionally, in contrast to text-based NLU, we apply pause duration to enrich contextual embeddings to improve shallow parsing of entities. Results show that our proposed novel embeddings improve the relative error rate by up to 8% consistently across three domains for French, without any added annotation or alignment costs to the parser.

Posted Content
TL;DR: In this paper, the authors introduce a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for automated compliance checking, and train a sequence tagger that achieves 79,93 F1-score on the test set.
Abstract: Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such training data has led to research that focuses on small sub-tasks, such as shallow parsing or the extraction of a limited subset of rules. This study introduces a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for ACC. We annotate a small domain-specific dataset of 200 sentences, SPaR.txt, and train a sequence tagger that achieves 79,93 F1-score on the test set. We then show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents, and that both contiguous and discontiguous Multi-Word Expressions (MWE) are discovered with reasonable accuracy (70,3%).

01 Nov 2021
TL;DR: In this article, the authors introduce a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for automated compliance checking, and train a sequence tagger that achieves 79,93 F1-score on the test set.
Abstract: Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such training data has led to research that focuses on small sub-tasks, such as shallow parsing or the extraction of a limited subset of rules. This study introduces a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for ACC. We annotate a small domain-specific dataset of 200 sentences, SPaR.txt, and train a sequence tagger that achieves 79,93 F1-score on the test set. We then show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents, and that both contiguous and discontiguous Multi-Word Expressions (MWE) are discovered with reasonable accuracy (70,3%).

27 Sep 2021
TL;DR: This article showed that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks and applied pause duration to enrich contextual embeddings to improve shallow parsing of entities.
Abstract: Entity tags in human-machine dialog are integral to natural language understanding (NLU) tasks in conversational assistants. However, current systems struggle to accurately parse spoken queries with the typical use of text input alone, and often fail to understand the user intent. Previous work in linguistics has identified a cross-language tendency for longer speech pauses surrounding nouns as compared to verbs. We demonstrate that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks. Analysis of pauses in French and English utterances from a commercial voice assistant shows the statistically significant difference in pause duration around multi-token entity span boundaries compared to within entity spans. Additionally, in contrast to text-based NLU, we apply pause duration to enrich contextual embeddings to improve shallow parsing of entities. Results show that our proposed novel embeddings improve the relative error rate by up to 8% consistently across three domains for French, without any added annotation or alignment costs to the parser.