scispace - formally typeset
Search or ask a question

Showing papers by "Sandra Maria Aluísio published in 2018"


Journal ArticleDOI
TL;DR: This work aimed to identify alterations in the macrolinguistic aspects of discourse using a new computational tool to establish the therapeutic intervention in cognitive impairment and dementia.

32 citations


Book ChapterDOI
24 Sep 2018
TL;DR: This work compiled the SIMPLEX-PB, the first available corpus of lexical simplification for Brazilian Portuguese, and makes available a benchmark for evaluating the most well-known methods of LS in the authors' dataset.
Abstract: Lexical Simplification has the function of changing words or expressions for synonyms that can be understood by a larger number of people. It is very common to have in mind a target audience which will benefit from the task, such as children, low-literacy audiences, and others. In recent years there has been great activity in this field of research, especially for English, but also for other languages such as Japanese and multilingual and cross-lingual scenarios. Few works have children as target audience. Currently, in Brazil, the Programa Nacional do Livro Didatico (PNLD) is an initiative with a broad impact on education, as it aims to choose, acquire, and distribute free textbooks to students in public elementary schools. In this scenario, adapting the level of complexity of a text to the reading ability of a student is a determinant of his/her improvement and whether he/she reaches the level of reading comprehension expected for that school year. On the other hand, there have not been publicly available resources on lexical simplification for Portuguese as yet. Therefore, the development of this material is urgent and welcome. This work compiled the SIMPLEX-PB, the first available corpus of lexical simplification for Brazilian Portuguese. We also make available a benchmark for evaluating the most well-known methods of LS in our dataset.

11 citations


Proceedings Article
01 Aug 2018
TL;DR: A nontrivial sentence corpus in Portuguese is generated, taking advantage of a parallel corpus of simplification, in which each sentence triplet is aligned and has simplification operations annotated, being ideal for justifying possible mistakes of future methods.
Abstract: Effective textual communication depends on readers being proficient enough to comprehend texts, and texts being clear enough to be understood by the intended audience, in a reading task. When the meaning of textual information and instructions is not well conveyed, many losses and damages may occur. Among the solutions to alleviate this problem is the automatic evaluation of sentence readability, task which has been receiving a lot of attention due to its large applicability. However, a shortage of resources, such as corpora for training and evaluation, hinders the full development of this task. In this paper, we generate a nontrivial sentence corpus in Portuguese. We evaluate three scenarios for building it, taking advantage of a parallel corpus of simplification, in which each sentence triplet is aligned and has simplification operations annotated, being ideal for justifying possible mistakes of future methods. The best scenario of our corpus PorSimplesSent is composed of 4,888 pairs, which is bigger than a similar corpus for English; all the three versions of it are publicly available. We created four baselines for PorSimplesSent and made available a pairwise ranking method, using 17 linguistic and psycholinguistic features, which correctly identifies the ranking of sentence pairs with an accuracy of 74.2%.

10 citations


Book ChapterDOI
24 Sep 2018
TL;DR: This work proposes a new model for NLI that achieves 0.72 F\(_1\) score on ASSIN, setting a new state of the art, and shows that word embeddings and syntactic knowledge are both important to achieve such results.
Abstract: Natural Language Inference (NLI) is the task of detecting relations such as entailment, contradiction and paraphrase in pairs of sentences. With the recent release of the ASSIN corpus, NLI in Portuguese is now getting more attention. However, published results on ASSIN have not explored syntactic structure, neither combined word embedding metrics with other types of features. In this work, we sought to remedy this gap, proposing a new model for NLI that achieves 0.72 F\(_1\) score on ASSIN, setting a new state of the art. Our feature analysis shows that word embeddings and syntactic knowledge are both important to achieve such results.

8 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated the impact of the automatic sentence segmentation method DeepBond on nine syntactic complexity metrics extracted from transcripts of healthy elderly (CTL) and mild cognitive impairment (MCI) patients.
Abstract: In recent years, Mild Cognitive Impairment (MCI) has received a great deal of attention, as it may represent a pre-clinical state of Alzheimer´s disease (AD). In the distinction between healthy elderly (CTL) and MCI patients, automated discourse analysis tools have been applied to narrative transcripts in English and in Brazilian Portuguese. However, the absence of sentence boundary segmentation in transcripts prevents the direct application of methods that rely on these marks for the correct use of tools, such as taggers and parsers. To our knowledge, there are only a few studies evaluating automatic sentence segmentation in transcripts of neuropsychological tests. The purpose of this study is to investigate the impact ofthe automatic sentence segmentation method DeepBond on nine syntactic complexity metrics extracted of transcripts of CTL and MCI patients.

1 citations


Book ChapterDOI
24 Sep 2018
TL;DR: This work presents a method to segment the transcripts into sentences and another to detect the disfluencies present in them, to serve as a preprocessing step for the application of subsequent NLP tools.
Abstract: Natural Language Processing (NLP) tools aiming at the diagnosis of language impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of sentence boundary segmentation in transcripts prevents the direct application of NLP methods which rely on these marks to work properly, such as taggers and parsers. We present a method to segment the transcripts into sentences and another to detect the disfluencies present in them, to serve as a preprocessing step for the application of subsequent NLP tools. Our methods use recurrent convolutional neural networks with prosodic, morphosyntactic features, and word embeddings. We evaluated both tasks intrinsically, analyzing the most important features, comparing the proposed methods to simpler ones, and identifying the main hits and misses. In addition, a final method was created to combine all tasks and it was evaluated extrinsically using 9 syntactic metrics of Coh-Metrix-Dementia. In the intrinsic evaluations, we showed that our method achieved (i) state-of-the-art results for the sentence segmentation task on impaired speech, and (ii) results that are similar to related works for the English language for disfluency detection tasks. Regarding the extrinsic evaluation, only 3 metrics showed a statistically significant difference between manual MCI transcripts and those generated by our method, suggesting that our method is capable to preprocess transcriptions to be further analyzed by NLP tools.

1 citations


Proceedings ArticleDOI
TL;DR: This work introduces MilkQA, a question answering dataset from the dairy domain dedicated to the study of consumer questions, and studies the behavior of four answer selection models on it: two baseline models and two convolutional neural network archictetures.
Abstract: We introduce MilkQA, a question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service. Our dataset was filtered and anonymized by three human annotators. Consumer questions are a challenging kind of question that is usually employed as a form of seeking information. Although several question answering datasets are available, most of such resources are not suitable for research on answer selection models for consumer questions. We aim to fill this gap by making MilkQA publicly available. We study the behavior of four answer selection models on MilkQA: two baseline models and two convolutional neural network archictetures. Our results show that MilkQA poses real challenges to computational models, particularly due to linguistic characteristics of its questions and to their unusually longer lengths. Only one of the experimented models gives reasonable results, at the cost of high computational requirements.

1 citations