scispace - formally typeset
Search or ask a question

Showing papers by "Sandra Maria Aluísio published in 2012"


Proceedings Article
01 May 2012
TL;DR: A new annotation guide ― called Propbank-Br ― has been generated to deal with specific language phenomena and parser problems, and the annotation of a Brazilian Portuguese Treebank with semantic role labels following Propbank guidelines is reported.
Abstract: This paper reports the annotation of a Brazilian Portuguese Treebank with semantic role labels following Propbank guidelines. A different language and a different parser output impact the task and require some decisions on how to annotate the corpus. Therefore, a new annotation guide ― called Propbank-Br - has been generated to deal with specific language phenomena and parser problems. In this phase of the project, the corpus was annotated by a unique linguist. The annotation task reported here is inserted in a larger projet for the Brazilian Portuguese language. This project aims to build Brazilian verbs frames files and a broader and distributed annotation of semantic role labels in Brazilian Portuguese, allowing inter-annotator agreement measures. The corpus, available in web, is already being used to build a semantic tagger for Portuguese language.

44 citations


Journal ArticleDOI
01 Dec 2012-EPL
TL;DR: In this paper, the authors represented pieces of text with different levels of simplification in co-occurrence networks and found that topological regularity correlated negatively with textual complexity, and in less complex texts the distance between concepts, represented as nodes, tended to decrease.
Abstract: Methods from statistical physics, such as those involving complex networks, have been increasingly used in the quantitative analysis of linguistic phenomena. In this paper, we represented pieces of text with different levels of simplification in co-occurrence networks and found that topological regularity correlated negatively with textual complexity. Furthermore, in less complex texts the distance between concepts, represented as nodes, tended to decrease. The complex networks metrics were treated with multivariate pattern recognition techniques, which allowed us to distinguish between original texts and their simplified versions. For each original text, two simplified versions were generated manually with increasing number of simplification operations. As expected, distinction was easier for the strongly simplified versions, where the most relevant metrics were node strength, shortest paths and diversity. Also, the discrimination of complex texts was improved with higher hierarchical network metrics, thus pointing to the usefulness of considering wider contexts around the concepts. Though the accuracy rate in the distinction was not as high as in methods using deep linguistic knowledge, the complex network approach is still useful for a rapid screening of texts whenever assessing complexity is essential to guarantee accessibility to readers with limited reading ability.

39 citations


Proceedings Article
01 May 2012
TL;DR: MAZEA (Multi-label Argumentative Zoning for English Abstracts), a multi-label classifier which automatically identifies rhetorical moves in abstracts but allows for a given sentence to be assigned as many labels as appropriate is presented.
Abstract: The relevance of automatically identifying rhetorical moves in scientific texts has been widely acknowledged in the literature. This study focuses on abstracts of standard research papers written in English and aims to tackle a fundamental limitation of current machine-learning classifiers: they are mono-labeled, that is, a sentence can only be assigned one single label. However, such approach does not adequately reflect actual language use since a move can be realized by a clause, a sentence, or even several sentences. Here, we present MAZEA (Multi-label Argumentative Zoning for English Abstracts), a multi-label classifier which automatically identifies rhetorical moves in abstracts but allows for a given sentence to be assigned as many labels as appropriate. We have resorted to various other NLP tools and used two large training corpora: (i) one corpus consists of 645 abstracts from physical sciences and engineering (PE) and (ii) the other corpus is made up of 690 from life and health sciences (LH). This paper presents our preliminary results and also discusses the various challenges involved in multi-label tagging and works towards satisfactory solutions. In addition, we also make our two training corpora publicly available so that they may serve as benchmark for this new task.

23 citations


Proceedings ArticleDOI
15 Oct 2012
TL;DR: The purpose of this paper is to present an innovative architecture of an MCAT for real users, as a Web application, and to discuss the theoretical and methodological development of such MCAT through a new approach named here Computer-based Multidimensional Adaptive Testing (CBMAT).
Abstract: Given a set of items, a Multidimensional Computer Adaptive Test (MCAT) selects those items from the bank according to the estimated abilities of the student, resulting in an individualized test. MCATs seek to maximize the test's accuracy, based on multiple simultaneous examination abilities (unlike a Computer Adaptive Test - CAT - which evaluates a single ability) using the sequence of items previously answered. Although MCATs have been very well studied from a statistical point of view, there is no computational system that covers all the steps needed for its appropriated use such as: the use of a calibrated item bank, proposal of initial and stopping criteria for the test, criteria for estimating the ability of the examinee and criteria to select items. The purpose of this paper is twofold: (i) to present an innovative architecture of an MCAT for real users, as a Web application, and (ii) to discuss the theoretical and methodological development of such MCAT, through a new approach named here Computer-based Multidimensional Adaptive Testing (CBMAT). The proof of concept of CBMAT was an implementation called Multidimensional Adaptive Test System for Educational Purposes (MADEPT). In simulations, MADEPT proved to be a computer system suitable for applications with real users, secure, accurate and portable.

8 citations


01 Jan 2012
TL;DR: The proposed model integrate different language level, providing a both bottom-up and top-down analysis approaches, and is currently working on Part of Speech (PoS) analysis, and based on Perceptron Neural Networks.
Abstract: In this work we propose a multi-level language analyser, currently working on Part of Speech (PoS) analysis, and based on Perceptron Neural Networks. The proposed model integrate different language level, providing a both bottom-up and top-down analysis approaches. The PoS implementation is evaluated using Minimun Square Errors and set meaures suitable for the modelling. A syntactic dependency analyser is also proposed. The result is a huge network, in which features are lexemes, lemmas and tags.