scispace - formally typeset
Search or ask a question
Author

Diana Santos

Bio: Diana Santos is an academic researcher from SINTEF. The author has contributed to research in topics: Web page & The Internet. The author has an hindex of 5, co-authored 6 publications receiving 52 citations.

Papers
More filters
01 Jan 2004
TL;DR: This paper document the replication of the experiments presented in Aires et al (2004), using all relevant Weka algorithms, also providing more information on the linguistic features used and on the issues concerning algorithm choice.
Abstract: In order to improve Web Information Retrieval, we have, in a previous work (Aires et al., 2004), investigated the use of stylistic features of Web texts in Portuguese to classify web pages according to users’ needs, using in most of the experiments the classification algorithm J48 (the Weka implementation of C4.5). From that study, we concluded that it was possible to identify some of the categories reliably, but we should investigate whether it was possible to get even better classification schemes using other algorithms. Language is a different domain, and the fact that C4.5 has been used successfully in other applications (even others dealing with written language) does not imply that it is also the best solution for our problem. In this paper, we document the replication of the experiments presented in Aires et al (2004), using all relevant Weka algorithms, also providing more information on the linguistic features used and on the issues concerning algorithm choice.

10 citations

01 Jan 2004
TL;DR: This paper present a panorâmica da actividade da Linguateca na criacao e disponibilizacao de recursos e ferramentas for a lingua portuguesa.
Abstract: Resumo. Neste artigo apresentamos uma panorâmica da actividade da Linguateca na criacao e disponibilizacao de recursos e ferramentas para a lingua portuguesa. Comecamos por uma descricao dos objectivos e pressupostos da Linguateca e uma breve historia da sua intervencao, e finalizamos com algumas consideracoes sobre a melhor forma de prosseguir na organizacao da area.

8 citations

01 Jan 2005
TL;DR: This paper looks into the use of 46 linguistic features to classify texts according to genres and text types, and employs the same features to train a classifier that decides which possible user need(s) a Web page may satisfy.
Abstract: In this paper we investigate the hypothesis that classification of Web pages according to the general user intentions is feasible and useful. As a preliminary study we look into the use of 46 linguistic features to classify texts according to genres and text types; we then employ the same features to train a classifier that decides which possible user need(s) a Web page may satisfy. We also report on experiments for customizing searching systems with the same set of features to train a classifier that helps users discriminate among their specific needs. Finally, we describe some user input that makes us confident on the utility of the approach.

6 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Thickness-based diagnostic models were superior to those based on regional volumes and outperformed volume-based classification across a variety of classification methods.

229 citations

Journal ArticleDOI
TL;DR: This work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record.

161 citations

Proceedings ArticleDOI
31 Jan 2017
TL;DR: This paper proposed a novel technique based on deep learning techniques to address the challenges of spam drift and information fabrication in Twitter spam and found that this method largely outperformed existing methods.
Abstract: Twitter spam has long been a critical but difficult problem to be addressed. So far, researchers have developed a series of machine learning-based methods and blacklisting techniques to detect spamming activities on Twitter. According to our investigation, current methods and techniques have achieved the accuracy of around 80%. However, due to the problems of spam drift and information fabrication, these machine-learning based methods cannot efficiently detect spam activities in real-life scenarios. Moreover, the blacklisting method cannot catch up with the variations of spamming activities as manually inspecting suspicious URLs is extremely time-consuming. In this paper, we proposed a novel technique based on deep learning techniques to address the above challenges. The syntax of each tweet will be learned through WordVector Training Mode. We then constructed a binary classifier based on the preceding representation dataset. In experiments, we collected and implemented a 10-day real Tweet datasets in order to evaluate our proposed method. We first studied the performance of different classifiers, and then compared our method to other existing text-based methods. We found that our method largely outperformed existing methods. We further compared our method to non-text-based detection techniques. According to the experiment results, our proposed method was more accurate.

111 citations

Proceedings Article
01 May 2018
TL;DR: This work presents the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages, and discusses the updated sentence-level approach for the strict removal of duplicated content.
Abstract: In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our updated sentence-level approach for the strict removal of duplicated content. Following the pipeline methodology, more than 60 million pages were crawled and filtered, with 3.5 million being selected. The obtained multi-domain corpus, named brWaC, is composed by 2.7 billion tokens, and has been annotated with tagging and parsing information. The incidence of non-unique long sentences, an indication of replicated content, which reaches 9% in other Web corpora, was reduced to only 0.5%. Domain diversity was also maximized, with 120,000 different websites contributing content. We are making our new resource freely available for the research community, both for querying and downloading, in the expectation of aiding in new advances for the processing of Brazilian Portuguese.

59 citations

Paulo Rocha1, Diana Santos1
01 Jan 2000
TL;DR: The creation of CETEMPublico is reported on, the largest publicly available corpus of Portuguese to date, containing 180 million words, created to boost research in language engineering in Portuguese.
Abstract: This paper reports on the creation of CETEMPublico, the largest publicly available corpus of Portuguese to date, containing 180 million words, created to boost research in language engineering in Portuguese. After providing some background for creating it, we focus on the processing required, explaining in detail some options taken, namely: the division of articles in extracts; their random reordering and numbering in the final corpus; the marking of structural units such as sentence separation, titles and author identification; the use of a partial system for contents classification; and the distribution methods.

55 citations