Home
/
Authors
/
Diana Santos

Author

Diana Santos

Bio: Diana Santos is an academic researcher from SINTEF. The author has contributed to research in topics: Web page & The Internet. The author has an hindex of 5, co-authored 6 publications receiving 52 citations.

Topics: Web page, The Internet, Question answering, C4.5 algorithm, Keyword density ...read more

Papers

PDF

Open Access

More filters

Ambientes de processamento de corpora em português: Comparação entre dois sistemas

[...]

Diana Santos, Elisabete Ranchhod

01 Jan 1999

15 citations

Measuring the Web in Portuguese

[...]

Diana Santos, Rachel Aires

01 Jan 2002

11 citations

Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users. needs

[...]

Rachel Aires¹, Aline M. P. Manfrin¹, Sandra Maria Aluísio¹, Diana Santos²•Institutions (2)

University of São Paulo¹, SINTEF²

01 Jan 2004

TL;DR: This paper document the replication of the experiments presented in Aires et al (2004), using all relevant Weka algorithms, also providing more information on the linguistic features used and on the issues concerning algorithm choice.

...read moreread less

Abstract: In order to improve Web Information Retrieval, we have, in a previous work (Aires et al., 2004), investigated the use of stylistic features of Web texts in Portuguese to classify web pages according to users’ needs, using in most of the experiments the classification algorithm J48 (the Weka implementation of C4.5). From that study, we concluded that it was possible to identify some of the categories reliably, but we should investigate whether it was possible to get even better classification schemes using other algorithms. Language is a different domain, and the fact that C4.5 has been used successfully in other applications (even others dealing with written language) does not imply that it is also the best solution for our problem. In this paper, we document the replication of the experiments presented in Aires et al (2004), using all relevant Weka algorithms, also providing more information on the linguistic features used and on the issues concerning algorithm choice.

...read moreread less

10 citations

Linguateca: um Centro de Recursos Distribuído para o Processamento Computacional da Língua Portuguesa

[...]

Diana Santos¹, Alberto Simões, Ana Frankenberg-Garcia, Ana Maria Viana Pinto, Anabela Barreiro, Belinda Maia, Cristina Mota, D. P. de Oliveira, Eckhard Bick, Elisabete Ranchhod, José João Almeida, Luís Miguel Cabral, Luís Costa, Luís Sarmento, Marcirio Silveira Chaves, Nuno Cardoso, Paulo M. Rocha, Rachel Aires, Rosário Silva, Rui Vilela, Susana Afonso - Show less +17 more•Institutions (1)

SINTEF¹

01 Jan 2004

TL;DR: This paper present a panorâmica da actividade da Linguateca na criacao e disponibilizacao de recursos e ferramentas for a lingua portuguesa.

...read moreread less

Abstract: Resumo. Neste artigo apresentamos uma panorâmica da actividade da Linguateca na criacao e disponibilizacao de recursos e ferramentas para a lingua portuguesa. Comecamos por uma descricao dos objectivos e pressupostos da Linguateca e uma breve historia da sua intervencao, e finalizamos com algumas consideracoes sobre a melhor forma de prosseguir na organizacao da area.

...read moreread less

8 citations

User-aware page classification in a search engine

[...]

Rachel Aires¹, Sandra Maria Aluísio¹, Diana Santos²•Institutions (2)

University of São Paulo¹, SINTEF²

01 Jan 2005

TL;DR: This paper looks into the use of 46 linguistic features to classify texts according to genres and text types, and employs the same features to train a classifier that decides which possible user need(s) a Web page may satisfy.

...read moreread less

Abstract: In this paper we investigate the hypothesis that classification of Web pages according to the general user intentions is feasible and useful. As a preliminary study we look into the use of 46 linguistic features to classify texts according to genres and text types; we then employ the same features to train a classifier that decides which possible user need(s) a Web page may satisfy. We also report on experiments for customizing searching systems with the same set of features to train a classifier that helps users discriminate among their specific needs. Finally, we describe some user input that makes us confident on the utility of the approach.

...read moreread less

6 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Predictive models of autism spectrum disorder based on brain regional cortical thickness.

[...]

Yun Jiao¹, Rong Chen², Xiaoyan Ke³, Xiaoyan Ke¹, Kangkang Chu³, Zuhong Lu¹, Edward H. Herskovits² - Show less +3 more•Institutions (3)

Southeast University¹, University of Pennsylvania², Nanjing Medical University³

01 Apr 2010-NeuroImage

TL;DR: Thickness-based diagnostic models were superior to those based on regional volumes and outperformed volume-based classification across a variety of classification methods.

...read moreread less

229 citations

Journal Article•DOI•

Feature selection and classification model construction on type 2 diabetic patients' data

[...]

Yue Huang¹, Paul McCullagh², Norman D. Black², Roy Harper³•Institutions (3)

Imperial College London¹, Ulster University², Ulster Hospital³

01 Nov 2007-Artificial Intelligence in Medicine

TL;DR: This work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record.

...read moreread less

161 citations

Proceedings Article•DOI•

Twitter spam detection based on deep learning

[...]

Tingmin Wu¹, Shigang Liu¹, Jun Zhang¹, Yang Xiang¹•Institutions (1)

Deakin University¹

31 Jan 2017

TL;DR: This paper proposed a novel technique based on deep learning techniques to address the challenges of spam drift and information fabrication in Twitter spam and found that this method largely outperformed existing methods.

...read moreread less

Abstract: Twitter spam has long been a critical but difficult problem to be addressed. So far, researchers have developed a series of machine learning-based methods and blacklisting techniques to detect spamming activities on Twitter. According to our investigation, current methods and techniques have achieved the accuracy of around 80%. However, due to the problems of spam drift and information fabrication, these machine-learning based methods cannot efficiently detect spam activities in real-life scenarios. Moreover, the blacklisting method cannot catch up with the variations of spamming activities as manually inspecting suspicious URLs is extremely time-consuming. In this paper, we proposed a novel technique based on deep learning techniques to address the above challenges. The syntax of each tweet will be learned through WordVector Training Mode. We then constructed a binary classifier based on the preceding representation dataset. In experiments, we collected and implemented a 10-day real Tweet datasets in order to evaluate our proposed method. We first studied the performance of different classifiers, and then compared our method to other existing text-based methods. We found that our method largely outperformed existing methods. We further compared our method to non-text-based detection techniques. According to the experiment results, our proposed method was more accurate.

...read moreread less

111 citations

Proceedings Article•

The brWaC Corpus: A New Open Resource for Brazilian Portuguese

[...]

Jorge A. Wagner Filho¹, Rodrigo Wilkens², Marco Idiart¹, Aline Villavicencio¹•Institutions (2)

Universidade Federal do Rio Grande do Sul¹, Université catholique de Louvain²

01 May 2018

TL;DR: This work presents the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages, and discusses the updated sentence-level approach for the strict removal of duplicated content.

...read moreread less

Abstract: In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our updated sentence-level approach for the strict removal of duplicated content. Following the pipeline methodology, more than 60 million pages were crawled and filtered, with 3.5 million being selected. The obtained multi-domain corpus, named brWaC, is composed by 2.7 billion tokens, and has been annotated with tagging and parsing information. The incidence of non-unique long sentences, an indication of replicated content, which reaches 9% in other Web corpora, was reduced to only 0.5%. Domain diversity was also maximized, with 120,000 different websites contributing content. We are making our new resource freely available for the research community, both for querying and downloading, in the expectation of aiding in new advances for the processing of Brazilian Portuguese.

...read moreread less

59 citations

CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa

[...]

Paulo Rocha¹, Diana Santos¹•Institutions (1)

SINTEF¹

01 Jan 2000

TL;DR: The creation of CETEMPublico is reported on, the largest publicly available corpus of Portuguese to date, containing 180 million words, created to boost research in language engineering in Portuguese.

...read moreread less

Abstract: This paper reports on the creation of CETEMPublico, the largest publicly available corpus of Portuguese to date, containing 180 million words, created to boost research in language engineering in Portuguese. After providing some background for creating it, we focus on the processing required, explaining in detail some options taken, namely: the division of articles in extracts; their random reordering and numbering in the final corpus; the marking of structural units such as sentence separation, titles and author identification; the use of a partial system for contents classification; and the distribution methods.

...read moreread less

55 citations

1
2
3
4
…
5
6
7
8
9
10
11

Collapse