Home
/
Authors
/
Tiago A. Almeida

Author

Tiago A. Almeida

Other affiliations: State University of Campinas, University of Lisbon

Bio: Tiago A. Almeida is an academic researcher from Federal University of São Carlos. The author has contributed to research in topics: Spambot & Naive Bayes classifier. The author has an hindex of 19, co-authored 61 publications receiving 1153 citations. Previous affiliations of Tiago A. Almeida include State University of Campinas & University of Lisbon.

Topics: Spambot, Naive Bayes classifier, Spamdexing, Bag-of-words model, Forum spam ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2013
2012
2011
2010
2009
2007
2006
2005

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Contributions to the study of SMS spam filtering: new collection and results

[...]

Tiago A. Almeida¹, José María Gómez Hidalgo, Akebo Yamakami¹•Institutions (1)

State University of Campinas¹

19 Sep 2011

TL;DR: A new real, public and non-encoded SMS spam collection that is the largest one as far as the authors know is offered and the performance achieved by several established machine learning methods is compared.

...read moreread less

Abstract: The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

...read moreread less

369 citations

Proceedings Article•DOI•

TubeSpam: Comment Spam Filtering on YouTube

[...]

Tulio C. Alberto, Johannes V. Lochter, Tiago A. Almeida

01 Dec 2015

TL;DR: The statistical analysis of results indicate that, with 99.9% of confidence level, decision trees, logistic regression, Bernoulli Naive Bayes, random forests, linear and Gaussian SVMs are statistically equivalent for comment spam filtering on YouTube.

...read moreread less

Abstract: The profitability promoted by Google in its brand new video distribution platform YouTube has attracted an increasing number of users. However, such success has also attracted malicious users, which aim to self-promote their videos or disseminate viruses and malwares. Since YouTube offers limited tools for comment moderation, the spam volume is shockingly increasing which lead owners of famous channels to disable the comments section in their videos. Automatic comment spam filtering on YouTube is a challenge even for established classification methods, since the messages are very short and often rife with slangs, symbols and abbreviations. In this work, we have evaluated several top-performance classification techniques for such purpose. The statistical analysis of results indicate that, with 99.9% of confidence level, decision trees, logistic regression, Bernoulli Naive Bayes, random forests, linear and Gaussian SVMs are statistically equivalent. Based on this, we have also offered the TubeSpam -- an accurate online system to filter comments posted on YouTube.

...read moreread less

109 citations

Journal Article•DOI•

Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers

[...]

Tiago A. Almeida¹, Jurandy Almeida¹, Akebo Yamakami¹•Institutions (1)

State University of Campinas¹

01 Feb 2011-Journal of Internet Services and Applications

TL;DR: This paper studies the performance of many term-selection techniques with several different models of Naive Bayes spam filters, and investigates the benefits of using the Matthews correlation coefficient as a measure of performance.

...read moreread less

Abstract: E-mail spam has become an increasingly important problem with a big economic impact in society. Fortunately, there are different approaches allowing to automatically detect and remove most of those messages, and the best-known techniques are based on Bayesian decision theory. However, such probabilistic approaches often suffer from a well-known difficulty: the high dimensionality of the feature space. Many term-selection methods have been proposed for avoiding the curse of dimensionality. Nevertheless, it is still unclear how the performance of Naive Bayes spam filters depends on the scheme applied for reducing the dimensionality of the feature space. In this paper, we study the performance of many term-selection techniques with several different models of Naive Bayes spam filters. Our experiments were diligently designed to ensure statistically sound results. Moreover, we perform an analysis concerning the measurements usually employed to evaluate the quality of spam filters. Finally, we also investigate the benefits of using the Matthews correlation coefficient as a measure of performance.

...read moreread less

86 citations

Towards SMS Spam Filtering: Results under a New Dataset

[...]

Tiago A. Almeida¹, JosÃ© MarÃa GÃ³mez Hidalgo, Tiago Pasqualini Silva¹•Institutions (1)

Federal University of São Carlos¹

31 Mar 2013

TL;DR: The results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.

...read moreread less

Abstract: The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. Recent reports clearly indicate that the volume of mobile phone spam is dramatically increasing year by year. In practice, fighting such plague is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. Probably, one of the major concerns in academic settings is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, traditional content-based filters may have their performance seriously degraded since SMS messages are fairly short and their text is generally rife with idioms and abbreviations. In this paper, we present details about a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we offer a comprehensive analysis of such dataset in order to ensure that there are no duplicated messages coming from previously existing datasets, since it may ease the task of learning SMS spam classifiers and could compromise the evaluation of methods. Additionally, we compare the performance achieved by several established machine learning techniques. Im summary, the results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.

...read moreread less

84 citations

Book Chapter•DOI•

Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results

[...]

Rafael de Araujo Arosa Monteiro¹, Roney Lira de Sales Santos¹, Thiago Alexandre Salgueiro Pardo¹, Tiago A. Almeida², Evandro Eduardo Seron Ruiz¹, Oto Araújo Vale² - Show less +2 more•Institutions (2)

University of São Paulo¹, Federal University of São Carlos²

24 Sep 2018

TL;DR: The first reference corpus in this area for Portuguese is introduced, composed of aligned true and fake news, which is analyzed to uncover some of their linguistic characteristics, showing that good results may be achieved.

...read moreread less

Abstract: Fake news are a problem of our time. They may influence a large number of people on a wide range of subjects, from politics to health. Although they have always existed, the volume of fake news has recently increased due to the soaring number of users of social networks and instant messengers. These news may cause direct losses to people and corporations, as fake news may include defamation of people, products and companies. Moreover, the scarcity of labeled datasets, mainly in Portuguese, prevents training classifiers to automatically filter such documents. In this paper, we investigate the issue for the Portuguese language. Inspired by previous initiatives for other languages, we introduce the first reference corpus in this area for Portuguese, composed of aligned true and fake news, which we analyze to uncover some of their linguistic characteristics. Then, using machine learning techniques, we run some automatic detection methods in this corpus, showing that good results may be achieved.

...read moreread less

82 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Journal Article•

Data Mining Practical Machine Learning Tools and Techniques

[...]

อนิรุธ สืบสิงห์

01 Jan 2014-Journal of management science

9,185 citations

The Self-Organizing Map

[...]

Teuvo Kohonen¹•Institutions (1)

Helsinki University of Technology¹

01 Jan 1990

TL;DR: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article, where the authors present an overview of their work.

...read moreread less

Abstract: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article.

...read moreread less

2,933 citations

C4.5: Programs for Machine Learning (書評)

[...]

重郎金田

01 May 1995

1,164 citations

Journal Article•DOI•

Bidirectional LSTM with attention mechanism and convolutional layer for text classification

[...]

Gang Liu¹, Jiabao Guo¹•Institutions (1)

Hubei University of Technology¹

14 Apr 2019-Neurocomputing

TL;DR: A novel and unified architecture which contains a bidirectional LSTM (BiLSTM), attention mechanism and the convolutional layer is proposed in this paper, which outperforms other state-of-the-art text classification methods in terms of the classification accuracy.

...read moreread less

581 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse