Filtering Template Driven Spam Mails using Vector Space Models

doi:10.5120/4891-7383

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Performance Evaluation of Data Mining based Classifier for Classification of Spam E-Mail

[...]

Manish Kumar Sahu

27 Apr 2017-International Journal for Research in Applied Science and Engineering Technology

TL;DR: This research work has recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.

...read moreread less

Abstract: E-mail is one of the important and economical communication media to transfer the information from one person to others. Due to increase number of E-mails resulted drastic increases spam E-mail. In this research work, we have used various classification techniques to classification of spam E-mail and non spam E-mails. The experiment done in Tanagra data mining tool. We have recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.

...read moreread less

Cites methods from "Filtering Template Driven Spam Mail..."

...(2012) [4] have also proposed a new hybrid model using VSM and Racchio for classification of spam e-mail....
[...]

References

PDF

Open Access

More filters

Journal Article•DOI•

Text document clustering based on neighbors

[...]

Congnan Luo¹, Yanjun Li², Soon M. Chung³•Institutions (3)

Teradata¹, Fordham University², Wright State University³

01 Nov 2009

TL;DR: This paper proposes to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based onThe neighbors of the cluster Centroids.

...read moreread less

Abstract: Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345-366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.

...read moreread less

116 citations

Journal Article•DOI•

Overview of textual anti-spam filtering techniques

[...]

Thamarai Subramaniam, Hamid A. Jalab, Alaa Y. Taqa¹•Institutions (1)

University of Malaya¹

04 Oct 2010-International Journal of Physical Sciences

TL;DR: Most common techniques used for anti-spam filtering by analyzing the e-mail content are summarized and machine learning algorithms such as Naive Bayesian, support vector machine and neural network that have been adopted to detect and control spam are looked into.

...read moreread less

Abstract: Elecronic mail (E-mail) is an essential communication tool that has been greatly abused by spammers to disseminate unwanted information (messages) and spread malicious contents to Internet users. Current Internet technologies further accelerated the distribution of spam. Effective controls need to be deployed to countermeasure the ever growing spam problem. Machine learning provides better protective mechanisms that are able to control spam. This paper summarizes most common techniques used for anti-spam filtering by analyzing the e-mail content and also looks into machine learning algorithms such as Naive Bayesian, support vector machine and neural network that have been adopted to detect and control spam. Each machine learning has its own strengths and limitations as such appropriate preprocessing need to be carefully considered to increase the effectiveness of any given machine learning. Key words: Anti-spam filters, text categorization, electronic mail (E-mail), machine learning.

...read moreread less

48 citations

Book Chapter•DOI•

Data Mining of Text Files

[...]

Angel R. Martinez

01 Jan 2005-Handbook of Statistics

TL;DR: This chapter looks at approaches beyond the word and sentence level, such as vector space models for information retrieval, latent semantic indexing, and a new approach based on a bigram proximity matrix.

...read moreread less

Abstract: The goal of this chapter is to present textual data mining from a broad perspective, in addition to discussing several methods in computational statistics that can be applied to this area. I begin by discussing natural language processing at the word and sentence level, since a textual data mining system that seeks to discover knowledge requires methods that will capture and represent the semantic content of the text units. This section includes descriptions of hidden Markov models, probabilistic context-free grammars, and various supervised and unsupervised methods for word sense disambiguation. Next, I look at approaches beyond the word and sentence level, such as vector space models for information retrieval, latent semantic indexing, and a new approach based on a bigram proximity matrix. I conclude with a brief description of self-organizing maps.

...read moreread less

2 citations

Filtering Template Driven Spam Mails using Vector Space Models

Citations

Cites methods from "Filtering Template Driven Spam Mail..."

References

Related Papers (5)