scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Filtering Template Driven Spam Mails using Vector Space Models

29 Feb 2012-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 39, Iss: 14, pp 33-35
TL;DR: The main objective in this paper is to find out semantic distance and evaluate the applicability of the two information retrieval techniques, Simple Vector Space Models (VSM) and VSM using Rocchio Classification in the spam context.
Abstract: Spam became a big problem to the society. Some spammers are using templates for sending spam. To send a particular promotion they create some template and merge the details of receivers with the template. Similarities can find among these mails and easily ignore the forthcoming spam. Most highvolume spam is sent using tools those randomizes parts of the message - subject, body, sender address etc. The general form of the template that the spammer is using can often guess by inspecting the features of messages. Most of the spam filters are either rule based models or Bayesian models. The main objective in this paper is to find out semantic distance and evaluate the applicability of the two information retrieval techniques, Simple Vector Space Models (VSM) and VSM using Rocchio Classification in the spam context. Both methods are using cosine similarities to identify the spam

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This research work has recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.
Abstract: E-mail is one of the important and economical communication media to transfer the information from one person to others. Due to increase number of E-mails resulted drastic increases spam E-mail. In this research work, we have used various classification techniques to classification of spam E-mail and non spam E-mails. The experiment done in Tanagra data mining tool. We have recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.

Cites methods from "Filtering Template Driven Spam Mail..."

  • ...(2012) [4] have also proposed a new hybrid model using VSM and Racchio for classification of spam e-mail....

    [...]

References
More filters
Journal ArticleDOI
01 Nov 2009
TL;DR: This paper proposes to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based onThe neighbors of the cluster Centroids.
Abstract: Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345-366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.

116 citations

Journal ArticleDOI
TL;DR: Most common techniques used for anti-spam filtering by analyzing the e-mail content are summarized and machine learning algorithms such as Naive Bayesian, support vector machine and neural network that have been adopted to detect and control spam are looked into.
Abstract: Elecronic mail (E-mail) is an essential communication tool that has been greatly abused by spammers to disseminate unwanted information (messages) and spread malicious contents to Internet users. Current Internet technologies further accelerated the distribution of spam. Effective controls need to be deployed to countermeasure the ever growing spam problem. Machine learning provides better protective mechanisms that are able to control spam. This paper summarizes most common techniques used for anti-spam filtering by analyzing the e-mail content and also looks into machine learning algorithms such as Naive Bayesian, support vector machine and neural network that have been adopted to detect and control spam. Each machine learning has its own strengths and limitations as such appropriate preprocessing need to be carefully considered to increase the effectiveness of any given machine learning. Key words: Anti-spam filters, text categorization, electronic mail (E-mail), machine learning.

48 citations

Book ChapterDOI
TL;DR: This chapter looks at approaches beyond the word and sentence level, such as vector space models for information retrieval, latent semantic indexing, and a new approach based on a bigram proximity matrix.
Abstract: The goal of this chapter is to present textual data mining from a broad perspective, in addition to discussing several methods in computational statistics that can be applied to this area. I begin by discussing natural language processing at the word and sentence level, since a textual data mining system that seeks to discover knowledge requires methods that will capture and represent the semantic content of the text units. This section includes descriptions of hidden Markov models, probabilistic context-free grammars, and various supervised and unsupervised methods for word sense disambiguation. Next, I look at approaches beyond the word and sentence level, such as vector space models for information retrieval, latent semantic indexing, and a new approach based on a bigram proximity matrix. I conclude with a brief description of self-organizing maps.

2 citations