Filtering Template Driven Spam Mails using Vector Space Models
TLDR
The main objective in this paper is to find out semantic distance and evaluate the applicability of the two information retrieval techniques, Simple Vector Space Models (VSM) and VSM using Rocchio Classification in the spam context.Abstract:
Spam became a big problem to the society. Some spammers are using templates for sending spam. To send a particular promotion they create some template and merge the details of receivers with the template. Similarities can find among these mails and easily ignore the forthcoming spam. Most highvolume spam is sent using tools those randomizes parts of the message - subject, body, sender address etc. The general form of the template that the spammer is using can often guess by inspecting the features of messages. Most of the spam filters are either rule based models or Bayesian models. The main objective in this paper is to find out semantic distance and evaluate the applicability of the two information retrieval techniques, Simple Vector Space Models (VSM) and VSM using Rocchio Classification in the spam context. Both methods are using cosine similarities to identify the spamread more
Citations
More filters
Journal ArticleDOI
Performance Evaluation of Data Mining based Classifier for Classification of Spam E-Mail
TL;DR: This research work has recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.
References
More filters
Book
Introduction to Information Retrieval
TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Journal ArticleDOI
A vector space model for automatic indexing
Gerard Salton,A. Wong,C. S. Yang +2 more
TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.
Journal ArticleDOI
On principal component analysis, cosine and Euclidean measures in information retrieval
TL;DR: The single and complete linkage and Ward clustering was applied to Finnish documents utilizing their relevance assessment as a new feature and a connection between the cosine measure and the Euclidean distance was used in association with PCA.