scispace - formally typeset

Journal ArticleDOI

Filtering Template Driven Spam Mails using Vector Space Models

29 Feb 2012-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 39, Iss: 14, pp 33-35

TL;DR: The main objective in this paper is to find out semantic distance and evaluate the applicability of the two information retrieval techniques, Simple Vector Space Models (VSM) and VSM using Rocchio Classification in the spam context.

AbstractSpam became a big problem to the society. Some spammers are using templates for sending spam. To send a particular promotion they create some template and merge the details of receivers with the template. Similarities can find among these mails and easily ignore the forthcoming spam. Most highvolume spam is sent using tools those randomizes parts of the message - subject, body, sender address etc. The general form of the template that the spammer is using can often guess by inspecting the features of messages. Most of the spam filters are either rule based models or Bayesian models. The main objective in this paper is to find out semantic distance and evaluate the applicability of the two information retrieval techniques, Simple Vector Space Models (VSM) and VSM using Rocchio Classification in the spam context. Both methods are using cosine similarities to identify the spam

Topics: Spamming (77%), Spambot (68%), Forum spam (67%), Bag-of-words model (60%)

...read more

Content maybe subject to copyright    Report

Citations
More filters

Journal ArticleDOI
TL;DR: This research work has recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.
Abstract: E-mail is one of the important and economical communication media to transfer the information from one person to others. Due to increase number of E-mails resulted drastic increases spam E-mail. In this research work, we have used various classification techniques to classification of spam E-mail and non spam E-mails. The experiment done in Tanagra data mining tool. We have recommended the Multilayer perceptron (MLP) as a best classifier for classification of spam which gives 93.15% accuracy with 10-fold cross validation.

Cites methods from "Filtering Template Driven Spam Mail..."

  • ...(2012) [4] have also proposed a new hybrid model using VSM and Racchio for classification of spam e-mail....

    [...]


References
More filters

Book
01 Jan 2008
Abstract: Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. Written from a computer science perspective by three leading experts in the field, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Although originally designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also create a buzz for researchers and professionals alike.

11,798 citations


Journal ArticleDOI
TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.
Abstract: In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.

6,281 citations


"Filtering Template Driven Spam Mail..." refers methods in this paper

  • ...The results showing that VSM using Rocchio Classification scheme performs better than Simple VSM scheme....

    [...]

  • ...VSM using Rocchio Classification is much faster than simple VSM because the number of iterations required is less....

    [...]

  • ...In information retrieval, a vector space model (VSM) [1] is a widely used model for representing information....

    [...]

  • ...The simple VSM model is efficient to find out the exact spam template....

    [...]

  • ...In that way simple VSM can performs better than Rocchio Classification....

    [...]


01 Jan 1971

3,074 citations


Book
01 Jan 1971

2,820 citations


"Filtering Template Driven Spam Mail..." refers background in this paper

  • ...The distance (or similarity) between a query vector and the document vectors is the basis for the information retrieval process....

    [...]


Journal ArticleDOI
TL;DR: The single and complete linkage and Ward clustering was applied to Finnish documents utilizing their relevance assessment as a new feature and a connection between the cosine measure and the Euclidean distance was used in association with PCA.
Abstract: Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.

129 citations