scispace - formally typeset
Open AccessJournal Article

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Giorgio Fumera, +2 more
- 01 Dec 2006 - 
- Vol. 7, Iss: 98, pp 2699-2720
Reads0
Chats0
TLDR
This paper proposes an approach to anti-spam filtering which exploits the text information embedded into images sent as attachments, based on the application of state-of-the-art text categorisation techniques to the analysis of text extracted by OCR tools from images attached to e-mails.
Abstract
In recent years anti-spam filters have become necessary tools for Internet service providers to face up to the continuously growing spam phenomenon. Current server-side anti-spam filters are made up of several modules aimed at detecting different features of spam e-mails. In particular, text categorisation techniques have been investigated by researchers for the design of modules for the analysis of the semantic content of e-mails, due to their potentially higher generalisation capability with respect to manually derived classification rules used in current server-side filters. However, very recently spammers introduced a new trick consisting of embedding the spam message into attached images, which can make all current techniques based on the analysis of digital text in the subject and body fields of e-mails ineffective. In this paper we propose an approach to anti-spam filtering which exploits the text information embedded into images sent as attachments. Our approach is based on the application of state-of-the-art text categorisation techniques to the analysis of text extracted by OCR tools from images attached to e-mails. The effectiveness of the proposed approach is experimentally evaluated on two large corpora of spam e-mails.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning

TL;DR: A thorough overview of the evolution of this research area over the last ten years and beyond is provided, starting from pioneering, earlier work on the security of non-deep learning algorithms up to more recent work aimed to understand the security properties of deep learning algorithms, in the context of computer vision and cybersecurity tasks.
Journal ArticleDOI

Review: A review of machine learning approaches to Spam filtering

TL;DR: A comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches concludes that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.
Journal ArticleDOI

A survey of learning-based techniques of email spam filtering

TL;DR: An overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods.
Book

Email Spam Filtering: A Systematic Review

TL;DR: This work examines the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe, and outlines several uncertainties and proposes experimental methods to address them.
Proceedings ArticleDOI

Combating Adversarial Misspellings with Robust Word Recognition

TL;DR: This work proposes a word recognition model in front of the downstream classifier, outperforming both adversarial training and off-the-shelf spell checkers, and reveals that robustness also depends upon a quantity that the authors denote the sensitivity.
References
More filters
Journal ArticleDOI

Machine learning in automated text categorization

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Proceedings ArticleDOI

Advances in kernel methods: support vector learning

TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Posted ContentDOI

Making large scale SVM learning practical

TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.
Proceedings Article

A comparison of event models for naive bayes text classification

TL;DR: It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Proceedings Article

A Bayesian Approach to Filtering Junk E-Mail

TL;DR: This work examines methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream, and shows the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.