scispace - formally typeset
Search or ask a question
Topic

Image spam

About: Image spam is a research topic. Over the lifetime, 175 publications have been published within this topic receiving 4126 citations. The topic is also known as: image-based spam.


Papers
More filters
Journal ArticleDOI
TL;DR: The use of support vector machines in classifying e-mail as spam or nonspam is studied by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees, which found SVM's performed best when using binary features.
Abstract: We study the use of support vector machines (SVM) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM performed best when using binary features. For both data sets, boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, SVM had significantly less training time.

1,536 citations

Journal ArticleDOI
TL;DR: A comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches concludes that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.
Abstract: In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.

468 citations

Journal Article
TL;DR: This paper proposes an approach to anti-spam filtering which exploits the text information embedded into images sent as attachments, based on the application of state-of-the-art text categorisation techniques to the analysis of text extracted by OCR tools from images attached to e-mails.
Abstract: In recent years anti-spam filters have become necessary tools for Internet service providers to face up to the continuously growing spam phenomenon. Current server-side anti-spam filters are made up of several modules aimed at detecting different features of spam e-mails. In particular, text categorisation techniques have been investigated by researchers for the design of modules for the analysis of the semantic content of e-mails, due to their potentially higher generalisation capability with respect to manually derived classification rules used in current server-side filters. However, very recently spammers introduced a new trick consisting of embedding the spam message into attached images, which can make all current techniques based on the analysis of digital text in the subject and body fields of e-mails ineffective. In this paper we propose an approach to anti-spam filtering which exploits the text information embedded into images sent as attachments. Our approach is based on the application of state-of-the-art text categorisation techniques to the analysis of text extracted by OCR tools from images attached to e-mails. The effectiveness of the proposed approach is experimentally evaluated on two large corpora of spam e-mails.

161 citations

Proceedings Article
01 Jan 2007
TL;DR: This paper presents features that focus on simple properties of the image, making classification as fast as possible, and introduces a new feature selection algorithm that selects features for classification based on their speed as well as predictive power.
Abstract: Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

145 citations

Proceedings ArticleDOI
01 Jan 2005
TL;DR: A novel anti-spam system which utilizes visual clues, in addition to text information in the email body, to determine whether a message is spam, using one-class support vector machines (SVM) as the underlying base classifier for anti- Spam filtering.
Abstract: Unsolicited commercial email (UCE), also known as spam, has been a major problem on the Internet. In the past, researchers have addressed this problem as a text classification or categorization problem. However, as spammers' techniques continue to evolve and the genre of email content becomes more and more diverse, text-based anti-spam approaches alone are no longer sufficient. In this paper, we propose a novel anti-spam system which utilizes visual clues, in addition to text information in the email body, to determine whether a message is spam. We analyze a large collection of spam emails containing images and identify a number of useful visual features for this application. We then propose using one-class support vector machines (SVM) as the underlying base classifier for anti-spam filtering. The experimental results demonstrate that the proposed system can add significant filtering power to the existing text-based anti-spam filters.

117 citations


Network Information
Related Topics (5)
Facial recognition system
38.7K papers, 883.4K citations
74% related
Feature (machine learning)
33.9K papers, 798.7K citations
73% related
Web page
50.3K papers, 975.1K citations
72% related
Feature vector
48.8K papers, 954.4K citations
72% related
Authentication
74.7K papers, 867.1K citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20217
20209
20198
201811
20174
20167