scispace - formally typeset
Proceedings ArticleDOI

Clustering malware-generated spam emails with a novel fuzzy string matching algorithm

Reads0
Chats0
TLDR
A fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware to detect similar patterns even when the spammer creates a variation of the original pattern.
Abstract
In this paper, a fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware. A modified scoring strategy is applied in dynamic programming to find subjects that are similar to each other. A recursive seed selection strategy allows the algorithm to detect similar patterns even when the spammer creates a variation of the original pattern. A sliding threshold based on string length helps to minimize false-positives.The algorithm proves to be effective in detecting and grouping spam emails using templates. It also helps spam investigators to collect and sort large amount of malware-generated spam more efficiently without looking at the email content.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A survey of emerging approaches to spam filtering

TL;DR: This survey focuses on emerging approaches to spam filtering built on recent developments in computing technologies, which include peer-to-peer computing, grid computing, semantic Web, and social networks.
Patent

System and Method for Matching Data Using Probabilistic Modeling Techniques

Shubh Bansal
TL;DR: In this paper, a system and method for matching data using probabilistic modeling techniques is provided, which includes a computer system and a data matching model/engine, which can match and identify entities from approximately matching short string text (e.g., company names, product names, addresses, etc.).
Journal ArticleDOI

Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies

TL;DR: In this article, the authors evaluate various forms of fuzzy string matching between participants' responses and target sentences, as automated metrics of listener transcript accuracy, and demonstrate that one particular metric, the token sort ratio, is a consistent, highly efficient, and accurate metric for automated assessment of listener transcripts, as evidenced by high correlations with human-generated scores.
Book ChapterDOI

A Survey of Machine Learning Algorithms and Their Application in Information Security

TL;DR: A wide variety of machine learning techniques are introduced, and a sample of the applications of each to security-related problems is briefly discussed.
Journal ArticleDOI

Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering

TL;DR: This paper illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns.
References
More filters
Book

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
Journal ArticleDOI

A Theory for Record Linkage

TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Proceedings Article

A comparison of string distance metrics for name-matching tasks

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Proceedings Article

The field matching problem: Algorithms and applications

TL;DR: Three field matching algorithms are described, one of which is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences, and their performance on real-world datasets is evaluated.
Journal ArticleDOI

Similarity-Based Models of Word Cooccurrence Probabilities

TL;DR: The authors proposed a method for estimating the probability of unseen word combinations using available information on "most similar" words and applied it to language modeling and pseudo-word disambiguation tasks.
Related Papers (5)
Trending Questions (1)
How do I stop Outlook from deleting emails from server?

The algorithm proves to be effective in detecting and grouping spam emails using templates.