Clustering malware-generated spam emails with a novel fuzzy string matching algorithm

doi:10.1145/1529282.1529473

Proceedings ArticleDOI

Clustering malware-generated spam emails with a novel fuzzy string matching algorithm

Chun Wei, +2 more

- pp 889-890

Chats0

TLDR

A fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware to detect similar patterns even when the spammer creates a variation of the original pattern.

Abstract:

In this paper, a fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware. A modified scoring strategy is applied in dynamic programming to find subjects that are similar to each other. A recursive seed selection strategy allows the algorithm to detect similar patterns even when the spammer creates a variation of the original pattern. A sliding threshold based on string length helps to minimize false-positives.The algorithm proves to be effective in detecting and grouping spam emails using templates. It also helps spam investigators to collect and sort large amount of malware-generated spam more efficiently without looking at the email content.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A survey of emerging approaches to spam filtering

Godwin Caruana, +1 more

- 05 Mar 2008 -

ACM Computing Surveys

TL;DR: This survey focuses on emerging approaches to spam filtering built on recent developments in computing technologies, which include peer-to-peer computing, grid computing, semantic Web, and social networks.

...read moreread less

Patent

System and Method for Matching Data Using Probabilistic Modeling Techniques

Shubh Bansal

TL;DR: In this paper, a system and method for matching data using probabilistic modeling techniques is provided, which includes a computer system and a data matching model/engine, which can match and identify entities from approximately matching short string text (e.g., company names, product names, addresses, etc.).

...read moreread less

Journal ArticleDOI

Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies

Hans R. Bosker

- 10 Mar 2021 -

Behavior Research Methods

TL;DR: In this article, the authors evaluate various forms of fuzzy string matching between participants' responses and target sentences, as automated metrics of listener transcript accuracy, and demonstrate that one particular metric, the token sort ratio, is a consistent, highly efficient, and accurate metric for automated assessment of listener transcripts, as evidenced by high correlations with human-generated scores.

...read moreread less

Book ChapterDOI

A Survey of Machine Learning Algorithms and Their Application in Information Security

Mark Stamp

TL;DR: A wide variety of machine learning techniques are introduced, and a sample of the applications of each to security-related problems is briefly discussed.

...read moreread less

Journal ArticleDOI

Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering

Rasib Khan, +3 more

- 31 Mar 2014 -

The Journal of Digital Forensics, Securi...

TL;DR: This paper illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns.

...read moreread less

References

PDF

Open Access

More filters

Book

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Richard Durbin, +3 more

TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.

...read moreread less

Journal ArticleDOI

A Theory for Record Linkage

Ivan P. Fellegi, +1 more

- 01 Dec 1969 -

Journal of the American Statistical Asso...

TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.

...read moreread less

Proceedings Article

A comparison of string distance metrics for name-matching tasks

William W. Cohen, +2 more

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.

...read moreread less

Proceedings Article

The field matching problem: Algorithms and applications

Alvaro Monge, +1 more

TL;DR: Three field matching algorithms are described, one of which is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences, and their performance on real-world datasets is evaluated.

...read moreread less

Journal ArticleDOI

Similarity-Based Models of Word Cooccurrence Probabilities

Ido Dagan, +2 more

- 01 Feb 1999 -

Machine Learning

TL;DR: The authors proposed a method for estimating the probability of unseen word combinations using available information on "most similar" words and applied it to language modeling and pseudo-word disambiguation tasks.

...read moreread less