Proceedings ArticleDOI
Clustering malware-generated spam emails with a novel fuzzy string matching algorithm
Chun Wei,Alan P. Sprague,Gary Warner +2 more
- pp 889-890
Reads0
Chats0
TLDR
A fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware to detect similar patterns even when the spammer creates a variation of the original pattern.Abstract:
In this paper, a fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware. A modified scoring strategy is applied in dynamic programming to find subjects that are similar to each other. A recursive seed selection strategy allows the algorithm to detect similar patterns even when the spammer creates a variation of the original pattern. A sliding threshold based on string length helps to minimize false-positives.The algorithm proves to be effective in detecting and grouping spam emails using templates. It also helps spam investigators to collect and sort large amount of malware-generated spam more efficiently without looking at the email content.read more
Citations
More filters
Journal ArticleDOI
A survey of emerging approaches to spam filtering
Godwin Caruana,Maozhen Li +1 more
TL;DR: This survey focuses on emerging approaches to spam filtering built on recent developments in computing technologies, which include peer-to-peer computing, grid computing, semantic Web, and social networks.
Patent
System and Method for Matching Data Using Probabilistic Modeling Techniques
TL;DR: In this paper, a system and method for matching data using probabilistic modeling techniques is provided, which includes a computer system and a data matching model/engine, which can match and identify entities from approximately matching short string text (e.g., company names, product names, addresses, etc.).
Journal ArticleDOI
Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies
TL;DR: In this article, the authors evaluate various forms of fuzzy string matching between participants' responses and target sentences, as automated metrics of listener transcript accuracy, and demonstrate that one particular metric, the token sort ratio, is a consistent, highly efficient, and accurate metric for automated assessment of listener transcripts, as evidenced by high correlations with human-generated scores.
Book ChapterDOI
A Survey of Machine Learning Algorithms and Their Application in Information Security
TL;DR: A wide variety of machine learning techniques are introduced, and a sample of the applications of each to security-related problems is briefly discussed.
Journal ArticleDOI
Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
TL;DR: This paper illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns.
References
More filters
Book
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
Journal ArticleDOI
A Theory for Record Linkage
Ivan P. Fellegi,Alan B. Sunter +1 more
TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Proceedings Article
A comparison of string distance metrics for name-matching tasks
TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Proceedings Article
The field matching problem: Algorithms and applications
Alvaro Monge,Charles Elkan +1 more
TL;DR: Three field matching algorithms are described, one of which is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences, and their performance on real-world datasets is evaluated.
Journal ArticleDOI
Similarity-Based Models of Word Cooccurrence Probabilities
TL;DR: The authors proposed a method for estimating the probability of unseen word combinations using available information on "most similar" words and applied it to language modeling and pseudo-word disambiguation tasks.