Rule-Based On-the-fly Web Spambot Detection Using Action Strings

Home
/
Papers
/
Rule-Based On-the-fly Web Spambot Detection Using Action Strings

Rule-Based On-the-fly Web Spambot Detection Using Action Strings

Pedram Hayati¹, Vidyasagar Potdar¹, Alex Talevski¹, William F. Smyth¹•Institutions (1)

01 Jan 2010-

TL;DR: In this article, a rule-based web usage behavior action string that can be analyzed using Trie data structures to detect web spambots is proposed to eliminate spam in web 2.0 applications.

read less

Abstract: spambots are a new type of internet robot that spread spam content through Web 2.0 applications like online discussion boards, blogs, wikis, social networking platforms etc. These robots are intelligently designed to act like humans in order to fool safeguards and other users. Such spam content not only wastes valuable resources and time but also may mislead users with unsolicited content. Spam content typically intends to misinform users (scams), generate traffic, make sales (marketing/advertising), and occasionally compromise parties, people or systems by spreading spyware or malwares. Current countermeasures do not effectively identify and prevent web spambots. Proactive measures to deter spambots from entering a site are limited to question / response scenarios. The remaining efforts then focus on spam content identification as a passive activity. Spammers have evolved their techniques to bypass existing anti-spam filters. In this paper, we describe a rule-based web usage behaviour action string that can be analysed using Trie data structures to detect web spambots. Our experimental results show the proposed system is successful for on-the-fly classification of web spambots hence eliminating spam in web 2.0 applications.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Artificial Neural Networks and Machine Learning

[...]

Valeri Mladenov, Petia Koprinkova-Hristova

22 Mar 2016

31 citations

Journal Article•DOI•

An integrated method for real time and offline web robot detection

[...]

Derek Doran¹, Swapna S. Gokhale²•Institutions (2)

Wright State University¹, University of Connecticut²

01 Dec 2016-Expert Systems

TL;DR: This paper presents a novel detection approach that relies on the differences in the resource request patterns of web robots and humans, and rationalizes why differences in resourcerequest patterns are expected to remain intrinsic to robots and human despite the continuous evolution of their traffic.

...read moreread less

Abstract: Recent academic and industry reports confirm that web robots dominate the traffic seen by web servers across the Internet. Because web robots crawl in an unregulated fashion, they may threaten the privacy, function, performance, and security of web servers. There is therefore a growing need to be able to identify robot visitors automatically, in offline and in real time, to assess their impact and to potentially protect web servers from abusive bots. Yet contemporary detection approaches, which rely on syntactic log analysis, finding statistical variations between robot and human traffic, analytical learning techniques, or complex software modifications may not be realistic to implement or remain effective as the behavior of robots evolve over time. Instead, this paper presents a novel detection approach that relies on the differences in the resource request patterns of web robots and humans. It rationalizes why differences in resource request patterns are expected to remain intrinsic to robots and humans despite the continuous evolution of their traffic. The performance of the approach, adoptable for both offline and real time settings with a simple implementation, is demonstrated by playing back streams of actual web traffic with varying session lengths and proportions of robot requests.

...read moreread less

28 citations

Overview of Web Spammer Detection

[...]

Mo Qian, Yang Ke

01 Jan 2014

TL;DR: An overview of Web spammer detection is presented, along with a comparison over the difference between traditional and burgeoning spammer Detection approaches, and the prospects for future development and suggestions for possible extensions are emphasized.

...read moreread less

Abstract: With its rising popularity, as evidenced in social networks, online shopping platforms and email systems, detection of Web spammer has already become one of the hottest topics in the data mining field. The main challenge of Web spammer detection is how to recognize spammer behavior patterns by examining spammer features and attributes from big dataset in order to limit the proliferation of Internet spam and insure quality of Internet service. This paper presents an overview of Web spammer detection, along with a comparison over the difference between traditional and burgeoning spammer detection approaches. The key techniques and evaluation methods are classified and discussed from several aspects. At last, the prospects for future development and suggestions for possible extensions are

...read moreread less

15 citations

Journal Article•DOI•

A design of a proxy inspired from human immune system to detect SQL Injection and Cross-Site Scripting

[...]

Erwin Adi

01 Jan 2012-Procedia Engineering

TL;DR: Wines is proposed, named after Web Immune Systems, a design for a proxy that learns variations of the attack strings from behaviors of malicious users for the purpose of detecting those mutated attack strings.

...read moreread less

7 citations

Dissertation•

Personal Email Spam Filtering with Minimal User Interaction

[...]

Mona Mojdeh

30 Apr 2012

TL;DR: This work describes new approaches to solve the problem of building a personal spam filter that requires minimal user feedback, and shows that learning filters with no user input can substantially improve the results of open-source and industry-leading commercial filters that employ no user-specific training.

...read moreread less

Abstract: This thesis investigates ways to reduce or eliminate the necessity of user input to learning-based personal email spam filters. Personal spam filters have been shown in previous studies to yield superior effectiveness, at the cost of requiring extensive user training which may be burdensome or impossible. This work describes new approaches to solve the problem of building a personal spam filter that requires minimal user feedback. An initial study investigates how well a personal filter can learn from different sources of data, as opposed to user’s messages. Our initial studies show that inter-user training yields substantially inferior results to intra-user training using the best known methods. Moreover, contrary to previous literature, it is found that transfer learning degrades the performance of spam filters when the source of training and test sets belong to two different users or different times. We also adapt and modify a graph-based semi-supervising learning algorithm to build a filter that can classify an entire inbox trained on twenty or fewer user judgments. Our experiments show that this approach compares well with previous techniques when trained on as few as two training examples. We also present the toolkit we developed to perform privacy-preserving user studies on spam filters. This toolkit allows researchers to evaluate any spam filter that conforms to a standard interface defined by TREC, on real users’ email boxes. Researchers have access only to the TREC-style result file, and not to any content of a user’s email stream. To eliminate the necessity of feedback from the user, we build a personal auv tonomous filter that learns exclusively on the result of a global spam filter. Our laboratory experiments show that learning filters with no user input can substantially improve the results of open-source and industry-leading commercial filters that employ no user-specific training. We use our toolkit to validate the performance of the autonomous filter in a user study.

...read moreread less

5 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Data mining and knowledge discovery: making sense out of data

[...]

U.M. Feyyad¹•Institutions (1)

Microsoft¹

01 Oct 1996-IEEE Intelligent Systems

TL;DR: Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.

...read moreread less

Abstract: Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.

...read moreread less

4,806 citations

Journal Article•DOI•

Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

[...]

Brian W. Matthews¹•Institutions (1)

University of Oregon¹

20 Oct 1975-Biochimica et Biophysica Acta

TL;DR: Although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.

...read moreread less

4,522 citations

Journal Article•DOI•

Assessing the accuracy of prediction algorithms for classification: an overview

[...]

Pierre Baldi¹, Søren Brunak, Yves Chauvin, Claus A. Andersen, Henrik Nielsen - Show less +1 more•Institutions (1)

University of California, Irvine¹

01 May 2000-Bioinformatics

TL;DR: A unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information are provided.

...read moreread less

Abstract: We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadvantages of each approach. For classification tasks, we derive new learning algorithms for the design of prediction systems by directly optimising the correlation coefficient. We observe and prove several results relating sensitivity and specificity of optimal systems. While the principles are general, we illustrate the applicability on specific problems such as protein secondary structure and signal peptide prediction.

...read moreread less

1,972 citations

Proceedings Article•DOI•

Opinion spam and analysis

[...]

Nitin Jindal¹, Bing Liu¹•Institutions (1)

University of Illinois at Chicago¹

11 Feb 2008

TL;DR: It is shown that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques, and therefore requires some novel techniques to detect them.

...read moreread less

Abstract: Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them

...read moreread less

1,385 citations

Proceedings Article•DOI•

Web mining: information and pattern discovery on the World Wide Web

[...]

Robert Cooley¹, Bamshad Mobasher¹, Jaideep Srivastava¹•Institutions (1)

University of Minnesota¹

03 Nov 1997

TL;DR: This paper defines Web mining and presents an overview of the various research issues, techniques, and development efforts, and briefly describes WEBMINER, a system for Web usage mining, and concludes the paper by listing research issues.

...read moreread less

Abstract: Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research efforts. The term Web mining has been used in two distinct ways. The first, called Web content mining in this paper, is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. We define Web mining and present an overview of the various research issues, techniques, and development efforts. We briefly describe WEBMINER, a system for Web usage mining, and conclude the paper by listing research issues.

...read moreread less

1,365 citations