scispace - formally typeset
Proceedings ArticleDOI

Anomaly detection in web graphs using vertex neighbourhood based signature similarity methods

TLDR
Two different types of anomalies which occur during crawling and two novel similarity measures based on vertex neighbourhood, which overcomes the proposed anomalies are proposed.
Abstract
With massive increase in the amount of data being generated each day, we need automated tools to oversee the evolution of the web and to quantify global effects like pagerank of webpages. Search engines crawl the web every now and then to build web graphs which store information about the structure of the web. This is an expensive and error prone process. Central to this problem is the notion of graph similarity (between two graphs spaced in time), which validates how well search engines secure content from web and the quality of the search results they produce. In this paper, we propose two different types of anomalies which occur during crawling and two novel similarity measures based on vertex neighbourhood, which overcomes the proposed anomalies. Extensive experimentation on real world datasets shows significant improvement over state of art signature similarity based methods.

read more

Citations
More filters
Journal ArticleDOI

Boosting Positive and Unlabeled Learning for Anomaly Detection With Multi-Features

TL;DR: This work introduces a novel PU learning method, which can tackle the situation where an unlabeled data set is mostly composed of positive instances, and starts by using a linear model to extract the most reliable negative instances followed by a self-learning process to add reliable negative and positive instances with different speeds based on the estimated positive class prior.
Posted ContentDOI

Imbalanced Aircraft Data Anomaly Detection

TL;DR: GTDA as mentioned in this paper proposes a Graphical Temporal Data Analysis (GTDA) framework, which consists of three modules, named Series-to-Image (S2I), Cluster-based Resampling Approach using Euclidean Distance (CRD) and Variance-Based Loss (VBL).
References
More filters
Proceedings ArticleDOI

Similarity estimation techniques from rounding algorithms

TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Proceedings ArticleDOI

Finding near-duplicate web pages: a large-scale evaluation of algorithms

TL;DR: A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.
Journal ArticleDOI

Web graph similarity for anomaly detection

TL;DR: This paper empirically evaluate and compare all five similarity schemes, adapted from existing graph similarity measures, and adapted from well-known document and vector similarity methods (namely, the shingling method and random projection based method).
Journal ArticleDOI

Effective web crawling

TL;DR: The World Wide Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small.
Patent

Methods and apparatus for computing graph similarity via signature similarity

TL;DR: In this article, a web graph is transformed into a set of weighted features, which are then converted into a signature via a SimHash algorithm, and the signature is compared to the signature of one or more other web graphs in order to determine similarity between web graphs.