Proceedings ArticleDOI
Anomaly detection in web graphs using vertex neighbourhood based signature similarity methods
Aritra Ghosh,Pallavi Gudipati +1 more
- Vol. 2016, pp 1-6
TLDR
Two different types of anomalies which occur during crawling and two novel similarity measures based on vertex neighbourhood, which overcomes the proposed anomalies are proposed.Abstract:
With massive increase in the amount of data being generated each day, we need automated tools to oversee the evolution of the web and to quantify global effects like pagerank of webpages. Search engines crawl the web every now and then to build web graphs which store information about the structure of the web. This is an expensive and error prone process. Central to this problem is the notion of graph similarity (between two graphs spaced in time), which validates how well search engines secure content from web and the quality of the search results they produce. In this paper, we propose two different types of anomalies which occur during crawling and two novel similarity measures based on vertex neighbourhood, which overcomes the proposed anomalies. Extensive experimentation on real world datasets shows significant improvement over state of art signature similarity based methods.read more
Citations
More filters
Journal ArticleDOI
Boosting Positive and Unlabeled Learning for Anomaly Detection With Multi-Features
TL;DR: This work introduces a novel PU learning method, which can tackle the situation where an unlabeled data set is mostly composed of positive instances, and starts by using a linear model to extract the most reliable negative instances followed by a self-learning process to add reliable negative and positive instances with different speeds based on the estimated positive class prior.
Posted ContentDOI
Imbalanced Aircraft Data Anomaly Detection
TL;DR: GTDA as mentioned in this paper proposes a Graphical Temporal Data Analysis (GTDA) framework, which consists of three modules, named Series-to-Image (S2I), Cluster-based Resampling Approach using Euclidean Distance (CRD) and Variance-Based Loss (VBL).
References
More filters
Proceedings ArticleDOI
Similarity estimation techniques from rounding algorithms
TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Proceedings ArticleDOI
Finding near-duplicate web pages: a large-scale evaluation of algorithms
TL;DR: A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.
Journal ArticleDOI
Web graph similarity for anomaly detection
TL;DR: This paper empirically evaluate and compare all five similarity schemes, adapted from existing graph similarity measures, and adapted from well-known document and vector similarity methods (namely, the shingling method and random projection based method).
Journal ArticleDOI
Effective web crawling
TL;DR: The World Wide Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small.
Patent
Methods and apparatus for computing graph similarity via signature similarity
TL;DR: In this article, a web graph is transformed into a set of weighted features, which are then converted into a signature via a SimHash algorithm, and the signature is compared to the signature of one or more other web graphs in order to determine similarity between web graphs.