scispace - formally typeset
Open AccessProceedings Article

Discovering large dense subgraphs in massive graphs

David Gibson, +2 more
- pp 721-732
TLDR
D dense subgraph extraction is proposed as a useful primitive for spam detection, and its incorporation into the workflow of web search engines is discussed.
Abstract
We present a new algorithm for finding large, dense subgraphs in massive graphs. Our algorithm is based on a recursive application of fingerprinting via shingles, and is extremely efficient, capable of handling graphs with tens of billions of edges on a single machine with modest resources.We apply our algorithm to characterize the large, dense subgraphs of a graph showing connections between hosts on the World Wide Web; this graph contains over 50M hosts and 11B edges, gathered from 2.1B web pages. We measure the distribution of these dense subgraphs and their evolution over time. We show that more than half of these hosts participate in some dense subgraph found by the analysis. There are several hundred giant dense subgraphs of at least ten thousand hosts; two thousand dense subgraphs at least a thousand hosts; and almost 64K dense subgraphs of at least a hundred hosts.Upon examination, many of the dense subgraphs output by our algorithm are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. We therefore propose dense subgraph extraction as a useful primitive for spam detection, and discuss its incorporation into the workflow of web search engines.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Community detection in Social Media

TL;DR: This survey first frames the concept of community and the problem of community detection in the context of Social Media, and provides a compact classification of existing algorithms based on their methodological principles, placing special emphasis on the performance of existing methods in terms of computational complexity and memory requirements.
MonographDOI

Social Media Mining: An Introduction

TL;DR: Social Media Mining introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining.
Journal ArticleDOI

Efficient similarity joins for near-duplicate detection

TL;DR: This article proposes new filtering techniques by exploiting the token ordering information and drastically reduce the candidate sizes and hence improve the efficiency of existing algorithms to find a pair of records such that their similarities are no less than a given threshold.
BookDOI

Managing and Mining Graph Data

TL;DR: This is the first comprehensive survey book in the emerging topic of graph data processing and contains extensive surveys on important graph topics such as graph languages, indexing, clustering, data generation, pattern mining, classification, keyword search, pattern matching, and privacy.
Proceedings ArticleDOI

Efficient similarity joins for near duplicate detection

TL;DR: This paper proposes new filtering techniques by exploiting the ordering information and drastically reduce the candidate sizes and improve the efficiency of existing algorithms to find pairs of records such that their similarities are above a given threshold.
References
More filters
Journal ArticleDOI

Graph structure in the Web

TL;DR: The study of the web as a graph yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution.
Journal ArticleDOI

Syntactic clustering of the Web

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Journal ArticleDOI

The Space Complexity of Approximating the Frequency Moments

TL;DR: In this paper, the authors considered the space complexity of randomized algorithms that approximate the frequency moments of a sequence, where the elements of the sequence are given one by one and cannot be stored.
Proceedings ArticleDOI

The space complexity of approximating the frequency moments

TL;DR: It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.