Discovering large dense subgraphs in massive graphs

Open AccessProceedings Article

Discovering large dense subgraphs in massive graphs

- pp 721-732

TLDR

D dense subgraph extraction is proposed as a useful primitive for spam detection, and its incorporation into the workflow of web search engines is discussed.

Abstract:

We present a new algorithm for finding large, dense subgraphs in massive graphs. Our algorithm is based on a recursive application of fingerprinting via shingles, and is extremely efficient, capable of handling graphs with tens of billions of edges on a single machine with modest resources.We apply our algorithm to characterize the large, dense subgraphs of a graph showing connections between hosts on the World Wide Web; this graph contains over 50M hosts and 11B edges, gathered from 2.1B web pages. We measure the distribution of these dense subgraphs and their evolution over time. We show that more than half of these hosts participate in some dense subgraph found by the analysis. There are several hundred giant dense subgraphs of at least ten thousand hosts; two thousand dense subgraphs at least a thousand hosts; and almost 64K dense subgraphs of at least a hundred hosts.Upon examination, many of the dense subgraphs output by our algorithm are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. We therefore propose dense subgraph extraction as a useful primitive for spam detection, and discuss its incorporation into the workflow of web search engines.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Community detection in Social Media

Symeon Papadopoulos, +3 more

- 01 May 2012 -

Data Mining and Knowledge Discovery

TL;DR: This survey first frames the concept of community and the problem of community detection in the context of Social Media, and provides a compact classification of existing algorithms based on their methodological principles, placing special emphasis on the performance of existing methods in terms of computational complexity and memory requirements.

...read moreread less

MonographDOI

Social Media Mining: An Introduction

Reza Zafarani, +2 more

TL;DR: Social Media Mining introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining.

...read moreread less

Journal ArticleDOI

Efficient similarity joins for near-duplicate detection

Chuan Xiao, +4 more

- 26 Aug 2011 -

ACM Transactions on Database Systems

TL;DR: This article proposes new filtering techniques by exploiting the token ordering information and drastically reduce the candidate sizes and hence improve the efficiency of existing algorithms to find a pair of records such that their similarities are no less than a given threshold.

...read moreread less

BookDOI

Managing and Mining Graph Data

Charu C. Aggarwal, +1 more

TL;DR: This is the first comprehensive survey book in the emerging topic of graph data processing and contains extensive surveys on important graph topics such as graph languages, indexing, clustering, data generation, pattern mining, classification, keyword search, pattern matching, and privacy.

...read moreread less

Proceedings ArticleDOI

Efficient similarity joins for near duplicate detection

Chuan Xiao, +3 more

TL;DR: This paper proposes new filtering techniques by exploiting the ordering information and drastically reduce the candidate sizes and improve the efficiency of existing algorithms to find pairs of records such that their similarities are above a given threshold.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Fast Algorithms for Mining Association Rules in Large Databases

Rakesh Agrawal, +1 more

Journal ArticleDOI

Graph structure in the Web

Andrei Z. Broder, +7 more

TL;DR: The study of the web as a graph yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution.

...read moreread less

Journal ArticleDOI

Syntactic clustering of the Web

Andrei Z. Broder, +3 more

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.

...read moreread less

Journal ArticleDOI

The Space Complexity of Approximating the Frequency Moments

Noga Alon, +2 more

- 01 Feb 1999 -

Journal of Computer and System Sciences

TL;DR: In this paper, the authors considered the space complexity of randomized algorithms that approximate the frequency moments of a sequence, where the elements of the sequence are given one by one and cannot be stored.

...read moreread less

Proceedings ArticleDOI

The space complexity of approximating the frequency moments

Noga Alon, +2 more

TL;DR: It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.

...read moreread less

Collapse

Discovering large dense subgraphs in massive graphs

Citations

Community detection in Social Media

Social Media Mining: An Introduction

Efficient similarity joins for near-duplicate detection

Managing and Mining Graph Data

Efficient similarity joins for near duplicate detection

References

Fast Algorithms for Mining Association Rules in Large Databases

Graph structure in the Web

Syntactic clustering of the Web

The Space Complexity of Approximating the Frequency Moments

The space complexity of approximating the frequency moments

Related Papers (5)

Trawling the Web for emerging cyber-communities

Finding a Maximum Density Subgraph

Greedy approximation algorithms for finding dense components in a graph

The Dense k -Subgraph Problem

Syntactic clustering of the Web