scispace - formally typeset
Search or ask a question
Posted Content

Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions

TL;DR: The nucleus decomposition of a graph is defined, which represents the graph as a forest of nuclei, and provably efficient algorithms for nucleus decompositions are given, and empirically evaluate their behavior in a variety of real graphs.
Abstract: Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
Citations
More filters
Journal ArticleDOI
01 May 2017
TL;DR: This paper develops an efficient greedy algorithmic framework to iteratively remove nodes with the least popular attributes, and shrink the graph into an ATC, and builds an elegant index to maintain k-truss structure and attribute information, and proposes efficient query processing algorithms.
Abstract: Recently, community search over graphs has gained significant interest In applications such as analysis of protein-protein interaction (PPI) networks, citation graphs, and collaboration networks, nodes tend to have attributes Unfortunately, most previous community search algorithms ignore attributes and result in communities with poor cohesion wrt their node attributes In this paper, we study the problem of attribute-driven community search, that is, given an undirected graph G where nodes are associated with attributes, and an input query Q consisting of nodes Vq and attributes Wq, find the communities containing Vq, in which most community members are densely inter-connected and have similar attributes We formulate this problem as finding attributed truss communities (ATC), ie, finding connected and close k-truss subgraphs containing Vq, with the largest attribute relevance score We design a framework of desirable properties that good score function should satisfy We show that the problem is NP-hard However, we develop an efficient greedy algorithmic framework to iteratively remove nodes with the least popular attributes, and shrink the graph into an ATC In addition, we also build an elegant index to maintain k-truss structure and attribute information, and propose efficient query processing algorithms Extensive experiments on large real-world networks with ground-truth communities show that our algorithms significantly outperform the state of the art and demonstrates their efficiency and effectiveness

198 citations


Cites background from "Finding the Hierarchy of Dense Subg..."

  • ...Besides k-truss, there exist several other definitions of dense subgraphs including: k-(r,s)-nucleus [34], quasiclique [10], densest subgraph [37], and k-core [35]....

    [...]

Journal ArticleDOI
01 Aug 2017
TL;DR: A novel equivalence relation, k-truss equivalence, is introduced to model the intrinsic density and cohesiveness of edges in k- Truss communities and validate the efficiency and effectiveness of EquiTruss.
Abstract: We consider the community search problem defined upon a large graph G: given a query vertex q in G, to find as output all the densely connected subgraphs of G, each of which contains the query v. As an online, query-dependent variant of the well-known community detection problem, community search enables personalized community discovery that has found widely varying applications in real-world, large-scale graphs. In this paper, we study the community search problem in the truss-based model aimed at discovering all dense and cohesive k-truss communities to which the query vertex q belongs. We introduce a novel equivalence relation, k-truss equivalence, to model the intrinsic density and cohesiveness of edges in k-truss communities. Consequently, all the edges of G can be partitioned to a series of k-truss equivalence classes that constitute a space-efficient, truss-preserving index structure, EquiTruss. Community search can be henceforth addressed directly upon EquiTruss without repeated, time-demanding accesses to the original graph, G, which proves to be theoretically optimal. In addition, EquiTruss can be efficiently updated in a dynamic fashion when G evolves with edge insertion and deletion. Experimental studies in real-world, large-scale graphs validate the efficiency and effectiveness of EquiTruss, which has achieved at least an order of magnitude speedup in community search over the state-of-the-art method, TCP-Index.

127 citations


Cites background from "Finding the Hierarchy of Dense Subg..."

  • ...There has been a rich literature in modelling and quantification of dense and cohesive graphs, including clique or quasi-clique [6, 31], k-core [25, 1, 15], and nucleus [26, 27]....

    [...]

Proceedings ArticleDOI
10 Aug 2015
TL;DR: This paper presents a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces high-quality approximations while providing significant speedups and improved space complexity, and devise an exact algorithm that can treat both clique and biclique densities in a unified way.
Abstract: Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of poly-time solvable formulations, known as the k-clique densest subgraph problem (k-Clique-DSP) [57]. When k=2, the problem becomes the well-known densest subgraph problem (DSP) [22, 31, 33, 39]. Our main contribution is a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces high-quality approximations while providing significant speedups and improved space complexity. We also extend this family of formulations to bipartite graphs by introducing the (p,q)-biclique densest subgraph problem ((p,q)-Biclique-DSP), and devise an exact algorithm that can treat both clique and biclique densities in a unified way.As an example of performance, our sparsifying algorithm extracts the 5-clique densest subgraph --which is a large-near clique on 62 vertices-- from a large collaboration network. Our algorithm achieves 100% accuracy over five runs, while achieving an average speedup factor of over 10,000. Specifically, we reduce the running time from ∼2 107 seconds to an average running time of 0.15 seconds. We also use our methods to study how the k-clique densest subgraphs change as a function of time in time-evolving networks for various small values of k. We observe significant deviations between the experimental findings on real-world networks and stochastic Kronecker graphs, a random graph model that mimics real-world networks in certain aspects.We believe that our work is a significant advance in routines with rigorous theoretical guarantees for scalable extraction of large near-cliques from networks.

120 citations


Cites methods from "Finding the Hierarchy of Dense Subg..."

  • ...They are also used for expert team formation [19, 55], detecting link spam in Web graphs [32], graph compression [21], reachability and distance query indexing [36], insightful graph decompositions [51] and mining micro-blogging streams [9]....

    [...]

Proceedings ArticleDOI
23 Apr 2018
TL;DR: This work revisits the iconic algorithm of Chiba and Nishizeki and develops the most efficient parallel algorithm for list all k-cliques in graphs containing up to tens of millions of edges, which is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism.
Abstract: Motivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all $10$-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.

109 citations


Cites methods from "Finding the Hierarchy of Dense Subg..."

  • ...In [46] an algorithm for organizing cliques into hierarchical structures is presented, which requires to list allk-cliques....

    [...]

Posted Content
TL;DR: It is proved that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns, the first practical algorithm for 5- Vertex pattern counting that runs at this scale and is able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine.
Abstract: Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all $4$-vertex patterns. This algorithm is typically ten times faster than the state of the art $4$-vertex counters.

109 citations

References
More filters
Journal ArticleDOI
04 Jun 1998-Nature
TL;DR: Simple models of networks that can be tuned through this middle ground: regular networks ‘rewired’ to introduce increasing amounts of disorder are explored, finding that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs.
Abstract: Networks of coupled dynamical systems have been used to model biological oscillators, Josephson junction arrays, excitable media, neural networks, spatial games, genetic control networks and many other self-organizing systems. Ordinarily, the connection topology is assumed to be either completely regular or completely random. But many biological, technological and social networks lie somewhere between these two extremes. Here we explore simple models of networks that can be tuned through this middle ground: regular networks 'rewired' to introduce increasing amounts of disorder. We find that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. We call them 'small-world' networks, by analogy with the small-world phenomenon (popularly known as six degrees of separation. The neural network of the worm Caenorhabditis elegans, the power grid of the western United States, and the collaboration graph of film actors are shown to be small-world networks. Models of dynamical systems with small-world coupling display enhanced signal-propagation speed, computational power, and synchronizability. In particular, infectious diseases spread more easily in small-world networks than in regular lattices.

39,297 citations


"Finding the Hierarchy of Dense Subg..." refers background in this paper

  • ...The classic notions of transitivity [47] and clustering coefficients [48] measure these densities, and are high for many real-world graphs [35, 40]....

    [...]

Book
25 Nov 1994
TL;DR: This paper presents mathematical representation of social networks in the social and behavioral sciences through the lens of Dyadic and Triadic Interaction Models, which describes the relationships between actor and group measures and the structure of networks.
Abstract: Part I. Introduction: Networks, Relations, and Structure: 1. Relations and networks in the social and behavioral sciences 2. Social network data: collection and application Part II. Mathematical Representations of Social Networks: 3. Notation 4. Graphs and matrixes Part III. Structural and Locational Properties: 5. Centrality, prestige, and related actor and group measures 6. Structural balance, clusterability, and transitivity 7. Cohesive subgroups 8. Affiliations, co-memberships, and overlapping subgroups Part IV. Roles and Positions: 9. Structural equivalence 10. Blockmodels 11. Relational algebras 12. Network positions and roles Part V. Dyadic and Triadic Methods: 13. Dyads 14. Triads Part VI. Statistical Dyadic Interaction Models: 15. Statistical analysis of single relational networks 16. Stochastic blockmodels and goodness-of-fit indices Part VII. Epilogue: 17. Future directions.

17,104 citations

Journal ArticleDOI
TL;DR: This work characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links that connect them.
Abstract: Social Network Analysis Methods And Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network ...

12,634 citations


"Finding the Hierarchy of Dense Subg..." refers background in this paper

  • ...The classic notions of transitivity [47] and clustering coefficients [48] measure these densities, and are high for many real-world graphs [35, 40]....

    [...]

Journal ArticleDOI
TL;DR: A general framework for `soft' thresholding that assigns a connection weight to each gene pair is described and several node connectivity measures are introduced and provided empirical evidence that they can be important for predicting the biological significance of a gene.
Abstract: Gene co-expression networks are increasingly used to explore the system-level functionality of genes. The network construction is conceptually straightforward: nodes represent genes and nodes are connected if the corresponding genes are significantly co-expressed across appropriately chosen tissue samples. In reality, it is tricky to define the connections between the nodes in such networks. An important question is whether it is biologically meaningful to encode gene co-expression using binary information (connected=1, unconnected=0). We describe a general framework for ;soft' thresholding that assigns a connection weight to each gene pair. This leads us to define the notion of a weighted gene co-expression network. For soft thresholding we propose several adjacency functions that convert the co-expression measure to a connection weight. For determining the parameters of the adjacency function, we propose a biologically motivated criterion (referred to as the scale-free topology criterion). We generalize the following important network concepts to the case of weighted networks. First, we introduce several node connectivity measures and provide empirical evidence that they can be important for predicting the biological significance of a gene. Second, we provide theoretical and empirical evidence that the ;weighted' topological overlap measure (used to define gene modules) leads to more cohesive modules than its ;unweighted' counterpart. Third, we generalize the clustering coefficient to weighted networks. Unlike the unweighted clustering coefficient, the weighted clustering coefficient is not inversely related to the connectivity. We provide a model that shows how an inverse relationship between clustering coefficient and connectivity arises from hard thresholding. We apply our methods to simulated data, a cancer microarray data set, and a yeast microarray data set.

4,448 citations


"Finding the Hierarchy of Dense Subg..." refers methods in this paper

  • ...It has been used for finding communities and spam link farms in web graphs [29, 20, 13], graph visualization [2], real-time story identification [4], DNA motif detection in biological networks [18], finding correlated genes [49], epilepsy prediction [26], finding price value motifs in financial data [14], graph compression [8], distance query indexing [27], and increasing the throughput of social networking site servers [21]....

    [...]

Journal ArticleDOI
TL;DR: The University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications, is described and a new multilevel coarsening scheme is proposed to facilitate this task.
Abstract: We describe the University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications The Collection is widely used by the numerical linear algebra community for the development and performance evaluation of sparse matrix algorithms It allows for robust and repeatable experiments: robust because performance results with artificially generated matrices can be misleading, and repeatable because matrices are curated and made publicly available in many formats Its matrices cover a wide spectrum of domains, include those arising from problems with underlying 2D or 3D geometry (as structural engineering, computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics, and other discretizations) and those that typically do not have such geometry (optimization, circuit simulation, economic and financial modeling, theoretical and quantum chemistry, chemical process simulation, mathematics and statistics, power networks, and other networks and graphs) We provide software for accessing and managing the Collection, from MATLAB™, Mathematica™, Fortran, and C, as well as an online search capability Graph visualization of the matrices is provided, and a new multilevel coarsening scheme is proposed to facilitate this task

3,456 citations