# SSumM: Sparse Summarization of Massive Graphs

23 Aug 2020-pp 144-154

TL;DR: SSumM is a scalable and effective graph-summarization algorithm that yields a sparse summary graph that not only merges nodes together but also sparsifies the summary graph, and the two strategies are carefully balanced based on the minimum description length principle.

Abstract: Given a graph G and the desired size k in bits, how can we summarize G within k bits, while minimizing the information loss? Large-scale graphs have become omnipresent, posing considerable computational challenges. Analyzing such large graphs can be fast and easy if they are compressed sufficiently to fit in main memory or even cache. Graph summarization, which yields a coarse-grained summary graph with merged nodes, stands out with several advantages among graph compression techniques. Thus, a number of algorithms have been developed for obtaining a concise summary graph with little information loss or equivalently small reconstruction error. However, the existing methods focus solely on reducing the number of nodes, and they often yield dense summary graphs, failing to achieve better compression rates. Moreover, due to their limited scalability, they can be applied only to moderate-size graphs. In this work, we propose SSumM, a scalable and effective graph-summarization algorithm that yields a sparse summary graph. SSumM not only merges nodes together but also sparsifies the summary graph, and the two strategies are carefully balanced based on the minimum description length principle. Compared with state-of-the-art competitors, SSumM is (a) Concise: yields up to 11.2X smaller summary graphs with similar reconstruction error, (b) Accurate: achieves up to 4.2X smaller reconstruction error with similarly concise outputs, and (c) Scalable: summarizes 26X larger graphs while exhibiting linear scalability. We validate these advantages through extensive experiments on 10 real-world graphs.

##### Citations

More filters

••

KAIST

^{1}TL;DR: MoSSo as discussed by the authors is the first incremental algorithm for lossless summarization of fully dynamic graphs, which updates the output representation by repeatedly moving nodes among supernodes and edges.

Abstract: Given a fully dynamic graph, represented as a stream of edge insertions and deletions, how can we obtain and incrementally update a lossless summary of its current snapshot? As large-scale graphs are prevalent, concisely representing them is inevitable for efficient storage and analysis. Lossless graph summarization is an effective graph-compression technique with many desirable properties. It aims to compactly represent the input graph as (a) a summary graph consisting of supernodes (i.e., sets of nodes) and superedges (i.e., edges between supernodes), which provide a rough description, and (b) edge corrections which fix errors induced by the rough description. While a number of batch algorithms, suited for static graphs, have been developed for rapid and compact graph summarization, they are highly inefficient in terms of time and space for dynamic graphs, which are common in practice. In this work, we propose MoSSo, the first incremental algorithm for lossless summarization of fully dynamic graphs. In response to each change in the input graph, MoSSo updates the output representation by repeatedly moving nodes among supernodes. MoSSo decides nodes to be moved and their destinations carefully but rapidly based on several novel ideas. Through extensive experiments on 10 real graphs, we show MoSSo is (a) Fast and 'any time': processing each change in near-constant time (less than 0.1 millisecond), up to 7 orders of magnitude faster than running state-of-the-art batch methods, (b) Scalable: summarizing graphs with hundreds of millions of edges, requiring sub-linear memory during the process, and (c) Effective: achieving comparable compression ratios even to state-of-the-art batch methods.

26 citations

•

TL;DR: The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns, and methods for mining various types of data and patterns are reviewed.

Abstract: This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems.

20 citations

••

09 Jun 2021TL;DR: LDME as discussed by the authors is a correction set based graph summarization algorithm that produces compact output representations in a fast and scalable manner by using weighted locality sensitive hashing to reduce the number comparisons required to find good node merges.

Abstract: Summarizing graphs is of paramount importance due to diverse applications of large-scale graph analysis. A popular family of summarization methods is the group-based approach. The general idea consists of merging nodes of the original graph into supernodes of the summary graph, encoding original edges into superedges/correction set edges, and dropping certain superedges or correction set edges (for lossy summarization). The current state of the art has several steps in its computation that are serious bottlenecks in terms of running time and scalability. In this work, we propose algorithm LDME, a correction set based graph summarization algorithm that produces compact output representations in a fast and scalable manner. To achieve this, we introduce (1) weighted locality sensitive hashing to drastically reduce the number comparisons required to find good node merges, (2) an efficient way to compute the best quality merges that produces more compact outputs, and (3) a new sort-based encoding algorithm that is faster and more robust. More interestingly, our algorithm provides performance tuning settings to allow the option of trading compression for running time. On high compression settings, LDME achieves compression equal to or better than the state of the art with up to 53x speedup in running time. On high speed settings, LDME achieves up to two orders of magnitude speedup with only slightly lower compression.

9 citations

••

5 citations

••

TL;DR: In this paper , a polynomial-time approximation algorithm based on the k-Median clustering for lossless single-relation graph summarization was proposed for multi-relation graphs, where multiple edges of different types may exist between any pair of nodes.

Abstract: Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, and graph processing in modern hardware. However, the bulk of the literature on graph summarization surprisingly overlooks the possibility of having edges of different types. In this paper, we study the novel problem of producing summaries of multi-relation networks, i.e., graphs where multiple edges of different types may exist between any pair of nodes. Multi-relation graphs are an expressive model of real-world activities, in which a relation can be a topic in social networks, an interaction type in genetic networks, or a snapshot in temporal graphs. The first approach that we consider for multi-relation graph summarization is a two-step method based on summarizing each relation in isolation, and then aggregating the resulting summaries in some clever way to produce a final unique summary. In doing this, as a side contribution, we provide the first polynomial-time approximation algorithm based on the k-Median clustering for the classic problem of lossless single-relation graph summarization. Then, we demonstrate the shortcomings of these two-step methods, and propose holistic approaches, both approximate and heuristic algorithms, to compute a summary directly for multi-relation graphs. In particular, we prove that the approximation bound of k-Median clustering for the single relation solution can be maintained in a multi-relation graph with proper aggregation operation over adjacency matrices corresponding to its multiple relations. Experimental results and case studies (on co-authorship networks and brain networks) validate the effectiveness and efficiency of the proposed algorithms.

4 citations

##### References

More filters

••

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.

Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

65,425 citations

••

TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.

Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

41,772 citations

•

11 Nov 1999

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

14,400 citations

### "SSumM: Sparse Summarization of Mass..." refers background in this paper

...For example, [3, 19, 28] compute adjacency queries, PageRank [27], and triangle density [36] directly from summary graphs, without restoring the original graph....

[...]

••

IBM

^{1}TL;DR: The number of digits it takes to write down an observed sequence x1,...,xN of a time series depends on the model with its parameters that one assumes to have generated the observed data.

6,254 citations

••

21 Aug 2005TL;DR: A new graph generator is provided, based on a "forest fire" spreading process, that has a simple, intuitive justification, requires very few parameters (like the "flammability" of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the present study.

Abstract: How do real graphs evolve over time? What are "normal" growth patterns in social, technological, and information networks? Many studies have discovered patterns in static graphs, identifying properties in a single snapshot of a large network, or in a very small number of snapshots; these include heavy tails for in- and out-degree distributions, communities, small-world phenomena, and others. However, given the lack of information about network evolution over long periods, it has been hard to convert these findings into statements about trends over time.Here we study a wide range of real graphs, and we observe some surprising phenomena. First, most of these graphs densify over time, with the number of edges growing super-linearly in the number of nodes. Second, the average distance between nodes often shrinks over time, in contrast to the conventional wisdom that such distance parameters should increase slowly as a function of the number of nodes (like O(log n) or O(log(log n)).Existing graph generation models do not exhibit these types of behavior, even at a qualitative level. We provide a new graph generator, based on a "forest fire" spreading process, that has a simple, intuitive justification, requires very few parameters (like the "flammability" of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the present study.

2,548 citations