scispace - formally typeset
Search or ask a question

Showing papers in "Internet Mathematics in 2004"


Journal ArticleDOI
TL;DR: The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.
Abstract: A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

2,199 citations


Journal ArticleDOI
TL;DR: A rich and long history is found of how lognormal distributions have arisen as a possible alternative to power law distributions across many fields, focusing on underlying generative models that lead to these distributions.
Abstract: Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a lognormal distribution. In trying to learn enough about these distributions to settle the question, I found a rich and long history, spanning many fields. Indeed, several recently proposed models from the computer science community have antecedents in work from decades ago. Here, I briefly survey some of this history, focusing on underlying generative models that lead to these distributions. One finding is that lognormal and power law distributions connect quite naturally, and hence, it is not surprising that lognormal distributions have arisen as a possible alternative to power law distributions across many fields.

1,787 citations


Journal ArticleDOI
TL;DR: A comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, and suggested alternatives to the traditional solution methods.
Abstract: This paper serves as a companion or extension to the "Inside PageRank" paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research.

910 citations


Journal ArticleDOI
TL;DR: The clustering algorithms satisfy strong theoretical criteria and perform well in practice, and it is shown that the quality of the produced clusters is bounded by strong minimum cut and expansion criteria.
Abstract: In this paper, we introduce simple graph clustering methods based on minimum cuts within the graph. The clustering methods are general enough to apply to any kind of graph but are well suited for graphs where the link structure implies a notion of reference, similarity, or endorsement, such as web and citation graphs. We show that the quality of the produced clusters is bounded by strong minimum cut and expansion criteria. We also develop a framework for hierarchical clustering and present applications to real-world data. We conclude that the clustering algorithms satisfy strong theoretical criteria and perform well in practice.

380 citations


Journal ArticleDOI
TL;DR: It is shown that for certain families of random graphs with given expected degrees, the average distance is almost surely of order log n/ logd̃ where d̃ is the weighted average of the sum of squares of the expected degrees.
Abstract: Random graph theory is used to examine the "small-world phenomenon"– any two strangers are connected through a short chain of mutual acquaintances. We will show that for certain families of random graphs with given expected degrees, the average distance is almost surely of order log n/ logd where d is the weighted average of the sum of squares of the expected degrees. Of particular interest are power law random graphs in which the number of vertices of degree k is proportional to 1/k β for some fixed exponent β. For the case of β > 3, we prove that the average distance of the power law graphs is almost surely of order log n/ log d. However, many Internet, social, and citation networks are power law graphs with exponents in the range 2 < β < 3 for which the power law random graphs have average distance almost surely of order log log n, but have diameter of order log n (provided having some mild constraints for the average distance and maximum degree). In particular, these graphs contain a dense subgraph...

370 citations


Journal ArticleDOI
TL;DR: It is shown that the LCD graph is much more robust than classical random graphs with the same number of edges, but also more vulnerable to attack, namely robustness to random damage, and vulnerability to malicious attack.
Abstract: Recently many new "scale-free" random graph models have been introduced, motivated by the power-law degree sequences observed in many large-scale, real-world networks. Perhaps the best known, the Barabasi-Albert model, has been extensively studied from heuristic and experimental points of view. Here we consider mathematically two basic characteristics of a precise version of this model, the LCD model, namely robustness to random damage, and vulnerability to malicious attack. We show that the LCD graph is much more robust than classical random graphs with the same number of edges, but also more vulnerable to attack. In particular, if vertices of the n-vertex LCD graph are deleted at random, then as long as any positive proportion remains, the graph induced on the remaining vertices has a component of order n. In contrast, if the deleted vertices are chosen maliciously, a constant fraction less then 1 can be deleted to destroy all large components. For the Barabasi-Albert model, these questions have been st...

310 citations


Journal ArticleDOI
TL;DR: This work devise a version of randomized rounding that is incentive compatible, giving a truthful mechanism for combinatorial auctions with single parameter agents (e.g., "single minded bidders") that approximately maximizes the social value of the auction.
Abstract: Mechanism design seeks algorithms whose inputs are provided by selfish agents who would lie if it were to their advantage. Incentive-compatible mechanisms compel the agents to tell the truth by making it in their self-interest to do so. Often, as in combinatorial auctions, such mechanisms involve the solution of NP-hard problems. Unfortunately, approximation algorithms typically destroy incentive compatibility. Randomized rounding is a commonly used technique for designing approximation algorithms. We devise a version of randomized rounding that is incentivecompatible, giving a truthful mechanism for combinatorial auctions with single parameter agents (e.g., "single minded bidders") that approximately maximizes the social value of the auction. We discuss two orthogonal notions of truthfulness for a randomized mechanism–truthfulness with high probability and in expectation–and give a mechanism that achieves both simultaneously. We consider combinatorial auctions where multiple copies of many different item...

252 citations


Journal ArticleDOI
TL;DR: It is shown that (under certain conditions) the eigenvalues of the (normalized) Laplacian of a random power law graph follow the semicircle law while the spectrum of the adjacency matrix of a power law graphs obeys the power law.
Abstract: In the study of the spectra of power law graphs, there are basically two competing approaches. One is to prove analogues of Wigner's semicircle law while the other predicts that the eigenvalues follow a power law distributions. Although the semicircle law and the power law have nothing in common, we will show that both approaches are essentially correct if one considers the appropriate matrices. We will show that (under certain conditions) the eigenvalues of the (normalized) Laplacian of a random power law graph follow the semicircle law while the spectrum of the adjacency matrix of a power law graph obeys the power law. Our results are based on the analysis of random graphs with given expected degrees and their relations to several key invariants. Of interest are a number of (new) values for the exponent β where phase transitions for eigenvalue distributions occur. The spectrum distributions have direct implications to numerous graph algorithms such as randomized algorithms that involve rapidly mixing Ma...

224 citations


Journal ArticleDOI
TL;DR: The Recursive Forest File model, a new, dynamic generative user model that combines multiplicative models that generate lognormal distributions with recent work on random graph models for the web, explains the behavior of file size distributions, and may be useful for describing other power law phenomena in computer systems as well as other fields.
Abstract: In this paper, we introduce and analyze a new, dynamic generative user model to explain the behavior of file size distributions. Our Recursive Forest File model combines multiplicative models that generate lognormal distributions with recent work on random graph models for the web. Unlike similar previous work, our Recursive Forest File model allows new files to be created and old files to be deleted over time, and our analysis covers problematic issues such as correlation among file sizes. Moreover, our model allows natural variations where files that are copied or modified are more likely to be copied or modified subsequently. Previous empirical work suggests that file sizes tend to have a lognormal body but a Pareto tail. The Recursive Forest File model explains this behavior, yielding a double Pareto distribution, which has a Pareto tail but close to a lognormal body. We believe the Recursive Forest model may be useful for describing other power law phenomena in computer systems as well as other fields.

152 citations


Journal ArticleDOI
TL;DR: This analysis formalizes why the intuition that drives Google never fails and presents a very efficient algorithm to incrementally compute good approximations to Google's PageRank, as links evolve.
Abstract: We anticipate that future web search techniques will exploit changes in web structure and content. As a first step in this direction, we examine the problem of integrating observed changes in link structure into static hyperlink-based ranking computations. We present a very efficient algorithm to incrementally compute good approximations to Google's PageRank [Brin and Page 98], as links evolve. Our experiments reveal that this algorithm is both fast and yields excellent approximations to PageRank, even in light of large changes to the link structure. Our algorithm derives intuition and partial justification from a rigorous sensitivity analysis of Markov chains. Consider a regular Markov chain with stationary probability π, and suppose the transition probability into a state j is increased. We prove that this can only cause • πj to increase–adding a link to a site can only cause the stationary probability of the target site to increase; • the rank of j to improve–if the states are ordered according to thei...

97 citations


Journal ArticleDOI
TL;DR: It is shown that for large k, t, the expected number of vertices of degree k is approximately dkt where as k → 8, dk ~ Ck -1-β where and C > 0 is a constant.
Abstract: We study a dynamically evolving random graph which adds vertices and edges using preferential attachment and deletes vertices randomly. At time t, with probability α1 > 0 we add a new vertex ut and m random edges incident with ut . The neighbours of ut are chosen with probability proportional to degree. With probability α -α1 ≥ 0 we add m random edges to existing vertices where the endpoints are chosen with probability proportional to degree. With probability 1-α-α0 we delete a random vertex, if there are vertices left to delete. With probability α0 we delete m random edges. Assuming that α + α1 + α0 > 1 and α0 is sufficently small, we show that for large k, t, the expected number of vertices of degree k is approximately dkt where as k → 8, dk ~ Ck -1-β where and C > 0 is a constant. Note that β can take any value greater than 1.

Journal ArticleDOI
TL;DR: Six algorithmic problems that arise in web search engines and that are not or only partially solved are described: Uniformly sampling of web pages; modeling the web graph; finding duplicate hosts; finding top gainers and losers in data streams; finding large dense bipartite graphs; and understanding how eigenvectors partition the web.
Abstract: In this paper, we describe six algorithmic problems that arise in web search engines and that are not or only partially solved: (1) Uniformly sampling of web pages; (2) modeling the web graph; (3) finding duplicate hosts; (4) finding top gainers and losers in data streams; (5) finding large dense bipartite graphs; and (6) understanding how eigenvectors partition the web.

Journal ArticleDOI
TL;DR: A coupling technique for analyzing online models by using offline models that is especially effective for a growth-deletion model that generalizes and includes the preferential attachment model for generating large complex networks which simulate numerous realistic networks.
Abstract: We develop a coupling technique for analyzing online models by using offline models. This method is especially effective for a growth-deletion model that generalizes and includes the preferential attachment model for generating large complex networks which simulate numerous realistic networks. By coupling the online model with the offline model for random power law graphs, we derive strong bounds for a number of graph properties including diameter, average distances, connected components, and spectral bounds. For example, we prove that a power law graph generated by the growth-deletion model almost surely has diameter O(log n) and average distance O(log log n).

Journal ArticleDOI
TL;DR: Coupling techniques are used to show that in certain ways the LCD model is not too far from a standard random graph; in particular, the fractions of vertices that must be retained under an optimal attack in order to keep a giant component are within a constant factor for the scale-free and classical models.
Abstract: Recently many new "scale-free" random graph models have been introduced, motivated by the power-law degree sequences observed in many large-scale real-world networks. The most studied of these is the Barabasi-Albert growth with "preferential attachment" model, made precise as the LCD model by the present authors. Here we use coupling techniques to show that in certain ways the LCD model is not too far from a standard random graph; in particular, the fractions of vertices that must be retained under an optimal attack in order to keep a giant component are within a constant factor for the scale-free and classical models.

Journal ArticleDOI
Jon Kleinberg1
TL;DR: This work describes algorithms that yield provable guarantees for a particular problem of this type: detecting a network failure, and establishes a connection between graph separators and the notion of VC-dimension, using techniques based on matchings and disjoint paths.
Abstract: Measuring the properties of a large, unstructured network can be difficult: One may not have full knowledge of the network topology, and detailed global measurements may be infeasible. A valuable approach to such problems is to take measurements from selected locations within the network and then aggregate them to infer large-scale properties. One sees this notion applied in settings that range from Internet topology discovery tools to remote software agents that estimate the download times of popular web pages. Some of the most basic questions about this type of approach, however, are largely unresolved at an analytical level. How reliable are the results? How much does the choice of measurement locations affect the aggregate information one infers about the network? We describe algorithms that yield provable guarantees for a particular problem of this type: detecting a network failure. Suppose we want to detect events of the following form in an n-node network: An adversary destroys up to k nodes or edg...

Journal ArticleDOI
TL;DR: This work considers the problem of searching a randomly growing graph by a random walk, and considers two simple models of "web-graphs," where at each time step a new vertex is added and it is connected to the current graph by randomly chosen edges.
Abstract: We consider the problem of searching a randomly growing graph by a random walk. In particular we consider two simple models of "web-graphs." Thus at each time step a new vertex is added and it is connected to the current graph by randomly chosen edges. At the same time a "spider" S makes a number of steps of a random walk on the current graph. The parameter we consider is the expected proportion of vertices that have been visited by S up to time t.

Journal ArticleDOI
TL;DR: Using a new recursive technique, this work presents an explicit construction of an infinite family of N-superconcentrators of density 44, the most economical previously known explicit graphs of this type.
Abstract: Using a new recursive technique, we present an explicit construction of an infinite family of N-superconcentrators of density 44. The most economical previously known explicit graphs of this type have density around 60.

Journal ArticleDOI
TL;DR: It is proved that deterministic variations of the so-called copying model can lead to several nonisomorphic limits, which explain how limits of the copying model of the web graph share several properties with R that seem to reflect known properties of theweb graph.
Abstract: Several stochastic models were proposed recently to model the dynamic evolution of the web graph. We study the infinite limits of the stochastic processes proposed to model the web graph when time goes to infinity. We prove that deterministic variations of the so-called copying model can lead to several nonisomorphic limits. Some models converge to the infinite random graph R, while the convergence of other models is sensitive to initial conditions or minor changes in the rules of the model. We explain how limits of the copying model of the web graph share several properties with R that seem to reflect known properties of the web graph.

Journal ArticleDOI
TL;DR: This paper shows that in a number of cases, it can in fact achieve a competitive ratio of 2 for rejections, and achieves matching Θ(√m) upper and lower bounds, where m is the number of edges in arbitrary graphs with arbitrary edge capacities.
Abstract: Admission control (call control) is a well-studied online problem. We are given a fixed graph with edge capacities, and must process a sequence of calls that arrive over time, accepting some and rejecting others in order to stay within capacity limitations of the network. In the standard theoretical formulation, this problem is analyzed as a benefit problem: The goal is to devise an online algorithm that accepts at least a reasonable fraction of the maximum number of calls that could possibly have been accepted in hindsight. This formulation, however, has the property that even algorithms with optimal competitive ratios (typically O(log n) where n is the number of nodes) may end up rejecting the vast majority of calls even when it would have been possible in hindsight to reject only very few. In this paper, we instead consider the goal of approximately minimizing the number of calls rejected. This is much more appropriate for settings in which rejections are intended to be rare events. In order to avoid t...

Journal ArticleDOI
TL;DR: In this article, the authors investigate the problem of extracting as much information as possible about the elements of a given subset X from the answers of a truthful adversary A. In particular, they investigate several aspects of this problem.
Abstract: We suppose we are given some fixed (but unknown) subset X of a set Ω = 𝔽 n 2 , where 𝔽2 denotes the field of two elements. Our goal is to learn as much as possible about the elements of X by asking certain binary questions. Each "question" Q is just some element of Ω, and the "answer" to Q is just the inner product Q . x ∈ 𝔽2 for some x ∈ X. However, the choice of x is made by a truthful (but possibly malevolent) adversary A, whom we may assume is trying to choose answers so as to yield as little information as possible about X. In this note, we investigate several aspects of this problem. In particular, we are interested in extracting as much information as possible about X from A's answers. Although Acan prevent us from learning the identity of any particular element of X, with appropriate questions we can still learn quite a bit about X. We determine the maximum amount of information that can be recovered under these assumptions and describe explicit sets of questions for achieving this goal. For the c...