scispace - formally typeset
Search or ask a question

Showing papers in "Internet Mathematics in 2005"



Journal ArticleDOI
Pavel Berkhin1
TL;DR: The theoretical foundations of the PageRank formulation are examined, the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability.
Abstract: This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRan...

479 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce a structural metric that allows us to differentiate between simple, connected graphs having an identical degree sequence, which is of particular interest when that sequence satisfies a power law relationship.
Abstract: There is a large, popular, and growing literature on "scale-free" networks with the Internet along with metabolic networks representing perhaps the canonical examples. While this has in many ways reinvigorated graph theory, there is unfortunately no consistent, precise definition of scale-free graphs and few rigorous proofs of many of their claimed properties. In fact, it is easily shown that the existing theory has many inherent contradictions and that the most celebrated claims regarding the Internet and biology are verifiably false. In this paper, we introduce a structural metric that allows us to differentiate between all simple, connected graphs having an identical degree sequence, which is of particular interest when that sequence satisfies a power law relationship. We demonstrate that the proposed structural metric yields considerable insight into the claimed properties of SF graphs and provides one possible measure of the extent to which a graph is scale-free. This structural view can be related t...

469 citations


Journal ArticleDOI
TL;DR: A novel algorithm is achieved by a novel algorithm that precomputes a compact database; using this database, it can serve online responses to arbitrary user-selected personalization and proves that for a fixed error probability the size of the database is linear in the number of web pages.
Abstract: Personalized PageRank expresses link-based page quality around userselected pages in a similar way as PageRank expresses quality over the entire web. Existing personalized PageRank algorithms can, however, serve online queries only for a restricted choice of pages. In this paper we achieve full personalization by a novel algorithm that precomputes a compact database; using this database, it can serve online responses to arbitrary user-selected personalization. The algorithm uses simulated random walks; we prove that for a fixed error probability the size of our database is linear in the number of web pages. We justify our estimation approach by asymptotic worst-case lower bounds: we show that on some sets of graphs, exact personalized PageRank values can only be obtained from a database of size quadratic in the number of vertices. Furthermore, we evaluate the precision of approximation experimentally on the Stanford WebBase graph.

301 citations


Journal ArticleDOI
TL;DR: In this paper, the PageRank computation in the original random surfer model is transformed in the problem of computing the solution of a sparse linear system, and the sparsity of the obtained linear system makes it possible to exploit the effectiveness of the Markov chain index reordering.
Abstract: Recently, the research community has devoted increased attention to reducing the computational time needed by web ranking algorithms. In particular, many techniques have been proposed to speed up the well-known PageRank algorithm used by Google. This interest is motivated by two dominant factors: (1) the web graph has huge dimensions and is subject to dramatic updates in terms of nodes and links, therefore the PageRank assignment tends to became obsolete very soon; (2) many PageRank vectors need to be computed according to different choices of the personalization vectors or when adopting strategies of collusion detection. In this paper, we show how the PageRank computation in the original random surfer model can be transformed in the problem of computing the solution of a sparse linear system. The sparsity of the obtained linear system makes it possible to exploit the effectiveness of the Markov chain index reordering to speed up the PageRank computation. In particular, we rearrange the system matrix acco...

113 citations


Journal ArticleDOI
Roberto I. Oliveira1, Joel Spencer
TL;DR: An evolving network model of Krapivsky and Redner in which new nodes arrive sequentially, each connecting to a previously existing node b with probability proportional to the pth power of the in-degree of b is analyzed.
Abstract: We analyze an evolving network model of Krapivsky and Redner in which new nodes arrive sequentially, each connecting to a previously existing node b with probability proportional to the pth power of the in-degree of b. We restrict to the super-linear case p > 1. When , the structure of the final countable tree is determined. There is a finite tree T with distinguished v (which has a limiting distribution) on which is "glued" a specific infinite tree; v has an infinite number of children, an infinite number of which have k - 1 children, and there are only a finite number of nodes (possibly only v) with k or more children. Our basic technique is to embed the discrete process in a continuous time process using exponential random variables, a technique that has previously been employed in the study of balls-in-bins processes with feedback.

85 citations


Journal ArticleDOI
TL;DR: It is argued that power law research must move from focusing on observation, interpretation, and modeling of power law behavior to instead considering the challenging problems of validation of models and control of systems.
Abstract: I argue that power law research must move from focusing on observation, interpretation, and modeling of power law behavior to instead considering the challenging problems of validation of models and control of systems.

79 citations


Book ChapterDOI
TL;DR: In this paper, it was shown that the largest k eigenvalues of the adjacency matrix of the preferential attachment graph have λ k = (1± o(1))Δ k 1/2 whp.
Abstract: The preferential attachment graph is a random graph formed by adding a new vertex at each time step, with a single edge which points to a vertex selected at random with probability proportional to its degree. Every m steps the most recently added m vertices are contracted into a single vertex, so at time t there are roughly t/m vertices and exactly t edges. This process yields a graph which has been proposed as a simple model of the world wide web [BA99]. For any constant k, let Δ1 ≥ Δ2 ≥ ⋯ ≥ Δ k be the degrees of the k highest degree vertices. We show that at time t, for any function f with f(t)→ ∞ as t→ ∞, \(\frac{t^{1/2}}{f(t)} \leq \Delta_1 \leq t^{1/2}f(t),\) and for i = 2,..., k, \(\frac{t^{1/2}}{f(t)} \leq \Delta_i \leq \Delta_{i-1} -- \frac{t^{1/2}}{f(t)},\) with high probability (whp). We use this to show that at time t the largest k eigenvalues of the adjacency matrix of this graph have λ k = (1± o(1))Δ k 1/2 whp.

64 citations


Journal ArticleDOI
TL;DR: Estimates of the expected length of ζ codes against power-law distributions are given, and the results are compared with analogous estimates for the more classical γ, δ and variable-length block codes.
Abstract: We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against power-law distributions, and compare the results with analogous estimates for the more classical γ, δ and variable-length block codes.

61 citations


Journal ArticleDOI
TL;DR: T-Rank Light and T-Rank are introduced, two link-analysis approaches that take into account the temporal aspects freshness and activity of pages and links and can produce better rankings of web pages.
Abstract: The link structure of the web is analyzed to measure the authority of pages, which can be taken into account for ranking query results. Due to the enormous dynamics of the web, with millions of pages created, updated, deleted, and linked to every day, temporal aspects of web pages and links are crucial factors for their evaluation. Users are interested in important pages (i.e., pages with high authority score) but are equally interested in the recency of information. Time—and thus the freshness of web content and link structure—emanates as a factor that should be taken into account in link analysis when computing the importance of a page. So far only minor effort has been spent on the integration of temporal aspects into link-analysis techniques. In this paper we introduce T-Rank Light and T-Rank, two link-analysis approaches that take into account the temporal aspects freshness (i.e., timestamps of most recent updates) and activity (i.e., update rates) of pages and links. Experimental results show that T...

60 citations


Journal ArticleDOI
TL;DR: The algorithmic results represent an application of a particular analysis technique which can be used to characterise the asymptotic behaviour of a number of dynamic processes related to the web.
Abstract: In this paper we study the size of generalised dominating sets in two graph processes that are widely used to model aspects of the World Wide Web. On the one hand, we show that graphs generated this way have fairly large dominating sets (i.e., linear in the size of the graph). On the other hand, we present efficient strategies to construct small dominating sets. The algorithmic results represent an application of a particular analysis technique which can be used to characterise the asymptotic behaviour of a number of dynamic processes related to the web.

Journal ArticleDOI
TL;DR: A measure of effectiveness for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned is proposed.
Abstract: Deciding which kind of visiting strategy accumulates high-quality pages more quickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectiveness for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields node orders that agree with the ones computed in the complete graph; orders are compared using Kendall's o . We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are ...

Journal ArticleDOI
TL;DR: This work presents an efficient algorithm that extracts a local graph from a given realistic network and shows that the underlying local graph is robust in the sense that when the extraction algorithm is applied to a hybrid graph, it recovers the original local graph with a small error.
Abstract: The small-world phenomenon includes both small average distance and the clustering effect. Randomly generated graphs with a power law degree distribution are widely used to model large real-world networks, but while these graphs have small average distance, they generally do not exhibit the clustering effect. We introduce an improved hybrid model that combines a global graph (a random power law graph) with a local graph (a graph with high local connectivity defined by network flow). We present an efficient algorithm that extracts a local graph from a given realistic network. We show that the underlying local graph is robust in the sense that when our extraction algorithm is applied to a hybrid graph, it recovers the original local graph with a small error. The proof involves a probabilistic analysis of the growth of neighborhoods in the hybrid graph model.

Journal ArticleDOI
TL;DR: This paper uses real BGP data to study some properties of the AS connectivity graph and its evolution in time, and builds a simple model that is inspired by observations made in the first part, and discusses simulations of this model.
Abstract: The connectivity of the autonomous systems (ASs) in the Internet can be modeled as a time-evolving random graph, whose nodes represent ASs (or routers), and whose edges represent direct connections between them. Even though this graph has some random aspects, its properties show it to be fundamentally different from "traditional'' random graphs. In the first part of this paper, we use real BGP data to study some properties of the AS connectivity graph and its evolution in time. In the second part, we build a simple model that is inspired by observations made in the first part, and we discuss simulations of this model.

Journal ArticleDOI
TL;DR: This paper presents an explicit set of 2 O (k) (log N) questions, along with a 2 O(k 2)(log2 N) recovery algorithm that achieves B's goal in this game and completely solves the problem for any constant number of secrets k.
Abstract: We consider the following game introduced by Chung, Graham, and Leighton in [Chung et al. 01]. One player, A, picks k > 1 secrets from a universe of N possible secrets, and another player, B, tries to gain as much information about this set as possible by asking binary questions ƒ : [N] → {0, 1}. Upon receiving a question ƒ, A adversarially chooses one of her k secrets, and answers ƒ according to it. In this paper we present an explicit set of 2 O(k) (log N) questions, along with a 2 O(k 2)(log2 N) recovery algorithm that achieves B's goal in this game. This, in particular, completely solves the problem for any constant number of secrets k. Our strategy is based on the list decoding of Reed-Solomon codes, and it extends and generalizes ideas introduced by Alon, Guruswami, Kaufman, and Sudan in [Alon et al. 02].

Journal ArticleDOI
TL;DR: This work uses the framework of competitive analysis of online algorithms and studies upper and lower bounds for page eviction strategies in the case where data have expiration times to show that minimal adaptations of marking algorithms achieve performance similar to that of the well- studied case of caching without the expiration time constraint.
Abstract: Caching data together with expiration times beyond which the data are no longer valid is a standard method for promoting information coherence in distributed systems, including the Internet, theWorld WideWeb (WWW), and Peer-to-Peer (P2P) networks. We use the framework of competitive analysis of online algorithms and study upper and lower bounds for page eviction strategies in the case where data have expiration times. We show that minimal adaptations of marking algorithms achieve performance similar to that of the well-studied case of caching without the expiration time constraint. Marking algorithms include the widely-used Least Recently Used (LRU) eviction policy. In practice, when data have expiration times, the LRU eviction policy is used widely, often without any consideration of expiration times. Our results explain and justify this standard practice.