scispace - formally typeset
Search or ask a question

Showing papers on "SimRank published in 2006"


Proceedings ArticleDOI
01 Sep 2006
TL;DR: This paper takes advantage of the power law distribution of links, and develops a hierarchical structure called SimTree to represent similarities in multi-granularity manner, to compute similarities between objects by avoiding pairwise similarity computations through merging computations that go through the same branches in the SimTree.
Abstract: Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich semantic information that may indicate important relationships among objects. Most current clustering methods rely only on the properties that belong to the objects per se. However, the similarities between objects are often indicated by the links, and desirable clusters cannot be generated using only the properties of objects.In this paper we explore linkage-based clustering, in which the similarity between two objects is measured based on the similarities between the objects linked with them. In comparison with a previous study (SimRank) that computes links recursively on all pairs of objects, we take advantage of the power law distribution of links, and develop a hierarchical structure called SimTree to represent similarities in multi-granularity manner. This method avoids the high cost of computing and storing pairwise similarities but still thoroughly explore relationships among objects. An efficient algorithm is proposed to compute similarities between objects by avoiding pairwise similarity computations through merging computations that go through the same branches in the SimTree. Experiments show the proposed approach achieves high efficiency, scalability, and accuracy in clustering multi-typed linked objects.

124 citations


Proceedings ArticleDOI
23 May 2006
TL;DR: This paper achieves unrestricted personalization by combining rounding and randomized sketching techniques in the dynamic programming algorithm of Jeh and Widom and shows that the algorithms use an optimal amount of space by also improving earlier asymptotic worst-case lower bounds.
Abstract: Personalized PageRank expresses link-based page quality around user selected pages. The only previous personalized PageRank algorithm that can serve on-line queries for an unrestricted choice of pages on large graphs is our Monte Carlo algorithm [WAW 2004]. In this paper we achieve unrestricted personalization by combining rounding and randomized sketching techniques in the dynamic programming algorithm of Jeh and Widom [WWW 2003]. We evaluate the precision of approximation experimentally on large scale real-world data and find significant improvement over previous results. As a key theoretical contribution we show that our algorithms use an optimal amount of space by also improving earlier asymptotic worst-case lower bounds. Our lower bounds and algorithms apply to the SimRank as well; of independent interest is the reduction of the SimRank computation to personalized PageRank.

92 citations


Proceedings Article
01 Jan 2006
TL;DR: This work forms classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank and test the method over two data sets previously used to measure spam filtering algorithms.
Abstract: We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms.

70 citations


Proceedings ArticleDOI
23 May 2006
TL;DR: This paper introduces a novel link-based similarity measure, called PageSim, which can measure similarity between any two web pages, whereas SimRank cannot in some cases.
Abstract: To find similar web pages to a query page on the Web, this paper introduces a novel link-based similarity measure, called PageSim. Contrast to SimRank, a recursive refinement of cocitation, PageSim can measure similarity between any two web pages, whereas SimRank cannot in some cases. We give some intuitions to the PageSim model, and outline the model with mathematical definitions. Finally, we give an example to illustrate its effectiveness.

45 citations