Search or ask a question

Showing papers on "SimRank published in 2006"

PDF

Open Access

Proceedings Article•DOI•

LinkClus: efficient clustering via heterogeneous semantic links

[...]

Xiaoxin Yin¹, Jiawei Han¹, Philip S. Yu²•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

01 Sep 2006

TL;DR: This paper takes advantage of the power law distribution of links, and develops a hierarchical structure called SimTree to represent similarities in multi-granularity manner, to compute similarities between objects by avoiding pairwise similarity computations through merging computations that go through the same branches in the SimTree.

...read moreread less

Abstract: Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich semantic information that may indicate important relationships among objects. Most current clustering methods rely only on the properties that belong to the objects per se. However, the similarities between objects are often indicated by the links, and desirable clusters cannot be generated using only the properties of objects.In this paper we explore linkage-based clustering, in which the similarity between two objects is measured based on the similarities between the objects linked with them. In comparison with a previous study (SimRank) that computes links recursively on all pairs of objects, we take advantage of the power law distribution of links, and develop a hierarchical structure called SimTree to represent similarities in multi-granularity manner. This method avoids the high cost of computing and storing pairwise similarities but still thoroughly explore relationships among objects. An efficient algorithm is proposed to compute similarities between objects by avoiding pairwise similarity computations through merging computations that go through the same branches in the SimTree. Experiments show the proposed approach achieves high efficiency, scalability, and accuracy in clustering multi-typed linked objects.

...read moreread less

124 citations

Proceedings Article•DOI•

To randomize or not to randomize: space optimal summaries for hyperlink analysis

[...]

Tamas Sarlos¹, Adrás A. Benczúr¹, Károly Csalogány¹, Dániel Fogaras², Balázs Rácz² - Show less +1 more•Institutions (2)

Hungarian Academy of Sciences¹, Budapest University of Technology and Economics²

23 May 2006

TL;DR: This paper achieves unrestricted personalization by combining rounding and randomized sketching techniques in the dynamic programming algorithm of Jeh and Widom and shows that the algorithms use an optimal amount of space by also improving earlier asymptotic worst-case lower bounds.

...read moreread less

Abstract: Personalized PageRank expresses link-based page quality around user selected pages. The only previous personalized PageRank algorithm that can serve on-line queries for an unrestricted choice of pages on large graphs is our Monte Carlo algorithm [WAW 2004]. In this paper we achieve unrestricted personalization by combining rounding and randomized sketching techniques in the dynamic programming algorithm of Jeh and Widom [WWW 2003]. We evaluate the precision of approximation experimentally on large scale real-world data and find significant improvement over previous results. As a key theoretical contribution we show that our algorithms use an optimal amount of space by also improving earlier asymptotic worst-case lower bounds. Our lower bounds and algorithms apply to the SimRank as well; of independent interest is the reduction of the SimRank computation to personalized PageRank.

...read moreread less

92 citations

Proceedings Article•

[...]

András A. Benczúr, Károly Csalogány, Tamas Sarlos¹•Institutions (1)

Hungarian Academy of Sciences¹

01 Jan 2006

TL;DR: This work forms classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank and test the method over two data sets previously used to measure spam filtering algorithms.

...read moreread less

Abstract: We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms.

...read moreread less

70 citations

Proceedings Article•DOI•

PageSim: a novel link-based measure of web page aimilarity

[...]

Zhenjiang Lin¹, Michael R. Lyu¹, Irwin King¹•Institutions (1)

The Chinese University of Hong Kong¹

23 May 2006

TL;DR: This paper introduces a novel link-based similarity measure, called PageSim, which can measure similarity between any two web pages, whereas SimRank cannot in some cases.

...read moreread less

Abstract: To find similar web pages to a query page on the Web, this paper introduces a novel link-based similarity measure, called PageSim. Contrast to SimRank, a recursive refinement of cocitation, PageSim can measure similarity between any two web pages, whereas SimRank cannot in some cases. We give some intuitions to the PageSim model, and outline the model with mathematical definitions. Finally, we give an example to illustrate its effectiveness.

...read moreread less

45 citations