scispace - formally typeset
Search or ask a question

Showing papers on "SimRank published in 2012"


Proceedings ArticleDOI
01 Apr 2012
TL;DR: An algorithmic framework called TopSim is proposed based on transforming the top-k SimRank problem on a graph G to one of finding thetop-k nodes with highest authority on the product graph G G, which further accelerate Top Sim by merging similarity paths and develop a more efficient algorithm called Top Sim-SM.
Abstract: Search for objects similar to a given query object in a network has numerous applications including web search and collaborative filtering. We use the notion of structural similarity to capture the commonality of two objects in a network, e.g., if two nodes are referenced by the same node, they may be similar. Meeting-based methods including SimRank and P-Rank capture structural similarity very well. Deriving inspiration from PageRank, SimRank has gained popularity by a natural intuition and domain independence. Since it's computationally expensive, subsequent work has focused on optimizing and approximating the computation of SimRank. In this paper, we approach SimRank from a top-k querying perspective where given a query node v, we are interested in finding the top-k nodes that have the highest SimRank score w.r.t. v. The only known approaches for answering such queries are either a naive algorithm of computing the similarity matrix for all node pairs or computing the similarity vector by comparing the query node v with each other node independently, and then picking the top-k. None of these approaches can handle top-k structural similarity search efficiently by scaling to very large graphs consisting of millions of nodes. We propose an algorithmic framework called TopSim based on transforming the top-k SimRank problem on a graph G to one of finding the top-k nodes with highest authority on the product graph G G. We further accelerate Top Sim by merging similarity paths and develop a more efficient algorithm called Top Sim-SM. Two heuristic algorithms, Trun-Top Sim-SM and Prio-Top Sim-SM, are also proposed to approximate Top Sim-SM on scale-free graphs to trade accuracy for speed, based on truncated random walk and prioritizing propagation respectively. We analyze the accuracy and performance of Top Sim family algorithms and report the results of a detailed experimental study.

80 citations


Journal ArticleDOI
TL;DR: It is shown that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank and be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility.
Abstract: Measuring object similarity in a graph is a fundamental data- mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that “similar objects have similar neighbors” and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy.

58 citations


Journal ArticleDOI
TL;DR: Novel optimization techniques such that each iteration takes time and space, and a reordering technique combined with an over-relaxation method is developed, not only speeding up the convergence rate of the existing techniques, but achieving I/O efficiency as well.
Abstract: SimRank has become an important similarity measure to rank web documents based on a graph model on hyperlinks. The existing approaches for conducting SimRank computation adopt an iteration paradigm. The most efficient deterministic technique yields $O\left(n^3\right)$ worst-case time per iteration with the space requirement $O\left(n^2\right)$ , where n is the number of nodes (web documents). In this paper, we propose novel optimization techniques such that each iteration takes $O \left(\min \left\{ n \cdot m , n^r \right\}\right)$ time and $O \left( n + m \right)$ space, where m is the number of edges in a web-graph model and r???log2 7. In addition, we extend the similarity transition matrix to prevent random surfers getting stuck, and devise a pruning technique to eliminate impractical similarities for each iteration. Moreover, we also develop a reordering technique combined with an over-relaxation method, not only speeding up the convergence rate of the existing techniques, but achieving I/O efficiency as well. We conduct extensive experiments on both synthetic and real data sets to demonstrate the efficiency and effectiveness of our iteration techniques.

58 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: The proposed Delta-SimRank, which is demonstrated to fit the nature of distributed computing and can be efficiently implemented using Google's MapReduce paradigm, can effectively reduce the computational cost and can also benefit the applications with non-static network structures.
Abstract: Based on the intuition that "two objects are similar if they are related to similar objects", SimRank (proposed by Jeh and Widom in 2002) has become a famous measure to compare the similarity between two nodes using network structure. Although SimRank is applicable to a wide range of areas such as social networks, citation networks, link prediction, etc., it suffers from heavy computational complexity and space requirements. Most existing efforts to accelerate SimRank computation work only for static graphs and on single machines. This paper considers the problem of computing SimRank efficiently in a distributed system while handling dynamic networks which grow with time. We first consider an abstract model called Harmonic Field on Node-pair Graph. We use this model to derive SimRank and the proposed Delta-SimRank, which is demonstrated to fit the nature of distributed computing and can be efficiently implemented using Google's MapReduce paradigm. Delta-SimRank can effectively reduce the computational cost and can also benefit the applications with non-static network structures. Our experimental results on four real world networks show that Delta-SimRank is much more efficient than the distributed SimRank algorithm, and leads to up to 30 times speed-up in the best case1.

22 citations


Proceedings ArticleDOI
04 Dec 2012
TL;DR: This paper proposes a novel structural similarity measure, E-Rank (Entity Rank), towards effectively computing the structural similarity of entities in SNs, based on the intuition that two entities are similar if they can arrive at common entities.
Abstract: With the social networks (SNs) becoming ubiquitous and massive, the issue of similarity computation among entities becomes more challenging and draws extensive interests from various research fields. SimRank is a well known similarity measure, however it considers only the meetings between two nodes that walk along equal length paths since the path length increases strictly with the iteration increasing during the similarity computation, besides, it does not differentiate importance for each link. In this paper, we propose a novel structural similarity measure, E-Rank (Entity Rank), towards effectively computing the structural similarity of entities in SNs, based on the intuition that two entities are similar if they can arrive at common entities. E-Rank can be well applied to social networks for measuring similarities of entities. Extensive experiments demonstrate the effectiveness of E-Rank by comparing with the state-of-the-art measures.

16 citations


Journal ArticleDOI
Guoming He1, Cuiping Li1, Hong Chen1, Xiaoyong Du, Haijun Feng1 
TL;DR: This paper exploits the inherent parallelism and high memory bandwidth of graphics processing units (GPU) to accelerate the computation of SimRank on large graphs and proposes the iterative aggregation techniques for uncoupling Markov chains to compute SimRank scores in parallel for large graphs.
Abstract: Recently there has been a lot of interest in graph-based analysis. One of the most important aspects of graph-based analysis is to measure similarity between nodes in a graph. SimRank is a simple and influential measure of this kind, based on a solid graph theoretical model. However, existing methods on SimRank computation suffer from two limitations: 1) the computing cost can be very high in practice; and 2) they can only be applied on static graphs. In this paper, we exploit the inherent parallelism and high memory bandwidth of graphics processing units (GPU) to accelerate the computation of SimRank on large graphs. Furthermore, based on the observation that SimRank is essentially a first-order Markov Chain, we propose to utilize the iterative aggregation techniques for uncoupling Markov chains to compute SimRank scores in parallel for large graphs. The iterative aggregation method can be applied on dynamic graphs. Moreover, it can handle not only the link-updating problem but also the node-updating problem. We give the corresponding theoretical justification and analysis, propose three optimization strategies to further improve the computation efficiency, and extend the proposed algorithm to dynamic graphs. Extensive experiments on synthetic and real data sets verify that the proposed methods are efficient and effective.

4 citations


Proceedings ArticleDOI
10 Dec 2012
TL;DR: This paper addresses the problem of link-based similarity measure of nodes in an information network distributed over different parties and proposes a privacy-preserving Sim Rank protocol based on fully-homomorphic encryption to provide cryptographic protection for the links.
Abstract: Information network analysis has drawn a lot attention in recent years. Among all the aspects of network analysis, similarity measure of nodes has been shown useful in many applications, such as clustering, link prediction and community identification, to name a few. As linkage data in a large network is inherently sparse, it is noted that collecting more data can improve the quality of similarity measure. This gives different parties a motivation to cooperate. In this paper, we address the problem of link-based similarity measure of nodes in an information network distributed over different parties. Concerning the data privacy, we propose a privacy-preserving Sim Rank protocol based on fully-homomorphic encryption to provide cryptographic protection for the links.

4 citations


Journal Article
TL;DR: This paper presents a method to recognize synonyms based on user behaviors to deal with the considerable new words, typos, and near-synonyms in this domain using Gradient Boost Decision Tree.
Abstract: Focused on the synonym recognition in e-commercethis paper presents a method to recognize synonyms based on user behaviors to deal with the considerable new words,typos,and near-synonyms in this domainFirstly,candidate synonym sets are retrieved by analyzing the titles and their corresponding queries based on SimRank theoryThen,features including literal feature,title feature,query feature,click feature are extractedFinally,Gradient Boost Decision Tree model is adopted to determine whether candidate synonyms are true or notThe experimental result shows that Gradient Boost Decision Tree(GBDT) is more suitable for this task,achieving a precision of 5652%

3 citations


Journal Article
TL;DR: This paper first extracts product feature expressions and sentimental words in pairs to build a bipartite graph, and then adopts the Weight Normalized SimRank to compute similarity between different feature expressions in the bipartites, and finally optimizes the Bayesian classifier in Semi-Supervised Learning via the similarity.
Abstract: This paper focuses on clustering different feature expressions in product reviews into proper groups.In product reviews,the same features may have different expressions,e.g."appearance" and "design" of a mobile phone actuallyindicate the same feature.Considering the fact that different expressions are always used with same sentimental words in a sentence,this paper first extracts product feature expressions and sentimental words in pairs to build a bipartite graph,and then adopts the Weight Normalized SimRank to compute similarity between different feature expressions in the bipartite graph,and finally optimizes the Bayesian classifier in Semi-Supervised Learning via the similarity.Experimental results show that the proposed method is valid.

2 citations


Posted Content
TL;DR: This paper addresses the problem of link-based similarity measure of nodes in an information network distributed over different parties and proposes a privacy-preserving SimRank protocol based on fully-homomorphic encryption to provide cryptographic protection for the links.
Abstract: Information network analysis has drawn a lot attention in recent years. Among all the aspects of network analysis, similarity measure of nodes has been shown useful in many applications, such as clustering, link prediction and community identification, to name a few. As linkage data in a large network is inherently sparse, it is noted that collecting more data can improve the quality of similarity measure. This gives different parties a motivation to cooperate. In this paper, we address the problem of link-based similarity measure of nodes in an information network distributed over different parties. Concerning the data privacy, we propose a privacy-preserving SimRank protocol based on fully-homomorphic encryption to provide cryptographic protection for the links.

1 citations