scispace - formally typeset
Search or ask a question

Showing papers on "SimRank published in 2017"


Journal ArticleDOI
01 May 2017
TL;DR: A random walk based indexing scheme to compute SimRank efficiently and accurately over large dynamic graphs is proposed and it is shown that the algorithm outperforms the state-of-the-art static and dynamic SimRank algorithms.
Abstract: Similarity among entities in graphs plays a key role in data analysis and mining. SimRank is a widely used and popular measurement to evaluate the similarity among the vertices. In real-life applications, graphs do not only grow in size, requiring fast and precise SimRank computation for large graphs, but also change and evolve continuously over time, demanding an efficient maintenance process to handle dynamic updates. In this paper, we propose a random walk based indexing scheme to compute SimRank efficiently and accurately over large dynamic graphs. We show that our algorithm outperforms the state-of-the-art static and dynamic SimRank algorithms.

43 citations


Journal ArticleDOI
01 Sep 2017
TL;DR: ProbeSim is presented, an index-free algorithm for single-source and top-$k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results and offers satisfying practical efficiency and effectiveness.
Abstract: Single-source and top-k SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads.This paper presents ProbeSim, an index-free algorithm for single-source and top-k SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

37 citations


Journal ArticleDOI
01 Jan 2017
TL;DR: Depending on the requirements of different applications, the optimal choice of algorithms differs, and this paper provides an empirical guideline for making such choices.
Abstract: Given a graph, SimRank is one of the most popular measures of the similarity between two vertices. We focus on efficiently calculating SimRank, which has been studied intensively over the last decade. This has led to many algorithms that efficiently calculate or approximate SimRank being proposed by researchers. Despite these abundant research efforts, there is no systematic comparison of these algorithms. In this paper, we conduct a study to compare these algorithms to understand their pros and cons.We first introduce a taxonomy for different algorithms that calculate SimRank and classify each algorithm into one of the following three classes, namely, iterative-, non-iterative-, and random walk-based method. We implement ten algorithms published from 2002 to 2015, and compare them using synthetic and real-world graphs. To ensure the fairness of our study, our implementations use the same data structure and execution framework, and we try our best to optimize each of these algorithms. Our study reveals that none of these algorithms dominates the others: algorithms based on iterative method often have higher accuracy while algorithms based on random walk can be more scalable. One noniterative algorithm has good effectiveness and efficiency on graphs with medium size. Thus, depending on the requirements of different applications, the optimal choice of algorithms differs. This paper provides an empirical guideline for making such choices.

21 citations


Journal ArticleDOI
TL;DR: This work proposes to convert SimRank to the problem of solving a linear system in matrix form, and proves that the system is non-singular, diagonally dominate, and symmetric definite positive (for undirected graphs).
Abstract: SimRank is a widely adopted similarity measure for objects modeled as nodes in a graph, based on the intuition that two objects are similar if they are referenced by similar objects. The recursive nature of SimRank definition makes it expensive to compute the similarity score even for a single pair of nodes. This defect limits the applications of SimRank. To speed up the computation, some existing works replace the original model with an approximate model to seek only rough solution of SimRank scores. In this work, we propose a novel solution for computing all-pair SimRank scores. In particular, we propose to convert SimRank to the problem of solving a linear system in matrix form, and further prove that the system is non-singular, diagonally dominate, and symmetric definite positive (for undirected graphs). Those features immediately lead to the adoption of Conjugate Gradient (CG) and Bi-Conjugate Gradient (BiCG) techniques for efficiently computing SimRank scores. As a result, a significant improvement on the convergence rate can be achieved; meanwhile, the sparsity of the adjacency matrix is not damaged all the time. Inspired by the existing common neighbor sharing strategy, we further reduce the computational complexity of the matrix multiplication and resolve the scalable issues. The experimental results show our proposed algorithms significantly outperform the state-of-the-art algorithms.

20 citations


Journal ArticleDOI
TL;DR: The results show that JacSim outperforms existing measures significantly in terms of accuracy and also provides better performance than the similarity measures targeted to solve the pairwise normalization problem.

20 citations


Journal ArticleDOI
31 Jul 2017
TL;DR: This article adopts “SimRank” to evaluate the similarity between two vertices in a large graph because of its generality, and proposes an efficient method without building the vertex-pair graph to find the h-go cover + vertex pairs.
Abstract: Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover+ vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover+ vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.

12 citations


Proceedings ArticleDOI
19 Apr 2017
TL;DR: A Monte Carlo based method, UniWalk, is designed to enable the fast top-k SimRank computation over large undirected graphs without indexing, and outperforms the state-of-the-art methods by orders of magnitude.
Abstract: SimRank is an effective structural similarity measurement between two vertices in a graph, which can be used in many applications like recommender systems Although progresses have been achieved, existing methods still face challenges to handle large graphs Besides huge index construction and maintenance cost, the existing methods require considerable search space and time overheads in the online SimRank query In this paper, we design a Monte Carlo based method, Uni-Walk, to enable the fast top-k SimRank computation over large undirected graphs without indexing UniWalk directly locates the top-k similar vertices for any single source vertex u via O(R) sampling paths originating from u only, which avoids the selection of candidate vertex set C and the following O(|C|R) bidirectional sampling paths starting from u and each candidate respectively in existing methods We also design a space-efficient method to reduce intermediate results, and a path-sharing strategy to optimize path sampling for multiple source vertices Furthermore, we extend UniWalk to existing distributed graph processing frameworks to improve its scalability We conduct extensive experiments to illustrate that UniWalk has high scalability, and outperforms the state-of-the-art methods by orders of magnitude, and such an improvement is achieved without any indexing overheads

10 citations


Proceedings Article
01 Jan 2017
TL;DR: The experimental results indicate that the proposed aggregated similarity measure overall outperforms the other three similarity measures in terms of both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), especially in the cases of 30-100 nearest neighbors.
Abstract: This paper addresses the sparsity problem in collaborative filtering (CF) by developing an aggregated useruser similarity measure suitable for the user-based CF model. The aggregated similarity measure is a weighted aggregation of the SimRank++ similarity on the user-item bipartite graph and the cosine similarity of the Linked Open Data (LOD)-based user profiles derived from both the rating data and the items' descriptive attributes found from LOD resources. To validate the effectiveness of the aggregated similarity and evaluate the accuracy of rating predictions with the user-based CF method, comparative experiments between four similarity measures, the Pearson correlation coefficient, the SimRank++ similarity, the cosine similarity and the aggregated similarity, were conducted on the MovieLens 100k dataset and DBpedia. The experimental results indicate that the proposed aggregated similarity measure overall outperforms the other three similarity measures in terms of both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), especially in the cases of 30-100 nearest neighbors.

8 citations


Journal ArticleDOI
TL;DR: This work defines the concept of super node such that for a node in the network, the SimRank with its super node is not less than that with any others, and proposes a tight upper bound for each node that can be easily calculated after each iteration.

7 citations


Journal ArticleDOI
TL;DR: ProSimSim as mentioned in this paper is an index-free algorithm for single-source and top-k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results.
Abstract: Single-source and top-$k$ SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-$k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to several non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions significantly outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

5 citations


Journal ArticleDOI
TL;DR: This paper defines random walks on uncertain graphs and shows that the definition of random walks satisfies Markov’s property, which makes all existing SimRank computation algorithms on deterministic graphs inapplicable to uncertain graphs.
Abstract: SimRank is a similarity measure between vertices in a graph. Recently, many algorithms have been proposed to efficiently evaluate SimRank similarities. However, the existing algorithms either overlook uncertainty in graph structures or depends on an unreasonable assumption. In this paper, we study SimRank on uncertain graphs. Following the random-walk-based formulation of SimRank on deterministic graphs and the possible world model of uncertain graphs, we first define random walks on uncertain graphs and show that our definition of random walks satisfies Markov’s property. We formulate our SimRank measure based on random walks on uncertain graphs. We discover a critical difference between random walks on uncertain graphs and random walks on deterministic graphs, which makes all existing SimRank computation algorithms on deterministic graphs inapplicable to uncertain graphs. For SimRank computation, we consider computing both single-pair SimRank and single-source top- $K$ SimRank. We propose three algorithms, namely the sampling algorithm with high efficiency, the two-phase algorithm with comparable efficiency and higher accuracy, and a speeding-up algorithm with much higher efficiency. Meanwhile, we present an optimized algorithm for efficient computing the single-source top- $K$ SimRank. The experimental results verify the effectiveness of our SimRank measure and the efficiency of the proposed SimRank computation algorithms.

01 Nov 2017
TL;DR: In this paper, an aggregated user-user similarity measure was proposed for the user-based CF model, which is a weighted aggregation of the SimRank++ similarity on user-item bipartite graph and the cosine similarity of the Linked Open Data (LOD)-based user profiles derived from both the rating data and the items' descriptive attributes found from LOD resources.
Abstract: This paper addresses the sparsity problem in collaborative filtering (CF) by developing an aggregated useruser similarity measure suitable for the user-based CF model. The aggregated similarity measure is a weighted aggregation of the SimRank++ similarity on the user-item bipartite graph and the cosine similarity of the Linked Open Data (LOD)-based user profiles derived from both the rating data and the items' descriptive attributes found from LOD resources. To validate the effectiveness of the aggregated similarity and evaluate the accuracy of rating predictions with the user-based CF method, comparative experiments between four similarity measures, the Pearson correlation coefficient, the SimRank++ similarity, the cosine similarity and the aggregated similarity, were conducted on the MovieLens 100k dataset and DBpedia. The experimental results indicate that the proposed aggregated similarity measure overall outperforms the other three similarity measures in terms of both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), especially in the cases of 30-100 nearest neighbors.

14 Jun 2017
TL;DR: A supervised, graph-based method to create word representations is investigated and a new calculus for the interpretable ultradense subspaces, including polarity, concreteness, frequency and part-of-speech (POS) is introduced.
Abstract: Word representations, also called word embeddings, are generic representations, often high-dimensional vectors. They map the discrete space of words into a continuous vector space, which allows us to handle rare or even unseen events, e.g. by considering the nearest neighbors. Many Natural Language Processing tasks can be improved by word representations if we extend the task specific training data by the general knowledge incorporated in the word representations. The first publication investigates a supervised, graph-based method to create word representations. This method leads to a graph-theoretic similarity measure, CoSimRank, with equivalent formalizations that show CoSimRank’s close relationship to Personalized Page-Rank and SimRank. The new formalization is efficient because it can use the graph-based word representation to compute a single node similarity without having to compute the similarities of the entire graph. We also show how we can take advantage of fast matrix multiplication algorithms. In the second publication, we use existing unsupervised methods for word representation learning and combine these with semantic resources by learning representations for non-word objects like synsets and entities. We also investigate improved word representations which incorporate the semantic information from the resource. The method is flexible in that it can take any word representations as input and does not need an additional training corpus. A sparse tensor formalization guarantees efficiency and parallelizability. In the third publication, we introduce a method that learns an orthogonal transformation of the word representation space that focuses the information relevant for a task in an ultradense subspace of a dimensionality that is smaller by a factor of 100 than the original space. We use ultradense representations for a Lexicon Creation task in which words are annotated with three types of lexical information – sentiment, concreteness and frequency. The final publication introduces a new calculus for the interpretable ultradense subspaces, including polarity, concreteness, frequency and part-of-speech (POS). The calculus supports operations like “−1 × hate = love” and “give me a neutral word for greasy” (i.e., oleaginous) and extends existing analogy computations like “king − man + woman = queen”.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: Several representative similarity measures of graph nodes are investigated and an NMF-based method for community discovery using SimRank similarity measure is proposed that presents good scalability and can be used to discover communities in the large-scale complex networks.
Abstract: Nonnegative Matrix Factorization (NMF) has become a powerful model for community discovery in complex networks. Existing NMF-based methods for community discovery often factorize the corresponding adjacent matrix of complex networks to obtain their community indicator matrices, which can provide intuitive interpretation for the community membership of nodes in complex networks. However, the adjacent matrix cannot represent the global structure feature of complex networks very well, and hence decreases the quality of discovered communities. Aiming at solving this problem, in this paper we investigate several representative similarity measures of graph nodes and propose an NMF-based method for community discovery using SimRank similarity measure. Additionally, to improve the scalability of our method, we implement its key components using MapReduce distributed computing framework, including computing SimRank feature matrix and iteratively solving the NMF-based model for community discovery. We conduct extensive experiments on several complex networks. The results show that our method can obtain better results of community discovery than NMF-based methods using other similarity measures. Moreover, our method presents good scalability and can be used to discover communities in the large-scale complex networks.

Book ChapterDOI
01 Jan 2017
TL;DR: This chapter studies the relevance search problem in heterogeneous networks, where the task is to measure the relatedness of heterogeneous objects (including objects with the same type or different types), and introduces a novel measure HeteSim and its extended version.
Abstract: Similarity search is an important function in many applications, which usually focuses on measuring the similarity between objects with the same type. However, in many scenarios, we need to measure the relatedness between objects with different types. With the surge of study on heterogeneous networks, the relevance measure on objects with different types becomes increasingly important. In this chapter, we study the relevance search problem in heterogeneous networks, where the task is to measure the relatedness of heterogeneous objects (including objects with the same type or different types). And then, we introduce a novel measure HeteSim and its extended version.

Journal ArticleDOI
TL;DR: A personalized similarity measure is proposed for measuring the similarity between query and candidate book by combining the similarities between books, and experiments demonstrate that, when the number of input books are not limited into one, the returned rankings are more consistent with students’ query intentions.
Abstract: Personalized education aims to give students a personalized learning schedule according to students’ backgrounds and preferences, and the required learning resources for learning are personalized. On-line bookstore allows students to collect learning recourses on-line through Internet, but the problem of information overload plagues students since it is difficult to find the suitable books with the data becoming diverse and massive. Similarity search aims to find the similar objects to a given query, which can be regarded as a promising solution to the problem of information overload. However, the existing similarity search approaches limit the query into only one object, the students cannot express their preferences personally. In this paper, we proposed a personalized similarity search framework, towards finding the similar books based on student’s preference for personalized education. We build the student-book network based on the students’ ratings for books, and use SimRank to measure the similarities between books according to the student-book network. For satisfying student’s personalized query preference, we allow student to express query with multi-books. A personalized similarity measure is proposed for measuring the similarity between query and candidate book by combining the similarities between books. Experiments on Amazon dataset demonstrate that, when the number of input books are not limited into one, the returned rankings are more consistent with students’ query intentions.

Proceedings ArticleDOI
23 Aug 2017
TL;DR: An optimization technique for fast P-Rank computation in information networks by adopting the spiritual of partial sums and developing an optimized similarity computation algorithm, which reduces the computation cost by skipping the similarity scores smaller than the give threshold during accumulation operations.
Abstract: P-Rank is a simple and captivating link-based similarity measure that extends SimRank by exploiting both in- and out-links for similarity computation. However, the existing work of P-Rank computation is expensive in terms of time and space cost and cannot efficiently support similarity computation in large information networks. For tackling this problem, in this paper, we propose an optimization technique for fast P-Rank computation in information networks by adopting the spiritual of partial sums. We write P-Rank equation based on partial sums and further approximate this equation by setting a threshold for ignoring the small similarity scores during iterative similarity computation. An optimized similarity computation algorithm is developed, which reduces the computation cost by skipping the similarity scores smaller than the give threshold during accumulation operations. And the accuracy loss estimation under the threshold is given through extensive mathematical analysis. Extensive experiments demonstrate the effectiveness and efficiency of our proposed approach through comparing with the straightforward P-Rank computation algorithm.

Posted Content
TL;DR: Li et al. as mentioned in this paper proposed a memory-efficient algorithm for all-pairs SimRank, which requires only O(kn+m) memory and O(n^2) memory in the worst case.
Abstract: In this article, we study the efficient dynamical computation of all-pairs SimRanks on time-varying graphs. Li {\em et al}.'s approach requires $O(r^4 n^2)$ time and $O(r^2 n^2)$ memory in a graph with $n$ nodes, where $r$ is the target rank of the low-rank SVD. (1) We first consider edge update that does not accompany new node insertions. We show that the SimRank update $\Delta S$ in response to every link update is expressible as a rank-one Sylvester matrix equation. This provides an incremental method requiring $O(Kn^2)$ time and $O(n^2)$ memory in the worst case to update all pairs of similarities for $K$ iterations. (2) To speed up the computation further, we propose a lossless pruning strategy that captures the "affected areas" of $\Delta S$ to eliminate unnecessary retrieval. This reduces the time of the incremental SimRank to $O(K(m+|AFF|))$, where $m$ is the number of edges in the old graph, and $|AFF| (\le n^2)$ is the size of "affected areas" in $\Delta S$, and in practice, $|AFF| \ll n^2$. (3) We also consider edge updates that accompany node insertions, and categorize them into three cases, according to which end of the inserted edge is a new node. For each case, we devise an efficient incremental algorithm that can support new node insertions. (4) We next design an efficient batch incremental method that handles "similar sink edges" simultaneously and eliminates redundant edge updates. (5) To achieve linear memory, we devise a memory-efficient strategy that dynamically updates all pairs of SimRanks column by column in just $O(Kn+m)$ memory, without the need to store all $(n^2)$ pairs of old SimRank scores. Experimental studies on various datasets demonstrate that our solution substantially outperforms the existing incremental SimRank methods, and is faster and more memory-efficient than its competitors on million-scale graphs.

Proceedings ArticleDOI
08 May 2017
TL;DR: A weighted improved SimRank algorithm is proposed to compute the rating similarity between users in rating data set and a trust network is built and the calculation of trust degree in the trust relationship data set is introduced.
Abstract: Collaborative filtering is one of the most widely used recommendation technologies, but the data sparsity and cold start problem of collaborative filtering algorithms are difficult to solve effectively. In order to alleviate the problem of data sparsity in collaborative filtering algorithm, firstly, a weighted improved SimRank algorithm is proposed to compute the rating similarity between users in rating data set. The improved SimRank can find more nearest neighbors for target users according to the transmissibility of rating similarity. Then, we build trust network and introduce the calculation of trust degree in the trust relationship data set. Finally, we combine rating similarity and trust to build a comprehensive similarity in order to find more appropriate nearest neighbors for target user. Experimental results show that the algorithm proposed in this paper improves the recommendation precision of the Collaborative algorithm effectively.