scispace - formally typeset
Search or ask a question

Showing papers on "SimRank published in 2013"


Journal ArticleDOI
01 Sep 2013
TL;DR: SimRank*, a revised version of SimRank, is proposed and rigorously justify, which resolves such counter-intuitive "zero-similarity" issues while inheriting merits of the basic SimRank philosophy.
Abstract: Similarity assessment is one of the core tasks in hyperlink analysis. Recently, with the proliferation of applications, e.g., web search and collaborative filtering, SimRank has been a well-studied measure of similarity between two nodes in a graph. It recursively follows the philosophy that "two nodes are similar if they are referenced (have incoming edges) from similar nodes", which can be viewed as an aggregation of similarities based on incoming paths. Despite its popularity, SimRank has an undesirable property, i.e., "zero-similarity": It only accommodates paths with equal length from a common "center" node. Thus, a large portion of other paths are fully ignored. This paper attempts to remedy this issue. (1) We propose and rigorously justify SimRank*, a revised version of SimRank, which resolves such counter-intuitive "zero-similarity" issues while inheriting merits of the basic SimRank philosophy. (2) We show that the series form of SimRank* can be reduced to a fairly succinct and elegant closed form, which looks even simpler than SimRank, yet enriches semantics without suffering from increased computational cost. This leads to a fixed-point iterative paradigm of SimRank* in O(Knm) time on a graph of n nodes and m edges for K iterations, which is comparable to SimRank. (3) To further optimize SimRank* computation, we leverage a novel clustering strategy via edge concentration. Due to its NP-hardness, we devise an efficient and effective heuristic to speed up SimRank* computation to O(Knm) time, where m is generally much smaller than m. (4) Using real and synthetic data, we empirically verify the rich semantics of SimRank*, and demonstrate its high computation efficiency.

72 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: The solution, SimMat, is based on two ideas: It computes the approximate similarity of a selected node pair efficiently in non-iterative style based on the Sylvester equation, and it prunes unnecessary approximate similarity computations when searching for the high similarity nodes by exploiting estimationsbased on the Cauchy-Schwarz inequality.
Abstract: Graphs are a fundamental data structure and have been employed to model objects as well as their relationships. The similarity of objects on the web (e.g., webpages, photos, music, micro-blogs, and social networking service users) is the key to identifying relevant objects in many recent applications. SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many applications such as web spam detection, collaborative tagging analysis, link prediction, and so on. SimRank computes similarities iteratively, and it needs O(N4T) time and O(N2) space for similarity computation where N and T are the number of nodes and iterations, respectively. Unfortunately, this iterative approach is computationally expensive. The goal of this work is to process top-k search and range search efficiently for a given node. Our solution, SimMat, is based on two ideas: (1) It computes the approximate similarity of a selected node pair efficiently in non-iterative style based on the Sylvester equation, and (2) It prunes unnecessary approximate similarity computations when searching for the high similarity nodes by exploiting estimations based on the Cauchy-Schwarz inequality. These two ideas reduce the time and space complexities of the proposed approach to O(Nn) where n is the target rank of the low-rank approximation (n ≪ N in practice). Our experiments show that our approach is much faster, by several orders of magnitude, than previous approaches in finding the high similarity nodes.

60 citations


Journal ArticleDOI
01 May 2013
TL;DR: This paper adopts "SimRank" to evaluate the similarity of two vertices in a large graph because of its generality, and extends the technique to the partition-based framework.
Abstract: Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, etc. In this paper, we adopt "SimRank" to evaluate the similarity of two vertices in a large graph because of its generality. Note that "SimRank" is purely structure dependent and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all the vertex pairs satisfying the threshold in a data graph G. In order to reduce the search space, we propose an estimated shortest-path distance based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover, to efficiently compute the SimRank score of a single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (called h-go covers), based on which, the SimRank score of any vertex pair can be computed easily. In order to handle large graphs, we extend our technique to the partition-based framework. Thorough theoretical analysis and extensive experiments over both real and synthetic datasets confirm the efficiency and effectiveness of our solution.

60 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: An adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and an efficient algorithm for speeding up the computation of SimRank to 0(Kd'n2) time, where d' is typically much smaller than the average in-degree of a graph.
Abstract: SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work [17] exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have duplicate redundancy. Besides, for a desired accuracy ϵ, the existing SimRank model requires K = [logC ϵ] iterations [17], where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to 0(Kd'n2) time, where d' is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores.

43 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A self-contained co-saliency detection algorithm based on superpixel affinity matrix is proposed that evaluates the foreground cohesiveness and locality compactness of superpixels within one image.
Abstract: Image co-saliency detection is a valuable technique to highlight perceptually salient regions in image pairs. In this paper, we propose a self-contained co-saliency detection algorithm based on superpixel affinity matrix. We first compute both intra and inter similarities of superpixels of image pairs. Bipartite graph matching is applied to determine most reliable inter similarities. To update the similarity score between every two superpixels, we next employ a GPU-based all-pair SimRank algorithm to do propagation on the affinity matrix. Based on the inter superpixel affinities we derive a co-saliency measure that evaluates the foreground cohesiveness and locality compactness of superpixels within one image. The effectiveness of our method is demonstrated in experimental evaluation.

42 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: It is shown that ASCOS outputs a more complete similarity score than SimRank because SimRank (and several of its variations, such as P-Rank and SimFusion) on average ignores half paths between nodes during calculation.
Abstract: Discovering similar objects in a social network has many interesting issues. Here, we present ASCOS, an Asymmetric Structure COntext Similarity measure that captures the similarity scores among any pairs of nodes in a network. The definition of ASCOS is similar to that of the well-known SimRank since both define score values recursively. However, we show that ASCOS outputs a more complete similarity score than SimRank because SimRank (and several of its variations, such as P-Rank and SimFusion) on average ignores half paths between nodes during calculation. To make ASCOS tractable in both computation time and memory usage, we propose two variations of ASCOS: a low rank approximation based approach and an iterative solver Gauss-Seidel for linear equations. When the target network is sparse, the run time and the required computing space of these variations are smaller than computing SimRank and ASCOS directly. In addition, the iterative solver divides the original network into several independent sub-systems so that a multi-core server or a distributed computing environment, such as MapReduce, can efficiently solve the problem. We compare the performance of ASCOS with other global structure based similarity measures, including SimRank, Katz, and LHN. The experimental results based on user evaluation suggest that ASCOS gives better results than other measures. In addition, the asymmetric property has the potential to identify the hierarchical structure of a network. Finally, variations of ASCOS (including one distributed variation) can also reduce computation both in space and time.

26 citations


Proceedings ArticleDOI
27 Jun 2013
TL;DR: This paper proposes parallel algorithms for SimRank computation on Map-Reduce framework, and more specifically its open source implementation, Hadoop, and employs the proposed methods to do the similarity computation in order to recommend appropriate products to users in social recommender systems.
Abstract: Recently there has been a lot of interest in graph-based analysis, with examples including social network analysis, recommendation systems, document classification and clustering, and so on. A graph is an abstraction that naturally captures data objects as well as relationships among those objects. Objects are represented as nodes and relationships are represented as edges in the graph. There are many cases in which similarities among nodes are required to compute. SimRank is one of the simple and intuitive algorithms for this purpose. It is rigidly based on the random walk theorem. Existing methods on SimRank computation suffer from one limitation: the computing cost can be very high in practice. In order to optimize the computation of SimRank, a few techniques have been proposed. However, the performance of these methods are still limited by the processing ability of the single computer. Ideally, we would like to develop new parallel solutions that can offer improved processing power to compute SimRank on large data set. In this paper, we propose parallel algorithms for SimRank computation on Map-Reduce framework, and more specifically its open source implementation, Hadoop. Two different parallel methods are proposed and their performances are evaluated and compared. Furthermore, we employ the proposed methods to do the similarity computation in order to recommend appropriate products to users in social recommender systems.

23 citations


Journal ArticleDOI
TL;DR: A comprehensive analysis and critical comparison of various link-based similarity measures and algorithms are presented and some novel and useful guidelines for users to choose the appropriate link- based measure for their applications are discovered.
Abstract: Measuring similarity between objects is a fundamental task in domains such as data mining, information retrieval, and so on. Link-based similarity measures have attracted the attention of many researchers and have been widely applied in recent years. However, most previous works mainly focus on introducing new link-based measures, and seldom provide theoretical as well as experimental comparisons with other measures. Thus, selecting the suitable measure in different situations and applications is difficult. In this paper, a comprehensive analysis and critical comparison of various link-based similarity measures and algorithms are presented. Their strengths and weaknesses are discussed. Their actual runtime performances are also compared via experiments on benchmark data sets. Some novel and useful guidelines for users to choose the appropriate link-based measure for their applications are discovered.

16 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: It is shown that contextual word connections can help to produce semantically meaningful similarity measurement between any pair of Chinese words, and a parallel all-pair SimRank algorithm is used to propagate such contextual similarities throughout the whole vocabulary.
Abstract: A lot of recent work in story segmentation focuses on developing better partitioning criteria to segment news transcripts into sequences of topically coherent stories, while simply relying on the repetition based hard word-level similarities and ignoring the semantic correlations between different words. In this paper, we propose a purely data-driven approach to measuring soft semantic word- and sentence-level similarity from a given corpus, without the guidance of linguistic knowledge, ground-truth topic labeling or story boundaries. We show that contextual word connections can help to produce semantically meaningful similarity measurement between any pair of Chinese words. Based on this, we further use a parallel all-pair SimRank algorithm to propagate such contextual similarities throughout the whole vocabulary. The resultant word semantic similarity matrix is then used to refine the classical cosine similarity measurement of sentences. Experiments on benchmark Chinese news corpora show that, story segmentation using the proposed soft semantic similarity measurement can always produce better segmentation accuracy than using the hard similarity. Specifically, we can achieve 3%-10% average F1-measure improvement to state-of-the-art NCuts based story segmentation.

10 citations


Book ChapterDOI
01 Jan 2013
TL;DR: Two approximate approaches that mitigate the problem of time complexity are presented: the approximate algorithm approach (Approximate SimRank Based Similarity matrix) and the approximate data approach (Prototype-based cluster ensemble model).
Abstract: Cluster ensemble methods have emerged as powerful techniques, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. In particular, link-based similarity techniques have recently been introduced with superior performance to the conventional co-association method. Their potential and applicability are, however limited due to the underlying time complexity. In light of such shortcoming, this paper presents two approximate approaches that mitigate the problem of time complexity: the approximate algorithm approach (Approximate SimRank Based Similarity matrix) and the approximate data approach (Prototype-based cluster ensemble model). The first approach involves decreasing the computational requirement of the existing link-based technique; the second reduces the size of the problem by finding a smaller, representative, approximate dataset, derived by a density-biased sampling technique. The advantages of both approximate approaches are empirically demonstrated over 22 datasets (both artificial and real data) and statistical comparisons of performance (with 95% confidence level) with three well-known validity criteria. Results obtained from these experiments suggest that approximate techniques can efficiently help scaling up the application of link-based similarity methods to wider range of data sizes.

7 citations


Book ChapterDOI
27 Aug 2013
TL;DR: A novel approach to conversational recommendation, UtilSim, where utilities corresponding to products get continually updated as a user iteratively interacts with the system, helping her discover her hidden preferences in the process.
Abstract: Conversational Recommender Systems belong to a class of knowledge based systems which simulate a customer’s interaction with a shopkeeper with the help of repeated user feedback till the user settles on a product. One of the modes for getting user feedback is Preference Based Feedback, which is especially suited for novice users(having little domain knowledge), who find it easy to express preferences across products as a whole, rather than specific product features. Such kind of novice users might not be aware of the specific characteristics of the items that they may be interested in, hence, the shopkeeper/system should show them a set of products during each interaction, which can constructively stimulate their preferences, leading them to a desirable product in subsequent interactions. We propose a novel approach to conversational recommendation, UtilSim, where utilities corresponding to products get continually updated as a user iteratively interacts with the system, helping her discover her hidden preferences in the process. We show that UtilSim, which combines domain-specific “dominance” knowledge with SimRank based similarity, significantly outperforms the existing conversational approaches using Preference Based Feedback in terms of recommendation efficiency.

Book ChapterDOI
22 Apr 2013
TL;DR: This work suggests a new approach to compute SimRank in disk-resident graphs and proposes optimization techniques that improve the time cost of the new approach from O (kN 2 D 2) to O(kNL), and presents a threshold sieving method to reduce storage and computational cost.
Abstract: There are many real-world applications based on similarity between objects, such as clustering, similarity query processing, information retrieval and recommendation systems. SimRank is a promising measure of similarity based on random surfers model. However, the computational complexity of SimRank is high and several optimization techniques have been proposed. In the paper optimization issue of SimRank computation in disk-resident graphs is our primary focus. First we suggest a new approach to compute SimRank.Then we propose optimization techniques that improve the time cost of the new approach from O (kN 2 D 2) to O(kNL), where k is the number of iteration, N is the number of nodes, L is the number of edges, and D is the average degree of nodes. Meanwhile, a threshold sieving method is presented to reduce storage and computational cost. On this basis, an external algorithm computing SimRank in disk-resident graphs is introduced. In the experiments, our algorithm outperforms its opponent whose computation complexity also is O(kNL).

Journal Article
TL;DR: The weighted SimRank algorithm is applied to construct a Latent Feature Space with feature similarity, and after reducing the mismatch of data distribution between domains, the algorithm performs well on cross-domain sentiment classification.
Abstract: Cross-domain sentiment classification has attracted more attention in natural language processing field currently.It aims to predict the text polarity of target domain with the help of labeled texts in source domain.Usually,traditional supervised classification approaches can not perform well due to the difference of data distribution between domains.In this paper,a weighted SimRank algorithm is proposed to address this problem.The weighted SimRank algorithm is applied to construct a Latent Feature Space(LFS) with feature similarity.Then each sample is reweighted by the mapping function learned from the LFS.After reducing the mismatch of data distribution between domains,the algorithm performs well on cross-domain sentiment classification.The experiment verifies the effectiveness of the proposed algorithm.

Book
01 Jan 2013
TL;DR: A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters and a Hybrid Approach for Relational Similarity Measurement are introduced.
Abstract: Shortest Path Computation over Disk-Resident Large Graphs Based on Extended Bulk Synchronous Parallel Methods.- Fast SimRank Computation over Disk-Resident Graphs.- S-store: An Engine for Large RDF Graph Integrating Spatial Information.- Physical Column Organization in In-Memory Column Stores.- A Specific Encryption Solution for Data Warehouses.- NameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System.- Mapping Entity-Attribute Web Tables to Web-Scale Knowledge Bases.- On Leveraging Crowdsourcing Techniques for Schema Matching Networks.- A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters.- Detecting User Preference on Microblog.- Efficient SPARQL Query Evaluation via Automatic Data Partitioning.- Searching Desktop Files Based on Access Logs.- An In-Memory/GPGPU Approach to Query Processing for Aspect-Oriented Data Management.- Parallel Triangle Counting over Large Graphs.- Document Summarization via Self-Present Sentence Relevance Model.- A Hybrid Framework for Product Normalization in Online Shopping.- A Hybrid Approach for Relational Similarity Measurement.- Susceptible-Infected- Susceptible Epidemic Model.- EntityManager: An Entity-Based Dirty Data Management System.- Similarity Joins on Item Set Collections Using Zero-Suppressed Binary Decision Diagrams.- Adaptive Query Scheduling in Key-Value Data Stores.- AVR-Tree: Speeding Up the NN and ANN Queries on Location Data.- Generalization-Based Private Indexes for Outsourced Databases.- MVP Index: Towards Efficient Known-Item Search on Large Graphs.- Continuous Topically Related Queries Grouping and Its Application on Interest Identification.- Minimizing Explanations for Missing Answers to Queries on Databases.- A Compact and Efficient Labeling Scheme for XML Documents.- History-Offset Implementation Scheme of XML Documents and Its Evaluations.- On the Complexity of t-Closeness Anonymization and Related Problems.- Distributed Anonymization for Multiple Data Providers in a Cloud System.- Feel Free to Check-in: Privacy Alert against Hidden Location Inference Attacks in GeoSNs.- Consistent Query Answering Based on Repairing Inconsistent Attributes with Nulls.- On Efficient k-Skyband Query Processing over Incomplete Data.- Mining Frequent Patterns from Uncertain Data with MapReduce for Big Data Analytics.- Efficient Probabilistic Reverse k-Nearest Neighbors Query Processing on Uncertain Data.

Journal Article
TL;DR: The proposed hybrid top-N recommendation method which combined social user tag and collaborative filtering obtained trustworthy user set through social relations and combined the predictive ratings of traditional collaborative filtering to provide the recommendations.
Abstract: Aiming at the cold start user problem of tradition recommendation system,this paper proposed a hybrid top-N recommendation method which combined social user tag and collaborative filtering.The method obtained trustworthy user set through social relations.Based on user personalized tags,it applied a structural-context similar algorithm named SimRank to generate the similar neighbors set,which was used to generate predictive ratings.Finally,it combined the predictive ratings of traditional collaborative filtering to provide the recommendations.The experimental evaluations on all users and cold start users dataset demonstrate the proposed method outperforms the collaborative filtering approach in terms of recall and precision.

Journal Article
TL;DR: A method is presented to identify co-attention objects from an image pair that provides an effective way to predict human fixations within multi-images, and robustly highlight co-salient regions.
Abstract: In this paper a method is presented to identify co-attention objects from an image pair. This method provides an effective way to predict human fixations within multi-images, and robustly highlight co-salient regions. This method generates the SISM by computing three visual saliency maps within each image. For the MISM computation, a comultilayer graph is introduced using a spatial pyramid representation for the image pair. Two types of descriptors (i.e., color and texture visual descriptors) are extracted for each region node, which are then used to compute the similarity between a node-pair. Finally, a fast single-pair SimRank algorithm is employed to measure the similarity based on the normalized SimRank score.

Book ChapterDOI
01 Jan 2013
TL;DR: This paper combines the features of the Hadoop and computes the simrank parallel with different methods, and compars them in the performance.
Abstract: Many fields need computing the similarity between objects, such as recommendation system, search engine etc Simrank is one of the simple and intuitive algorithms It is rigidly based on the random walk theorem There are three existing iterative ways to compute simrank, however, all of them have one problem, that is time consuming; moreover, with the rapidly growing data on the Internet, we need a novel parallel method to compute simrank on large scale dataset Hadoop is one of the popular distributed platforms This paper combines the features of the Hadoop and computes the simrank parallel with different methods, and compars them in the performance