scispace - formally typeset
Search or ask a question

Showing papers on "SimRank published in 2015"


Proceedings ArticleDOI
09 Aug 2015
TL;DR: A novel sampling method for identifying products that have been targeted for manipulation and a seed set of deceptive reviewers who have been enlisted through crowdsourcing platforms are proposed, outperforming both traditional detection methods and a SimRank-based alternative clustering approach.
Abstract: Online reviews are a cornerstone of consumer decision making. However, their authenticity and quality has proven hard to control, especially as polluters target these reviews toward promoting products or in degrading competitors. In a troubling direction, the widespread growth of crowdsourcing platforms like Mechanical Turk has created a large-scale, potentially difficult-to-detect workforce of malicious review writers. Hence, this paper tackles the challenge of uncovering crowdsourced manipulation of online reviews through a three-part effort: (i) First, we propose a novel sampling method for identifying products that have been targeted for manipulation and a seed set of deceptive reviewers who have been enlisted through crowdsourcing platforms. (ii) Second, we augment this base set of deceptive reviewers through a reviewer-reviewer graph clustering approach based on a Markov Random Field where we define individual potentials (of single reviewers) and pair potentials (between two reviewers). (iii) Finally, we embed the results of this probabilistic model into a classification framework for detecting crowd-manipulated reviews. We find that the proposed approach achieves up to 0.96 AUC, outperforming both traditional detection methods and a SimRank-based alternative clustering approach.

82 citations


Journal ArticleDOI
01 Jan 2015
TL;DR: A novel "seed germination" model that computes partial-pairs SimRank in O(k|E| min{|A|, |B|}) time and O(|E | + k|V|) memory for k iterations on a graph of |V| nodes and |E| edges, allowing scores to be assessed accurately on graphs with tens of millions of links.
Abstract: The assessment of node-to-node similarities based on graph topology arises in a myriad of applications, e.g., web search. SimRank is a notable measure of this type, with the intuition that "two nodes are similar if their in-neighbors are similar". While most existing work retrieving SimRank only considers all-pairs SimRank s(*, *) and single-source SimRank s(*, j) (scores between every node and query j), there are appealing applications for partial-pairs SimRank, e.g., similarity join. Given two node subsets A and B in a graph, partial-pairs SimRank assessment aims to retrieve only {s(a, b)}∀aeA,∀beB. However, the best-known solution appears not self-contained since it hinges on the premise that the SimRank scores with node-pairs in an h-go cover set must be given beforehand.This paper focuses on efficient assessment of partial-pairs SimRank in a self-contained manner. (1) We devise a novel "seed germination" model that computes partial-pairs SimRank in O(k|E| min{|A|, |B|}) time and O(|E| + k|V|) memory for k iterations on a graph of |V| nodes and |E| edges. (2) We further eliminate unnecessary edge access to improve the time of partial-pairs SimRank to O(m min{|A|, |B|}), where m ≤ min{k|E|, Δ2k}, and Δ is the maximum degree. (3) We show that our partial-pairs SimRank model also can handle the computations of all-pairs and single-source SimRanks. (4) We empirically verify that our algorithms are (a) 38x faster than the best-known competitors, and (b) memory-efficient, allowing scores to be assessed accurately on graphs with tens of millions of links.

63 citations


Journal ArticleDOI
01 Apr 2015
TL;DR: This paper proposes a novel two-stage random-walk sampling framework (TSF) for SimRank-based similarity search (e.g., top-k search) and demonstrates that TSF can handle dynamic billion-edge graphs with high performance.
Abstract: SimRank is an important measure of vertex-pair similarity according to the structure of graphs. The similarity search based on SimRank is an important operation for identifying similar vertices in a graph and has been employed in many data analysis applications. Nowadays, graphs in the real world become much larger and more dynamic. The existing solutions for similarity search are expensive in terms of time and space cost. None of them can efficiently support similarity search over large dynamic graphs. In this paper, we propose a novel two-stage random-walk sampling framework (TSF) for SimRank-based similarity search (e.g., top-k search). In the preprocessing stage, TSF samples a set of one-way graphs to index raw random walks in a novel manner within O(NRg) time and space, where N is the number of vertices and Rg is the number of one-way graphs. The one-way graph can be efficiently updated in accordance with the graph modification, thus TSF is well suited to dynamic graphs. During the query stage, TSF can search similar vertices fast by naturally pruning unqualified vertices based on the connectivity of one-way graphs. Furthermore, with additional Rq samples, TSF can estimate the SimRank score with probability [EQUATION] if the error of approximation is bounded by 1 -- e. Finally, to guarantee the scalability of TSF, the one-way graphs can also be compactly stored on the disk when the memory is limited. Extensive experiments have demonstrated that TSF can handle dynamic billion-edge graphs with high performance.

57 citations


Proceedings ArticleDOI
18 May 2015
TL;DR: Two existing metrics, SimRank and PageRank, are reviewed and investigated and their suitability and performance for computing similarity between resources in RDF graphs and their usage to feed a content-based recommender system are investigated.
Abstract: The Web of Data is the natural evolution of the World Wide Web from a set of interlinked documents to a set of interlinked entities. It is a graph of information resources interconnected by semantic relations, thereby yielding the name Linked Data. The proliferation of Linked Data is for sure an opportunity to create a new family of data-intensive applications such as recommender systems. In particular, since content-based recommender systems base on the notion of similarity between items, the selection of the right graph-based similarity metric is of paramount importance to build an effective recommendation engine. In this paper, we review two existing metrics, SimRank and PageRank, and investigate their suitability and performance for computing similarity between resources in RDF graphs and investigate their usage to feed a content-based recommender system. Finally, we conduct experimental evaluations on a dataset for musical artists and bands recommendations thus comparing our results with two other content-based baselines measuring their performance with precision and recall, catalog coverage, items distribution and novelty metrics.

54 citations


Journal ArticleDOI
01 Sep 2015
TL;DR: This work is the first to report results on clue-web, which is 10x larger than the largest graph ever reported for SimRank computation, and is orders of magnitude more efficient and scalable than existing solutions for large-scale problems.
Abstract: Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a length-n indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both single-pair and single-source queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for large-scale problems. Implemented on Spark with 10 machines and tested on the web-scale clue-web graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a single-pair query, and 188 seconds for a single-source query. To the best of our knowledge, our work is the first to report results on clue-web, which is 10x larger than the largest graph ever reported for SimRank computation.

53 citations


Journal ArticleDOI
TL;DR: A structural-based similarity measure, NetSim, towards efficiently computing similarity between centers in an x-star network, which requires less time and space cost than existing methods since the scale of attribute network is significantly smaller than the whole x- star network.
Abstract: The efficiency improvement is evident for similarity computation.The effectiveness of returned result is good for similarity search.The pruning algorithm is presented for supporting fast online query processing.The accuracy loss of pruning algorithm can be controlled by setting thresholds. An x-star network is an information network which consists of centers with connections among themselves, and different type attributes linking to these centers. As x-star networks become ubiquitous, extracting knowledge from x-star networks has become an important task. Similarity search in x-star network aims to find the centers similar to a given query center, which has numerous applications including collaborative filtering, community mining and web search. Although existing methods yield promising similar results, such as SimRank and P-Rank, they are not applicable for massive x-star networks. In this paper, we propose a structural-based similarity measure, NetSim, towards efficiently computing similarity between centers in an x-star network. The similarity between attributes is computed in the pre-processing stage by the expected meeting probability over attribute network that is extracted from the whole structure of x-star network. The similarity between centers is computed online according to the attribute similarities based on the intuition that similar centers are linked with similar attributes. NetSim requires less time and space cost than existing methods since the scale of attribute network is significantly smaller than the whole x-star network. For supporting fast online query processing, we develop a pruning algorithm by building a pruning index, which prunes candidate centers that are not promising. Extensive experiments demonstrate the effectiveness and efficiency of our method through comparing with the state-of-the-art measures.

36 citations


Proceedings ArticleDOI
09 Aug 2015
TL;DR: The scheme, SR#, is efficient and semantically meaningful, and gives mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: if D is replaced by a scaled identity matrix, top-K rankings will not be affected much.
Abstract: SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.

35 citations


Journal ArticleDOI
Lingxia Du1, Cuiping Li1, Hong Chen1, Liwen Tan1, Yinglong Zhang1 
TL;DR: This paper investigates the problem of node similarity computation on large uncertain graphs and proposes a probabilistic framework to compute it, and proposes an efficient dynamic programming algorithm to degrade the time complexity from exponential to polynomial.

24 citations


Journal ArticleDOI
TL;DR: This article argues that SimRank and its families, such as P-Rank and SimRank++, fail to capture similar node pairs in certain conditions, and presents new similarity measures ASCOS and ASCOS++ to address the problem.
Abstract: In this article, we explore the relationships among digital objects in terms of their similarity based on vertex similarity measures. We argue that SimRank—a famous similarity measure—and its families, such as P-Rank and SimRank++, fail to capture similar node pairs in certain conditions, especially when two nodes can only reach each other through paths of odd lengths. We present new similarity measures ASCOS and ASCOS++ to address the problem. ASCOS outputs a more complete similarity score than SimRank and SimRank’s families. ASCOS++ enriches ASCOS to include edge weight into the measure, giving all edges and network weights an opportunity to make their contribution. We show that both ASCOS++ and ASCOS can be reformulated and applied on a distributed environment for parallel contribution. Experimental results show that ASCOS++ reports a better score than SimRank and several famous similarity measures. Finally, we re-examine previous use cases of SimRank, and explain appropriate and inappropriate use cases. We suggest future SimRank users following the rules proposed here before naively applying it. We also discuss the relationship between ASCOS++ and PageRank.

22 citations


Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper proposes a scalable approximation algorithm with an arbitrary accuracy for the similarity join problem with the SimRank similarity measure that scales up to the network of 5M vertices and 70M edges.
Abstract: Similarity join finds all pairs of objects (i, j) with similarity score s(i, j) greater than some specified threshold θ. This is a fundamental query problem in the database research community, and is used in many practical applications, such as duplicate detection, merge/purge, record linkage, object matching, and reference conciliation.

21 citations


Journal ArticleDOI
TL;DR: A novel clustering strategy is proposed to eliminate duplicate computations occurring in partial sums, and an efficient algorithm is devised to accelerate SimRank computation to O(Kd'n2) time, which achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores.
Abstract: SimRank is a powerful model for assessing vertex-pair similarities in a graph. It follows the concept that two vertices are similar if they are referenced by similar vertices. The prior work [18] exploits partial sums memoization to compute SimRank in $O(Kmn)$ time on a graph of $n$ vertices and $m$ edges, for $K$ iterations. However, computations among different partial sums may have redundancy. Besides, to guarantee a given accuracy $\epsilon$ , the existing SimRank needs $K=\lceil \log _C \,\epsilon \rceil$ iterations, where $C$ is a damping factor, but the geometric rate of convergence is slow if a high accuracy is expected. In this paper, (1) a novel clustering strategy is proposed to eliminate duplicate computations occurring in partial sums, and an efficient algorithm is then devised to accelerate SimRank computation to $O(K d^{\prime } n^2)$ time, where $d^{\prime }$ is typically much smaller than $\frac{m}{n}$ . (2) A new differential SimRank equation is proposed, which can represent the SimRank matrix as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) In bipartite domains, a novel finer-grained partial max clustering method is developed to speed up the computation of the Minimax SimRank variation from $O(Kmn)$ to $O(Km^{\prime }n)$ time, where $m^{\prime } \ ({\le} m)$ is the number of edges in a reduced graph after edge clustering, which can be typically much smaller than $m$ . Using real and synthetic data, we empirically verify that (1) our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude; (2) the revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores; (3) our finer-grained partial max memoization for the Minimax SimRank variation in bipartite domains is 5X-12X faster than the baselines.

Proceedings ArticleDOI
17 Oct 2015
TL;DR: This paper proposes efficient ranking criteria that can secure correct relative orders of node-pairs with respect to SimRank scores when they are computed in an iterative fashion and shows the superiority of this criteria in harvesting top-K Sim Rank scores and bucket orders from a full ranking list.
Abstract: One of the important tasks in link analysis is to quantify the similarity between two objects based on hyperlink structure. SimRank is an attractive similarity measure of this type. Existing work mainly focuses on absolute SimRank scores, and often harnesses an iterative paradigm to compute them. While these iterative scores converge to exact ones with the increasing number of iterations, it is still notoriously difficult to determine how well the relative orders of these iterative scores can be preserved for a given iteration. In this paper, we propose efficient ranking criteria that can secure correct relative orders of node-pairs with respect to SimRank scores when they are computed in an iterative fashion. Moreover, we show the superiority of our criteria in harvesting top-K SimRank scores and bucket orders from a full ranking list. Finally, viable empirical studies verify the usefulness of our techniques for SimRank top-K ranking and bucket ordering.

Journal ArticleDOI
TL;DR: This paper defines effective relationship strength (ERS) to distinguish link importance by utilizing node activity, node attraction and link frequency, and formalizes ESimRank equation by combining ERS and the expected meeting probabilities of any path length.

Journal ArticleDOI
TL;DR: A link-based similarity search method towards efficiently finding similar entities in web networks, WebSim, which defines the similarity between entities as the 2-hop similarity of SimRank and develops a pruning algorithm to support fast query processing.
Abstract: The pre-computation cost in the off-line stage is significantly reduced.The efficiency of query processing is optimized by proposing a pruning algorithm.The accuracy loss of pruning algorithm is controlled by tuning threshold.The effectiveness of returned result is effective and acceptable. Similarity search in web networks, aiming to find entities similar to the given entity, is one of the core tasks in network analysis. With the proliferation of web applications, including web search and recommendation system, SimRank has been a well-known measure for evaluating entity similarity in a network. However, the existing work computes SimRank iteratively over a huge similarity matrix, which is expensive in terms of time and space cost and cannot efficiently support similarity search over large networks. In this paper, we propose a link-based similarity search method, WebSim, towards efficiently finding similar entities in web networks. WebSim defines the similarity between entities as the 2-hop similarity of SimRank. To reduce computation cost, we divide the similarity search process into two stages: off-line stage and on-line stage. In the off-line stage, the 1-hop similarities are computed, and an optimized algorithm is designed to reduce the unnecessary accumulation operations on zero similarities. In the on-line stage, the 2-hop similarities are computed, and a pruning algorithm is developed to support fast query processing through searching similar entries from a partial sums index derived from the 1-hop similarities. The index items that are lower than a given threshold are skipped to reduce the searching space. Compared to the iterative SimRank computation, the time and space cost of similarity computation is significantly reduced, since WebSim maintains only the similarity matrix of 1-hop that is much smaller than that of multi-hop. Experiments through comparison with SimRank and its optimized algorithms demonstrate that WebSim has on average a 99.83% reduction in the time cost and a 92.12% reduction in the space cost of similarity computation, and achieves on average 99.98% NDCG.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This study devise a model, Co-Simmate, to speed up the retrieval of all pairs of Co-Simranks to O(log2 (log(1/e))*n^3) time, and integrate it with a matrix decomposition based method on singular graphs to attain higher efficiency.
Abstract: Co-Simrank is a useful Simrank-like measure of similarity based on graph structure. The existing method iteratively computes each pair of Co-Simrank score from a dot product of two Pagerank vectors, entailing O(log(1/e)*n^3) time to compute all pairs of Co-Simranks in a graph with n nodes, to attain a desired accuracy e. In this study, we devise a model, Co-Simmate, to speed up the retrieval of all pairs of Co-Simranks to O(log2 (log(1/e))*n^3) time. Moreover, we show the optimality of Co-Simmate among other hop-(u^k) variations, and integrate it with a matrix decomposition based method on singular graphs to attain higher efficiency. The viable experiments verify the superiority of Co-Simmate to others.

Proceedings ArticleDOI
Haiyan Zhang1, Chenxi Zhou1, Xun Liang1, Xi Zhao1, Yaping Li1 
01 Oct 2015
TL;DR: The problem of the local and global weighting balance is first proposed and the SimRank is next introduced as a novel edge weighting method and the fast Newman algorithm is extended to be applicable for a weighted network.
Abstract: Community detection is one of the most popular issues in analyzing and understanding the networks. Existing works show that community detection can be enhanced by proper assignments of weights onto the edges of a network. Large numbers of edge weighting schemes have been developed to cope with this problem. However, hardly has a satisfied balance between the local and global weightings been found. In this paper, the problem of the local and global weighting balance is first proposed and discussed. The SimRank is next introduced as a novel edge weighting method. Furthermore, the fast Newman algorithm is extended to be applicable for a weighted network. Combined with the edge weighting techniques, the extended algorithm enhances the performance of the original algorithm significantly through exhaustive experiments. And by comparing with several weighting methods, the experiments demonstrate that the proposed algorithm is superior and more robust for different kinds of networks.

Posted Content
TL;DR: A generalization of SimRank similarity measure for heterogeneous information networks is proposed and it is shown that the intraclass similarity score s(a, b) is high if the set of objects that are related with a are pair-wise similar according to all imposed relations.
Abstract: We propose a generalization of SimRank similarity measure for heterogeneous information networks. Given the information network, the intraclass similarity score s(a, b) is high if the set of objects that are related with a and the set of objects that are related with b are pair-wise similar according to all imposed relations.

Journal Article
TL;DR: The concept of image-rich information networks, image retrieval system and techniques like CBIR and TBIR, and the comparative study of image ranking and retrieval algorithms like simrank, k-simRank, HMOK-simrank are explained in this paper.
Abstract: Social networking sites allow users to share images, Ecommerce web sites also contains millions of images and thus forms image-rich information networks. Retrieving images from image-rich information networks is very challenging task, due to existence of information like text, user, image, feature, tags and group. The concept of image-rich information networks, image retrieval system and techniques like CBIR and TBIR are explained in this paper. The comparative study of image ranking and retrieval algorithms like simrank, k-simrank, HMOK-simrank is also mentioned in this paper. General Terms Image-rich information networks, CBIR, TBIR.

Patent
23 Sep 2015
TL;DR: In this article, the authors proposed a node similarity calculation method based on SimRank, which comprises the steps as follows: 1) using an adjacent matrix form to express a multi-relational network; 2) establishing an Eigen-SimRank model and analyzing correlation matrix information needed to calculate node similarity matrix S; 3) calculating the node similarity in the multirelational network according to the correlation Matrix information needed for calculating node similarity matrices S if a network structure is not changed.
Abstract: The invention relates to a node similarity calculation method based on SimRank. The method comprises the steps as follows: 1) using an adjacent matrix form to express a multi-relational network; using non-iterative node similarity matrix S to express the node similarity of the multi-relational network; 2) establishing an Eigen-SimRank model and analyzing correlation matrix information needed to calculate node similarity matrix S; 3) calculating the node similarity in the multi-relational network according to the correlation matrix information needed to calculate node similarity matrix S if a network structure is not changed; 4) using an Eigen-SimRank dynamic update algorithm to update the correlation matrix information if the network structure is changed and calculating new correlation matrix information needed by a similarity matrix after obtaining the change of network structure; 5) calculating node similarity according to the updated correlation matrix information; 6) analyzing a similarity value among nodes in the multi-relational network according to a similarity calculation result obtained by calculating. The node similarity calculation method based on SimRank of the invention could be widely applied to the field of node similarity calculation in the network structure.

Book ChapterDOI
08 Jun 2015
TL;DR: This paper adopts SimRank as the similarity computation measure and re-write the original inefficient iterative equation into a non-iterative one, called Eigen-SimRank, which is focused on multi-relational networks.
Abstract: SimRank is one measure that compute the similarities between nodes in applications, where the returning of top-k query lists is often required. In this paper, we adopt SimRank as the similarity computation measure and re-write the original inefficient iterative equation into a non-iterative one, we call it Eigen-SimRank. We focus on multi-relational networks, where there may exist different kinds of relationships among nodes and query results may change with different perspectives. In order to compute a top-k query list under any perspective especially compound perspective, we suggest dynamic updating algorithm and rank aggregation methods. We evaluate our algorithms in the experiment section.

Journal Article
TL;DR: This work uses SimRank to find similarity between neighbours in a contextual way and evaluates in a numerical way, and uses Jaccard Similarly for calculating similarity by using LSH and various other methods.
Abstract: The growth of data dynamically over the internet and the need to store, access information efficiently brings up new challenges of finding related documents, similar nodes, domain & inter-domain similarities etc. Though SimRank is applicable to wide range of areas, we use this similarity ranking to find similarity between neighbours in a contextual way and evaluate in a numerical way. Here we use Jaccard Similarly for calculating similarity by using LSH and various other methods. We further optimize the Jaccard Algorithm by using Token Optimization join method. The obtained result is further evaluated with a combination of four other parameters and from the result obtained the similarity values of nodes that are greater than the optimal threshold value φ are retrieved from the huge graph.

Proceedings ArticleDOI
14 Jun 2015
TL;DR: This paper firstly builds the co-purchasing network by using the relationships between different type products, and then compute the similarity between products using SimRank, and gives some experimental results by implementing this method on Amazon dataset.
Abstract: Online bookstores have attracted millions of people and helped provide them hopeful books. Similarity search over on-line book store mainly focuses on finding the top-K most similar products for a given query. In this paper, we discuss how to find similar products for a given query product, and propose a framework for finding similar products from online bookstore. We firstly build the co-purchasing network by using the relationships between different type products, and then compute the similarity between products using SimRank. Finally, we give some experimental results by implementing this method on Amazon dataset, which demonstrate that the proposed method can find the underlying results over real dataset.

Journal ArticleDOI
TL;DR: This paper proposes a Mok-SimRank to compute link-based similarity and a dual similarity integration algorithm for both link and content based similarity and shows that this approach is significantly better than traditional methods in terms of relevance.
Abstract: In the real world scenario the use of image grows rapidly, the image rich network is the one that comprises of billions of images. The social media websites, such as Picasa, Flickr and Facebook comprises billions of end user posted images along with their annotations. Similarly the electronic commerce website such as Flipkart, Myntra and Amazon are also furnished with product related images. In this paper, we introduce how to perform efficient and optimum information retrieval in online image rich system. We propose a Mok-SimRank to compute link-based similarity and a dual similarity integration algorithm for both link and content based similarity. Experimental results on online electronic commerce site show that our approach is significantly better than traditional methods in terms of relevance.

01 Jan 2015
TL;DR: An algorithm Integrated Weighted Similarity Learning (IWSL) is proposed to account for both link-based and content based similarities by considering the network structure and mutually reinforcing link similarity and feature weight learning.
Abstract: Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also furnished with tremendous amounts of product-related images. In addition, images in such social networks are also accompanied by annotations, comments, and other information, thus forming heterogeneous image-rich information networks. In this paper, the concept of (heterogeneous) image-rich information network and the problem of how to perform information retrieval and recommendation in such networks is introduced. A fast algorithm, heterogeneous minimum order k-SimRank (HMok- SimRank) is proposed to compute link-based similarity in weighted heterogeneous information networks. Then, we propose an algorithm Integrated Weighted Similarity Learning (IWSL) to account for both link-based and content based similarities by considering the network structure and mutually reinforcing link similarity and feature weight learning. Both local and global feature learning methods are designed. Experimental results on Flickr and Amazon data sets show that our approach is significantly better than traditional methods in terms of both relevance and speed. A new product search and recommendation system for e-commerce has been implemented based on our algorithm.

Posted Content
TL;DR: Following the random-walk-based formulation of SimRank on deterministic graphs and the possible worlds model of uncertain graphs, the definition of random walks satisfies Markov's property for the first time and the SimRank measure is formulated based on random walks on uncertain graphs.
Abstract: SimRank is a similarity measure between vertices in a graph, which has become a fundamental technique in graph analytics. Recently, many algorithms have been proposed for efficient evaluation of SimRank similarities. However, the existing SimRank computation algorithms either overlook uncertainty in graph structures or is based on an unreasonable assumption (Du et al). In this paper, we study SimRank similarities on uncertain graphs based on the possible world model of uncertain graphs. Following the random-walk-based formulation of SimRank on deterministic graphs and the possible worlds model of uncertain graphs, we define random walks on uncertain graphs for the first time and show that our definition of random walks satisfies Markov's property. We formulate the SimRank measure based on random walks on uncertain graphs. We discover a critical difference between random walks on uncertain graphs and random walks on deterministic graphs, which makes all existing SimRank computation algorithms on deterministic graphs inapplicable to uncertain graphs. To efficiently compute SimRank similarities, we propose three algorithms, namely the baseline algorithm with high accuracy, the sampling algorithm with high efficiency, and the two-phase algorithm with comparable efficiency as the sampling algorithm and about an order of magnitude smaller relative error than the sampling algorithm. The extensive experiments and case studies verify the effectiveness of our SimRank measure and the efficiency of our SimRank computation algorithms.

Book ChapterDOI
08 Jun 2015
TL;DR: A model to recommend user’s potential friends by incorporating users’ generated content and structure features and a weighted SimRank algorithm is proposed to recommend the most similar users as the friends.
Abstract: Intuitively, a friendship link between two users can be recommended based on the similarity of their generated text content or structure information. Although this problem has been extensively studied, the challenge of how to effectively incorporate the information from the social interaction and user generated content remains largely open. We propose a model (LRCS) to recommend user’s potential friends by incorporating user’s generated content and structure features. First, network users are clustered based on the similarity of user’s interest and structural features. Users in the same cluster with the query user are considered as the candidate friends. Then, a weighted SimRank algorithm is proposed to recommend the most similar users as the friends. Experiments on two real-life datasets show the superiority of our approach.