Bio: Kijung Shin is an academic researcher from KAIST. The author has contributed to research in topics: Computer science & Hypergraph. The author has an hindex of 18, co-authored 73 publications receiving 1164 citations. Previous affiliations of Kijung Shin include Microsoft & Seoul National University.
••13 Aug 2016
TL;DR: FRAUDAR is proposed, an algorithm that is camouflage-resistant, provides upper bounds on the effectiveness of fraudsters, and is effective in real-world data.
Abstract: Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look "normal". Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic. Our focus is to spot fraudsters in the presence of camouflage or hijacked accounts. We propose FRAUDAR, an algorithm that (a) is camouflage-resistant, (b) provides upper bounds on the effectiveness of fraudsters, and (c) is effective in real-world data. Experimental results under various attacks show that FRAUDAR outperforms the top competitor in accuracy of detecting both camouflaged and non-camouflaged fraud. Additionally, in real-world experiments with a Twitter follower-followee graph of 1.47 billion edges, FRAUDAR successfully detected a subgraph of more than 4000 detected accounts, of which a majority had tweets showing that they used follower-buying services.
••01 Dec 2016
TL;DR: P pervasive patterns that are related to k-cores and emerging in graphs from several diverse domains are explored and algorithmic contributions show the usefulness of these patterns.
Abstract: How do the k-core structures of real-world graphs look like? What are the common patterns and the anomalies? How can we use them for algorithm design and applications? A k-core is the maximal subgraph where all vertices have degree at least k. This concept has been applied to such diverse areas as hierarchical structure analysis, graph visualization, and graph clustering. Here, we explore pervasive patterns that are related to k-cores and emerging in graphs from several diverse domains. Our discoveries are as follows: (1) Mirror Pattern: coreness of vertices (i.e., maximum k such that each vertex belongs to the k-core) is strongly correlated to their degree. (2) Core-Triangle Pattern: degeneracy of a graph (i.e., maximum k such that the k-core exists in the graph) obeys a 3-to-1 power law with respect to the count of triangles. (3) Structured Core Pattern: degeneracy-cores are not cliques but have non-trivial structures such as core-periphery and communities. Our algorithmic contributions show the usefulness of these patterns. (1) Core-A, which measures the deviation from Mirror Pattern, successfully finds anomalies in real-world graphs complementing densest-subgraph based anomaly detection methods. (2) Core-D, a single-pass streaming algorithm based on Core-Triangle Pattern, accurately estimates the degeneracy of billion-scale graphs up to 7× faster than a recent multipass algorithm.(3) Core-S, inspired by Structured Core Pattern, identifies influential spreaders up to 17× faster than top competitors with comparable accuracy.
••27 May 2015
TL;DR: BEAR is proposed, a fast, scalable, and accurate method for computing RWR on large graphs that significantly outperforms other state-of-the-art methods in terms of preprocessing and query speed, space efficiency, and accuracy.
Abstract: Given a large graph, how can we calculate the relevance between nodes fast and accurately? Random walk with restart (RWR) provides a good measure for this purpose and has been applied to diverse data mining applications including ranking, community detection, link prediction, and anomaly detection. Since calculating RWR from scratch takes long, various preprocessing methods, most of which are related to inverting adjacency matrices, have been proposed to speed up the calculation. However, these methods do not scale to large graphs because they usually produce large and dense matrices which do not fit into memory. In this paper, we propose BEAR, a fast, scalable, and accurate method for computing RWR on large graphs. BEAR comprises the preprocessing step and the query step. In the preprocessing step, BEAR reorders the adjacency matrix of a given graph so that it contains a large and easy-to-invert submatrix, and precomputes several matrices including the Schur complement of the submatrix. In the query step, BEAR computes the RWR scores for a given query node quickly using a block elimination approach with the matrices computed in the preprocessing step. Through extensive experiments, we show that BEAR significantly outperforms other state-of-the-art methods in terms of preprocessing and query speed, space efficiency, and accuracy.
••19 Sep 2016
TL;DR: This work proposes M-Zoom, a flexible framework for finding dense blocks in tensors, which works with a broad class of density measures and provides a guarantee on the lowest density of the blocks it finds.
Abstract: Given a large-scale and high-order tensor, how can we find dense blocks in it? Can we find them in near-linear time but with a quality guarantee? Extensive previous work has shown that dense blocks in tensors as well as graphs indicate anomalous or fraudulent behavior e.g., lockstep behavior in social networks. However, available methods for detecting such dense blocks are not satisfactory in terms of speed, accuracy, or flexibility. In this work, we propose M-Zoom, a flexible framework for finding dense blocks in tensors, which works with a broad class of density measures. M-Zoom has the following properties: 1 Scalable: M-Zoom scales linearly with all aspects of tensors and is upi¾źto 114$$\times $$faster than state-of-the-art methods with similar accuracy. 2 Provably accurate: M-Zoom provides a guarantee on the lowest density of the blocks it finds. 3 Flexible: M-Zoom supports multi-block detection and size bounds as well as diverse density measures. 4 Effective: M-Zoom successfully detected edit wars and bot activities in Wikipedia, and spotted network attacks from a TCP dump with near-perfect accuracy AUCi¾ź=i¾ź0.98. The data and software related to this paper are available at http://www.cs.cmu.edu/~kijungs/codes/mzoom/.
TL;DR: This paper proposes two distributed tensor factorization methods, CDTF and SALS, which are scalable with all aspects of data and show a trade-off between convergence speed and memory requirements.
Abstract: Given a high-order large-scale tensor, how can we decompose it into latent factors? Can we process it on commodity computers with limited memory? These questions are closely related to recommender systems, which have modeled rating data not as a matrix but as a tensor to utilize contextual information such as time and location. This increase in the order requires tensor factorization methods scalable with both the order and size of a tensor. In this paper, we propose two distributed tensor factorization methods, CDTF and SALS . Both methods are scalable with all aspects of data and show a trade-off between convergence speed and memory requirements. CDTF , based on coordinate descent, updates one parameter at a time, while SALS generalizes on the number of parameters updated at a time. In our experiments, only our methods factorized a five-order tensor with 1 billion observable entries, 10 M mode length, and 1 K rank, while all other state-of-the-art methods failed. Moreover, our methods required several orders of magnitude less memory than their competitors. We implemented our methods on MapReduce with two widely-applicable optimization techniques: local disk caching and greedy row assignment. They speeded up our methods up to 98.2 $\times$ and also the competitors up to 5.9 $\times$ .
22 Jan 2006
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.
01 Jan 2014
01 Jan 2013
TL;DR: In this paper, the authors analyze 14 million messages spreading 400 thousand articles on Twitter during ten months in 2016 and 2017 and find evidence that social bots played a disproportionate role in spreading articles from low-credibility sources.
Abstract: The massive spread of digital misinformation has been identified as a major threat to democracies. Communication, cognitive, social, and computer scientists are studying the complex causes for the viral diffusion of misinformation, while online platforms are beginning to deploy countermeasures. Little systematic, data-based evidence has been published to guide these efforts. Here we analyze 14 million messages spreading 400 thousand articles on Twitter during ten months in 2016 and 2017. We find evidence that social bots played a disproportionate role in spreading articles from low-credibility sources. Bots amplify such content in the early spreading moments, before an article goes viral. They also target users with many followers through replies and mentions. Humans are vulnerable to this manipulation, resharing content posted by bots. Successful low-credibility sources are heavily supported by social bots. These results suggest that curbing social bots may be an effective strategy for mitigating the spread of online misinformation.
TL;DR: The DBLP Computer Science Bibliography of the University of Trier as discussed by the authors is a large collection of bibliographic information used by thousands of computer scientists, which is used for scientific communication.
Abstract: Publications are essential for scientific communication. Access to publications is provided by conventional libraries, digital libraries operated by learned societies or commercial publishers, and a huge number of web sites maintained by the scientists themselves or their institutions. Comprehensive meta-indices for this increasing number of information sources are missing for most areas of science. The DBLP Computer Science Bibliography of the University of Trier has grown from a very specialized small collection of bibliographic information to a major part of the infrastructure used by thousands of computer scientists. This short paper first reports the history of DBLP and sketches the very simple software behind the service. The most time-consuming task for the maintainers of DBLP may be viewed as a special instance of the authority control problem; how to normalize different spellings of person names. The third section of the paper discusses some details of this problem which might be an interesting research issue for the information retrieval community.