scispace - formally typeset
Search or ask a question

Showing papers by "Reynold Cheng published in 2017"


Journal ArticleDOI
01 Jan 2017
TL;DR: It is believed that the truth inference problem is not fully solved, and the limitations of existing algorithms are identified and point out promising research directions.
Abstract: Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions.

376 citations


Journal ArticleDOI
Yixiang Fang1, Reynold Cheng1, Xiaodong Li1, Siqiang Luo1, Jiafeng Hu1 
01 Feb 2017
TL;DR: Experimental results show that SAC is better than the communities returned by existing solutions, and three approximation solutions can find SACs accurately and efficiently.
Abstract: Communities are prevalent in social networks, knowledge graphs, and biological networks. Recently, the topic of community search (CS) has received plenty of attention. Given a query vertex, CS looks for a dense subgraph that contains it. Existing CS solutions do not consider the spatial extent of a community. They can yield communities whose locations of vertices span large areas. In applications that facilitate the creation of social events (e.g., finding conference attendees to join a dinner), it is important to find groups of people who are physically close to each other. In this situation, it is desirable to have a spatial-aware community (or SAC), whose vertices are close structurally and spatially. Given a graph G and a query vertex q, we develop exact solutions for finding an SAC that contains q. Since these solutions cannot scale to large datasets, we have further designed three approximation algorithms to compute an SAC. We have performed an experimental evaluation for these solutions on both large real and synthetic datasets. Experimental results show that SAC is better than the communities returned by existing solutions. Moreover, our approximation solutions can find SACs accurately and efficiently.

180 citations


Proceedings ArticleDOI
09 May 2017
TL;DR: This tutorial gives an overview of crowdsourcing, and then summarizes the fundamental techniques, including quality control, cost control, and latency control, which must be considered in crowdsourced data management.
Abstract: Many important data management and analytics tasks cannot be completely addressed by automated processes. Crowdsourcing is an effective way to harness human cognitive abilities to process these computer-hard tasks, such as entity resolution, sentiment analysis, and image recognition. Crowdsourced data management has been extensively studied in research and industry recently. In this tutorial, we will survey and synthesize a wide spectrum of existing studies on crowdsourced data management. We first give an overview of crowdsourcing, and then summarize the fundamental techniques, including quality control, cost control, and latency control, which must be considered in crowdsourced data management. Next we review crowdsourced operators, including selection, collection, join, top-k, sort, categorize, aggregation, skyline, planning, schema matching, mining and spatial crowdsourcing. We also discuss crowdsourcing optimization techniques and systems. Finally, we provide the emerging challenges.

90 citations


Journal ArticleDOI
Yixiang Fang1, Reynold Cheng1, Yankai Chen1, Siqiang Luo1, Jiafeng Hu1 
01 Dec 2017
TL;DR: The results show that ACQ is more effective and efficient than existing community retrieval approaches, and an AC contains more precise and personalized information than that of existing community search and detection methods.
Abstract: Given a graph G and a vertex $$q \in G$$ , the community search query returns a subgraph of G that contains vertices related to q. Communities, which are prevalent in attributed graphs such as social networks and knowledge bases, can be used in emerging applications such as product advertisement and setting up of social events. In this paper, we investigate the attributed community query (or ACQ), which returns an attributed community (AC) for an attributed graph. The AC is a subgraph of G, which satisfies both structure cohesiveness (i.e., its vertices are tightly connected) and keyword cohesiveness (i.e., its vertices share common keywords). The AC enables a better understanding of how and why a community is formed (e.g., members of an AC have a common interest in music, because they all have the same keyword “music”). An AC can be “personalized”; for example, an ACQ user may specify that an AC returned should be related to some specific keywords like “research” and “sports”. To enable efficient AC search, we develop the CL-tree index structure and three algorithms based on it. We further propose efficient algorithms for maintaining the index on dynamic graphs. Moreover, we study two problems that are related to the ACQ problem. We evaluate our solutions on six large graphs. Our results show that ACQ is more effective and efficient than existing community retrieval approaches. Moreover, an AC contains more precise and personalized information than that of existing community search and detection methods.

87 citations


Proceedings ArticleDOI
Jiafeng Hu1, Reynold Cheng1, Zhipeng Huang1, Yixang Fang1, Siqiang Luo1 
06 Nov 2017
TL;DR: The results show that URGE attains better effectiveness than current uncertain data mining algorithms, as well as state-of-the-art embedding solutions, and develops novel algorithms to enable fast evaluation.
Abstract: Graph data are prevalent in communication networks, social media, and biological networks. These data, which are often noisy or inexact, can be represented by uncertain graphs, whose edges are associated with probabilities to indicate the chances that they exist. Recently, researchers have studied various algorithms (e.g., clustering, classification, and k-NN) for uncertain graphs. These solutions face two problems: (1) high dimensionality: uncertain graphs are often highly complex, which can affect the mining quality; and (2) low reusability, where an existing mining algorithm has to be redesigned to deal with uncertain graphs. To tackle these problems, we propose a solution called URGE, or UnceRtain Graph Embedding. Given an uncertain graph G, URGE generates G's embedding, or a set of low-dimensional vectors, which carry the proximity information of nodes in G. This embedding enables the dimensionality of G to be reduced, without destroying node proximity information. Due to its simplicity, existing mining solutions can be used on the embedding. We investigate two low- and high-order node proximity measures in the embedding generation process, and develop novel algorithms to enable fast evaluation. To our best knowledge, there is no prior study on the use of embedding for uncertain graphs. We have further performed extensive experiments for clustering, classification, and k-NN on several uncertain graph datasets. Our results show that URGE attains better effectiveness than current uncertain data mining algorithms, as well as state-of-the-art embedding solutions. The embedding and mining performance is also highly efficient in our experiments.

47 citations


Journal ArticleDOI
10 May 2017
TL;DR: The ProbTree is a data structure that stores a succinct, or indexed, version of the possible worlds of the graph, and lossless and lossy methods for generating the ProbTree are examined, which reflect the tradeoff between the accuracy and efficiency of query evaluation.
Abstract: Information in many applications, such as mobile wireless systems, social networks, and road networks, is captured by graphs. In many cases, such information is uncertain. We study the problem of querying a probabilistic graph, in which vertices are connected to each other probabilistically. In particular, we examine “source-to-target” queries (ST-queries), such as computing the shortest path between two vertices. The major difference with the deterministic setting is that query answers are enriched with probabilistic annotations. Evaluating ST-queries over probabilistic graphs is nP-hard, as it requires examining an exponential number of “possible worlds”—database instances generated from the probabilistic graph. Existing solutions to the ST-query problem, which sample possible worlds, have two downsides: (i) a possible world can be very large and (ii) many samples are needed for reasonable accuracy. To tackle these issues, we study the ProbTree, a data structure that stores a succinct, or indexed, version of the possible worlds of the graph. Existing ST-query solutions are executed on top of this structure, with the number of samples and sizes of the possible worlds reduced. We examine lossless and lossy methods for generating the ProbTree, which reflect the tradeoff between the accuracy and efficiency of query evaluation. We analyze the correctness and complexity of these approaches. Our extensive experiments on real datasets show that the ProbTree is fast to generate and small in size. It also enhances the accuracy and efficiency of existing ST-query algorithms significantly.

32 citations


Journal ArticleDOI
Jiafeng Hu1, Xiaowei Wu1, Reynold Cheng1, Siqiang Luo1, Yixiang Fang1 
TL;DR: The minimal SMCS is investigated, which is the minimal subgraph of the inline-formula-based Expand-Refine algorithms, as well as their approximate versions with accuracy guarantees and a cache-based processing model to improve the efficiency for an important case when the largest connectivity is needed.
Abstract: Given a graph $G$ and a set $Q$ of query nodes, we examine the Steiner Maximum-Connected Subgraph (SMCS) problem. The SMCS, or $G$ 's induced subgraph that contains $Q$ with the largest connectivity, can be useful for customer prediction, product promotion, and team assembling. Despite its importance, the SMCS problem has only been recently studied. Existing solutions evaluate the maximum SMCS , whose number of nodes is the largest among all the SMCSs of $Q$ . However, the maximum SMCS, which may contain a lot of nodes, can be difficult to interpret. In this paper, we investigate the minimal SMCS , which is the minimal subgraph of $G$ with the maximum connectivity containing $Q$ . The minimal SMCS contains much fewer nodes than its maximum counterpart, and is thus easier to be understood. However, the minimal SMCS can be costly to evaluate. We thus propose efficient Expand-Refine algorithms, as well as their approximate versions with accuracy guarantees. We further develop a cache-based processing model to improve the efficiency for an important case when $Q$ consists of a single node. Extensive experiments on large real and synthetic graph datasets validate the effectiveness and efficiency of our approaches.

31 citations


Journal ArticleDOI
Yixiang Fang1, Reynold Cheng1, Siqiang Luo1, Jiafeng Hu1, Kai Huang1 
01 Aug 2017
TL;DR: C-Explorer is proposed, an open-source web-based platform to assist users in extracting, visualizing, and analyzing communities, and implements several state-of-the-art CR algorithms, as well as functions for analyzing their effectiveness.
Abstract: Community retrieval (CR) algorithms, which enable the extraction of subgraphs from large social networks (e.g., Facebook and Twitter), have attracted tremendous interest. Various CR solutions, such as k-core and codicil, have been proposed to obtain graphs whose vertices are closely related. In this paper, we propose the C-Explorer system to assist users in extracting, visualizing, and analyzing communities. C-Explorer provides online and interactive CR facilities, allowing a user to view her interesting graphs, indicate her required vertex q, and display the communities to which q belongs. A seminal feature of C-Explorer is that it uses an attributed graph, whose vertices are associated with labels and keywords, and looks for an attributed community (or AC), whose vertices are structurally and semantically related. Moreover, C-Explorer implements several state-of-the-art CR algorithms, as well as functions for analyzing their effectiveness. We plan to make C-Explorer an open-source web-based platform, and design API functions for software developers to test their CR algorithms in our system.

26 citations


Journal ArticleDOI
TL;DR: It is shown that the FVF framework is highly efficient in shortest query processing on EGSs, and gives a larger speedup, is more flexible in terms of memory requirements, and is far less sensitive to parameter values.

4 citations


Book ChapterDOI
01 Sep 2017
TL;DR: This article investigates the keyword-based attributed community (or KAC) query, which returns a KAC for a query vertex in an online manner, and proposes efficient query algorithms to enable efficient KAC search and SAC search.
Abstract: Communities, which are prevalent in attributed graphs (e.g., social networks and knowledge bases) can be used in emerging applications such as product advertisement and setting up of social events. Given a graph G and a vertex \(q \in G\), the community search (CS) query returns a subgraph of G that contains vertices related to q. In this article, we study CS over two common attributed graphs, where (1) vertices are associated with keywords; and (2) vertices are augmented with locations. For keyword-based attributed graphs, we investigate the keyword-based attributed community (or KAC) query, which returns a KAC for a query vertex. A KAC satisfies both structure cohesiveness (i.e., its vertices are tightly connected) and keyword cohesiveness (i.e., its vertices share common keywords). For spatial-based attributed graphs, we aim to find the spatial-aware community (or SAC), whose vertices are close structurally and spatially, for a query vertex in an online manner. To enable efficient KAC search and SAC search, we propose efficient query algorithms. We also perform experimental evaluation on large real datasets, and the results show that our methods achieve higher effectiveness than the state-of-the-art community retrieval algorithms. Moreover, our solutions are faster than baseline approaches. In addition, we develop the C-Explorer system to assist users in extracting, visualizing, and analyzing KACs.

4 citations


Book ChapterDOI
07 Jul 2017
TL;DR: This tutorial will study a solution based on the Query-by-Example (QBE) paradigm, which allows us to discover meta-paths in an effective and efficient manner, and explore systematic methods for finding meta paths.
Abstract: A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures, such as PCRW, PathSim, and HeteSim, can be used in various applications, including information retrieval, entity resolution, and product recommendation. These metrics are based on the use of meta-paths, essentially a sequence of node classes and edge types between two nodes in a HIN. In this tutorial, we will give a detailed review of meta-paths, as well as how they are used to define relevance. In a large and complex HIN, retrieving meta paths manually can be complex, expensive, and error-prone. Hence, we will explore systematic methods for finding meta paths. In particular, we will study a solution based on the Query-by-Example (QBE) paradigm, which allows us to discover meta-paths in an effective and efficient manner.

Book ChapterDOI
28 Feb 2017
TL;DR: This work proposes a new approach to compute Voronoi cells for the case of objects having rectangular uncertainty regions and develops three algorithms to explore index structures and shows that the approach that descends both index structures in parallel yields fast query processing times.
Abstract: The problem of computing Voronoi cells for spatial objects whose locations are not certain has been recently studied. In this work, we propose a new approach to compute Voronoi cells for the case of objects having rectangular uncertainty regions. Since exact computation of Voronoi cells is hard, we propose an approximate solution. The main idea of this solution is to apply hierarchical access methods for both data and object space. Our space index is used to efficiently find spatial regions which must (not) be inside a Voronoi cell. Our object index is used to efficiently identify Delauny relations, i.e., data objects which affect the shape of a Voronoi cell. We develop three algorithms to explore index structures and show that the approach that descends both index structures in parallel yields fast query processing times. Our experiments show that we are able to approximate uncertain Voronoi cells much more effectively than the state-of-the-art, and at the same time, improve run-time performance.

Proceedings ArticleDOI
Siqiang Luo1, Jiafeng Hu1, Reynold Cheng1, Jing Yan1, Ben Kao1 
06 Nov 2017
TL;DR: This paper proposes the Spatial Exemplar Query (SEQ), which allows the user to input a result example over an interface inside the map service, and proposes an effective similarity measure to evaluate the proximity between a candidate answer and the given example.
Abstract: Spatial object search is prevalent in map services (e.g., Google Maps). To rent an apartment, for example, one will take into account its nearby facilities, such as supermarkets, hospitals, and subway stations. Traditional keyword search solutions, such as the nearby function in Google Maps, are insufficient in expressing the often complex attribute/spatial requirements of users. Those require- ments, however, are essential to reflect the user search intention. In this paper, we propose the Spatial Exemplar Query (SEQ), which allows the user to input a result example over an interface inside the map service. We then propose an effective similarity measure to evaluate the proximity between a candidate answer and the given example. We conduct a user study to validate the effectiveness of SEQ. Our result shows that more than 88% of users would like to have an example assisted search in map services. Moreover, SEQ gets a user satisfactory score of 4.3/5.0, which is more than 2 times higher than that of a baseline solution.

Journal ArticleDOI
TL;DR: This paper proposes an efficient approach to identifying and evaluating iceberg cells of s-cuboids and shows that the algorithms are orders of magnitude faster than existing approaches.
Abstract: A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template $T$ and an aggregate function $F$ . A pattern template is a sequence of variables, each is defined over a domain. Each variable is instantiated with all possible values in its corresponding domain to derive all possible patterns of the template. Sequences are grouped based on the patterns they possess. The answer to a PBA query is a sequence cuboid (s-cuboid), which is a multidimensional array of cells. Each cell is associated with a pattern instantiated from the query's pattern template. The value of each s-cuboid cell is obtained by applying the aggregate function $F$ to the set of data sequences that belong to that cell. Since a pattern template can involve many variables and can be arbitrarily long, the induced s-cuboid for a PBA query can be huge. For most analytical tasks, however, only iceberg cells with very large aggregate values are of interest. This paper proposes an efficient approach to identifying and evaluating iceberg cells of s-cuboids. Experimental results show that our algorithms are orders of magnitude faster than existing approaches.

Posted Content
TL;DR: T-Crowd as mentioned in this paper integrates each worker's answers on different attributes to effectively learn his/her trustworthiness and the true data values, which is also used to guide task allocation to workers.
Abstract: Crowdsourcing employs human workers to solve computer-hard problems, such as data cleaning, entity resolution, and sentiment analysis. When crowdsourcing tabular data, e.g., the attribute values of an entity set, a worker's answers on the different attributes (e.g., the nationality and age of a celebrity star) are often treated independently. This assumption is not always true and can lead to suboptimal crowdsourcing performance. In this paper, we present the T-Crowd system, which takes into consideration the intricate relationships among tasks, in order to converge faster to their true values. Particularly, T-Crowd integrates each worker's answers on different attributes to effectively learn his/her trustworthiness and the true data values. The attribute relationship information is also used to guide task allocation to workers. Finally, T-Crowd seamlessly supports categorical and continuous attributes, which are the two main datatypes found in typical databases. Our extensive experiments on real and synthetic datasets show that T-Crowd outperforms state-of-the-art methods in terms of truth inference and reducing the cost of crowdsourcing.

Book ChapterDOI
07 Oct 2017
TL;DR: The metric top-k sliding average similarity (top-k SAS) is developed which measures the reliability of k most frequent tags and is effective and efficient to determine whether the k most Frequent tags can be considered as high-quality top- k tags for r.
Abstract: Collaborative tagging systems, such as Flickr and Del.icio.us, allow users to provide keyword labels, or tags, for various Internet resources (e.g., photos, songs, and bookmarks). These tags, which provide a rich source of information, have been used in important applications such as resource searching, webpage clustering, etc. However, tags are provided by casual users, and so their quality cannot be guaranteed. In this paper, we examine a question: given a resource r and a set of user-provided tags associated with r, can r be correctly described by the k most frequent tags? To answer this question, we develop the metric top-k sliding average similarity (top-k SAS) which measures the reliability of k most frequent tags. One threshold is then set to estimate whether the reliability is sufficient for retrieving the top-k tags. Our experiments on real datasets show that the threshold-based evaluation on top-k SAS is effective and efficient to determine whether the k most frequent tags can be considered as high-quality top-k tags for r.