scispace - formally typeset
Search or ask a question
Author

Jeffery Xu Yu

Bio: Jeffery Xu Yu is an academic researcher from The Chinese University of Hong Kong. The author has contributed to research in topics: Graph (abstract data type) & Graph database. The author has an hindex of 7, co-authored 12 publications receiving 295 citations.

Papers
More filters
Book ChapterDOI
31 Aug 2004
TL;DR: In this article, the authors proposed algorithms based on the Chernoff bound for mining frequent itemsets from high speed transactional data streams with a bound of memory consumption, where the number of false positive itemsets can be controlled by a predefined parameter so that desired recall rate of frequent item sets can be guaranteed.
Abstract: The problem of finding frequent items has been recently studied over high speed data streams However, mining frequent itemsets from transactional data streams has not been well addressed yet in terms of its bounds of memory consumption The main difficulty is due to the nature of the exponential explosion of itemsets Given a domain of I unique items, the possible number of itemsets can be up to 2I - 1 When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory However, the real killer of effective frequent itemset mining is that most of existing algorithms are false-positive oriented That is, they control memory consumption in the counting processes by an error parameter e, and allow items with support below the specified minimum support s but above s-e counted as frequent ones Such false-positive items increase the number of false-positive frequent itemsets exponentially, which may make the problem computationally intractable with bounded memory consumption In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed We developed algorithms based on Chernoff bound Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time They significantly outperform the existing false-positive algorithms

151 citations

Journal ArticleDOI
Yuhai Zhao1, Jeffery Xu Yu, Guoren Wang, Lei Chen, Bin Wang, Ge Yu 
TL;DR: This paper proposes a coding scheme that allows us to cluster two genes into the same cluster if they have the same code, where two genes that have thesame code can be either positive or negative regulated.
Abstract: Clustering is a popular technique for analyzing microarray data sets, with n genes and m experimental conditions. As explored by biologists, there is a real need to identify coregulated gene clusters, which include both positive and negative regulated gene clusters. The existing pattern-based and tendency-based clustering approaches cannot directly be applied to find such coregulated gene clusters, because they are designed for finding positive regulated gene clusters. In this paper, in order to cluster coregulated genes, we propose a coding scheme that allows us to cluster two genes into the same cluster if they have the same code, where two genes that have the same code can be either positive or negative regulated. Based on the coding scheme, we propose a new algorithm for finding maximal subspace coregulated gene clusters with new pruning techniques. A maximal subspace coregulated gene cluster clusters a set of genes on a condition sequence such that the cluster is not included in any other subspace coregulated gene clusters. We conduct extensive experimental studies. Our approach can effectively and efficiently find maximal subspace coregulated gene clusters. In addition, our approach outperforms the existing approaches for finding positive regulated gene clusters.

45 citations

Journal ArticleDOI
TL;DR: A signature-based search algorithm is proposed that encodes the shortest-path distance from a vertex to any given keyword in the graph, and can find query answers by exploring fewer paths, so that the time and communication costs are low.
Abstract: Graph keyword search has drawn many research interests, since graph models can generally represent both structured and unstructured databases and keyword searches can extract valuable information for users without the knowledge of the underlying schema and query language. In practice, data graphs can be extremely large, e.g., a Web-scale graph containing billions of vertices. The state-of-the-art approaches employ centralized algorithms to process graph keyword searches, and thus they are infeasible for such large graphs, due to the limited computational power and storage space of a centralized server. To address this problem, we investigate keyword search for Web-scale graphs deployed in a distributed environment. We first give a naive search algorithm to answer the query efficiently. However, the naive search algorithm uses a flooding search strategy that incurs large time and network overhead. To remedy this shortcoming, we then propose a signature-based search algorithm. Specifically, we design a vertex signature that encodes the shortest-path distance from a vertex to any given keyword in the graph. As a result, we can find query answers by exploring fewer paths, so that the time and communication costs are low. Moreover, we reorganize the graph data in the cluster after its initial random partitioning so that the signature-based techniques are more effective. Finally, our experimental results demonstrate the feasibility of our proposed approach in performing keyword searches over Web-scale graph data.

22 citations

Proceedings ArticleDOI
24 Oct 2011
TL;DR: This paper studies a variant of reachability queries, called label-constraint reachability (LCR) queries, and proves the superiority of this method by extensive experiments.
Abstract: In this paper, we study a variant of reachability queries, called label-constraint reachability (LCR) queries, specifically,given a label set S and two vertices u1 and u2 in a large directed graph G, we verify whether there exists a path from u1 to u2 under label constraint S. Like traditional reachability queries, LCR queries are very useful, such as pathway finding in biological networks, inferring over RDF (resource description f ramework) graphs, relationship finding in social networks. However, LCR queries are much more complicated than their traditional counterpart.Several techniques are proposed in this paper to minimize the search space in computing path-label transitive closure. Furthermore, we demonstrate the superiority of our method by extensive experiments.

20 citations

Book ChapterDOI
03 Jun 2018
TL;DR: In order to solve the influence minimization problem in large, real-world social networks, a robust sampling-based solution with a desirable theoretic bound is proposed and extensive experiments using real social network datasets offer insight into the effectiveness and efficiency of the proposed solutions.
Abstract: An online social network can be used for the diffusion of malicious information like derogatory rumors, disinformation, hate speech, revenge pornography, etc. This motivates the study of influence minimization that aim to prevent the spread of malicious information. Unlike previous influence minimization work, this study considers the influence minimization in relation to a particular group of social network users, called targeted influence minimization. Thus, the objective is to protect a set of users, called target nodes, from malicious information originating from another set of users, called active nodes. This study also addresses two fundamental, but largely ignored, issues in different influence minimization problems: (i) the impact of a budget on the solution; (ii) robust sampling. To this end, two scenarios are investigated, namely unconstrained and constrained budget. Given an unconstrained budget, we provide an optimal solution; Given a constrained budget, we show the problem is NP-hard and develop a greedy algorithm with an \((1-1/e)\)-approximation. More importantly, in order to solve the influence minimization problem in large, real-world social networks, we propose a robust sampling-based solution with a desirable theoretic bound. Extensive experiments using real social network datasets offer insight into the effectiveness and efficiency of the proposed solutions.

16 citations


Cited by
More filters
Journal Article
TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Abstract: We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.

1,992 citations

Journal ArticleDOI
TL;DR: It is believed that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run, however, there are still some challenging research issues that need to be solved before frequent patternmining can claim a cornerstone approach in data mining applications.
Abstract: Frequent pattern mining has been a focused theme in data mining research for over a decade. Abundant literature has been dedicated to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset mining in transaction databases to numerous research frontiers, such as sequential pattern mining, structured pattern mining, correlation mining, associative classification, and frequent pattern-based clustering, as well as their broad applications. In this article, we provide a brief overview of the current status of frequent pattern mining and discuss a few promising research directions. We believe that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run. However, there are still some challenging research issues that need to be solved before frequent pattern mining can claim a cornerstone approach in data mining applications.

1,448 citations

Journal ArticleDOI
01 Mar 2006
TL;DR: There exist emerging applications of data streams that require association rule mining, such as network traffic monitoring and web click streams analysis, which raises new issues that need to be considered when developing association rulemining techniques for stream data.
Abstract: There exist emerging applications of data streams that require association rule mining, such as network traffic monitoring and web click streams analysis. Different from data in traditional static databases, data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data. This paper discusses those issues and how they are addressed in the existing literature.

305 citations

Journal ArticleDOI
TL;DR: An efficient hybrid hierarchical clustering is proposed based on agglomerative method that builds a hierarchy based on a group of centroids that is relatively consistent regardless the variation of the settings, i.e., clustering methods, data distributions, and distance measures.
Abstract: An efficient hybrid hierarchical clustering is proposed based on agglomerative method.It performs consistently with different distance measures.It performs consistently on data with different distributions and sizes. Hierarchical clustering is of great importance in data analytics especially because of the exponential growth of real-world data. Often these data are unlabelled and there is little prior domain knowledge available. One challenge in handling these huge data collections is the computational cost. In this paper, we aim to improve the efficiency by introducing a set of methods of agglomerative hierarchical clustering. Instead of building cluster hierarchies based on raw data points, our approach builds a hierarchy based on a group of centroids. These centroids represent a group of adjacent points in the data space. By this approach, feature extraction or dimensionality reduction is not required. To evaluate our approach, we have conducted a comprehensive experimental study. We tested the approach with different clustering methods (i.e., UPGMA and SLINK), data distributions, (i.e., normal and uniform), and distance measures (i.e., Euclidean and Canberra). The experimental results indicate that, using the centroid based approach, computational cost can be significantly reduced without compromising the clustering performance. The performance of this approach is relatively consistent regardless the variation of the settings, i.e., clustering methods, data distributions, and distance measures.

294 citations

Journal ArticleDOI
01 Jan 2020
TL;DR: A comprehensive review of existing community search works can be found in this paper, where the authors analyze and compare the quality of communities under their models, and the performance of different solutions.
Abstract: With the rapid development of information technologies, various big graphs are prevalent in many real applications (e.g., social media and knowledge bases). An important component of these graphs is the network community. Essentially, a community is a group of vertices which are densely connected internally. Community retrieval can be used in many real applications, such as event organization, friend recommendation, and so on. Consequently, how to efficiently find high-quality communities from big graphs is an important research topic in the era of big data. Recently, a large group of research works, called community search, have been proposed. They aim to provide efficient solutions for searching high-quality communities from large networks in real time. Nevertheless, these works focus on different types of graphs and formulate communities in different manners, and thus, it is desirable to have a comprehensive review of these works. In this survey, we conduct a thorough review of existing community search works. Moreover, we analyze and compare the quality of communities under their models, and the performance of different solutions. Furthermore, we point out new research directions. This survey does not only help researchers to have better understanding of existing community search solutions, but also provides practitioners a better judgment on choosing the proper solutions.

190 citations