scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2019"


Journal ArticleDOI
TL;DR: Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure as discussed by the authors, and a significant amount of progress has been made toward this emerging network analysis paradigm.
Abstract: Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we focus on categorizing and then reviewing the current development on network embedding methods, and point out its future research directions. We first summarize the motivation of network embedding. We discuss the classical graph embedding algorithms and their relationship with network embedding. Afterwards and primarily, we provide a comprehensive overview of a large number of network embedding methods in a systematic manner, covering the structure- and property-preserving network embedding methods, the network embedding methods with side information, and the advanced information preserving network embedding methods. Moreover, several evaluation approaches for network embedding and some useful online resources, including the network data sets and softwares, are reviewed, too. Finally, we discuss the framework of exploiting these network embedding methods to build an effective system and point out some potential future directions.

929 citations


Journal ArticleDOI
TL;DR: A novel heterogeneous network embedding based approach for HIN based recommendation, called HERec is proposed, which shows the capability of the HERec model for the cold-start problem, and reveals that the transformed embedding information from HINs can improve the recommendation performance.
Abstract: Due to the flexibility in modelling data heterogeneity, heterogeneous information network (HIN) has been adopted to characterize complex and heterogeneous auxiliary data in recommender systems, called HIN based recommendation . It is challenging to develop effective methods for HIN based recommendation in both extraction and exploitation of the information from HINs. Most of HIN based recommendation methods rely on path based similarity, which cannot fully mine latent structure features of users and items. In this paper, we propose a novel heterogeneous network embedding based approach for HIN based recommendation, called HERec. To embed HINs, we design a meta-path based random walk strategy to generate meaningful node sequences for network embedding. The learned node embeddings are first transformed by a set of fusion functions, and subsequently integrated into an extended matrix factorization (MF) model. The extended MF model together with fusion functions are jointly optimized for the rating prediction task. Extensive experiments on three real-world datasets demonstrate the effectiveness of the HERec model. Moreover, we show the capability of the HERec model for the cold-start problem, and reveal that the transformed embedding information from HINs can improve the recommendation performance.

768 citations


Journal ArticleDOI
TL;DR: A high quality, instructive review of current research developments and trends in the concept drift field is conducted, and a framework of learning under concept drift is established including three main components: concept drift detection, concept drift understanding, and concept drift adaptation.
Abstract: Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding, and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.

557 citations


Journal ArticleDOI
TL;DR: Multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas as mentioned in this paper, and a comprehensive survey of multi-view representations can be found in this paper.
Abstract: Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion. Consequently, we first review the representative methods and theories of multi-view representation learning based on the perspective of alignment, such as correlation-based alignment. Representative examples are canonical correlation analysis (CCA) and its several extensions. Then, from the perspective of representation fusion, we investigate the advancement of multi-view representation learning that ranges from generative methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks, and multi-modal recurrent neural networks. Further, we also investigate several important applications of multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical foundation and state-of-the-art developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.

328 citations


Journal ArticleDOI
TL;DR: Some of the cross-cutting research themes in machine learning that are applicable across several geoscience problems, and the importance of a deep collaboration between machine learning and geosciences for synergistic advancements in both disciplines are discussed.
Abstract: Geosciences is a field of great societal relevance that requires solutions to several urgent problems facing our humanity and the planet. As geosciences enters the era of big data, machine learning (ML)—that has been widely successful in commercial domains—offers immense potential to contribute to problems in geosciences. However, geoscience applications introduce novel challenges for ML due to combinations of geoscience properties encountered in every problem, requiring novel research in machine learning. This article introduces researchers in the machine learning (ML) community to these challenges offered by geoscience problems and the opportunities that exist for advancing both machine learning and geosciences. We first highlight typical sources of geoscience data and describe their common properties. We then describe some of the common categories of geoscience problems where machine learning can play a role, discussing the challenges faced by existing ML methods and opportunities for novel ML research. We conclude by discussing some of the cross-cutting research themes in machine learning that are applicable across several geoscience problems, and the importance of a deep collaboration between machine learning and geosciences for synergistic advancements in both disciplines.

290 citations


Journal ArticleDOI
TL;DR: The results highlight gaps and unexplored tradeoffs in the field, including the lack of scalability of some methods and a strong divergence in their performance with respect to the different quality metrics used.
Abstract: Process mining allows analysts to exploit logs of historical executions of business processes to extract insights regarding the actual performance of these processes. One of the most widely studied process mining operations is automated process discovery. An automated process discovery method takes as input an event log, and produces as output a business process model that captures the control-flow relations between tasks that are observed in or implied by the event log. Various automated process discovery methods have been proposed in the past two decades, striking different tradeoffs between scalability, accuracy, and complexity of the resulting models. However, these methods have been evaluated in an ad-hoc manner, employing different datasets, experimental setups, evaluation measures, and baselines, often leading to incomparable conclusions and sometimes unreproducible results due to the use of closed datasets. This article provides a systematic review and comparative evaluation of automated process discovery methods, using an open-source benchmark and covering 12 publicly-available real-life event logs, 12 proprietary real-life event logs, and nine quality metrics. The results highlight gaps and unexplored tradeoffs in the field, including the lack of scalability of some methods and a strong divergence in their performance with respect to the different quality metrics used.

225 citations


Journal ArticleDOI
TL;DR: This paper argues that for NB highly predictive features should be highly correlated with the class, yet uncorrelated with other features (minimum mutual redundancy), and proposes a correlation-based feature weighting (CFW) filter for NB.
Abstract: Due to its simplicity, efficiency, and efficacy, naive Bayes (NB) has continued to be one of the top 10 algorithms in the data mining and machine learning community. Of numerous approaches to alleviating its conditional independence assumption, feature weighting has placed more emphasis on highly predictive features than those that are less predictive. In this paper, we argue that for NB highly predictive features should be highly correlated with the class (maximum mutual relevance), yet uncorrelated with other features (minimum mutual redundancy). Based on this premise, we propose a correlation-based feature weighting (CFW) filter for NB. In CFW, the weight for a feature is a sigmoid transformation of the difference between the feature-class correlation (mutual relevance) and the average feature-feature intercorrelation (average mutual redundancy). Experimental results show that NB with CFW significantly outperforms NB and all the other existing state-of-the-art feature weighting filters used to compare. Compared to feature weighting wrappers for improving NB, the main advantages of CFW are its low computational complexity (no search involved) and the fact that it maintains the simplicity of the final model. Besides, we apply CFW to text classification and have achieved remarkable improvements.

179 citations


Journal ArticleDOI
TL;DR: A Low-rank Sparse Subspace (LSS) clustering method via dynamically learning the affinity matrix from low-dimensional space of the original data is proposed, which outperforms the state-of-the-art clustering methods.
Abstract: Traditional graph clustering methods consist of two sequential steps, i.e., constructing an affinity matrix from the original data and then performing spectral clustering on the resulting affinity matrix. This two-step strategy achieves optimal solution for each step separately, but cannot guarantee that it will obtain the globally optimal clustering results. Moreover, the affinity matrix directly learned from the original data will seriously affect the clustering performance, since high-dimensional data are usually noisy and may contain redundancy. To address the above issues, this paper proposes a Low-rank Sparse Subspace (LSS) clustering method via dynamically learning the affinity matrix from low-dimensional space of the original data. Specifically, we learn a transformation matrix to project the original data to their low-dimensional space, by conducting feature selection and subspace learning in the sample self-representation framework. Then, we utilize the rank constraint and the affinity matrix directly obtained from the original data to construct a dynamic and intrinsic affinity matrix. Moreover, each of these three matrices is updated iteratively while fixing the other two. In this way, the affinity matrix learned from the low-dimensional space is the final clustering results. Extensive experiments are conducted on both synthetic and real datasets to show that our proposed LSS method outperforms the state-of-the-art clustering methods.

148 citations


Journal ArticleDOI
TL;DR: A novel One-step Multi-view Spectral Clustering (OMSC) method to output the common affinity matrix as the final clustering result and an iterative optimization method to fast solve the proposed objective function is proposed.
Abstract: Previous multi-view spectral clustering methods are a two-step strategy, which first learns a fixed common representation (or common affinity matrix) of all the views from original data and then conducts k-means clustering on the resulting common affinity matrix. The two-step strategy is not able to output reasonable clustering performance since the goal of the first step (i.e., the common affinity matrix learning) is not designed for achieving the optimal clustering result. Moreover, the two-step strategy learns the common affinity matrix from original data, which often contain noise and redundancy to influence the quality of the common affinity matrix. To address these issues, in this paper, we design a novel One-step Multi-view Spectral Clustering (OMSC) method to output the common affinity matrix as the final clustering result. In the proposed method, the goal of the common affinity matrix learning is designed to achieving optimal clustering result and the common affinity matrix is learned from low-dimensional data where the noise and redundancy of original high-dimensional data have been removed. We further propose an iterative optimization method to fast solve the proposed objective function. Experimental results on both synthetic datasets and public datasets validated the effectiveness of our proposed method, comparing to the state-of-the-art methods for multi-view clustering.

142 citations


Journal ArticleDOI
TL;DR: The proposed method is based on the assumption that the intrinsic underlying graph structure would assign corresponding connected component in each graph to the same cluster, and obtains better clustering performance than the state-of-the-art methods.
Abstract: Most existing multiview clustering methods take graphs, which are usually predefined independently in each view, as input to uncover data distribution. These methods ignore the correlation of graph structure among multiple views and clustering results highly depend on the quality of predefined affinity graphs. We address the problem of multiview clustering by seamlessly integrating graph structures of different views to fully exploit the geometric property of underlying data structure. The proposed method is based on the assumption that the intrinsic underlying graph structure would assign corresponding connected component in each graph to the same cluster. Different graphs from multiple views are integrated by using the Hadamard product since different views usually together admit the same underlying structure across multiple views. Specifically, these graphs are integrated into a global one and the structure of the global graph is adaptively tuned by a well-designed objective function so that the number of components of the graph is exactly equal to the number of clusters. It is worth noting that we directly obtain cluster indicators from the graph itself without performing further graph-cut or $k$k-means clustering algorithms. Experiments show the proposed method obtains better clustering performance than the state-of-the-art methods.

126 citations


Journal ArticleDOI
TL;DR: It is proved that the trace optimization of multi-layer modularity density is equivalent to the objective functions of algorithms, such as kernel-means, nonnegative matrix factorization (NMF), spectral clustering and multi-view clustering, for multi- layer networks, which serves as the theoretical foundation for designing algorithms for community detection.
Abstract: Many complex systems are composed of coupled networks through different layers, where each layer represents one of many possible types of interactions. A fundamental question is how to extract communities in multi-layer networks. The current algorithms either collapses multi-layer networks into a single-layer network or extends the algorithms for single-layer networks by using consensus clustering. However, these approaches have been criticized for ignoring the connection among various layers, thereby resulting in low accuracy. To attack this problem, a quantitative function (multi-layer modularity density) is proposed for community detection in multi-layer networks. Afterward, we prove that the trace optimization of multi-layer modularity density is equivalent to the objective functions of algorithms, such as kernel $K$ -means, nonnegative matrix factorization (NMF), spectral clustering and multi-view clustering, for multi-layer networks, which serves as the theoretical foundation for designing algorithms for community detection. Furthermore, a S emi- S upervised j oint N onnegative M atrix F actorization algorithm ( S2-jNMF ) is developed by simultaneously factorizing matrices that are associated with multi-layer networks. Unlike the traditional semi-supervised algorithms, the partial supervision is integrated into the objective of the S2-jNMF algorithm. Finally, through extensive experiments on both artificial and real world networks, we demonstrate that the proposed method outperforms the state-of-the-art approaches for community detection in multi-layer networks.

Journal ArticleDOI
TL;DR: A stable three-order tensor is first constructed from the normalized image, so as to enhance the robustness of the TD hashing, where image hash generation is viewed as deriving a compact representation from a tensor.
Abstract: This paper presents a new image hashing that is designed with tensor decomposition (TD), referred to as TD hashing, where image hash generation is viewed as deriving a compact representation from a tensor. Specifically, a stable three-order tensor is first constructed from the normalized image, so as to enhance the robustness of our TD hashing. A popular TD algorithm, called Tucker decomposition, is then exploited to decompose the three-order tensor into a core tensor and three orthogonal factor matrices. As the factor matrices can reflect intrinsic structure of original tensor, hash construction with the factor matrices makes a desirable discrimination of the TD hashing. To examine these claims, there are 14,551 images selected for our experiments. A receiver operating characteristics (ROC) graph is used to conduct theoretical analysis and the ROC comparisons illustrate that the TD hashing outperforms some state-of-the-art algorithms in classification performance between the robustness and discrimination.

Journal ArticleDOI
TL;DR: A unified framework for nonuniform embedding, dynamical system revealing, and time series prediction, termed as Structured Manifold Broad Learning System (SM-BLS), which provides a homogeneous way to recover the chaotic attractor from multivariate and heterogeneous time series.
Abstract: High-dimensional and large-scale time series processing has aroused considerable research interests during decades. It is difficult for traditional methods to reveal the evolution state in dynamical systems and discover the relationship among variables automatically. In this paper, we propose a unified framework for nonuniform embedding, dynamical system revealing, and time series prediction, termed as Structured Manifold Broad Learning System (SM-BLS). The structured manifold learning is introduced for nonuniform embedding and unsupervised manifold learning simultaneously. Graph embedding and feature selection are both considered to depict the intrinsic structure connections between chaotic time series and its low-dimensional manifold. Compared with traditional methods, the proposed framework could discover potential deterministic evolution information of dynamical systems and make the modeling more interpretable. It provides us a homogeneous way to recover the chaotic attractor from multivariate and heterogeneous time series. Simulation analysis and results show that SM-BLS has advantages in dynamic discovery and feature extraction of large-scale chaotic time series prediction.

Journal ArticleDOI
TL;DR: This paper presents a novel real-time nonparametric change point detection algorithm called SEP, which uses Separation distance as a divergence measure to detect change points in high-dimensional time series.
Abstract: Change Point Detection (CPD) is the problem of discovering time points at which the behavior of a time series changes abruptly. In this paper, we present a novel real-time nonparametric change point detection algorithm called SEP, which uses Separation distance as a divergence measure to detect change points in high-dimensional time series. Through experiments on artificial and real-world datasets, we demonstrate the usefulness of the proposed method in comparison with existing methods.

Journal ArticleDOI
TL;DR: Ten massiveness characteristics for big knowledge and big-knowledge systems, including massive concepts, connectedness, clean data resources, cases, confidence, capabilities, cumulativeness, concerns, consistency, and completeness, are defined and explored.
Abstract: After entering the big data era, a new term of ‘big knowledge’ has been coined to deal with challenges in mining a mass of knowledge from big data. While researchers used to explore the basic characteristics of big data, we have not seen any studies on the general and essential properties of big knowledge. To fill this gap, this paper studies the concepts of big knowledge, big-knowledge system, and big-knowledge engineering. Ten massiveness characteristics for big knowledge and big-knowledge systems, including massive concepts, connectedness, clean data resources, cases, confidence, capabilities, cumulativeness, concerns, consistency, and completeness, are defined and explored. Based on these characteristics, a comprehensive investigation is conducted on some large-scale knowledge engineering projects, including the Fifth Comprehensive Traffic Survey in Shanghai, the China's Xia-Shang-Zhou Chronology Project, the Troy and Trojan War Project, and the International Human Genome Project, as well as the online free encyclopedia Wikipedia. We also investigate the recent research efforts on knowledge graphs, where they are analyzed to determine which ones can be considered as big knowledge and big-knowledge systems. Further, a definition of big-knowledge engineering and its life cycle paradigm is presented. All of these projects are accordingly checked to determine whether they belong to big-knowledge engineering projects. Finally, the perspectives of big knowledge research are discussed.

Journal ArticleDOI
TL;DR: This work proposes a unified NRL framework by introducing community information of vertices, named as Community-enhanced Network Representation Learning (CNRL), which simultaneously detects community distribution of each vertex and learns embeddings of both vertices and communities.
Abstract: Network representation learning (NRL) aims to learn low-dimensional vectors for vertices in a network. Most existing NRL methods focus on learning representations from local context of vertices (such as their neighbors). Nevertheless, vertices in many complex networks also exhibit significant global patterns widely known as communities. It's intuitive that vertices in the same community tend to connect densely and share common attributes. These patterns are expected to improve NRL and benefit relevant evaluation tasks, such as link prediction and vertex classification. Inspired by the analogy between network representation learning and text modeling, we propose a unified NRL framework by introducing community information of vertices, named as Community-enhanced Network Representation Learning (CNRL). CNRL simultaneously detects community distribution of each vertex and learns embeddings of both vertices and communities. Moreover, the proposed community enhancement mechanism can be applied to various existing NRL models. In experiments, we evaluate our model on vertex classification, link prediction, and community detection using several real-world datasets. The results demonstrate that CNRL significantly and consistently outperforms other state-of-the-art methods while verifying our assumptions on the correlations between vertices and communities.

Journal ArticleDOI
TL;DR: This paper proposes a new formulation of linear discriminant analysis via joint inline-formula-norm minimization on objective function to induce robustness, so as to efficiently alleviate the influence of outliers and improve the robustness of proposed method.
Abstract: Dimensionality reduction is a critical technology in the domain of pattern recognition, and linear discriminant analysis (LDA) is one of the most popular supervised dimensionality reduction methods. However, whenever its distance criterion of objective function uses $L_2$ -norm, it is sensitive to outliers. In this paper, we propose a new formulation of linear discriminant analysis via joint $L_{2,1}$ -norm minimization on objective function to induce robustness, so as to efficiently alleviate the influence of outliers and improve the robustness of proposed method. An efficient iterative algorithm is proposed to solve the optimization problem and proved to be convergent. Extensive experiments are performed on an artificial data set, on UCI data sets, and on four face data sets, which sufficiently demonstrates the efficiency of comparing to other methods and robustness to outliers of our approach.

Journal ArticleDOI
TL;DR: In this paper, the authors generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanism based on the global sensitivity of statistical queries and compare the utility of sanitized results in the tail probability and dispersion between the Gaussian and Laplace mechanisms.
Abstract: Assessment of disclosure risk is of paramount importance in data privacy research and applications. The concept of differential privacy (DP) formalizes privacy in probabilistic terms and provides a robust concept for privacy protection. Practical applications of DP involve development of DP mechanisms to release data at a pre-specified privacy budget. In this paper, we generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanism based on the $l_p$ global sensitivity of statistical queries. We explore the theoretical requirement for the GG mechanism to reach DP at prespecified privacy parameters, and investigate the connections and differences between the GG mechanism and the Exponential mechanism based on the GG distribution. We also present a lower bound on the scale parameter of the Gaussian mechanism of $(\epsilon, \delta)$ -probabilistic DP as a special case of the GG mechanism, and compare the utility of sanitized results in the tail probability and dispersion between the Gaussian and Laplace mechanisms. Lastly, we apply the GG mechanism in three experiments and compare the accuracy of sanitized results in the $l_1$ distance and Kullback-Leibler divergence, and examine the prediction power of a SVM classifier constructed with the sanitized data relative to the original results.

Journal ArticleDOI
TL;DR: An improved overlapping community detection algorithm, LPANNI (Label Propagation Algorithm with Neighbor Node Influence), is proposed, which detects overlapping community structures by adopting fixed label propagation sequence based on the ascending order of node importance and label update strategy based on neighbor node influence and historical label preferred strategy.
Abstract: Overlapping community structure is a significant feature of large-scale complex networks. Some existing community detection algorithms cannot be applied to large-scale complex networks due to their high time or space complexity. Label propagation algorithms were proposed for detecting communities in large-scale networks because of their linear time complexity, however most of which can only detect non-overlapping communities, or the results are inaccurate and unstable. Aimed at the defects, we proposed an improved overlapping community detection algorithm, LPANNI (Label Propagation Algorithm with Neighbor Node Influence), which detects overlapping community structures by adopting fixed label propagation sequence based on the ascending order of node importance and label update strategy based on neighbor node influence and historical label preferred strategy. Extensive experimental results in both real networks and synthetic networks show that, LPANNI can significantly improve the accuracy and stability of community detection algorithms based on label propagation in large-scale complex networks. Meanwhile, LPANNI can detect overlapping community structures in large-scale complex networks under linear time complexity.

Journal ArticleDOI
TL;DR: DeepClue is presented, a system built to bridge text-based deep learning models and end users through visually interpreting the key factors learned in the stock price prediction model by designing the deep neural network architecture for interpretation and applying an algorithm to extract relevant predictive factors.
Abstract: The recent advance of deep learning has enabled trading algorithms to predict stock price movements more accurately. Unfortunately, there is a significant gap in the real-world deployment of this breakthrough. For example, professional traders in their long-term careers have accumulated numerous trading rules, the myth of which they can understand quite well. On the other hand, deep learning models have been hardly interpretable. This paper presents DeepClue, a system built to bridge text-based deep learning models and end users through visually interpreting the key factors learned in the stock price prediction model. We make three contributions in DeepClue. First, by designing the deep neural network architecture for interpretation and applying an algorithm to extract relevant predictive factors, we provide a useful case on what can be interpreted out of the prediction model for end users. Second, by exploring hierarchies over the extracted factors and displaying these factors in an interactive, hierarchical visualization interface, we shed light on how to effectively communicate the interpreted model to end users. Specially, the interpretation separates the predictables from the unpredictables for stock prediction through the use of intercept model parameters and a risk visualization design. Third, we evaluate the integrated visualization system through two case studies in predicting the stock price with financial news and company-related tweets from social media. Quantitative experiments comparing the proposed neural network architecture with state-of-the-art models and the human baseline are conducted and reported. Feedbacks from an informal user study with domain experts are summarized and discussed in details. The study results demonstrate the effectiveness of DeepClue in helping to complete stock market investment and analysis tasks.

Journal ArticleDOI
TL;DR: A novel parallel collaborative (PCol) search method based on a divide-and-conquer strategy and an upper bound on the spatiotemporal correlation and a heuristic scheduling strategy are developed to prune the search space.
Abstract: The matching between trajectories and locations, called Trajectory-to-Location join (TL-Join), is fundamental functionality in spatiotemporal data management. Given a set of trajectories, a set of locations, and a threshold $\theta$θ, the TL-Join finds all (trajectory, location) pairs from the two sets with spatiotemporal correlation above $\theta$θ. This join targets diverse applications, including location recommendation, event tracking, and trajectory activity analyses. We address three challenges in relation to the TL-Join: how to define the spatiotemporal correlation between trajectories and locations, how to prune the search space effectively when computing the join, and how to perform the computation in parallel. Specifically, we define new metrics to measure the spatiotemporal correlation between trajectories and locations. We develop a novel parallel collaborative (PCol) search method based on a divide-and-conquer strategy. For each location $o$o, we retrieve the trajectories with high spatiotemporal correlation to $o$o, and then we merge the results. An upper bound on the spatiotemporal correlation and a heuristic scheduling strategy are developed to prune the search space. The trajectory searches from different locations are independent and are performed in parallel, and the result merging cost is independent of the degree of parallelism. Studies of the performance of the developed algorithms using large spatiotemporal data sets are reported.

Journal ArticleDOI
TL;DR: This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment to tf-idf that accounts for this bias.
Abstract: Text mining have gained great momentum in recent years, with user-generated content becoming widely available. One key use is comment mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increases with its appearance in the studied text, yet is offset by the frequency of the term in the domain of interest. A common practice is to use the well-known tf-idf formula to compute these weights. This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment. We find that content extracted from discourse is often highly correlated, resulting in dependency structures between observations in the study, thus introducing a statistical bias. Ignoring this bias can manifest in a non-robust analysis at best and can lead to an entirely wrong conclusion at worst. We propose an adjustment to tf-idf that accounts for this bias. We illustrate the effects of both the bias and correction with with seven Facebook fan pages data, covering different domains, including news, finance, politics, sport, shopping, and entertainment.

Journal ArticleDOI
TL;DR: This paper presents a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models, and evaluates it using the MNIST dataset, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries.
Abstract: Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models. We model the generator distribution of the training data with a mixture of $k$k generative neural networks. These are trained together and collectively learn the generator distribution of a dataset. Data is divided into $k$k clusters, using a novel differentially private kernel $k$k-means, then each cluster is given to separate generative neural networks, such as Restricted Boltzmann Machines or Variational Autoencoders, which are trained only on their own cluster using differentially private gradient descent. We evaluate our approach using the MNIST dataset, as well as call detail records and transit datasets, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries.

Journal ArticleDOI
TL;DR: This paper analyzes the privacy leakage of a traditional DP mechanism under temporal correlation that can be modeled using Markov Chain, and reveals that, the event-level privacy loss of a DP mechanism may increase over time.
Abstract: Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may increase over time. We call the unexpected privacy loss temporal privacy leakage (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.

Journal ArticleDOI
TL;DR: An empirical evaluation on both synthetic and real-world datasets shows that the proposed PrivRank framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized ranking-based recommendation.
Abstract: Personalized recommendation is crucial to help users find pertinent information. It often relies on a large collection of user data, in particular users’ online activity (e.g., tagging/rating/checking-in) on social media, to mine user preference. However, releasing such user activity data makes users vulnerable to inference attacks, as private data (e.g., gender) can often be inferred from the users’ activity data. In this paper, we proposed PrivRank, a customizable and continuous privacy-preserving social media data publishing framework protecting users against inference attacks while enabling personalized ranking-based recommendations. Its key idea is to continuously obfuscate user activity data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which bounds the ranking loss incurred from the data obfuscation process in order to preserve the utility of the data for enabling recommendations. An empirical evaluation on both synthetic and real-world datasets shows that our framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized ranking-based recommendation. Compared to state-of-the-art approaches, PrivRank achieves both a better privacy protection and a higher utility in all the ranking-based recommendation use cases we tested.

Journal ArticleDOI
TL;DR: This paper develops a baseline algorithm based on the concept of D-core, and proposes three index structures and corresponding query algorithms for CS on directed graph, and results show that the solutions are very effective and efficient.
Abstract: Communities are prevalent in social networks, knowledge graphs, and biological networks Recently, the topic of community search (CS), extracting a dense subgraph containing a query vertex $q$q from a graph, has received great attention However, existing CS solutions are designed for undirected graphs, and overlook directions of edges which potentially lose useful information carried on directions In many applications (eg, Twitter), users’ relationships are often modeled as directed graphs (eg, if a user $a$a follows another user $b$b, then there is an edge from $a$a to $b$b) In this paper, we study the problem of CS on directed graph Given a vertex $q$q of a graph $G$G, we aim to find a densely connected subgraph containing $q$q from $G$G, in which vertices have strong interactions and high similarities, by using the minimum in/out-degrees metric We first develop a baseline algorithm based on the concept of D-core We further propose three index structures and corresponding query algorithms Our experimental results on seven real graphs show that our solutions are very effective and efficient For example, on a graph with over 1 billion of edges, we only need around 40mins to index it and 1$\sim$∼2sec to answer a query

Journal ArticleDOI
TL;DR: In this article, the authors propose a family of schema-agnostic Progressive Entity Resolution (ER) methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety.
Abstract: Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety. First, we introduce two naive schema-agnostic methods, showing that straightforward solutions exhibit a poor performance that does not scale well to large volumes of data. Then, we propose four different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the naive and the state-of-the-art schema-based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection.

Journal ArticleDOI
Zhe Jiang1
TL;DR: A taxonomy of methods categorized by the key challenge they address is provided, to help interdisciplinary domain scientists choose techniques to solve their problems and to help data mining researchers understand the main principles and methods in spatial prediction and identify future research opportunities.
Abstract: With the advancement of GPS and remote sensing technologies, large amounts of geospatial data are being collected from various domains, driving the need for effective and efficient prediction methods. Given spatial data samples with explanatory features and targeted responses (categorical or continuous) at a set of locations, the spatial prediction problem aims to learn a model that can predict the response variable based on explanatory features. The problem is important with broad applications in earth science, urban informatics, geosocial media analytics, and public health, but is challenging due to the unique characteristics of spatial data, including spatial autocorrelation, heterogeneity, limited ground truth, and multiple scales and resolutions. This paper provides a systematic review on principles and methods in spatial prediction. We provide a taxonomy of methods categorized by the key challenge they address. For each method, we introduce its underlying assumption, theoretical foundation, and discuss its advantages and disadvantages. We also discuss spatiotemporal extensions of methods. Our goal is to help interdisciplinary domain scientists choose techniques to solve their problems, and more importantly, to help data mining researchers to understand the main principles and methods in spatial prediction and identify future research opportunities.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed four tight average-utility upper-bounds, based on a vertical database representation, and three efficient pruning strategies to reduce the search space of itemsets.
Abstract: Mining High Average-Utility Itemsets (HAUIs) in a quantitative database is an extension of the traditional problem of frequent itemset mining, having several practical applications. Discovering HAUIs is more challenging than mining frequent itemsets using the traditional support model since the average-utilities of itemsets do not satisfy the downward-closure property. To design algorithms for mining HAUIs that reduce the search space of itemsets, prior studies have proposed various upper-bounds on the average-utilities of itemsets. However, these algorithms can generate a huge amount of unpromising HAUI candidates, which result in high memory consumption and long runtimes. To address this problem, this paper proposes four tight average-utility upper-bounds, based on a vertical database representation, and three efficient pruning strategies. Furthermore, a novel generic framework for comparing average-utility upper-bounds is presented. Based on these theoretical results, an efficient algorithm named dHAUIM is introduced for mining the complete set of HAUIs. dHAUIM represents the search space and quickly compute upper-bounds using a novel IDUL structure. Extensive experiments show that dHAUIM outperforms four state-of-the-art algorithms for mining HAUIs in terms of runtime on both real-life and synthetic databases. Moreover, results show that the proposed pruning strategies dramatically reduce the number of candidate HAUIs.

Journal ArticleDOI
TL;DR: The problem of continuous SAC search on a “dynamic spatial graph,” whose vertices’ locations change with time, is studied, and three fast solutions are proposed.
Abstract: Communities are prevalent in social networks, knowledge graphs, and biological networks. Recently, the topic of community search (CS) has received plenty of attention. The CS problem aims to look for a dense subgraph that contains a query vertex. Existing CS solutions do not consider the spatial extent of a community. They can yield communities whose locations of vertices span large areas. In applications that facilitate setting social events (e.g., finding conference attendees to join a dinner), it is important to find groups of people who are physically close to each other, so it is desirable to have a spatial-aware community (or SAC), whose vertices are close structurally and spatially. Given a graph $G$ and a query vertex $q$ , we develop an exact solution to find the SAC containing $q$ , but it cannot scale to large datasets, so we design three approximation algorithms. We further study the problem of continuous SAC search on a “dynamic spatial graph,” whose vertices’ locations change with time, and propose three fast solutions. We evaluate the solutions on both real and synthetic datasets, and the results show that SACs are better than communities returned by existing solutions. Moreover, our approximation solutions perform accurately and efficiently.