Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2019"

PDF

Open Access

Journal Article•DOI•

[...]

Peng Cui¹, Xiao Wang¹, Jian Pei², Wenwu Zhu¹•Institutions (2)

Tsinghua University¹, Simon Fraser University²

01 May 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure as discussed by the authors, and a significant amount of progress has been made toward this emerging network analysis paradigm.

...read moreread less

Abstract: Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we focus on categorizing and then reviewing the current development on network embedding methods, and point out its future research directions. We first summarize the motivation of network embedding. We discuss the classical graph embedding algorithms and their relationship with network embedding. Afterwards and primarily, we provide a comprehensive overview of a large number of network embedding methods in a systematic manner, covering the structure- and property-preserving network embedding methods, the network embedding methods with side information, and the advanced information preserving network embedding methods. Moreover, several evaluation approaches for network embedding and some useful online resources, including the network data sets and softwares, are reviewed, too. Finally, we discuss the framework of exploiting these network embedding methods to build an effective system and point out some potential future directions.

...read moreread less

929 citations

Journal Article•DOI•

Heterogeneous Information Network Embedding for Recommendation

[...]

Chuan Shi¹, Binbin Hu¹, Wayne Xin Zhao², Philip S. Yu³•Institutions (3)

Beijing University of Posts and Telecommunications¹, Renmin University of China², University of Illinois at Chicago³

01 Feb 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel heterogeneous network embedding based approach for HIN based recommendation, called HERec is proposed, which shows the capability of the HERec model for the cold-start problem, and reveals that the transformed embedding information from HINs can improve the recommendation performance.

...read moreread less

Abstract: Due to the flexibility in modelling data heterogeneity, heterogeneous information network (HIN) has been adopted to characterize complex and heterogeneous auxiliary data in recommender systems, called HIN based recommendation . It is challenging to develop effective methods for HIN based recommendation in both extraction and exploitation of the information from HINs. Most of HIN based recommendation methods rely on path based similarity, which cannot fully mine latent structure features of users and items. In this paper, we propose a novel heterogeneous network embedding based approach for HIN based recommendation, called HERec. To embed HINs, we design a meta-path based random walk strategy to generate meaningful node sequences for network embedding. The learned node embeddings are first transformed by a set of fusion functions, and subsequently integrated into an extended matrix factorization (MF) model. The extended MF model together with fusion functions are jointly optimized for the rating prediction task. Extensive experiments on three real-world datasets demonstrate the effectiveness of the HERec model. Moreover, we show the capability of the HERec model for the cold-start problem, and reveal that the transformed embedding information from HINs can improve the recommendation performance.

...read moreread less

768 citations

Journal Article•DOI•

Learning under Concept Drift: A Review

[...]

Jie Lu¹, Anjin Liu¹, Fan Dong¹, Feng Gu¹, João Gama², Guangquan Zhang¹ - Show less +2 more•Institutions (2)

University of Technology, Sydney¹, University of Porto²

01 Dec 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A high quality, instructive review of current research developments and trends in the concept drift field is conducted, and a framework of learning under concept drift is established including three main components: concept drift detection, concept drift understanding, and concept drift adaptation.

...read moreread less

Abstract: Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding, and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.

...read moreread less

557 citations

Journal Article•DOI•

A Survey of Multi-View Representation Learning

[...]

Yingming Li¹, Ming Yang¹, Zhongfei Zhang¹•Institutions (1)

Zhejiang University¹

01 Oct 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas as mentioned in this paper, and a comprehensive survey of multi-view representations can be found in this paper.

...read moreread less

Abstract: Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion. Consequently, we first review the representative methods and theories of multi-view representation learning based on the perspective of alignment, such as correlation-based alignment. Representative examples are canonical correlation analysis (CCA) and its several extensions. Then, from the perspective of representation fusion, we investigate the advancement of multi-view representation learning that ranges from generative methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks, and multi-modal recurrent neural networks. Further, we also investigate several important applications of multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical foundation and state-of-the-art developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.

...read moreread less

328 citations

Journal Article•DOI•

Machine Learning for the Geosciences: Challenges and Opportunities

[...]

Anuj Karpatne¹, Imme Ebert-Uphoff², Sai Ravela, Hassan A. Babaie³, Vipin Kumar¹ - Show less +1 more•Institutions (3)

University of Minnesota¹, Colorado State University², Georgia State University³

01 Aug 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Some of the cross-cutting research themes in machine learning that are applicable across several geoscience problems, and the importance of a deep collaboration between machine learning and geosciences for synergistic advancements in both disciplines are discussed.

...read moreread less

Abstract: Geosciences is a field of great societal relevance that requires solutions to several urgent problems facing our humanity and the planet. As geosciences enters the era of big data, machine learning (ML)—that has been widely successful in commercial domains—offers immense potential to contribute to problems in geosciences. However, geoscience applications introduce novel challenges for ML due to combinations of geoscience properties encountered in every problem, requiring novel research in machine learning. This article introduces researchers in the machine learning (ML) community to these challenges offered by geoscience problems and the opportunities that exist for advancing both machine learning and geosciences. We first highlight typical sources of geoscience data and describe their common properties. We then describe some of the common categories of geoscience problems where machine learning can play a role, discussing the challenges faced by existing ML methods and opportunities for novel ML research. We conclude by discussing some of the cross-cutting research themes in machine learning that are applicable across several geoscience problems, and the importance of a deep collaboration between machine learning and geosciences for synergistic advancements in both disciplines.

...read moreread less

290 citations

Journal Article•DOI•

Automated Discovery of Process Models from Event Logs: Review and Benchmark

[...]

Adriano Augusto¹, Raffaele Conforti², Marlon Dumas¹, Marcello La Rosa², Fabrizio Maria Maggi¹, Andrea Marrella³, Massimo Mecella³, Allar Soo¹ - Show less +4 more•Institutions (3)

University of Tartu¹, University of Melbourne², Sapienza University of Rome³

01 Apr 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The results highlight gaps and unexplored tradeoffs in the field, including the lack of scalability of some methods and a strong divergence in their performance with respect to the different quality metrics used.

...read moreread less

Abstract: Process mining allows analysts to exploit logs of historical executions of business processes to extract insights regarding the actual performance of these processes. One of the most widely studied process mining operations is automated process discovery. An automated process discovery method takes as input an event log, and produces as output a business process model that captures the control-flow relations between tasks that are observed in or implied by the event log. Various automated process discovery methods have been proposed in the past two decades, striking different tradeoffs between scalability, accuracy, and complexity of the resulting models. However, these methods have been evaluated in an ad-hoc manner, employing different datasets, experimental setups, evaluation measures, and baselines, often leading to incomparable conclusions and sometimes unreproducible results due to the use of closed datasets. This article provides a systematic review and comparative evaluation of automated process discovery methods, using an open-source benchmark and covering 12 publicly-available real-life event logs, 12 proprietary real-life event logs, and nine quality metrics. The results highlight gaps and unexplored tradeoffs in the field, including the lack of scalability of some methods and a strong divergence in their performance with respect to the different quality metrics used.

...read moreread less

225 citations

Journal Article•DOI•

A Correlation-Based Feature Weighting Filter for Naive Bayes

[...]

Liangxiao Jiang¹, Lungan Zhang¹, Chaoqun Li¹, Jia Wu²•Institutions (2)

China University of Geosciences (Wuhan)¹, Macquarie University²

01 Feb 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper argues that for NB highly predictive features should be highly correlated with the class, yet uncorrelated with other features (minimum mutual redundancy), and proposes a correlation-based feature weighting (CFW) filter for NB.

...read moreread less

Abstract: Due to its simplicity, efficiency, and efficacy, naive Bayes (NB) has continued to be one of the top 10 algorithms in the data mining and machine learning community. Of numerous approaches to alleviating its conditional independence assumption, feature weighting has placed more emphasis on highly predictive features than those that are less predictive. In this paper, we argue that for NB highly predictive features should be highly correlated with the class (maximum mutual relevance), yet uncorrelated with other features (minimum mutual redundancy). Based on this premise, we propose a correlation-based feature weighting (CFW) filter for NB. In CFW, the weight for a feature is a sigmoid transformation of the difference between the feature-class correlation (mutual relevance) and the average feature-feature intercorrelation (average mutual redundancy). Experimental results show that NB with CFW significantly outperforms NB and all the other existing state-of-the-art feature weighting filters used to compare. Compared to feature weighting wrappers for improving NB, the main advantages of CFW are its low computational complexity (no search involved) and the fact that it maintains the simplicity of the final model. Besides, we apply CFW to text classification and have achieved remarkable improvements.

...read moreread less

179 citations

Journal Article•DOI•

Low-Rank Sparse Subspace for Spectral Clustering

[...]

Xiaofeng Zhu¹, Shichao Zhang¹, Yonggang Li¹, Jilian Zhang², Lifeng Yang¹, Yue Fang¹ - Show less +2 more•Institutions (2)

Guangxi Normal University¹, Jinan University²

01 Aug 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A Low-rank Sparse Subspace (LSS) clustering method via dynamically learning the affinity matrix from low-dimensional space of the original data is proposed, which outperforms the state-of-the-art clustering methods.

...read moreread less

Abstract: Traditional graph clustering methods consist of two sequential steps, i.e., constructing an affinity matrix from the original data and then performing spectral clustering on the resulting affinity matrix. This two-step strategy achieves optimal solution for each step separately, but cannot guarantee that it will obtain the globally optimal clustering results. Moreover, the affinity matrix directly learned from the original data will seriously affect the clustering performance, since high-dimensional data are usually noisy and may contain redundancy. To address the above issues, this paper proposes a Low-rank Sparse Subspace (LSS) clustering method via dynamically learning the affinity matrix from low-dimensional space of the original data. Specifically, we learn a transformation matrix to project the original data to their low-dimensional space, by conducting feature selection and subspace learning in the sample self-representation framework. Then, we utilize the rank constraint and the affinity matrix directly obtained from the original data to construct a dynamic and intrinsic affinity matrix. Moreover, each of these three matrices is updated iteratively while fixing the other two. In this way, the affinity matrix learned from the low-dimensional space is the final clustering results. Extensive experiments are conducted on both synthetic and real datasets to show that our proposed LSS method outperforms the state-of-the-art clustering methods.

...read moreread less

148 citations

Journal Article•DOI•

One-Step Multi-View Spectral Clustering

[...]

Xiaofeng Zhu¹, Shichao Zhang², Wei He¹, Rongyao Hu¹, Cong Lei¹, Pengfei Zhu³ - Show less +2 more•Institutions (3)

Guangxi Normal University¹, Central South University², Tianjin University³

01 Oct 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel One-step Multi-view Spectral Clustering (OMSC) method to output the common affinity matrix as the final clustering result and an iterative optimization method to fast solve the proposed objective function is proposed.

...read moreread less

Abstract: Previous multi-view spectral clustering methods are a two-step strategy, which first learns a fixed common representation (or common affinity matrix) of all the views from original data and then conducts k-means clustering on the resulting common affinity matrix. The two-step strategy is not able to output reasonable clustering performance since the goal of the first step (i.e., the common affinity matrix learning) is not designed for achieving the optimal clustering result. Moreover, the two-step strategy learns the common affinity matrix from original data, which often contain noise and redundancy to influence the quality of the common affinity matrix. To address these issues, in this paper, we design a novel One-step Multi-view Spectral Clustering (OMSC) method to output the common affinity matrix as the final clustering result. In the proposed method, the goal of the common affinity matrix learning is designed to achieving optimal clustering result and the common affinity matrix is learned from low-dimensional data where the noise and redundancy of original high-dimensional data have been removed. We further propose an iterative optimization method to fast solve the proposed objective function. Experimental results on both synthetic datasets and public datasets validated the effectiveness of our proposed method, comparing to the state-of-the-art methods for multi-view clustering.

...read moreread less

142 citations

Journal Article•DOI•

Graph Structure Fusion for Multiview Clustering

[...]

Kun Zhan¹, Chaoxi Niu¹, Changlu Chen¹, Feiping Nie², Changqing Zhang³, Yi Yang⁴ - Show less +2 more•Institutions (4)

Lanzhou University¹, Northwestern Polytechnical University², Tianjin University³, University of Technology, Sydney⁴

01 Oct 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The proposed method is based on the assumption that the intrinsic underlying graph structure would assign corresponding connected component in each graph to the same cluster, and obtains better clustering performance than the state-of-the-art methods.

...read moreread less

Abstract: Most existing multiview clustering methods take graphs, which are usually predefined independently in each view, as input to uncover data distribution. These methods ignore the correlation of graph structure among multiple views and clustering results highly depend on the quality of predefined affinity graphs. We address the problem of multiview clustering by seamlessly integrating graph structures of different views to fully exploit the geometric property of underlying data structure. The proposed method is based on the assumption that the intrinsic underlying graph structure would assign corresponding connected component in each graph to the same cluster. Different graphs from multiple views are integrated by using the Hadamard product since different views usually together admit the same underlying structure across multiple views. Specifically, these graphs are integrated into a global one and the structure of the global graph is adaptively tuned by a well-designed objective function so that the number of components of the graph is exactly equal to the number of clusters. It is worth noting that we directly obtain cluster indicators from the graph itself without performing further graph-cut or $k$k-means clustering algorithms. Experiments show the proposed method obtains better clustering performance than the state-of-the-art methods.

...read moreread less

126 citations

Journal Article•DOI•

Community Detection in Multi-Layer Networks Using Joint Nonnegative Matrix Factorization

[...]

Xiaoke Ma¹, Di Dong², Quan Wang¹•Institutions (2)

Xidian University¹, Chinese Academy of Sciences²

01 Feb 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is proved that the trace optimization of multi-layer modularity density is equivalent to the objective functions of algorithms, such as kernel-means, nonnegative matrix factorization (NMF), spectral clustering and multi-view clustering, for multi- layer networks, which serves as the theoretical foundation for designing algorithms for community detection.

...read moreread less

Abstract: Many complex systems are composed of coupled networks through different layers, where each layer represents one of many possible types of interactions. A fundamental question is how to extract communities in multi-layer networks. The current algorithms either collapses multi-layer networks into a single-layer network or extends the algorithms for single-layer networks by using consensus clustering. However, these approaches have been criticized for ignoring the connection among various layers, thereby resulting in low accuracy. To attack this problem, a quantitative function (multi-layer modularity density) is proposed for community detection in multi-layer networks. Afterward, we prove that the trace optimization of multi-layer modularity density is equivalent to the objective functions of algorithms, such as kernel $K$ -means, nonnegative matrix factorization (NMF), spectral clustering and multi-view clustering, for multi-layer networks, which serves as the theoretical foundation for designing algorithms for community detection. Furthermore, a S emi- S upervised j oint N onnegative M atrix F actorization algorithm ( S2-jNMF ) is developed by simultaneously factorizing matrices that are associated with multi-layer networks. Unlike the traditional semi-supervised algorithms, the partial supervision is integrated into the objective of the S2-jNMF algorithm. Finally, through extensive experiments on both artificial and real world networks, we demonstrate that the proposed method outperforms the state-of-the-art approaches for community detection in multi-layer networks.

...read moreread less

Journal Article•DOI•

Robust Image Hashing with Tensor Decomposition

[...]

Zhenjun Tang¹, Lv Chen¹, Xianquan Zhang¹, Shichao Zhang¹•Institutions (1)

Guangxi Normal University¹

01 Mar 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A stable three-order tensor is first constructed from the normalized image, so as to enhance the robustness of the TD hashing, where image hash generation is viewed as deriving a compact representation from a tensor.

...read moreread less

Abstract: This paper presents a new image hashing that is designed with tensor decomposition (TD), referred to as TD hashing, where image hash generation is viewed as deriving a compact representation from a tensor. Specifically, a stable three-order tensor is first constructed from the normalized image, so as to enhance the robustness of our TD hashing. A popular TD algorithm, called Tucker decomposition, is then exploited to decompose the three-order tensor into a core tensor and three orthogonal factor matrices. As the factor matrices can reflect intrinsic structure of original tensor, hash construction with the factor matrices makes a desirable discrimination of the TD hashing. To examine these claims, there are 14,551 images selected for our experiments. A receiver operating characteristics (ROC) graph is used to conduct theoretical analysis and the ROC comparisons illustrate that the TD hashing outperforms some state-of-the-art algorithms in classification performance between the robustness and discrimination.

...read moreread less

Journal Article•DOI•

Structured Manifold Broad Learning System: A Manifold Perspective for Large-Scale Chaotic Time Series Analysis and Prediction

[...]

Min Han¹, Shoubo Feng¹, C. L. Philip Chen², Meiling Xu¹, Tie Qiu³ - Show less +1 more•Institutions (3)

Dalian University of Technology¹, University of Macau², Tianjin University³

01 Sep 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A unified framework for nonuniform embedding, dynamical system revealing, and time series prediction, termed as Structured Manifold Broad Learning System (SM-BLS), which provides a homogeneous way to recover the chaotic attractor from multivariate and heterogeneous time series.

...read moreread less

Abstract: High-dimensional and large-scale time series processing has aroused considerable research interests during decades. It is difficult for traditional methods to reveal the evolution state in dynamical systems and discover the relationship among variables automatically. In this paper, we propose a unified framework for nonuniform embedding, dynamical system revealing, and time series prediction, termed as Structured Manifold Broad Learning System (SM-BLS). The structured manifold learning is introduced for nonuniform embedding and unsupervised manifold learning simultaneously. Graph embedding and feature selection are both considered to depict the intrinsic structure connections between chaotic time series and its low-dimensional manifold. Compared with traditional methods, the proposed framework could discover potential deterministic evolution information of dynamical systems and make the modeling more interpretable. It provides us a homogeneous way to recover the chaotic attractor from multivariate and heterogeneous time series. Simulation analysis and results show that SM-BLS has advantages in dynamic discovery and feature extraction of large-scale chaotic time series prediction.

...read moreread less

Journal Article•DOI•

Real-Time Change Point Detection with Application to Smart Home Time Series Data

[...]

Samaneh Aminikhanghahi¹, Tinghui Wang¹, Diane J. Cook¹•Institutions (1)

Washington State University¹

01 May 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a novel real-time nonparametric change point detection algorithm called SEP, which uses Separation distance as a divergence measure to detect change points in high-dimensional time series.

...read moreread less

Abstract: Change Point Detection (CPD) is the problem of discovering time points at which the behavior of a time series changes abruptly. In this paper, we present a novel real-time nonparametric change point detection algorithm called SEP, which uses Separation distance as a divergence measure to detect change points in high-dimensional time series. Through experiments on artificial and real-world datasets, we demonstrate the usefulness of the proposed method in comparison with existing methods.

...read moreread less

Journal Article•DOI•

A Study on Big Knowledge and Its Engineering Issues

[...]

Ruqian Lu¹, Xiaolong Jin¹, Songmao Zhang¹, Meikang Qiu², Xindong Wu³ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, Columbia University², Hefei University of Technology³

01 Sep 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Ten massiveness characteristics for big knowledge and big-knowledge systems, including massive concepts, connectedness, clean data resources, cases, confidence, capabilities, cumulativeness, concerns, consistency, and completeness, are defined and explored.

...read moreread less

Abstract: After entering the big data era, a new term of ‘big knowledge’ has been coined to deal with challenges in mining a mass of knowledge from big data. While researchers used to explore the basic characteristics of big data, we have not seen any studies on the general and essential properties of big knowledge. To fill this gap, this paper studies the concepts of big knowledge, big-knowledge system, and big-knowledge engineering. Ten massiveness characteristics for big knowledge and big-knowledge systems, including massive concepts, connectedness, clean data resources, cases, confidence, capabilities, cumulativeness, concerns, consistency, and completeness, are defined and explored. Based on these characteristics, a comprehensive investigation is conducted on some large-scale knowledge engineering projects, including the Fifth Comprehensive Traffic Survey in Shanghai, the China's Xia-Shang-Zhou Chronology Project, the Troy and Trojan War Project, and the International Human Genome Project, as well as the online free encyclopedia Wikipedia. We also investigate the recent research efforts on knowledge graphs, where they are analyzed to determine which ones can be considered as big knowledge and big-knowledge systems. Further, a definition of big-knowledge engineering and its life cycle paradigm is presented. All of these projects are accordingly checked to determine whether they belong to big-knowledge engineering projects. Finally, the perspectives of big knowledge research are discussed.

...read moreread less

Journal Article•DOI•

A Unified Framework for Community Detection and Network Representation Learning

[...]

Cunchao Tu¹, Xiangkai Zeng², Hao Wang¹, Zhengyan Zhang¹, Zhiyuan Liu¹, Maosong Sun¹, Bo Zhang³, Leyu Lin³ - Show less +4 more•Institutions (3)

Tsinghua University¹, Beihang University², Tencent³

01 Jun 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a unified NRL framework by introducing community information of vertices, named as Community-enhanced Network Representation Learning (CNRL), which simultaneously detects community distribution of each vertex and learns embeddings of both vertices and communities.

...read moreread less

Abstract: Network representation learning (NRL) aims to learn low-dimensional vectors for vertices in a network. Most existing NRL methods focus on learning representations from local context of vertices (such as their neighbors). Nevertheless, vertices in many complex networks also exhibit significant global patterns widely known as communities. It's intuitive that vertices in the same community tend to connect densely and share common attributes. These patterns are expected to improve NRL and benefit relevant evaluation tasks, such as link prediction and vertex classification. Inspired by the analogy between network representation learning and text modeling, we propose a unified NRL framework by introducing community information of vertices, named as Community-enhanced Network Representation Learning (CNRL). CNRL simultaneously detects community distribution of each vertex and learns embeddings of both vertices and communities. Moreover, the proposed community enhancement mechanism can be applied to various existing NRL models. In experiments, we evaluate our model on vertex classification, link prediction, and community detection using several real-world datasets. The results demonstrate that CNRL significantly and consistently outperforms other state-of-the-art methods while verifying our assumptions on the correlations between vertices and communities.

...read moreread less

Journal Article•DOI•

A New Formulation of Linear Discriminant Analysis for Robust Dimensionality Reduction

[...]

Haifeng Zhao¹, Zheng Wang², Feiping Nie²•Institutions (2)

Anhui University¹, Northwestern Polytechnical University²

01 Apr 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a new formulation of linear discriminant analysis via joint inline-formula-norm minimization on objective function to induce robustness, so as to efficiently alleviate the influence of outliers and improve the robustness of proposed method.

...read moreread less

Abstract: Dimensionality reduction is a critical technology in the domain of pattern recognition, and linear discriminant analysis (LDA) is one of the most popular supervised dimensionality reduction methods. However, whenever its distance criterion of objective function uses $L_2$ -norm, it is sensitive to outliers. In this paper, we propose a new formulation of linear discriminant analysis via joint $L_{2,1}$ -norm minimization on objective function to induce robustness, so as to efficiently alleviate the influence of outliers and improve the robustness of proposed method. An efficient iterative algorithm is proposed to solve the optimization problem and proved to be convergent. Extensive experiments are performed on an artificial data set, on UCI data sets, and on four face data sets, which sufficiently demonstrates the efficiency of comparing to other methods and robustness to outliers of our approach.

...read moreread less

Journal Article•DOI•

Generalized Gaussian Mechanism for Differential Privacy

[...]

Fang Liu¹•Institutions (1)

University of Notre Dame¹

01 Apr 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, the authors generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanism based on the global sensitivity of statistical queries and compare the utility of sanitized results in the tail probability and dispersion between the Gaussian and Laplace mechanisms.

...read moreread less

Abstract: Assessment of disclosure risk is of paramount importance in data privacy research and applications. The concept of differential privacy (DP) formalizes privacy in probabilistic terms and provides a robust concept for privacy protection. Practical applications of DP involve development of DP mechanisms to release data at a pre-specified privacy budget. In this paper, we generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanism based on the $l_p$ global sensitivity of statistical queries. We explore the theoretical requirement for the GG mechanism to reach DP at prespecified privacy parameters, and investigate the connections and differences between the GG mechanism and the Exponential mechanism based on the GG distribution. We also present a lower bound on the scale parameter of the Gaussian mechanism of $(\epsilon, \delta)$ -probabilistic DP as a special case of the GG mechanism, and compare the utility of sanitized results in the tail probability and dispersion between the Gaussian and Laplace mechanisms. Lastly, we apply the GG mechanism in three experiments and compare the accuracy of sanitized results in the $l_1$ distance and Kullback-Leibler divergence, and examine the prediction power of a SVM classifier constructed with the sanitized data relative to the original results.

...read moreread less

Journal Article•DOI•

LPANNI: Overlapping Community Detection Using Label Propagation in Large-Scale Complex Networks

[...]

Meilian Lu¹, Zhenglin Zhang¹, Zhihe Qu¹, Yu Kang¹•Institutions (1)

Beijing University of Posts and Telecommunications¹

01 Sep 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An improved overlapping community detection algorithm, LPANNI (Label Propagation Algorithm with Neighbor Node Influence), is proposed, which detects overlapping community structures by adopting fixed label propagation sequence based on the ascending order of node importance and label update strategy based on neighbor node influence and historical label preferred strategy.

...read moreread less

Abstract: Overlapping community structure is a significant feature of large-scale complex networks. Some existing community detection algorithms cannot be applied to large-scale complex networks due to their high time or space complexity. Label propagation algorithms were proposed for detecting communities in large-scale networks because of their linear time complexity, however most of which can only detect non-overlapping communities, or the results are inaccurate and unstable. Aimed at the defects, we proposed an improved overlapping community detection algorithm, LPANNI (Label Propagation Algorithm with Neighbor Node Influence), which detects overlapping community structures by adopting fixed label propagation sequence based on the ascending order of node importance and label update strategy based on neighbor node influence and historical label preferred strategy. Extensive experimental results in both real networks and synthetic networks show that, LPANNI can significantly improve the accuracy and stability of community detection algorithms based on label propagation in large-scale complex networks. Meanwhile, LPANNI can detect overlapping community structures in large-scale complex networks under linear time complexity.

...read moreread less

Journal Article•DOI•

DeepClue: Visual Interpretation of Text-Based Deep Stock Prediction

[...]

Lei Shi¹, Zhiyang Teng², Le Wang¹, Yue Zhang², Alexander Binder² - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Singapore University of Technology and Design²

01 Jun 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: DeepClue is presented, a system built to bridge text-based deep learning models and end users through visually interpreting the key factors learned in the stock price prediction model by designing the deep neural network architecture for interpretation and applying an algorithm to extract relevant predictive factors.

...read moreread less

Abstract: The recent advance of deep learning has enabled trading algorithms to predict stock price movements more accurately. Unfortunately, there is a significant gap in the real-world deployment of this breakthrough. For example, professional traders in their long-term careers have accumulated numerous trading rules, the myth of which they can understand quite well. On the other hand, deep learning models have been hardly interpretable. This paper presents DeepClue, a system built to bridge text-based deep learning models and end users through visually interpreting the key factors learned in the stock price prediction model. We make three contributions in DeepClue. First, by designing the deep neural network architecture for interpretation and applying an algorithm to extract relevant predictive factors, we provide a useful case on what can be interpreted out of the prediction model for end users. Second, by exploring hierarchies over the extracted factors and displaying these factors in an interactive, hierarchical visualization interface, we shed light on how to effectively communicate the interpreted model to end users. Specially, the interpretation separates the predictables from the unpredictables for stock prediction through the use of intercept model parameters and a risk visualization design. Third, we evaluate the integrated visualization system through two case studies in predicting the stock price with financial news and company-related tweets from social media. Quantitative experiments comparing the proposed neural network architecture with state-of-the-art models and the human baseline are conducted and reported. Feedbacks from an informal user study with domain experts are summarized and discussed in details. The study results demonstrate the effectiveness of DeepClue in helping to complete stock market investment and analysis tasks.

...read moreread less

Journal Article•DOI•

Parallel Trajectory-to-Location Join

[...]

Shuo Shang¹, Lisi Chen², Kai Zheng³, Christian S. Jensen⁴, Zhewei Wei⁵, Panos Kalnis¹ - Show less +2 more•Institutions (5)

King Abdullah University of Science and Technology¹, University of Wollongong², University of Electronic Science and Technology of China³, Aalborg University⁴, Renmin University of China⁵

01 Jun 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel parallel collaborative (PCol) search method based on a divide-and-conquer strategy and an upper bound on the spatiotemporal correlation and a heuristic scheduling strategy are developed to prune the search space.

...read moreread less

Abstract: The matching between trajectories and locations, called Trajectory-to-Location join (TL-Join), is fundamental functionality in spatiotemporal data management. Given a set of trajectories, a set of locations, and a threshold $\theta$θ, the TL-Join finds all (trajectory, location) pairs from the two sets with spatiotemporal correlation above $\theta$θ. This join targets diverse applications, including location recommendation, event tracking, and trajectory activity analyses. We address three challenges in relation to the TL-Join: how to define the spatiotemporal correlation between trajectories and locations, how to prune the search space effectively when computing the join, and how to perform the computation in parallel. Specifically, we define new metrics to measure the spatiotemporal correlation between trajectories and locations. We develop a novel parallel collaborative (PCol) search method based on a divide-and-conquer strategy. For each location $o$o, we retrieve the trajectories with high spatiotemporal correlation to $o$o, and then we merge the results. An upper bound on the spatiotemporal correlation and a heuristic scheduling strategy are developed to prune the search space. The trajectory searches from different locations are independent and are performed in parallel, and the result merging cost is independent of the degree of parallelism. Studies of the performance of the developed algorithms using large spatiotemporal data sets are reported.

...read moreread less

Journal Article•DOI•

Comments Mining With TF-IDF: The Inherent Bias and Its Removal

[...]

Inbal Yahav¹, Onn Shehory¹, David G. Schwartz¹•Institutions (1)

Bar-Ilan University¹

01 Mar 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment to tf-idf that accounts for this bias.

...read moreread less

Abstract: Text mining have gained great momentum in recent years, with user-generated content becoming widely available. One key use is comment mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increases with its appearance in the studied text, yet is offset by the frequency of the term in the domain of interest. A common practice is to use the well-known tf-idf formula to compute these weights. This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment. We find that content extracted from discourse is often highly correlated, resulting in dependency structures between observations in the study, thus introducing a statistical bias. Ignoring this bias can manifest in a non-robust analysis at best and can lead to an entirely wrong conclusion at worst. We propose an adjustment to tf-idf that accounts for this bias. We illustrate the effects of both the bias and correction with with seven Facebook fan pages data, covering different domains, including news, finance, politics, sport, shopping, and entertainment.

...read moreread less

Journal Article•DOI•

Differentially Private Mixture of Generative Neural Networks

[...]

Gergely Acs, Luca Melis¹, Claude Castelluccia², Emiliano De Cristofaro¹•Institutions (2)

University College London¹, French Institute for Research in Computer Science and Automation²

01 Jun 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models, and evaluates it using the MNIST dataset, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries.

...read moreread less

Abstract: Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models. We model the generator distribution of the training data with a mixture of $k$k generative neural networks. These are trained together and collectively learn the generator distribution of a dataset. Data is divided into $k$k clusters, using a novel differentially private kernel $k$k-means, then each cluster is given to separate generative neural networks, such as Restricted Boltzmann Machines or Variational Autoencoders, which are trained only on their own cluster using differentially private gradient descent. We evaluate our approach using the MNIST dataset, as well as call detail records and transit datasets, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries.

...read moreread less

Journal Article•DOI•

Quantifying Differential Privacy in Continuous Data Release Under Temporal Correlations

[...]

Yang Cao¹, Masatoshi Yoshikawa², Yonghui Xiao³, Li Xiong¹•Institutions (3)

Emory University¹, Kyoto University², Google³

01 Jul 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper analyzes the privacy leakage of a traditional DP mechanism under temporal correlation that can be modeled using Markov Chain, and reveals that, the event-level privacy loss of a DP mechanism may increase over time.

...read moreread less

Abstract: Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may increase over time. We call the unexpected privacy loss temporal privacy leakage (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.

...read moreread less

Journal Article•DOI•

Privacy-Preserving Social Media Data Publishing for Personalized Ranking-Based Recommendation

[...]

Dingqi Yang¹, Bingqing Qu¹, Philippe Cudré-Mauroux¹•Institutions (1)

University of Fribourg¹

01 Mar 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An empirical evaluation on both synthetic and real-world datasets shows that the proposed PrivRank framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized ranking-based recommendation.

...read moreread less

Abstract: Personalized recommendation is crucial to help users find pertinent information. It often relies on a large collection of user data, in particular users’ online activity (e.g., tagging/rating/checking-in) on social media, to mine user preference. However, releasing such user activity data makes users vulnerable to inference attacks, as private data (e.g., gender) can often be inferred from the users’ activity data. In this paper, we proposed PrivRank, a customizable and continuous privacy-preserving social media data publishing framework protecting users against inference attacks while enabling personalized ranking-based recommendations. Its key idea is to continuously obfuscate user activity data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which bounds the ranking loss incurred from the data obfuscation process in order to preserve the utility of the data for enabling recommendations. An empirical evaluation on both synthetic and real-world datasets shows that our framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized ranking-based recommendation. Compared to state-of-the-art approaches, PrivRank achieves both a better privacy protection and a higher utility in all the ranking-based recommendation use cases we tested.

...read moreread less

Journal Article•DOI•

Effective and Efficient Community Search Over Large Directed Graphs

[...]

Yixiang Fang¹, Zhongran Wang², Reynold Cheng¹, Hongzhi Wang², Jiafeng Hu¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, Harbin Institute of Technology²

01 Nov 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper develops a baseline algorithm based on the concept of D-core, and proposes three index structures and corresponding query algorithms for CS on directed graph, and results show that the solutions are very effective and efficient.

...read moreread less

Abstract: Communities are prevalent in social networks, knowledge graphs, and biological networks Recently, the topic of community search (CS), extracting a dense subgraph containing a query vertex $q$q from a graph, has received great attention However, existing CS solutions are designed for undirected graphs, and overlook directions of edges which potentially lose useful information carried on directions In many applications (eg, Twitter), users’ relationships are often modeled as directed graphs (eg, if a user $a$a follows another user $b$b, then there is an edge from $a$a to $b$b) In this paper, we study the problem of CS on directed graph Given a vertex $q$q of a graph $G$G, we aim to find a densely connected subgraph containing $q$q from $G$G, in which vertices have strong interactions and high similarities, by using the minimum in/out-degrees metric We first develop a baseline algorithm based on the concept of D-core We further propose three index structures and corresponding query algorithms Our experimental results on seven real graphs show that our solutions are very effective and efficient For example, on a graph with over 1 billion of edges, we only need around 40mins to index it and 1$\sim$∼2sec to answer a query

...read moreread less

Journal Article•DOI•

Schema-Agnostic Progressive Entity Resolution

[...]

Giovanni Simonini¹, George Papadakis², Themis Palpanas³, Sonia Bergamaschi¹•Institutions (3)

University of Modena and Reggio Emilia¹, National and Kapodistrian University of Athens², Paris Descartes University³

01 Jun 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this article, the authors propose a family of schema-agnostic Progressive Entity Resolution (ER) methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety.

...read moreread less

Abstract: Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety. First, we introduce two naive schema-agnostic methods, showing that straightforward solutions exhibit a poor performance that does not scale well to large volumes of data. Then, we propose four different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the naive and the state-of-the-art schema-based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection.

...read moreread less

Journal Article•DOI•

A Survey on Spatial Prediction Methods

[...]

Zhe Jiang¹•Institutions (1)

University of Alabama¹

01 Sep 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A taxonomy of methods categorized by the key challenge they address is provided, to help interdisciplinary domain scientists choose techniques to solve their problems and to help data mining researchers understand the main principles and methods in spatial prediction and identify future research opportunities.

...read moreread less

Abstract: With the advancement of GPS and remote sensing technologies, large amounts of geospatial data are being collected from various domains, driving the need for effective and efficient prediction methods. Given spatial data samples with explanatory features and targeted responses (categorical or continuous) at a set of locations, the spatial prediction problem aims to learn a model that can predict the response variable based on explanatory features. The problem is important with broad applications in earth science, urban informatics, geosocial media analytics, and public health, but is challenging due to the unique characteristics of spatial data, including spatial autocorrelation, heterogeneity, limited ground truth, and multiple scales and resolutions. This paper provides a systematic review on principles and methods in spatial prediction. We provide a taxonomy of methods categorized by the key challenge they address. For each method, we introduce its underlying assumption, theoretical foundation, and discuss its advantages and disadvantages. We also discuss spatiotemporal extensions of methods. Our goal is to help interdisciplinary domain scientists choose techniques to solve their problems, and more importantly, to help data mining researchers to understand the main principles and methods in spatial prediction and identify future research opportunities.

...read moreread less

Journal Article•DOI•

Efficient Vertical Mining of High Average-Utility Itemsets Based on Novel Upper-Bounds

[...]

Tin C. Truong, Hai Duong, Bac Le, Philippe Fournier-Viger¹•Institutions (1)

Harbin Institute of Technology¹

01 Feb 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Zhang et al. as discussed by the authors proposed four tight average-utility upper-bounds, based on a vertical database representation, and three efficient pruning strategies to reduce the search space of itemsets.

...read moreread less

Abstract: Mining High Average-Utility Itemsets (HAUIs) in a quantitative database is an extension of the traditional problem of frequent itemset mining, having several practical applications. Discovering HAUIs is more challenging than mining frequent itemsets using the traditional support model since the average-utilities of itemsets do not satisfy the downward-closure property. To design algorithms for mining HAUIs that reduce the search space of itemsets, prior studies have proposed various upper-bounds on the average-utilities of itemsets. However, these algorithms can generate a huge amount of unpromising HAUI candidates, which result in high memory consumption and long runtimes. To address this problem, this paper proposes four tight average-utility upper-bounds, based on a vertical database representation, and three efficient pruning strategies. Furthermore, a novel generic framework for comparing average-utility upper-bounds is presented. Based on these theoretical results, an efficient algorithm named dHAUIM is introduced for mining the complete set of HAUIs. dHAUIM represents the search space and quickly compute upper-bounds using a novel IDUL structure. Extensive experiments show that dHAUIM outperforms four state-of-the-art algorithms for mining HAUIs in terms of runtime on both real-life and synthetic databases. Moreover, results show that the proposed pruning strategies dramatically reduce the number of candidate HAUIs.

...read moreread less

Journal Article•DOI•

On Spatial-Aware Community Search

[...]

Yixiang Fang¹, Zheng Wang¹, Reynold Cheng¹, Xiaodong Li¹, Siqiang Luo¹, Jiafeng Hu¹, Xiaojun Chen² - Show less +3 more•Institutions (2)

University of Hong Kong¹, Shenzhen University²

01 Apr 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The problem of continuous SAC search on a “dynamic spatial graph,” whose vertices’ locations change with time, is studied, and three fast solutions are proposed.

...read moreread less

Abstract: Communities are prevalent in social networks, knowledge graphs, and biological networks. Recently, the topic of community search (CS) has received plenty of attention. The CS problem aims to look for a dense subgraph that contains a query vertex. Existing CS solutions do not consider the spatial extent of a community. They can yield communities whose locations of vertices span large areas. In applications that facilitate setting social events (e.g., finding conference attendees to join a dinner), it is important to find groups of people who are physically close to each other, so it is desirable to have a spatial-aware community (or SAC), whose vertices are close structurally and spatially. Given a graph $G$ and a query vertex $q$ , we develop an exact solution to find the SAC containing $q$ , but it cannot scale to large datasets, so we design three approximation algorithms. We further study the problem of continuous SAC search on a “dynamic spatial graph,” whose vertices’ locations change with time, and propose three fast solutions. We evaluate the solutions on both real and synthetic datasets, and the results show that SACs are better than communities returned by existing solutions. Moreover, our approximation solutions perform accurately and efficiently.

...read moreread less

Collapse