scispace - formally typeset
Search or ask a question

Showing papers in "Data Mining and Knowledge Discovery in 2018"


Journal ArticleDOI
TL;DR: A novel scalable algorithm for time series subsequence all-pairs-similarity-search that computes the answer to the time series motif and time series discord problem as a side-effect and incidentally provides the fastest known algorithm for both these extensively-studied problems.
Abstract: The last decade has seen a flurry of research on all-pairs-similarity-search (or similarity joins) for text, DNA and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. However, there has been surprisingly little progress made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce only one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time and/or be accelerated by a trivial porting to a GPU framework. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for many time series data mining problems, including motif discovery, novelty discovery, shapelet discovery, semantic segmentation, density estimation, and contrast set mining. Moreover, we demonstrate the utility of our ideas on domains as diverse as seismology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring and medicine.

104 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present Ditras (DIary-based TRAjectory Simulator), a framework to simulate the spatio-temporal patterns of human mobility, which operates in two steps: the generation of a mobility diary and the translation of the mobility diary into a mobility trajectory.
Abstract: The generation of realistic spatio-temporal trajectories of human mobility is of fundamental importance in a wide range of applications, such as the developing of protocols for mobile ad-hoc networks or what-if analysis in urban ecosystems. Current generative algorithms fail in accurately reproducing the individuals’ recurrent schedules and at the same time in accounting for the possibility that individuals may break the routine during periods of variable duration. In this article we present Ditras (DIary-based TRAjectory Simulator), a framework to simulate the spatio-temporal patterns of human mobility. Ditras operates in two steps: the generation of a mobility diary and the translation of the mobility diary into a mobility trajectory. We propose a data-driven algorithm which constructs a diary generator from real data, capturing the tendency of individuals to follow or break their routine. We also propose a trajectory generator based on the concept of preferential exploration and preferential return. We instantiate Ditras with the proposed diary and trajectory generators and compare the resulting algorithm with real data and synthetic data produced by other generative algorithms, built by instantiating Ditras with several combinations of diary and trajectory generators. We show that the proposed algorithm reproduces the statistical properties of real trajectories in the most accurate way, making a step forward the understanding of the origin of the spatio-temporal patterns of human mobility.

84 citations


Journal ArticleDOI
TL;DR: It is argued that concept drift mapping is an essential prerequisite for tackling concept drift and shift, and tools for this purpose are proposed, arguing for the importance of quantitative descriptions of Drift and shift in marginal distributions.
Abstract: Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping—the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy-pricing, vegetation monitoring and airline scheduling.

79 citations


Journal ArticleDOI
TL;DR: The longer the time needed for the search, the higher the speedup ratio achieved by the method, and it is demonstrated that the method performs similarly to UCR suite for small queries and narrow warping constraints, but performs up to five times faster for long queries and large warping windows.
Abstract: Similarity search is the core procedure for several time series mining tasks. While different distance measures can be used for this purpose, there is clear evidence that the Dynamic Time Warping (DTW) is the most suitable distance function for a wide range of application domains. Despite its quadratic complexity, research efforts have proposed a significant number of pruning methods to speed up the similarity search under DTW. However, the search may still take a considerable amount of time depending on the parameters of the search, such as the length of the query and the warping window width. The main reason is that the current techniques for speeding up the similarity search focus on avoiding the costly distance calculation between as many pairs of time series as possible. Nevertheless, the few pairs of subsequences that were not discarded by the pruning techniques can represent a significant part of the entire search time. In this work, we adapt a recently proposed algorithm to improve the internal efficiency of the DTW calculation. Our method can speed up the UCR suite, considered the current fastest tool for similarity search under DTW. More important, the longer the time needed for the search, the higher the speedup ratio achieved by our method. We demonstrate that our method performs similarly to UCR suite for small queries and narrow warping constraints. However, it performs up to five times faster for long queries and large warping windows.

72 citations


Journal ArticleDOI
TL;DR: The importance of setting DTW’s warping window width correctly is demonstrated, and novel methods to learn this parameter in both supervised and unsupervised settings are proposed, which can produce significant improvements in classification accuracy and clustering quality.
Abstract: Dynamic Time Warping (DTW) is a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping (w). In the supervised case with ample data, w is typically set by cross-validation in the training stage. However, this method is likely to yield suboptimal results for small training sets. For the unsupervised case, learning via cross-validation is not possible because we do not have access to labeled data. Many practitioners have thus resorted to assuming that “the larger the better”, and they use the largest value of w permitted by the computational resources. However, as we will show, in most circumstances, this is a naive approach that produces inferior clusterings. Moreover, the best warping window width is generally non-transferable between the two tasks, i.e., for a single dataset, practitioners cannot simply apply the best w learned for classification on clustering or vice versa. In addition, we will demonstrate that the appropriate amount of warping not only depends on the data structure, but also on the dataset size. Thus, even if a practitioner knows the best setting for a given dataset, they will likely be at a lost if they apply that setting on a bigger size version of that data. All these issues seem largely unknown or at least unappreciated in the community. In this work, we demonstrate the importance of setting DTW’s warping window width correctly, and we also propose novel methods to learn this parameter in both supervised and unsupervised settings. The algorithms we propose to learn w can produce significant improvements in classification accuracy and clustering quality. We demonstrate the correctness of our novel observations and the utility of our ideas by testing them with more than one hundred publicly available datasets. Our forceful results allow us to make a perhaps unexpected claim; an underappreciated “low hanging fruit” in optimizing DTW’s performance can produce improvements that make it an even stronger baseline, closing most or all the improvement gap of the more sophisticated methods proposed in recent years.

67 citations


Journal ArticleDOI
TL;DR: This paper presents a novel robust graph regularized NMF model (RGNMF) to approximate the data matrix for clustering and shows that the proposed method consistently outperforms many state-of-the-art methods.
Abstract: Nonnegative matrix factorization and its graph regularized extensions have received significant attention in machine learning and data mining. However, existing approaches are sensitive to outliers and noise due to the utilization of the squared loss function in measuring the quality of graph regularization and data reconstruction. In this paper, we present a novel robust graph regularized NMF model (RGNMF) to approximate the data matrix for clustering. Our assumption is that there may exist some entries of the data corrupted arbitrarily, but the corruption is sparse. To address this problem, an error matrix is introduced to capture the sparse corruption. With this sparse outlier matrix, a robust factorization result could be obtained since a much cleaned data could be reconstructed. Moreover, the $$\ell _{1}$$ -norm function is used to alleviate the influence of unreliable regularization which is incurred by unexpected graphs. That is, the sparse error matrix alleviates the impact of noise and outliers, and the $$\ell _{1}$$ -norm function leads to a faithful regularization since the influence of the unreliable regularization errors can be reduced. Thus, RGNMF is robust to unreliable graphs and noisy data. In order to solve the optimization problem of our method, an iterative updating algorithm is proposed and its convergence is also guaranteed theoretically. Experimental results show that the proposed method consistently outperforms many state-of-the-art methods.

56 citations


Journal ArticleDOI
TL;DR: Theoretical analysis is presented to show the superiority of the proposed framework of online learning on streaming networks (OLSN) and extensive experiments on real-world networks further demonstrate the effectiveness and efficiency ofThe proposed OLSN framework.
Abstract: The proliferation of networked data in various disciplines motivates a surge of research interests on network or graph mining. Among them, node classification is a typical learning task that focuses on exploiting the node interactions to infer the missing labels of unlabeled nodes in the network. A vast majority of existing node classification algorithms overwhelmingly focus on static networks and they assume the whole network structure is readily available before performing learning algorithms. However, it is not the case in many real-world scenarios where new nodes and new links are continuously being added in the network. Considering the streaming nature of networks, we study how to perform online node classification on this kind of streaming networks (a.k.a. online learning on streaming networks). As the existence of noisy links may negatively affect the node classification performance, we first present an online network embedding algorithm to alleviate this problem by obtaining the embedding representation of new nodes on the fly. Then we feed the learned embedding representation into a novel online soft margin kernel learning algorithm to predict the node labels in a sequential manner. Theoretical analysis is presented to show the superiority of the proposed framework of online learning on streaming networks (OLSN). Extensive experiments on real-world networks further demonstrate the effectiveness and efficiency of the proposed OLSN framework.

45 citations


Journal ArticleDOI
TL;DR: The boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner.
Abstract: Recent years have witnessed a growing number of publications dealing with the imbalanced learning issue. While a plethora of techniques have been investigated on traditional low-dimensional data, little is known on the effect thereof on behaviour data. This kind of data reflects fine-grained behaviours of individuals or organisations and is characterized by sparseness and very large dimensions. In this article, we investigate the effects of several over-and undersampling, cost-sensitive learning and boosting techniques on the problem of learning from imbalanced behaviour data. Oversampling techniques show a good overall performance and do not seem to suffer from overfitting as traditional studies report. A variety of undersampling approaches are investigated as well and show the performance degrading effect of instances showing odd behaviour. Furthermore, the boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner. Finally, the EasyEnsemble technique is presented as the method outperforming all others. By randomly sampling several balanced subsets, feeding them to a boosting process and subsequently combining their hypotheses, a classifier is obtained that achieves noise/outlier reduction effects and simultaneously explores the majority class space efficiently. Furthermore, the method is very fast since it is parallelizable and each subset is only twice as large as the minority class size.

39 citations


Journal ArticleDOI
TL;DR: The proposed statistical tests provide the unprecedented possibility to minimize the number of false positive biclusters without incurring on false negatives, and to compare state-of-the-art biclustering algorithms according to the statistical significance of their outputs.
Abstract: Statistical evaluation of biclustering solutions is essential to guarantee the absence of spurious relations and to validate the high number of scientific statements inferred from unsupervised data analysis without a proper statistical ground. Most biclustering methods rely on merit functions to discover biclusters with specific homogeneity criteria. However, strong homogeneity does not guarantee the statistical significance of biclustering solutions. Furthermore, although some biclustering methods test the statistical significance of specific types of biclusters, there are no methods to assess the significance of flexible biclustering models. This work proposes a method to evaluate the statistical significance of biclustering solutions. It integrates state-of-the-art statistical views on the significance of local patterns and extends them with new principles to assess the significance of biclusters with additive, multiplicative, symmetric, order-preserving and plaid coherencies. The proposed statistical tests provide the unprecedented possibility to minimize the number of false positive biclusters without incurring on false negatives, and to compare state-of-the-art biclustering algorithms according to the statistical significance of their outputs. Results on synthetic and real data support the soundness and relevance of the proposed contributions, and stress the need to combine significance and homogeneity criteria to guide the search for biclusters.

38 citations


Journal ArticleDOI
TL;DR: A new approach called x-PACS (for eXplaining Patterns of Anomalies with Characterizing Subspaces), which “reverse-engineers” the known anomalies by identifying the groups (or patterns) that they form, and the characterizing subspace and feature rules that separate each anomalous pattern from normal instances.
Abstract: Anomaly detection has numerous applications and has been studied vastly. We consider a complementary problem that has a much sparser literature: anomaly description. Interpretation of anomalies is crucial for practitioners for sense-making, troubleshooting, and planning actions. To this end, we present a new approach called x-PACS (for eXplaining Patterns of Anomalies with Characterizing Subspaces), which “reverse-engineers” the known anomalies by identifying (1) the groups (or patterns) that they form, and (2) the characterizing subspace and feature rules that separate each anomalous pattern from normal instances. Explaining anomalies in groups not only saves analyst time and gives insight into various types of anomalies, but also draws attention to potentially critical, repeating anomalies. In developing x-PACS, we first construct a desiderata for the anomaly description problem. From a descriptive data mining perspective, our method exhibits five desired properties in our desiderata. Namely, it can unearth anomalous patterns (i) of multiple different types, (ii) hidden in arbitrary subspaces of a high dimensional space, (iii) interpretable by human analysts, (iv) different from normal patterns of the data, and finally (v) succinct, providing a short data description. No existing work on anomaly description satisfies all of these properties simultaneously. Furthermore, x-PACS is highly parallelizable; it is linear on the number of data points and exponential on the (typically small) largest characterizing subspace size. The anomalous patterns that x-PACS finds constitute interpretable “signatures”, and while it is not our primary goal, they can be used for anomaly detection. Through extensive experiments on real-world datasets, we show the effectiveness and superiority of x-PACS in anomaly explanation over various baselines, and demonstrate its competitive detection performance as compared to the state-of-the-art.

35 citations


Journal ArticleDOI
TL;DR: This paper is concerned with the estimation of a local measure of intrinsic dimensionality recently proposed by Houle, and several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions.
Abstract: This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl’s expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.

Journal ArticleDOI
TL;DR: This paper defines and analyzes the role of different aspects in the location recommendation, and proposes two fused models that incorporate all the major aspects into a single recommendation model and evaluates the proposed models against two real-world datasets.
Abstract: The evolution of World Wide Web (WWW) and the smart-phone technologies have revolutionized our daily life. This has facilitated the emergence of many useful systems, such as Location-based Social Networks (LBSN) which have provisioned many factors that are crucial for selection of Point-of-Interests (POI). Some of the major factors are: (i) the location attributes, such as geo-coordinates, category, and check-in time, (ii) the user attributes, such as, comments, tips, reviews, and ratings made to the locations, and (iii) other information, such as the distance of the POI from user’s house/office, social tie between users, and so forth. Careful selection of such factors can have significant impact on the efficiency of POI recommendation. In this paper, we define and analyze the fusion of different major aspects in POI recommendation. Such a fusion and analysis is barely explored by other researchers. The major contributions of this paper are: (i) it analyzes the role of different aspects (e.g., check-in frequency, social, temporal, spatial, and categorical) in the location recommendation, (ii) it proposes two fused models—a ranking-based, and a matrix factorization-based, that incorporate all the major aspects into a single recommendation model, and (iii) it evaluates the proposed models against two real-world datasets.

Journal ArticleDOI
TL;DR: The Infinite Ensemble Clustering (IEC) is proposed, which incorporates marginalized denoising auto-encoder with dropout noises to generate the expectation representation for infinite basic partitions and is evaluated in the application of pan-omics gene expression analysis application via survival analysis.
Abstract: Ensemble clustering aims to fuse several diverse basic partitions into a consensus one, which has been widely recognized as a promising tool to discover novel clusters and deliver robust partitions, while representation learning with deep structure shows appealing performance in unsupervised feature pre-treatment. In the literature, it has been empirically found that with the increasing number of basic partitions, ensemble clustering gets better performance and lower variances, yet the best number of basic partitions for a given data set is a pending problem. In light of this, we propose the Infinite Ensemble Clustering (IEC), which incorporates marginalized denoising auto-encoder with dropout noises to generate the expectation representation for infinite basic partitions. Generally speaking, a set of basic partitions is firstly generated from the data. Then by converting the basic partitions to the 1-of-K codings, we link the marginalized denoising auto-encoder to the infinite basic partition representation. Finally, we follow the layer-wise training procedure and feed the concatenated deep features to K-means for final clustering. According to different types of marginalized auto-encoders, the linear and non-linear versions of IEC are proposed. Extensive experiments on diverse vision data sets with different levels of visual descriptors demonstrate the superior performance of IEC compared to the state-of-the-art ensemble clustering and deep clustering methods. Moreover, we evaluate the performance of IEC in the application of pan-omics gene expression analysis application via survival analysis.

Journal ArticleDOI
TL;DR: A multivariate anomaly detection algorithm which detects anomalies and identifies the dimensions and locations of the anomalous subsequences and can successfully detect the correct anomalies without requiring any prior knowledge about the data is introduced.
Abstract: The problem of anomaly detection in time series has received a lot of attention in the past two decades. However, existing techniques cannot locate where the anomalies are within anomalous time series, or they require users to provide the length of potential anomalies. To address these limitations, we propose a self-learning online anomaly detection algorithm that automatically identifies anomalous time series, as well as the exact locations where the anomalies occur in the detected time series. In addition, for multivariate time series, it is difficult to detect anomalies due to the following challenges. First, anomalies may occur in only a subset of dimensions (variables). Second, the locations and lengths of anomalous subsequences may be different in different dimensions. Third, some anomalies may look normal in each individual dimension but different with combinations of dimensions. To mitigate these problems, we introduce a multivariate anomaly detection algorithm which detects anomalies and identifies the dimensions and locations of the anomalous subsequences. We evaluate our approaches on several real-world datasets, including two CPU manufacturing data from Intel. We demonstrate that our approach can successfully detect the correct anomalies without requiring any prior knowledge about the data.

Journal ArticleDOI
TL;DR: This article presents an exhaustive review of constrained clustering algorithms and a comparative study, in which their performance is evaluated when applied to time-series, and finds that k-means based algorithms become computationally expensive and unstable under these modifications.
Abstract: Constrained clustering is becoming an increasingly popular approach in data mining. It offers a balance between the complexity of producing a formal definition of thematic classes-required by supervised methods-and unsupervised approaches, which ignore expert knowledge and intuition. Nevertheless, the application of constrained clustering to time-series analysis is relatively unknown. This is partly due to the unsuitability of the Euclidean distance metric, which is typically used in data mining, to time-series data. This article addresses this divide by presenting an exhaustive review of constrained clustering algorithms and by modifying publicly available implementations to use a more appropriate distance measure-dynamic time warping. It presents a comparative study, in which their performance is evaluated when applied to time-series. It is found that k-Means based algorithms become computationally expensive and unstable under these modifications. Spectral approaches are easily applied and offer state-of-the-art performance, whereas declarative approaches are also easily applied and guarantee constraint satisfaction. An analysis of the results raises several influencing factors to an algorithm's performance when constraints are introduced.

Journal ArticleDOI
TL;DR: This work formally defines pattern mining as a game and to solve it with Monte Carlo tree search (MCTS), an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property.
Abstract: The discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It outperforms other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks.

Journal ArticleDOI
TL;DR: The presented work is a step towards autonomous knowledge discovery in a domain where data volumes are increasing, the complexity of systems is growing, and dedicating human experts to build fault detection and diagnostic models for all possible faults is not economically viable.
Abstract: An approach for intelligent monitoring of mobile cyberphysical systems is described, based on consensus among distributed self-organised agents. Its usefulness is experimentally demonstrated over a long-time case study in an example domain: a fleet of city buses. The proposed solution combines several techniques, allowing for life-long learning under computational and communication constraints. The presented work is a step towards autonomous knowledge discovery in a domain where data volumes are increasing, the complexity of systems is growing, and dedicating human experts to build fault detection and diagnostic models for all possible faults is not economically viable. The embedded, self-organised agents operate on-board the cyberphysical systems, modelling their states and communicating them wirelessly to a back-office application. Those models are subsequently compared against each other to find systems which deviate from the consensus. In this way the group (e.g., a fleet of vehicles) is used to provide a standard, or to describe normal behaviour, together with its expected variability under particular operating conditions. The intention is to detect faults without the need for human experts to anticipate them beforehand. This can be used to build up a knowledge base that accumulates over the life-time of the systems. The approach is demonstrated using data collected during regular operation of a city bus fleet over the period of almost 4 years.

Journal ArticleDOI
TL;DR: This paper casts the location recommendation as a mathematical matrix-completion problem and proposes a robust algorithm named Linearized Bregman Iteration for Matrix Completion (LBIMC), which can effectively recover the user-location matrix considering structural noise and provide recommendations based solely on check-in records.
Abstract: Due to the sharply increasing number of users and venues in Location-Based Social Networks, it becomes a big challenge to provide recommendations which match users’ preferences. Furthermore, the sparse data and skew distribution (i.e., structural noise) also worsen the coverage and accuracy of recommendations. This problem is prevalent in traditional recommender methods since they assume that the collected data truly reflect users’ preferences. To overcome the limitation of current recommenders, it is imperative to explore an effective strategy, which can accurately provide recommendations while tolerating the structural noise. However, few study concentrates on the process of noisy data in the recommender system, even recent matrix-completion algorithms. In this paper, we cast the location recommendation as a mathematical matrix-completion problem and propose a robust algorithm named Linearized Bregman Iteration for Matrix Completion (LBIMC), which can effectively recover the user-location matrix considering structural noise and provide recommendations based solely on check-in records. Our experiments are conducted by an amount of check-in data from Foursquare, and the results demonstrate the effectiveness of LBIMC.

Journal ArticleDOI
TL;DR: This paper introduces a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable- length motifs in large-scale time series data (e.g. over one hundred million in length).
Abstract: The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets.

Journal ArticleDOI
TL;DR: GNG-A is proposed, an adaptive method for incremental unsupervised learning from evolving data streams experiencing various types of change and is demonstrated for anomaly and novelty detection in non-stationary environments.
Abstract: In the era of big data, considerable research focus is being put on designing efficient algorithms capable of learning and extracting high-level knowledge from ubiquitous data streams in an online fashion. While, most existing algorithms assume that data samples are drawn from a stationary distribution, several complex environments deal with data streams that are subject to change over time. Taking this aspect into consideration is an important step towards building truly aware and intelligent systems. In this paper, we propose GNG-A, an adaptive method for incremental unsupervised learning from evolving data streams experiencing various types of change. The proposed method maintains a continuously updated network (graph) of neurons by extending the Growing Neural Gas algorithm with three complementary mechanisms, allowing it to closely track both gradual and sudden changes in the data distribution. First, an adaptation mechanism handles local changes where the distribution is only non-stationary in some regions of the feature space. Second, an adaptive forgetting mechanism identifies and removes neurons that become irrelevant due to the evolving nature of the stream. Finally, a probabilistic evolution mechanism creates new neurons when there is a need to represent data in new regions of the feature space. The proposed method is demonstrated for anomaly and novelty detection in non-stationary environments. Results show that the method handles different data distributions and efficiently reacts to various types of change.

Journal ArticleDOI
TL;DR: This paper proposes to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths, and develops a new similarity measure called KnowSim which is an ensemble of selected meta- PathSim, which results in impressive high-quality document clustering and classification performance.
Abstract: Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves However, finding useful meta-paths is not trivial to human In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity

Journal ArticleDOI
TL;DR: This work proposes an analytical method (i.e. simulation-free) that extends the works of Picard et al. to identify statistically significant motifs in a given network and provides an analytical expression of the mean and variance of the count under the Expected Degree Distribution random graph model.
Abstract: Network motif discovery is the problem of finding subgraphs of a network that occur more frequently than expected, according to some reasonable null hypothesis. Such subgraphs may indicate small scale interaction features in genomic interaction networks or intriguing relationships involving actors or a relationship among airlines. When nodes are labeled, they can carry information such as the genomic entity under study or the dominant genre of an actor. For that reason, labeled subgraphs convey information beyond structure and could therefore enjoy more applications. To identify statistically significant motifs in a given network, we propose an analytical method (i.e. simulation-free) that extends the works of Picard et al. (J Comput Biol 15(1):1---20, 2008) and Schbath et al. (J Bioinform Syst Biol 2009(1):616234, 2009) to label-dependent scale-free graph models. We provide an analytical expression of the mean and variance of the count under the Expected Degree Distribution random graph model. Our model deals with both induced and non-induced motifs. We have tested our methodology on a wide set of graphs ranging from protein---protein interaction networks to movie networks. The analytical model is a fast (usually faster by orders of magnitude) alternative to simulation. This advantage increases as graphs grow in size.

Journal ArticleDOI
TL;DR: It is shown that the proposed provenance network metrics can successfully identify the owners of provenance documents, assess the quality of crowdsourced data, and identify instructions from chat messages in an alternate-reality game with high levels of accuracy.
Abstract: Provenance network analytics is a novel data analytics approach that helps infer properties of data, such as quality or importance, from their provenance. Instead of analysing application data, which are typically domain-dependent, it analyses the data's provenance as represented using the World Wide Web Consortium's domain-agnostic PROV data model. Specifically, the approach proposes a number of network metrics for provenance data and applies established machine learning techniques over such metrics to build predictive models for some key properties of data. Applying this method to the provenance of real-world data from three different applications, we show that it can successfully identify the owners of provenance documents, assess the quality of crowdsourced data, and identify instructions from chat messages in an alternate-reality game with high levels of accuracy. By so doing, we demonstrate the different ways the proposed provenance network metrics can be used in analysing data, providing the foundation for provenance-based data analytics.

Journal ArticleDOI
TL;DR: This paper analyzes two uplift modeling approaches to linear regression, one based on the use of two separate models and the other based on target variable transformation, and proposes a third model which combines the benefits of both approaches and seems to be the model of choice for uplift linear regression.
Abstract: The purpose of statistical modeling is to select targets for some action, such as a medical treatment or a marketing campaign. Unfortunately, classical machine learning algorithms are not well suited to this task since they predict the results after the action, and not its causal impact. The answer to this problem is uplift modeling, which, in addition to the usual training set containing objects on which the action was taken, uses an additional control group of objects not subjected to it. The predicted true effect of the action on a given individual is modeled as the difference between responses in both groups. This paper analyzes two uplift modeling approaches to linear regression, one based on the use of two separate models and the other based on target variable transformation. Adapting the second estimator to the problem of regression is one of the contributions of the paper. We identify the situations when each model performs best and, contrary to several claims in the literature, show that the double model approach has favorable theoretical properties and often performs well in practice. Finally, based on our analysis we propose a third model which combines the benefits of both approaches and seems to be the model of choice for uplift linear regression. Experimental analysis confirms our theoretical results on both simulated and real data, clearly demonstrating good performance of the double model and the advantages of the proposed approach.

Journal ArticleDOI
TL;DR: A non-parametric kernel mixture model (KMM) based probability density estimation approach, in which the data sample of a class is assumed to be drawn by several unknown independent hidden subclasses, which is able to improve the quality of estimated PDFs of conventional kernel density estimation (KDE) method.
Abstract: Estimating reliable class-conditional probability is the prerequisite to implement Bayesian classifiers, and how to estimate the probability density functions (PDFs) is also a fundamental problem for other probabilistic induction algorithms. The finite mixture model (FMM) is able to represent arbitrary complex PDFs by using a mixture of mutimodal distributions, but it assumes that the component mixtures follows a given distribution, which may not be satisfied for real world data. This paper presents a non-parametric kernel mixture model (KMM) based probability density estimation approach, in which the data sample of a class is assumed to be drawn by several unknown independent hidden subclasses. Unlike traditional FMM schemes, we simply use the k-means clustering algorithm to partition the data sample into several independent components, and the regional density diversities of components are combined using the Bayes theorem. On the basis of the proposed kernel mixture model, we present a three-step Bayesian classifier, which includes partitioning, structure learning, and PDF estimation. Experimental results show that KMM is able to improve the quality of estimated PDFs of conventional kernel density estimation (KDE) method, and also show that KMM-based Bayesian classifiers outperforms existing Gaussian, GMM, and KDE-based Bayesian classifiers.

Journal ArticleDOI
TL;DR: A novel technique to localize spatiotemporal anomalous events based on tensor decomposition and employs a spatial-feature-temporal tensor model and analyzes latent mobility patterns through unsupervised learning is proposed.
Abstract: Anomaly detection in multidimensional data is a challenging task. Detecting anomalous mobility patterns in a city needs to take spatial, temporal, and traffic information into consideration. Although existing techniques are able to extract spatiotemporal features for anomaly analysis, few systematic analysis about how different factors contribute to or affect the anomalous patterns has been proposed. In this paper, we propose a novel technique to localize spatiotemporal anomalous events based on tensor decomposition. The proposed method employs a spatial-feature-temporal tensor model and analyzes latent mobility patterns through unsupervised learning. We first train the model based on historical data and then use the model to capture the anomalies, i.e., the mobility patterns that are significantly different from the normal patterns. The proposed technique is evaluated based on the yellow-cab dataset collected from New York City. The results show several interesting latent mobility patterns and traffic anomalies that can be deemed as anomalous events in the city, suggesting the effectiveness of the proposed anomaly detection method.

Journal ArticleDOI
TL;DR: An approach is proposed that takes a noisy GPS trajectory as input and calculates the stop probability at each entry and allows the user to directly filter out any classified stops that are of an unacceptable probability for their application through the use of a minimum stop probability parameter.
Abstract: Stop and move information can be used to uncover useful semantic patterns; therefore, annotating GPS trajectories as either stopping or moving is beneficial. However, the task of automatically discovering if the entity is stopping or moving is challenging due to the spatial noisiness of real-world GPS trajectories. Existing approaches classify each entry definitively as being either a stop or a move: hiding all indication that some classifications can be made with more certainty than others. Such an indication of the “goodness of classification” of each entry would allow the user to filter out certain stop classifications that appear too ambiguous for their use-case, which in a data-mining context may ultimately lead to less false patterns. In this work we propose such an approach that takes a noisy GPS trajectory as input and calculates the stop probability at each entry. Through the use of a minimum stop probability parameter our proposed approach allows the user to directly filter out any classified stops that are of an unacceptable probability for their application. Using several real-world and synthetic GPS trajectories (that we have made available) we compared the classification effectiveness, parameter sensitivity, and running time of our approach to two well-known existing approaches SMoT and CB-SMoT. Experimental results indicated the efficiency, effectiveness, and sampling rate robustness of our approach compared to the existing approaches. The results also demonstrated that the user can increase the minimum stop probability parameter to easily filter out low probability stop classifications—which equated to effectively reducing the number of false positive classifications in our ground truth experiments. Lastly, we proposed estimation heuristics for each our approaches’ parameters and empirically demonstrated the effectiveness of each heuristic using real-world trajectories. Specifically, the results revealed that even when all of the parameters were estimated the classification effectiveness of our approach was higher than existing approaches across a range of sampling rates.

Journal ArticleDOI
TL;DR: In this article, the authors define a notion of temporal stability for binary classification tasks in predictive process monitoring and evaluate existing methods with respect to both temporal stability and accuracy, and find that methods based on XGBoost and LSTM neural networks exhibit the highest temporal stability.
Abstract: Predictive process monitoring is concerned with the analysis of events produced during the execution of a business process in order to predict as early as possible the final outcome of an ongoing case. Traditionally, predictive process monitoring methods are optimized with respect to accuracy. However, in environments where users make decisions and take actions in response to the predictions they receive, it is equally important to optimize the stability of the successive predictions made for each case. To this end, this paper defines a notion of temporal stability for binary classification tasks in predictive process monitoring and evaluates existing methods with respect to both temporal stability and accuracy. We find that methods based on XGBoost and LSTM neural networks exhibit the highest temporal stability. We then show that temporal stability can be enhanced by hyperparameter-optimizing random forests and XGBoost classifiers with respect to inter-run stability. Finally, we show that time series smoothing techniques can further enhance temporal stability at the expense of slightly lower accuracy.

Journal ArticleDOI
TL;DR: In this paper, a counter-based algorithm for tracking correlated heavy hitters (CHHs) is presented, formally proving its error bounds and correctness and show, through extensive experimental results, that their algorithm outperforms the Misra-Gries based algorithm with regard to accuracy and speed whilst requiring asymptotic much less space.
Abstract: The problem of mining correlated heavy hitters (CHH) from a two-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra–Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through extensive experimental results, that our algorithm outperforms the Misra–Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space.

Journal ArticleDOI
TL;DR: This paper proposes a novel anytime approach, called AnyDBC, that compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time of DBSCAN.
Abstract: The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, DBSCAN requires neighborhood queries for all objects and propagation of labels from object to object. This scheme is time consuming and thus limits its applicability for large datasets. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time. Moreover, instead of passively performing range queries for all objects as in existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still satisfying the cluster definition of DBSCAN. Moreover, by processing queries in block and merging the results into the current cluster structure, AnyDBC can be efficiently parallelized on shared memory architectures to further accelerate the performance, uniquely making it a parallel and anytime technique at the same time. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants as well as a high parallel scalability on multicore processors for very large real and synthetic complex datasets.