scispace - formally typeset
Search or ask a question

Showing papers in "Data Mining and Knowledge Discovery in 2019"


Journal ArticleDOI
TL;DR: This article proposes the most exhaustive study of DNNs for TSC by training 8730 deep learning models on 97 time series datasets and provides an open source deep learning framework to the TSC community.
Abstract: Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-of-the-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.

1,833 citations


Journal ArticleDOI
TL;DR: In this paper, a taxonomy of all those methods that aim to classify time series using a distance based approach, as well as a discussion of the strengths and weaknesses of each method is presented.
Abstract: Time series classification is an increasing research topic due to the vast amount of time series data that is being created over a wide variety of fields. The particularity of the data makes it a challenging task and different approaches have been taken, including the distance based approach. 1-NN has been a widely used method within distance based time series classification due to its simplicity but still good performance. However, its supremacy may be attributed to being able to use specific distances for time series within the classification process and not to the classifier itself. With the aim of exploiting these distances within more complex classifiers, new approaches have arisen in the past few years that are competitive or which outperform the 1-NN based approaches. In some cases, these new methods use the distance measure to transform the series into feature vectors, bridging the gap between time series and traditional classifiers. In other cases, the distances are employed to obtain a time series kernel and enable the use of kernel methods for time series classification. One of the main challenges is that a kernel function must be positive semi-definite, a matter that is also addressed within this review. The presented review includes a taxonomy of all those methods that aim to classify time series using a distance based approach, as well as a discussion of the strengths and weaknesses of each method.

114 citations


Journal ArticleDOI
TL;DR: Proximity Forest is introduced, an algorithm that learns accurate models from datasets with millions of time series, and classifies a time series in milliseconds, and ranks among the most accurate classifiers while being significantly faster on the UCR archive.
Abstract: Research into the classification of time series has made enormous progress in the last decade. The UCR time series archive has played a significant role in challenging and guiding the development of new learners for time series classification. The largest dataset in the UCR archive holds 10,000 time series only; which may explain why the primary research focus has been on creating algorithms that have high accuracy on relatively small datasets. This paper introduces Proximity Forest, an algorithm that learns accurate models from datasets with millions of time series, and classifies a time series in milliseconds. The models are ensembles of highly randomized Proximity Trees. Whereas conventional decision trees branch on attribute values (and usually perform poorly on time series), Proximity Trees branch on the proximity of time series to one exemplar time series or another; allowing us to leverage the decades of work into developing relevant measures for time series. Proximity Forest gains both efficiency and accuracy by stochastic selection of both exemplars and similarity measures. Our work is motivated by recent time series applications that provide orders of magnitude more time series than the UCR benchmarks. Our experiments demonstrate that Proximity Forest is highly competitive on the UCR archive: it ranks among the most accurate classifiers while being significantly faster. We demonstrate on a 1M time series Earth observation dataset that Proximity Forest retains this accuracy on datasets that are many orders of magnitude greater than those in the UCR repository, while learning its models at least 100,000 times faster than current state-of-the-art models Elastic Ensemble and COTE.

111 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce a method to infer small sets of time-series features that exhibit strong classification performance across a given collection of time series problems, and are minimally redundant.
Abstract: Capturing the dynamical properties of time series concisely as interpretable feature vectors can enable efficient clustering and classification for time-series applications across science and industry. Selecting an appropriate feature-based representation of time series for a given application can be achieved through systematic comparison across a comprehensive time-series feature library, such as those in the hctsa toolbox. However, this approach is computationally expensive and involves evaluating many similar features, limiting the widespread adoption of feature-based representations of time series for real-world applications. In this work, we introduce a method to infer small sets of time-series features that (i) exhibit strong classification performance across a given collection of time-series problems, and (ii) are minimally redundant. Applying our method to a set of 93 time-series classification datasets (containing over 147,000 time series) and using a filtered version of the hctsa feature library (4791 features), we introduce a set of 22 CAnonical Time-series CHaracteristics, catch22, tailored to the dynamics typically encountered in time-series data-mining tasks. This dimensionality reduction, from 4791 to 22, is associated with an approximately 1000-fold reduction in computation time and near linear scaling with time-series length, despite an average reduction in classification accuracy of just 7%. catch22 captures a diverse and interpretable signature of time series in terms of their properties, including linear and non-linear autocorrelation, successive differences, value distributions and outliers, and fluctuation scaling properties. We provide an efficient implementation of catch22, accessible from many programming environments, that facilitates feature-based time-series analysis for scientific, industrial, financial and medical applications using a common language of interpretable time-series properties.

110 citations


Journal ArticleDOI
TL;DR: A multi-resolution multi-domain linear classifier achieves a similar accuracy to the state-of-the-art COTE ensemble, and to recent deep learning methods (FCN, ResNet), but uses a fraction of the time and memory required by either COTE or deep models.
Abstract: The time series classification literature has expanded rapidly over the last decade, with many new classification approaches published each year. Prior research has mostly focused on improving the accuracy and efficiency of classifiers, with interpretability being somewhat neglected. This aspect of classifiers has become critical for many application domains and the introduction of the EU GDPR legislation in 2018 is likely to further emphasize the importance of interpretable learning algorithms. Currently, state-of-the-art classification accuracy is achieved with very complex models based on large ensembles (COTE) or deep neural networks (FCN). These approaches are not efficient with regard to either time or space, are difficult to interpret and cannot be applied to variable-length time series, requiring pre-processing of the original series to a set fixed-length. In this paper we propose new time series classification algorithms to address these gaps. Our approach is based on symbolic representations of time series, efficient sequence mining algorithms and linear classification models. Our linear models are as accurate as deep learning models but are more efficient regarding running time and memory, can work with variable-length time series and can be interpreted by highlighting the discriminative symbolic features on the original time series. We advance the state-of-the-art in time series classification by proposing new algorithms built using the following three key ideas: (1) Multiple resolutions of symbolic representations: we combine symbolic representations obtained using different parameters, rather than one fixed representation (e.g., multiple SAX representations); (2) Multiple domain representations: we combine symbolic representations in time (e.g., SAX) and frequency (e.g., SFA) domains, to be more robust across problem types; (3) Efficient navigation in a huge symbolic-words space: we extend a symbolic sequence classifier (SEQL) to work with multiple symbolic representations and use its greedy feature selection strategy to effectively filter the best features for each representation. We show that our multi-resolution multi-domain linear classifier (mtSS-SEQL+LR) achieves a similar accuracy to the state-of-the-art COTE ensemble, and to recent deep learning methods (FCN, ResNet), but uses a fraction of the time and memory required by either COTE or deep models. To further analyse the interpretability of our classifier, we present a case study on a human motion dataset collected by the authors. We discuss the accuracy, efficiency and interpretability of our proposed algorithms and release all the results, source code and data to encourage reproducibility.

61 citations


Journal ArticleDOI
TL;DR: A unifying view on what is called multi-target prediction (MTP) problems and methods is presented by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems.
Abstract: Many problem settings in machine learning are concerned with the simultaneous prediction of multiple target variables of diverse type. Amongst others, such problem settings arise in multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. These subfields of machine learning are typically studied in isolation, without highlighting or exploring important relationships. In this paper, we present a unifying view on what we call multi-target prediction (MTP) problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research.

59 citations


Journal ArticleDOI
TL;DR: It is, on average, better to ensemble strong classifiers with a weighting scheme rather than perform extensive tuning and that CAWPE is a sensible starting point for combining classifiers.
Abstract: Our hypothesis is that building ensembles of small sets of strong classifiers constructed with different learning algorithms is, on average, the best approach to classification for real-world problems. We propose a simple mechanism for building small heterogeneous ensembles based on exponentially weighting the probability estimates of the base classifiers with an estimate of the accuracy formed through cross-validation on the train data. We demonstrate through extensive experimentation that, given the same small set of base classifiers, this method has measurable benefits over commonly used alternative weighting, selection or meta-classifier approaches to heterogeneous ensembles. We also show how an ensemble of five well-known, fast classifiers can produce an ensemble that is not significantly worse than large homogeneous ensembles and tuned individual classifiers on datasets from the UCI archive. We provide evidence that the performance of the cross-validation accuracy weighted probabilistic ensemble (CAWPE) generalises to a completely separate set of datasets, the UCR time series classification archive, and we also demonstrate that our ensemble technique can significantly improve the state-of-the-art classifier for this problem domain. We investigate the performance in more detail, and find that the improvement is most marked in problems with smaller train sets. We perform a sensitivity analysis and an ablation study to demonstrate the robustness of the ensemble and the significant contribution of each design element of the classifier. We conclude that it is, on average, better to ensemble strong classifiers with a weighting scheme rather than perform extensive tuning and that CAWPE is a sensible starting point for combining classifiers.

59 citations


Journal ArticleDOI
TL;DR: A multi-dimensional algorithm is presented, which is domain agnostic, has only one, easily-determined parameter, and can handle data streaming at a high rate, and is tested on the largest and most diverse collection of time series datasets ever considered.
Abstract: Unsupervised semantic segmentation in the time series domain is a much studied problem due to its potential to detect unexpected regularities and regimes in poorly understood data. However, the current techniques have several shortcomings, which have limited the adoption of time series semantic segmentation beyond academic settings for four primary reasons. First, most methods require setting/learning many parameters and thus may have problems generalizing to novel situations. Second, most methods implicitly assume that all the data is segmentable and have difficulty when that assumption is unwarranted. Thirdly, many algorithms are only defined for the single dimensional case, despite the ubiquity of multi-dimensional data. Finally, most research efforts have been confined to the batch case, but online segmentation is clearly more useful and actionable. To address these issues, we present a multi-dimensional algorithm, which is domain agnostic, has only one, easily-determined parameter, and can handle data streaming at a high rate. In this context, we test the algorithm on the largest and most diverse collection of time series datasets ever considered for this task and demonstrate the algorithm’s superiority over current solutions.

42 citations


Journal ArticleDOI
TL;DR: A new method which learns artificial neural networks by addressing all issues of renewable energy forecasting, which performs online adaptive training and enriches the entropy measures with spatial information of the data, in order to take into account spatial autocorrelation.
Abstract: In renewable energy forecasting, data are typically collected by geographically distributed sensor networks, which poses several issues. (i) Data represent physical properties that are subject to concept drift, i.e., their characteristics could change over time. To address the concept drift phenomenon, adaptive online learning methods should be considered. (ii) The error distribution is typically non-Gaussian. Therefore, traditional quality performance criteria during training, like the mean-squared error, are less suitable. In the literature, entropy-based criteria have been proposed to deal with this problem. (iii) Spatially-located sensors introduce some form of autocorrelation, that is, values collected by sensors show a correlation strictly due to their relative spatial proximity. Although all these issues have already been investigated in the literature, they have not been investigated in combination. In this paper, we propose a new method which learns artificial neural networks by addressing all these issues. The method performs online adaptive training and enriches the entropy measures with spatial information of the data, in order to take into account spatial autocorrelation. Experimental results on two photovoltaic power production datasets are clearly favorable for entropy-based measures that take into account spatial autocorrelation, also when compared with state-of-the art methods.

41 citations


Journal ArticleDOI
TL;DR: Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery as mentioned in this paper, which can also be applied to filter out patterns that are unlikely to be useful.
Abstract: Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries—patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.

34 citations


Journal ArticleDOI
TL;DR: Attri2vec as discussed by the authors proposes a unified framework for attributed network embedding that learns node embeddings by discovering a latent node attribute subspace via a network structure guided transformation performed on the original attribute space.
Abstract: Network embedding aims to learn a latent, low-dimensional vector representations of network nodes, effective in supporting various network analytic tasks. While prior arts on network embedding focus primarily on preserving network topology structure to learn node representations, recently proposed attributed network embedding algorithms attempt to integrate rich node content information with network topological structure for enhancing the quality of network embedding. In reality, networks often have sparse content, incomplete node attributes, as well as the discrepancy between node attribute feature space and network structure space, which severely deteriorates the performance of existing methods. In this paper, we propose a unified framework for attributed network embedding–attri2vec—that learns node embeddings by discovering a latent node attribute subspace via a network structure guided transformation performed on the original attribute space. The resultant latent subspace can respect network structure in a more consistent way towards learning high-quality node representations. We formulate an optimization problem which is solved by an efficient stochastic gradient descent algorithm, with linear time complexity to the number of nodes. We investigate a series of linear and non-linear transformations performed on node attributes and empirically validate their effectiveness on various types of networks. Another advantage of attri2vec is its ability to solve out-of-sample problems, where embeddings of new coming nodes can be inferred from their node attributes through the learned mapping function. Experiments on various types of networks confirm that attri2vec is superior to state-of-the-art baselines for node classification, node clustering, as well as out-of-sample link prediction tasks. The source code of this paper is available at https://github.com/daokunzhang/attri2vec .

Journal ArticleDOI
TL;DR: A new interpretable approach for multiple data streams clustering in a smart grid used for the improvement of forecasting accuracy of aggregated electricity consumption and grid analysis named ClipStream, which shows the suitability of the proposed representation in many tested applications.
Abstract: This paper presents a new interpretable approach for multiple data streams clustering in a smart grid used for the improvement of forecasting accuracy of aggregated electricity consumption and grid analysis named ClipStream. Consumers time series streams are compressed and represented by interpretable features extracted from the clipped representation. The proposed representation has low computational complexity and is incremental in the sense of the windowing method. From the extracted features, outlier consumers can be simply and quickly detected. The clustering phase consists of three parts: clustering non-outlier representations, the aggregation of consumption within clusters, and unsupervised change detection procedure on aggregated time series streams windows. ClipStream behaviour and its forecasting accuracy improvement were evaluated on four different real datasets containing variable patterns of electricity consumption. The clustering accuracy with the proposed feature extraction method from the clipped representation was evaluated on 85 time series datasets from a large public repository. The results of experiments proved the stability of the proposed ClipStream in the sense of improving forecasting accuracy and showed the suitability of the proposed representation in many tested applications.

Journal ArticleDOI
TL;DR: An exponential-time dynamic program for computing a global minimum of the Fréchet function is proposed and an exact polynomial-time algorithm for the special case of binary time series is presented.
Abstract: Averaging time series under dynamic time warping is an important tool for improving nearest-neighbor classifiers and formulating centroid-based clustering. The most promising approach poses time series averaging as the problem of minimizing a Frechet function. Minimizing the Frechet function is NP-hard and so far solved by several heuristics and inexact strategies. Our contributions are as follows: we first discuss some inaccuracies in the literature on exact mean computation in dynamic time warping spaces. Then we propose an exponential-time dynamic program for computing a global minimum of the Frechet function. The proposed algorithm is useful for benchmarking and evaluating known heuristics. In addition, we present an exact polynomial-time algorithm for the special case of binary time series. Based on the proposed exponential-time dynamic program, we empirically study properties like uniqueness and length of a mean, which are of interest for devising better heuristics. Experimental evaluations indicate substantial deficits of state-of-the-art heuristics in terms of their output quality.

Journal ArticleDOI
TL;DR: The proposed method, called EACD, offers both implicit and explicit mechanisms to deal with concept drifts, and its performance is compared with that of the state-of-the-art algorithms using the immediate and delayed prequential evaluation methods.
Abstract: This paper presents a novel ensemble learning method based on evolutionary algorithms to cope with different types of concept drifts in non-stationary data stream classification tasks. In ensemble learning, multiple learners forming an ensemble are trained to obtain a better predictive performance compared to that of a single learner, especially in non-stationary environments, where data evolve over time. The evolution of data streams can be viewed as a problem of changing environment, and evolutionary algorithms offer a natural solution to this problem. The method proposed in this paper uses random subspaces of features from a pool of features to create different classification types in the ensemble. Each such type consists of a limited number of classifiers (decision trees) that have been built at different times over the data stream. An evolutionary algorithm (replicator dynamics) is used to adapt to different concept drifts; it allows the types with a higher performance to increase and those with a lower performance to decrease in size. Genetic algorithm is then applied to build a two-layer architecture based on the proposed technique to dynamically optimise the combination of features in each type to achieve a better adaptation to new concepts. The proposed method, called EACD, offers both implicit and explicit mechanisms to deal with concept drifts. A set of experiments employing four artificial and five real-world data streams is conducted to compare its performance with that of the state-of-the-art algorithms using the immediate and delayed prequential evaluation methods. The results demonstrate favourable performance of the proposed EACD method in different environments.

Journal ArticleDOI
Jihoi Park1, Kihwan Nam1
TL;DR: This research is the first of its kind to consider recommendation quantity and repetitive recommendations when creating group recommender systems, which is capable of determining the best suited recommendation items for each store in offline stores.
Abstract: We propose a group recommender system considering the recommendation quantity and repeat purchasing by using the existing collaborative filtering algorithm in order to optimize the offline physical store inventories. This research is the first of its kind to consider recommendation quantity and repetitive recommendations when creating group recommender systems. In offline stores, physical limitations result in the ability to display only a limited number of items. Quantity and selection of the item is an important decision for offline stores. In this paper, we suggest applying the user-based recommender system, which is capable of determining the best suited recommendation items for each store. This model is evaluated by the MAE, precision, recall, and F1 measures, and shows higher performance than the baseline model. A new performance evaluation measure is also suggested in this research. New quantity precision, quantity recall, and quantity F1 measures consider a penalty for a shortage or excess of the recommendation quantity. Novelty is defined as the proportion of items that the consumer may not have experienced in the recommendation list. Through the use of this novelty measure, we assess the new profit creation effect of the suggested model. Finally, previous research focused on recommendations for online customers, however, we expanded the recommender system to incorporate offline stores. This research is not only an academic contribution to the marketing field, but also practical contribution to offline stores through the usability of a developed offline shopping algorithm.

Journal ArticleDOI
TL;DR: This paper provides definitions for density over multiple graph snapshots, that capture different semantics of connectedness over time, and proposes a set of efficient algorithms to solve the Best Friends Forever problem.
Abstract: Graphs form a natural model for relationships and interactions between entities, for example, between people in social and cooperation networks, servers in computer networks, or tags and words in documents and tweets. But, which of these relationships or interactions are the most lasting ones? In this paper, we study the following problem: given a set of graph snapshots, which may correspond to the state of an evolving graph at different time instances, identify the set of nodes that are the most densely connected in all snapshots. We call this problem the Best Friends Forever ( $$\text {BFF}$$ ) problem. We provide definitions for density over multiple graph snapshots, that capture different semantics of connectedness over time, and we study the corresponding variants of the $$\text {BFF}$$ problem. We then look at the On–Off $$\text {BFF}$$ ( $$\textsc {O}^{\textsc {2}}\text {BFF}$$ ) problem that relaxes the requirement of nodes being connected in all snapshots, and asks for the densest set of nodes in at least k of a given set of graph snapshots. We show that this problem is NP-complete for all definitions of density, and we propose a set of efficient algorithms. Finally, we present experiments with synthetic and real datasets that show both the efficiency of our algorithms and the usefulness of the $$\text {BFF}$$ and the $$\textsc {O}^{\textsc {2}}\text {BFF}$$ problems.

Journal ArticleDOI
TL;DR: This work develops methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ, and proposes a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics.
Abstract: Random walk-based sampling methods are gaining popularity and importance in characterizing large networks. While powerful, they suffer from the slow mixing problem when the graph is loosely connected, which results in poor estimation accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but it is inapplicable if the graph does not support uniform vertex sampling (UNI). In this work, we develop methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe that many graphs under study, called target graphs, do not exist in isolation. In many situations, a target graph is related to an auxiliary graph and a bipartite graph, and they together form a better connected two-layered network structure. This new viewpoint brings extra benefits to graph sampling: if directly sampling a target graph is difficult, we can sample it indirectly with the assistance of the other two graphs. We propose a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics. Experiments conducted on both synthetic and real-world networks demonstrate the effectiveness and usefulness of these new techniques.

Journal ArticleDOI
TL;DR: The spatial leave-pair-out cross-validation method is introduced, that corrects for both of these biases simultaneously and is used to benchmark a number of classification methods on mineral prospectivity mapping data from the Central Lapland greenstone belt.
Abstract: Machine learning based classification methods are widely used in geoscience applications, including mineral prospectivity mapping. Typical characteristics of the data, such as small number of positive instances, imbalanced class distributions and lack of verified negative instances make ROC analysis and cross-validation natural choices for classifier evaluation. However, recent literature has identified two sources of bias, that can affect reliability of area under ROC curve estimation via cross-validation on spatial data. The pooling procedure performed by methods such as leave-one-out can introduce a substantial negative bias to results. At the same time, spatial dependencies leading to spatial autocorrelation can result in overoptimistic results, if not corrected for. In this work, we introduce the spatial leave-pair-out cross-validation method, that corrects for both of these biases simultaneously. The methodology is used to benchmark a number of classification methods on mineral prospectivity mapping data from the Central Lapland greenstone belt. The evaluation highlights the dangers of obtaining misleading results on spatial data and demonstrates how these problems can be avoided. Further, the results show the advantages of simple linear models for this classification task.

Journal ArticleDOI
TL;DR: FURL as discussed by the authors is a memory-efficient and accurate local triangle counting method for multigraph streams, which improves accuracy by reducing the variance of its estimation via a regularization strategy, and sampling more triangles than the state-of-the-art methods do, by using its memory space efficiently.
Abstract: Given a multigraph stream (e.g., Facebook messages or network traffic) where duplicate edges arrive continuously, how can we accurately and memory-efficiently estimate local triangles for all nodes? Local triangle counting in a graph stream is one of the fundamental tasks in graph mining with important applications including anomaly detection, social role identification, community detection, etc. Many recent graph streams include duplicate edges, hence form multigraph streams: e.g., many network packets might have a same (source, destination) pair. Although there have been several local triangle counting methods for multigraph streams, they have problems in terms of accuracy and memory efficiency; furthermore, most methods support either binary or weighted counting, and thus cannot find anomalies whose detection requires both types of counting. In this paper, we propose FURL, a memory-efficient and accurate local triangle counting method for multigraph streams. FURL has two main advantages. First, FURL improves accuracy by (1) reducing the variance of its estimation via a regularization strategy, and (2) sampling more triangles than the state-of-the-art methods do, by using its memory space efficiently. Second, FURL finds anomalies which state-of-the-art methods cannot discover. Experimental results show that FURL outperforms state-of-the-art methods in terms of accuracy and memory efficiency. Thanks to FURL, we discover interesting anomalies from a Bitcoin network.

Journal ArticleDOI
TL;DR: This work defines and investigates the new problem of mining subjectively interesting trees connecting a set of query vertices in a graph, i.e., trees that are highly surprising to the specific user at hand, and proposes heuristic algorithms to find the best trees efficiently.
Abstract: Consider a large graph or network, and a user-provided set of query vertices between which the user wishes to explore relations. For example, a researcher may want to connect research papers in a citation network, an analyst may wish to connect organized crime suspects in a communication network, or an internet user may want to organize their bookmarks given their location in the world wide web. A natural way to do this is to connect the vertices in the form of a tree structure that is present in the graph. However, in sufficiently dense graphs, most such trees will be large or somehow trivial (e.g. involving high degree vertices) and thus not insightful. Extending previous research, we define and investigate the new problem of mining subjectively interesting trees connecting a set of query vertices in a graph, i.e., trees that are highly surprising to the specific user at hand. Using information theoretic principles, we formalize the notion of interestingness of such trees mathematically, taking in account certain prior beliefs the user has specified about the graph. A remaining problem is efficiently fitting a prior belief model. We show how this can be done for a large class of prior beliefs. Given a specified prior belief model, we then propose heuristic algorithms to find the best trees efficiently. An empirical validation of our methods on a large real graphs evaluates the different heuristics and validates the interestingness of the given trees.

Journal ArticleDOI
TL;DR: A new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood (DNML) criterion, which can be applied to a large class of hierarchical latent variable models, such as naïve Bayes models, stochastic block models, latent Dirichlet allocations and Gaussian mixture models.
Abstract: We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood (DNML) criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as naive Bayes models, stochastic block models, latent Dirichlet allocations and Gaussian mixture models, to which many conventional information criteria cannot be straightforwardly applied due to non-identifiability of latent variable models. Our method also has an advantage that it can be exactly evaluated without asymptotic approximation with small time complexity. We theoretically justify DNML in terms of hierarchical minimax regret and estimation optimality. Our experiments using synthetic data and benchmark data demonstrate the validity of our method in terms of computational efficiency and model selection accuracy. We show that our criterion especially dominate other existing criteria when sample size is small and when data are noisy.

Journal ArticleDOI
TL;DR: A Hidden Hierarchical Matrix Factorization technique, which learns the hidden hierarchical structure from the user-item rating records, and outperforms existing methods, demonstrating that the discovery of latent hierarchical structures indeed improves the quality of recommendation.
Abstract: Matrix factorization (MF) is one of the most powerful techniques used in recommender systems. MF models the (user, item) interactions behind historical explicit or implicit ratings. Standard MF does not capture the hierarchical structural correlations, such as publisher and advertiser in advertisement recommender systems, or the taxonomy (e.g., tracks, albums, artists, genres) in music recommender systems. There are a few hierarchical MF approaches, but they require the hierarchical structures to be known beforehand. In this paper, we propose a Hidden Hierarchical Matrix Factorization (HHMF) technique, which learns the hidden hierarchical structure from the user-item rating records. HHMF does not require the prior knowledge of hierarchical structure; hence, as opposed to existing hierarchical MF methods, HHMF can be applied when this information is either explicit or implicit. According to our extensive experiments, HHMF outperforms existing methods, demonstrating that the discovery of latent hierarchical structures indeed improves the quality of recommendation.

Journal ArticleDOI
TL;DR: In this article, a mixture model, SparseMix, is proposed for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering, every group is described by a representative and a probability distribution modeling dispersion from this representative.
Abstract: Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

Journal ArticleDOI
TL;DR: In this paper, explicit and implicit feature maps for graph kernels and graph properties are derived and applied to large-scale problems, including graph invariant kernels and random walk, shortest-path and subgraph matching kernels.
Abstract: Non-linear kernel methods can be approximated by fast linear ones using suitable explicit feature maps allowing their application to large scale problems. We investigate how convolution kernels for structured data are composed from base kernels and construct corresponding feature maps. On this basis we propose exact and approximative feature maps for widely used graph kernels based on the kernel trick. We analyze for which kernels and graph properties computation by explicit feature maps is feasible and actually more efficient. In particular, we derive approximative, explicit feature maps for state-of-the-art kernels supporting real-valued attributes including the GraphHopper and graph invariant kernels. In extensive experiments we show that our approaches often achieve a classification accuracy close to the exact methods based on the kernel trick, but require only a fraction of their running time. Moreover, we propose and analyze algorithms for computing random walk, shortest-path and subgraph matching kernels by explicit and implicit feature maps. Our theoretical results are confirmed experimentally by observing a phase transition when comparing running time with respect to label diversity, walk lengths and subgraph size, respectively.

Journal ArticleDOI
TL;DR: A deeply supervised architecture that jointly learns the semantic embeddings of a query and an ad as well as their corresponding CTR is proposed and a novel cohort negative sampling technique for learning implicit negative signals is proposed.
Abstract: In sponsored search it is critical to match ads that are relevant to a query and to accurately predict their likelihood of being clicked. Commercial search engines typically use machine learning models for both query-ad relevance matching and click-through-rate (CTR) prediction. However, matching models are based on the similarity between a query and an ad, ignoring the fact that a retrieved ad may not attract clicks, while click models rely on click history, limiting their use for new queries and ads. We propose a deeply supervised architecture that jointly learns the semantic embeddings of a query and an ad as well as their corresponding CTR. We also propose a novel cohort negative sampling technique for learning implicit negative signals. We trained the proposed architecture using one billion query-ad pairs from a major commercial web search engine. This architecture improves the best-performing baseline deep neural architectures by 2% of AUC for CTR prediction and by statistically significant 0.5% of NDCG for query-ad matching.

Journal ArticleDOI
TL;DR: Unsupervised matrix-factorization-based dimensionality reduction techniques are popularly used for feature engineering with the goal of improving the generalization performance of predictive models, especially with massive, sparse feature sets, and results are compared.
Abstract: Unsupervised matrix-factorization-based dimensionality reduction (DR) techniques are popularly used for feature engineering with the goal of improving the generalization performance of predictive models, especially with massive, sparse feature sets. Often DR is employed for the same purpose as supervised regularization and other forms of complexity control: exploiting a bias/variance tradeoff to mitigate overfitting. Contradicting this practice, there is consensus among existing expert guidelines that supervised regularization is a superior way to improve predictive performance. However, these guidelines are not always followed for this sort of data, and it is not unusual to find DR used with no comparison to modeling with the full feature set. Further, the existing literature does not take into account that DR and supervised regularization are often used in conjunction. We experimentally compare binary classification performance using DR features versus the original features under numerous conditions: using a total of 97 binary classification tasks, 6 classifiers, 3 DR techniques, and 4 evaluation metrics. Crucially, we also experiment using varied methodologies to tune and evaluate various key hyperparameters. We find a very clear, but nuanced result. Using state-of-the-art hyperparameter-selection methods, applying DR does not add value beyond supervised regularization, and can often diminish performance. However, if regularization is not done well (e.g., one just uses the default regularization parameter), DR does have relatively better performance—but these approaches result in lower performance overall. These latter results provide an explanation for why practitioners may be continuing to use DR without undertaking the necessary comparison to using the original features. However, this practice seems generally wrongheaded in light of the main results, if the goal is to maximize generalization performance.

Journal ArticleDOI
TL;DR: This paper shows that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification and builds upon this view to bridge the areas of semi-supervised clustering and classification under a common umbrella ofdensity-based techniques.
Abstract: Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

Journal ArticleDOI
TL;DR: The VAN model is extended to incorporate social homophily so as to further enhance its modeling power and its performance is robust under different parameter settings.
Abstract: Modeling user check-in behavior helps us gain useful insights about venues as well as the users visiting them These insights are important in urban planning and recommender system applications Since check-in behavior is the result of multiple factors, this paper focuses on studying two venue related factors, namely, area attraction and neighborhood competition The former refers to the ability of a spatial area covering multiple venues to collectively attract check-ins from users, while the latter represents the extent to which a venue can compete with other venues in the same area for check-ins We first embark on empirical studies to ascertain the two factors using three datasets gathered from users and venues of three major cities, Singapore, Jakarta and New York City We then propose the visitation by area attractiveness and neighborhood competition (VAN) model incorporating area attraction and neighborhood competition factors Our VAN model is also extended to incorporate social homophily so as to further enhance its modeling power We evaluate VAN model using real world datasets against various state-of-the-art baselines The results show that VAN model outperforms the baselines in check-in prediction task and its performance is robust under different parameter settings

Journal ArticleDOI
TL;DR: A new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation are proposed and it is proved that the new metrics are complete and show useful relationships with f-divergences for probability distributions.
Abstract: We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with f-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski ( $$L^p$$ ) family, the fractional $$L^p$$ family, two f-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.

Journal ArticleDOI
TL;DR: In this paper, the authors formulate the design of robust active attacks as an optimisation problem and give definitions of robustness for different stages of the active attack strategy, and experimentally show that the new robust attacks are considerably more resilient than the original ones, while remaining at the same level of feasibility.
Abstract: In order to prevent the disclosure of privacy-sensitive data, such as names and relations between users, social network graphs have to be anonymised before publication. Naive anonymisation of social network graphs often consists in deleting all identifying information of the users, while maintaining the original graph structure. Various types of attacks on naively anonymised graphs have been developed. Active attacks form a special type of such privacy attacks, in which the adversary enrols a number of fake users, often called sybils, to the social network, allowing the adversary to create unique structural patterns later used to re-identify the sybil nodes and other users after anonymisation. Several studies have shown that adding a small amount of noise to the published graph already suffices to mitigate such active attacks. Consequently, active attacks have been dubbed a negligible threat to privacy-preserving social graph publication. In this paper, we argue that these studies unveil shortcomings of specific attacks, rather than inherent problems of active attacks as a general strategy. In order to support this claim, we develop the notion of a robust active attack, which is an active attack that is resilient to small perturbations of the social network graph. We formulate the design of robust active attacks as an optimisation problem and we give definitions of robustness for different stages of the active attack strategy. Moreover, we introduce various heuristics to achieve these notions of robustness and experimentally show that the new robust attacks are considerably more resilient than the original ones, while remaining at the same level of feasibility.