scispace - formally typeset
Search or ask a question

Showing papers by "James C. Bezdek published in 2017"


Proceedings ArticleDOI
01 Apr 2017
TL;DR: In this paper, the authors propose a memory efficient incremental local outlier detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound.
Abstract: Outlier detection is an important task in data mining. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for processing. While the wellknown Local Outlier Factor (LOF) algorithm has an incremental version (called iLOF), it assumes unbounded memory to keep all previous data points. In this paper, we propose a memory efficient incremental local outlier (MiLOF) detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound. In addition MiLOF F is robust to changes in the number of data points, underlying clusters and dimensions in the data stream.

55 citations


Journal ArticleDOI
TL;DR: It is shown that the novel two-stage clusiVAT approach can produce natural and informative trajectory clusters on this real life dataset while finding representative anomalies.
Abstract: This paper proposes a novel application of Visual Assessment of Tendency (VAT)-based hierarchical clustering algorithms (VAT, iVAT, and clusiVAT) for trajectory analysis. We introduce a new clustering based anomaly detection framework named iVAT+ and clusiVAT+ and use it for trajectory anomaly detection. This approach is based on partitioning the VAT-generated Minimum Spanning Tree based on an efficient thresholding scheme. The trajectories are classified as normal or anomalous based on the number of paths in the clusters. On synthetic datasets with fixed and variable numbers of clusters and anomalies, we achieve 98 % classification accuracy. Our two-stage clusiVAT method is applied to 26,039 trajectories of vehicles and pedestrians from a parking lot scene from the real life MIT trajectories dataset. The first stage clusters the trajectories ignoring directionality. The second stage divides the clusters obtained from the first stage by considering trajectory direction. We show that our novel two-stage clusiVAT approach can produce natural and informative trajectory clusters on this real life dataset while finding representative anomalies.

54 citations


Journal ArticleDOI
TL;DR: A normalized version of the soft mutual information cluster validity index (NMI sM) is advocated as the best overall choice, as it outperforms the other seven indices for both FCM and EM according to tests on synthetic and real data.
Abstract: Previously, eight popular information-theoretic-based cluster validity indices have been generalized and tested for probabilistic partitions built by the expectation-maximization (EM) algorithm for the Gaussian mixture model. However, the analysis was limited to probabilistic clusters, and there were limited explanations for differences in the performance of the indices. In this paper, we extend the tests to partitions found by fuzzy c-means (FCM) and provide further explanations and insights about the performance of these indices. Of the eight generalized indices, we advocate a normalized version of the soft mutual information cluster validity index (NMI sM) as the best overall choice, as it outperforms the other seven indices for both FCM and EM according to our tests on synthetic and real data. The superiority of NMIsM is most pronounced for datasets with overlapped and/or varying-sized clusters. Finally, we provide a theoretical analysis, which helps explain the superior performance of NMIsM compared with the other three normalizations of soft mutual information.

35 citations


Journal ArticleDOI
TL;DR: In this paper, a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared is identified, and theoretical explanations for understanding why and when GT bias happens.

31 citations


Journal ArticleDOI
TL;DR: This paper generalizes an online efficient anomaly detection technique called iterative data capture anomaly detection to adapt to changes in the data stream by exponentially weighting past observations and illustrates the efficiency and accuracy of the approach compared to existing methods.
Abstract: Efficient localized data modeling techniques in Internet of Things (IoT) applications enable the nodes to change their behavior upon observing events of interest. Additionally, battery-powered IoT nodes can conserve their energy resources by limiting their data communications to specific events. Despite the real-time nature of the data collected in the IoT and limited memory and computational resources, most of the current data modeling approaches for the IoT involve batch training. Recently, an online efficient anomaly detection technique called iterative data capture anomaly detection has been proposed for environmental sensing and monitoring applications. However, this approach cannot handle changing environments. So far, efforts in extending this algorithm to adapt to changes in the environment have met with limited success. In this paper, we generalize this algorithm to adapt to changes in the data stream by exponentially weighting past observations. We illustrate the proposed algorithm with numerical results on both real-life and simulated data sets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.

7 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this paper, two new fuzzy c-means derivatives, Fuzzy c-shapes plus (FCS+) and FuzzY c-Shapes double plus, were proposed.
Abstract: The existence of large volumes of time series data in many applications has motivated data miners to investigate specialized methods for mining time series data. Clustering is a popular data mining method due to its powerful exploratory nature and its usefulness as a preprocessing step for other data mining techniques. This article develops two novel clustering algorithms for time series data that are extensions of a crisp c-shapes algorithm. The two new algorithms are heuristic derivatives of fuzzy c-means (FCM). Fuzzy c-Shapes plus (FCS+) replaces the inner product norm in the FCM model with a shape-based distance function. Fuzzy c-Shapes double plus (FCS++) uses the shape-based distance, and also replaces the FCM cluster centers with shape-extracted prototypes. Numerical experiments on 48 real time series data sets show that the two new algorithms outperform state-of-the-art shape-based clustering algorithms in terms of accuracy and efficiency. Four external cluster validity indices (the Rand index, Adjusted Rand Index, Variation of Information, and Normalized Mutual Information) are used to match candidate partitions generated by each of the studied algorithms. All four indices agree that for these finite waveform data sets, FCS++ gives a small improvement over FCS+, and in turn, FCS+ is better than the original crisp c-shapes method. Finally, we apply two tests of statistical significance to the three algorithms. The Wilcoxon and Friedman statistics both rank the three algorithms in exactly the same way as the four cluster validity indices.

5 citations


Posted Content
TL;DR: An ensemble of base classifiers in this approach is obtained by learning Naive Bayes classifiers on different training sets which are generated by projecting the original training set to lower dimensional space.
Abstract: In this study, we introduce an ensemble-based approach for online machine learning. The ensemble of base classifiers in our approach is obtained by learning Naive Bayes classifiers on different training sets which are generated by projecting the original training set to lower dimensional space. We propose a mechanism to learn sequences of data using data chunks paradigm. The experiments conducted on a number of UCI datasets and one synthetic dataset demonstrate that the proposed approach performs significantly better than some well-known online learning algorithms.

4 citations