scispace - formally typeset
Search or ask a question

Showing papers by "James C. Bezdek published in 2016"


Journal ArticleDOI
TL;DR: A memory efficient incremental local outlier detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound are proposed.
Abstract: Outlier detection is an important task in data mining, with applications ranging from intrusion detection to human gait analysis. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for processing. While the well-known Local Outlier Factor (LOF) algorithm has an incremental version, it assumes unbounded memory to keep all previous data points. In this paper, we propose a memory efficient incremental local outlier (MiLOF) detection algorithm for data streams, and a more flexible version (MiLOF_F), both have an accuracy close to Incremental LOF but within a fixed memory bound. Our experimental results show that both proposed approaches have better memory and time complexity than Incremental LOF while having comparable accuracy. In addition, we show that MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. These results show that MiLOF/MiLOF_F are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams.

97 citations


Journal ArticleDOI
TL;DR: A new clusiVAT algorithm is presented and it proves to be the fastest and most accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in the real world KDD-99 cup data (4292637 samples in 41 dimensions) in 76 s.
Abstract: Clustering of big data has received much attention recently. In this paper, we present a new clusiVAT algorithm and compare it with four other popular data clustering algorithms. Three of the four comparison methods are based on the well known, classical batch ${k}$ -means model. Specifically, we use ${k}$ -means, single pass ${k}$ -means, online ${k}$ -means, and clustering using representatives (CURE) for numerical comparisons. clusiVAT is based on sampling the data, imaging the reordered distance matrix to estimate the number of clusters in the data visually, clustering the samples using a relative of single linkage (SL), and then noniteratively extending the labels to the rest of the data-set using the nearest prototype rule. Previous work has established that clusiVAT produces true SL clusters in compact-separated data. We have performed experiments to show that ${k}$ -means and its modified algorithms suffer from initialization issues that cause many failures. On the other hand, clusiVAT needs no initialization, and almost always finds partitions that accurately match ground truth labels in labeled data. CURE also finds SL type partitions but is much slower than the other four algorithms. In our experiments, clusiVAT proves to be the fastest and most accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s.

95 citations


Posted Content
TL;DR: This work identifies a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared, and names it as GT bias, which is the first extensive study of such a property for external cluster validity indices.
Abstract: It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand index (RI) exhibits a monotone increasing (NCinc) bias, while the Jaccard Index (JI) index suffers from a monotone decreasing (NCdec) bias. This type of bias has been previously recognized in the literature. In this work, we identify a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared. We call this new type of bias ground truth (GT) bias. This type of bias occurs if a change in the reference partition causes a change in the bias status (e.g., NCinc, NCdec) of a CVI. For example, NCinc bias in the RI can be changed to NCdec bias by skewing the distribution of clusters in the ground truth partition. It is important for users to be aware of this new type of biased behaviour, since it may affect the interpretations of CVI results. The objective of this article is to study the empirical and theoretical implications of GT bias. To the best of our knowledge, this is the first extensive study of such a property for external cluster validity indices.

33 citations


Journal ArticleDOI
TL;DR: A soft generalization of the C index that can be used to evaluate sets of candidate partitions found by either fuzzy or probabilistic clustering algorithms and is concluded that the sum-min generalization is the second best performer in the best-c tests and the best performers in the I/E tests on small data.
Abstract: The C index is an internal cluster validity index that was introduced in 1970 as a way to define and identify a “best” crisp partition on n objects represented by either unlabeled feature vectors or dissimilarity matrix data. This index is often one of the better performers among the plethora of internal indices available for this task. This paper develops a soft generalization of the C index that can be used to evaluate sets of candidate partitions found by either fuzzy or probabilistic clustering algorithms. We define four generalizations based on relational transformations of the soft partition and, then, compare their performance to eight other popular internal fuzzy cluster indices using two methods of comparison (internal “best- c ” and internal/external (I/E) “best match”), six synthetic datasets, and six real-world labeled datasets. Our main conclusion is that the sum-min generalization is the second best performer in the best-c tests and the best performer in the I/E tests on small data.

32 citations


Journal ArticleDOI
TL;DR: Two new relatives of the visual assessment of tendency (VAT), which uses cluster heat maps to visualize structure in static datasets, are developed and exemplifies and demonstrated their ability to successfully isolate anomalies and visualize changing cluster structure in the streaming data.
Abstract: The growth in pervasive network infrastructure called the Internet of Things (IoT) enables a wide range of physical objects and environments to be monitored in fine spatial and temporal detail. The detailed, dynamic data that are collected in large quantities from sensor devices provide the basis for a variety of applications. Automatic interpretation of these evolving large data is required for timely detection of interesting events. This article develops and exemplifies two new relatives of the visual assessment of tendency (VAT) and improved visual assessment of tendency (iVAT) models, which uses cluster heat maps to visualize structure in static datasets. One new model is initialized with a static VAT/iVAT image, and then incrementally (hence inc-VAT/inc-iVAT) updates the current minimal spanning tree (MST) used by VAT with an efficient edge insertion scheme. Similarly, dec-VAT/dec-iVAT efficiently removes a node from the current VAT MST. A sequence of inc-iVAT/dec-iVAT images can be used for (visual) anomaly detection in evolving data streams and for sliding window based cluster assessment for time series data. The method is illustrated with four real datasets (three of them being smart city IoT data). The evaluation demonstrates the algorithms’ ability to successfully isolate anomalies and visualize changing cluster structure in the streaming data.

30 citations


Proceedings ArticleDOI
30 Jun 2016
TL;DR: This research presents a novel data clustering algorithm, which exploits the correlation between data points in time to cluster the data, while maintaining a set of decision boundaries to identify noisy or anomalous data.
Abstract: Copyright © by SIAM. The intrinsic nature of streaming data requires algorithms that are capable of fast data analysis to extract knowledge. Most current unsupervised data analysis techniques rely on the implementation of known batch techniques over a sliding window, which can hinder their utility for the analysis of evolving structure in applications involving large streams of data. This research presents a novel data clustering algorithm, which exploits the correlation between data points in time to cluster the data, while maintaining a set of decision boundaries to identify noisy or anomalous data. We illustrate the proposed algorithm for online clustering with numerical results on both real-life and simulated datasets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.

29 citations


Journal ArticleDOI
TL;DR: Bezdek as discussed by the authors discusses the historical evolution of the terms AI and CI, the seductive semantics of terms such as machine learning, which owe a heavy debt to our intuitive ideas about intelligence; and the evolution of IEEE Computational Intelligence Society; and role that buzzwords play in the lives of all researchers.
Abstract: This article is about the terms intelligence, artificial intelligence (AI), and computational intelligence (CI). Topics addressed here include 1) the historical evolution of the terms AI and CI; 2) the seductive semantics of terms such as machine learning, which owe a heavy debt to our intuitive ideas about intelligence; 3) the evolution of the IEEE Computational Intelligence Society; and 4) the role that buzzwords play in the lives of all researchers. I estimate that this article is roughly 40% facts, 10% anecdotes, 15% speculation, and 30% opinions. The other 5%? It's reserved for you to fill in the blank-one option would be "bull." [Parts of this article are excerpted from: J.C. Bezdek., (2015). "The History, Philosophy and Development of Computat ional Intel l igence (How a simple tune became a Monster hit), in Encyclopedia of Life Support Systems (EOLSS), vol. 3, Computational Intelligence, Hisao Ishibuchi, Ed., pp. 1-22. Avai lable: ieee-cis.sightworks.net/documents/History/Bezdekeolss-CI-history.pdf.]

17 citations


Journal ArticleDOI
TL;DR: This paper examines four methods for completing the input data with imputed values before imaging and chooses a best method using contaminated versions of the complete Iris data, for which the desired results are known.
Abstract: The iVAT (asiVAT) algorithms reorder symmetric (asymmetric) dissimilarity data so that an image of the data may reveal cluster substructure. Images formed from incomplete data don't offer a very rich interpretation of cluster structure. In this paper, we examine four methods for completing the input data with imputed values before imaging. We choose a best method using contaminated versions of the complete Iris data, for which the desired results are known. Then, we analyze two real world data sets from social networks that are incomplete using the best imputation method chosen in the juried trials with Iris: (i) Sampson's monastery data, an incomplete, asymmetric relation matrix; and (ii) the karate club data, comprising a symmetric similarity matrix that is about 86 percent incomplete.

15 citations


Book ChapterDOI
05 Dec 2016
TL;DR: This paper proposes a novel boosting algorithm for outlier detection called BSS, where it sequentially improves the accuracy of each ensemble detector in an unsupervised manner and discusses the effectiveness of the approach in terms of bias-variance trade-off.
Abstract: While various ensemble algorithms have been proposed for supervised ensembles or clustering ensembles, there are few ensemble based approaches for outlier detection. The main challenge in this context is the lack of knowledge about the accuracy of the outlier detectors. Hence, none of the proposed approaches focused on sequential boosting techniques. In this paper for the first time we propose a novel boosting algorithm for outlier detection called BSS, where we sequentially improve the accuracy of each ensemble detector in an unsupervised manner. We discuss the effectiveness of our approach in terms of bias-variance trade-off. Furthermore, an extended version of BSS (called DBSS) is proposed to introduce a novel source of diversity in outlier ensemble modeling. DBSS is used to analyze the effect of changing the input parameter of BSS on its detection accuracy. Our experimental results on both synthetic and real data sets demonstrate that our approaches outperform the two state-of-the-art outlier ensemble algorithms and benefit from bias reduction. In addition, our BSS approach is robust with respect to the changing input parameter. Since each detector in our proposed BSS/DBSS is only a subset of the whole dataset, our both techniques are well suited to application environments with limited memory processors (e.g., wireless sensor networks).

13 citations


Proceedings ArticleDOI
24 Jul 2016
TL;DR: This work studies several ways to identify a “good” rogue random projection when the target downspace has dimensions below the JL limit, and uses Pearson and Spearman correlation coefficients and a visual imaging method that usually reveals cluster structure in spaces of any dimension to do this.
Abstract: The Johnson-Lindenstrauss (JL) lemma, with known probability, sets a lower bound q 0 on the dimension for which a random projection of p-dimensional vector data is guaranteed to be within (1±e) of being an isometry in a randomly projected downspace. We study several ways to identify a “good” rogue random projection when the target downspace has dimensions below the JL limit. The tools used towards this end are Pearson and Spearman correlation coefficients, and a visual imaging method (a cluster heat map) that usually reveals cluster structure in spaces of any dimension. We use four synthetic data sets and the ubiquitous Iris data to study our procedures for tracking the reliability of RRPs. Unsurprisingly, rogue random projection is quite unpredictable. At its best, it is every bit as good as Principal Components Analysis, but at it's worst, it is awful. Pearson and Spearman correlations do signal good and bad projections, but the visual imaging method seems even more effective in determining the quality of RRPs.

8 citations


Proceedings ArticleDOI
21 Apr 2016
TL;DR: This method first summarizes the sensor measurements collected from smart homes of elderly into textual statements, and dissimilarity between the text summaries is then computed to quantify the comparison.
Abstract: The recent rise in the availability of sensors to monitor our day-to-day lifestyles calls for techniques to glean information out of the generated data. For many of these sensors, the data recorded is periodic in nature. Moreover, it has been shown that a change in this periodic data is correlated to a shift in physical or mental health. We present a technique to compare this repetitive data, both qualitatively and quantitatively. Our method first summarizes the sensor measurements collected from smart homes of elderly into textual statements. Dissimilarity between the text summaries is then computed to quantify the comparison.

Journal Article
TL;DR: In this paper, a boosting algorithm for outlier detection called BSS is proposed, where each detector in the proposed BSS/DBSS is only a subset of the whole dataset, and an extended version of BSS (called DBSS) is proposed to introduce a novel source of diversity in outlier ensemble modeling.
Abstract: © Springer International Publishing AG 2016. While various ensemble algorithms have been proposed for supervised ensembles or clustering ensembles, there are few ensemble based approaches for outlier detection. The main challenge in this context is the lack of knowledge about the accuracy of the outlier detectors. Hence, none of the proposed approaches focused on sequential boosting techniques. In this paper for the first time we propose a novel boosting algorithm for outlier detection called BSS, where we sequentially improve the accuracy of each ensemble detector in an unsupervised manner. We discuss the effectiveness of our approach in terms of bias-variance tradeoff. Furthermore, an extended version of BSS (called DBSS) is proposed to introduce a novel source of diversity in outlier ensemble modeling. DBSS is used to analyze the effect of changing the input parameter of BSS on its detection accuracy. Our experimental results on both synthetic and real data sets demonstrate that our approaches outperform the two state-of-the-art outlier ensemble algorithms and benefit from bias reduction. In addition, our BSS approach is robust with respect to the changing input parameter. Since each detector in our proposed BSS/DBSS is only a subset of the whole dataset, our both techniques are well suited to application environments with limited memory processors (e.g., wireless sensor networks).