Showing papers by "James C. Bezdek published in 2016"

PDF

Open Access

Journal Article•DOI•

Fast Memory Efficient Local Outlier Detection in Data Streams

[...]

Mahsa Salehi¹, Christopher Leckie¹, James C. Bezdek¹, Tharshan Vaithianathan¹, Xuyun Zhang² - Show less +1 more•Institutions (2)

University of Melbourne¹, University of Auckland²

01 Dec 2016-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A memory efficient incremental local outlier detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound are proposed.

...read moreread less

Abstract: Outlier detection is an important task in data mining, with applications ranging from intrusion detection to human gait analysis. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for processing. While the well-known Local Outlier Factor (LOF) algorithm has an incremental version, it assumes unbounded memory to keep all previous data points. In this paper, we propose a memory efficient incremental local outlier (MiLOF) detection algorithm for data streams, and a more flexible version (MiLOF_F), both have an accuracy close to Incremental LOF but within a fixed memory bound. Our experimental results show that both proposed approaches have better memory and time complexity than Incremental LOF while having comparable accuracy. In addition, we show that MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. These results show that MiLOF/MiLOF_F are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams.

...read moreread less

97 citations

Journal Article•DOI•

A Hybrid Approach to Clustering in Big Data

[...]

Dheeraj Kumar¹, James C. Bezdek¹, Marimuthu Palaniswami¹, Sutharshan Rajasegarar¹, Christopher Leckie¹, Timothy C. Havens² - Show less +2 more•Institutions (2)

University of Melbourne¹, Michigan Technological University²

01 Oct 2016-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A new clusiVAT algorithm is presented and it proves to be the fastest and most accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in the real world KDD-99 cup data (4292637 samples in 41 dimensions) in 76 s.

...read moreread less

Abstract: Clustering of big data has received much attention recently. In this paper, we present a new clusiVAT algorithm and compare it with four other popular data clustering algorithms. Three of the four comparison methods are based on the well known, classical batch ${k}$ -means model. Specifically, we use ${k}$ -means, single pass ${k}$ -means, online ${k}$ -means, and clustering using representatives (CURE) for numerical comparisons. clusiVAT is based on sampling the data, imaging the reordered distance matrix to estimate the number of clusters in the data visually, clustering the samples using a relative of single linkage (SL), and then noniteratively extending the labels to the rest of the data-set using the nearest prototype rule. Previous work has established that clusiVAT produces true SL clusters in compact-separated data. We have performed experiments to show that ${k}$ -means and its modified algorithms suffer from initialization issues that cause many failures. On the other hand, clusiVAT needs no initialization, and almost always finds partitions that accurately match ground truth labels in labeled data. CURE also finds SL type partitions but is much slower than the other four algorithms. In our experiments, clusiVAT proves to be the fastest and most accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s.

...read moreread less

95 citations

Posted Content•

Ground Truth Bias in External Cluster Validity Indices

[...]

Yang Lei¹, James C. Bezdek¹, Simone Romano¹, Nguyen Xuan Vinh¹, Jeffrey Chan², James Bailey¹ - Show less +2 more•Institutions (2)

University of Melbourne¹, RMIT University²

17 Jun 2016-arXiv: Machine Learning

TL;DR: This work identifies a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared, and names it as GT bias, which is the first extensive study of such a property for external cluster validity indices.

...read moreread less

Abstract: It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand index (RI) exhibits a monotone increasing (NCinc) bias, while the Jaccard Index (JI) index suffers from a monotone decreasing (NCdec) bias. This type of bias has been previously recognized in the literature. In this work, we identify a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared. We call this new type of bias ground truth (GT) bias. This type of bias occurs if a change in the reference partition causes a change in the bias status (e.g., NCinc, NCdec) of a CVI. For example, NCinc bias in the RI can be changed to NCdec bias by skewing the distribution of clusters in the ground truth partition. It is important for users to be aware of this new type of biased behaviour, since it may affect the interpretations of CVI results. The objective of this article is to study the empirical and theoretical implications of GT bias. To the best of our knowledge, this is the first extensive study of such a property for external cluster validity indices.

...read moreread less

33 citations

Journal Article•DOI•

The Generalized C Index for Internal Fuzzy Cluster Validity

[...]

James C. Bezdek¹, Masud Moshtaghi¹, Thomas A. Runkler², Christopher Leckie¹•Institutions (2)

University of Melbourne¹, Siemens²

01 Dec 2016-IEEE Transactions on Fuzzy Systems

TL;DR: A soft generalization of the C index that can be used to evaluate sets of candidate partitions found by either fuzzy or probabilistic clustering algorithms and is concluded that the sum-min generalization is the second best performer in the best-c tests and the best performers in the I/E tests on small data.

...read moreread less

Abstract: The C index is an internal cluster validity index that was introduced in 1970 as a way to define and identify a “best” crisp partition on n objects represented by either unlabeled feature vectors or dissimilarity matrix data. This index is often one of the better performers among the plethora of internal indices available for this task. This paper develops a soft generalization of the C index that can be used to evaluate sets of candidate partitions found by either fuzzy or probabilistic clustering algorithms. We define four generalizations based on relational transformations of the soft partition and, then, compare their performance to eight other popular internal fuzzy cluster indices using two methods of comparison (internal “best- c ” and internal/external (I/E) “best match”), six synthetic datasets, and six real-world labeled datasets. Our main conclusion is that the sum-min generalization is the second best performer in the best-c tests and the best performer in the I/E tests on small data.

...read moreread less

32 citations

Journal Article•DOI•

Adaptive Cluster Tendency Visualization and Anomaly Detection for Streaming Data

[...]

Dheeraj Kumar¹, James C. Bezdek¹, Sutharshan Rajasegarar¹, Marimuthu Palaniswami¹, Christopher Leckie¹, Jeffrey Chan¹, Jayavardhana Gubbi¹ - Show less +3 more•Institutions (1)

University of Melbourne¹

03 Dec 2016-ACM Transactions on Knowledge Discovery From Data

TL;DR: Two new relatives of the visual assessment of tendency (VAT), which uses cluster heat maps to visualize structure in static datasets, are developed and exemplifies and demonstrated their ability to successfully isolate anomalies and visualize changing cluster structure in the streaming data.

...read moreread less

Abstract: The growth in pervasive network infrastructure called the Internet of Things (IoT) enables a wide range of physical objects and environments to be monitored in fine spatial and temporal detail. The detailed, dynamic data that are collected in large quantities from sensor devices provide the basis for a variety of applications. Automatic interpretation of these evolving large data is required for timely detection of interesting events. This article develops and exemplifies two new relatives of the visual assessment of tendency (VAT) and improved visual assessment of tendency (iVAT) models, which uses cluster heat maps to visualize structure in static datasets. One new model is initialized with a static VAT/iVAT image, and then incrementally (hence inc-VAT/inc-iVAT) updates the current minimal spanning tree (MST) used by VAT with an efficient edge insertion scheme. Similarly, dec-VAT/dec-iVAT efficiently removes a node from the current VAT MST. A sequence of inc-iVAT/dec-iVAT images can be used for (visual) anomaly detection in evolving data streams and for sliding window based cluster assessment for time series data. The method is illustrated with four real datasets (three of them being smart city IoT data). The evaluation demonstrates the algorithms’ ability to successfully isolate anomalies and visualize changing cluster structure in the streaming data.

...read moreread less

30 citations

Proceedings Article•DOI•

Online Clustering of Multivariate Time-series.

[...]

Masud Moshtaghi¹, Christopher Leckie¹, James C. Bezdek²•Institutions (2)

University of Melbourne¹, University of Missouri²

30 Jun 2016

TL;DR: This research presents a novel data clustering algorithm, which exploits the correlation between data points in time to cluster the data, while maintaining a set of decision boundaries to identify noisy or anomalous data.

...read moreread less

Abstract: Copyright © by SIAM. The intrinsic nature of streaming data requires algorithms that are capable of fast data analysis to extract knowledge. Most current unsupervised data analysis techniques rely on the implementation of known batch techniques over a sliding window, which can hinder their utility for the analysis of evolving structure in applications involving large streams of data. This research presents a novel data clustering algorithm, which exploits the correlation between data points in time to cluster the data, while maintaining a set of decision boundaries to identify noisy or anomalous data. We illustrate the proposed algorithm for online clustering with numerical results on both real-life and simulated datasets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.

...read moreread less

29 citations

Journal Article•DOI•

(Computational) Intelligence: What's in a Name?

[...]

James C. Bezdek¹•Institutions (1)

University of Melbourne¹

24 Aug 2016-IEEE Systems, Man, and Cybernetics Magazine

TL;DR: Bezdek as discussed by the authors discusses the historical evolution of the terms AI and CI, the seductive semantics of terms such as machine learning, which owe a heavy debt to our intuitive ideas about intelligence; and the evolution of IEEE Computational Intelligence Society; and role that buzzwords play in the lives of all researchers.

...read moreread less

Abstract: This article is about the terms intelligence, artificial intelligence (AI), and computational intelligence (CI). Topics addressed here include 1) the historical evolution of the terms AI and CI; 2) the seductive semantics of terms such as machine learning, which owe a heavy debt to our intuitive ideas about intelligence; 3) the evolution of the IEEE Computational Intelligence Society; and 4) the role that buzzwords play in the lives of all researchers. I estimate that this article is roughly 40% facts, 10% anecdotes, 15% speculation, and 30% opinions. The other 5%? It's reserved for you to fill in the blank-one option would be "bull." [Parts of this article are excerpted from: J.C. Bezdek., (2015). "The History, Philosophy and Development of Computat ional Intel l igence (How a simple tune became a Monster hit), in Encyclopedia of Life Support Systems (EOLSS), vol. 3, Computational Intelligence, Hisao Ishibuchi, Ed., pp. 1-22. Avai lable: ieee-cis.sightworks.net/documents/History/Bezdekeolss-CI-history.pdf.]

...read moreread less

17 citations

Journal Article•DOI•

Visual Assessment of Clustering Tendency for Incomplete Data

[...]

Laurence A. F. Park¹, James C. Bezdek², Christopher Leckie², Ramamohanarao Kotagiri², James Bailey², Marimuthu Palaniswami² - Show less +2 more•Institutions (2)

University of Sydney¹, University of Melbourne²

01 Dec 2016-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper examines four methods for completing the input data with imputed values before imaging and chooses a best method using contaminated versions of the complete Iris data, for which the desired results are known.

...read moreread less

Abstract: The iVAT (asiVAT) algorithms reorder symmetric (asymmetric) dissimilarity data so that an image of the data may reveal cluster substructure. Images formed from incomplete data don't offer a very rich interpretation of cluster structure. In this paper, we examine four methods for completing the input data with imputed values before imaging. We choose a best method using contaminated versions of the complete Iris data, for which the desired results are known. Then, we analyze two real world data sets from social networks that are incomplete using the best imputation method chosen in the juried trials with Iris: (i) Sampson's monastery data, an incomplete, asymmetric relation matrix; and (ii) the karate club data, comprising a symmetric similarity matrix that is about 86 percent incomplete.

...read moreread less

15 citations

Book Chapter•DOI•

Smart sampling: a novel unsupervised boosting approach for outlier detection

[...]

Mahsa Salehi¹, Xuyun Zhang², James C. Bezdek³, Christopher Leckie³•Institutions (3)

IBM¹, University of Auckland², University of Melbourne³

05 Dec 2016

TL;DR: This paper proposes a novel boosting algorithm for outlier detection called BSS, where it sequentially improves the accuracy of each ensemble detector in an unsupervised manner and discusses the effectiveness of the approach in terms of bias-variance trade-off.

...read moreread less

Abstract: While various ensemble algorithms have been proposed for supervised ensembles or clustering ensembles, there are few ensemble based approaches for outlier detection. The main challenge in this context is the lack of knowledge about the accuracy of the outlier detectors. Hence, none of the proposed approaches focused on sequential boosting techniques. In this paper for the first time we propose a novel boosting algorithm for outlier detection called BSS, where we sequentially improve the accuracy of each ensemble detector in an unsupervised manner. We discuss the effectiveness of our approach in terms of bias-variance trade-off. Furthermore, an extended version of BSS (called DBSS) is proposed to introduce a novel source of diversity in outlier ensemble modeling. DBSS is used to analyze the effect of changing the input parameter of BSS on its detection accuracy. Our experimental results on both synthetic and real data sets demonstrate that our approaches outperform the two state-of-the-art outlier ensemble algorithms and benefit from bias reduction. In addition, our BSS approach is robust with respect to the changing input parameter. Since each detector in our proposed BSS/DBSS is only a subset of the whole dataset, our both techniques are well suited to application environments with limited memory processors (e.g., wireless sensor networks).

...read moreread less

13 citations

Proceedings Article•DOI•

Random projection below the JL limit

[...]

James C. Bezdek¹, Xiuyi Ye¹, Mihail Popescu¹, James M. Keller¹, Alina Zare¹ - Show less +1 more•Institutions (1)

University of Missouri¹

24 Jul 2016

TL;DR: This work studies several ways to identify a “good” rogue random projection when the target downspace has dimensions below the JL limit, and uses Pearson and Spearman correlation coefficients and a visual imaging method that usually reveals cluster structure in spaces of any dimension to do this.

...read moreread less

Abstract: The Johnson-Lindenstrauss (JL) lemma, with known probability, sets a lower bound q 0 on the dimension for which a random projection of p-dimensional vector data is guaranteed to be within (1±e) of being an isometry in a randomly projected downspace. We study several ways to identify a “good” rogue random projection when the target downspace has dimensions below the JL limit. The tools used towards this end are Pearson and Spearman correlation coefficients, and a visual imaging method (a cluster heat map) that usually reveals cluster structure in spaces of any dimension. We use four synthetic data sets and the ubiquitous Iris data to study our procedures for tracking the reliability of RRPs. Unsurprisingly, rogue random projection is quite unpredictable. At its best, it is every bit as good as Principal Components Analysis, but at it's worst, it is awful. Pearson and Spearman correlations do signal good and bad projections, but the visual imaging method seems even more effective in determining the quality of RRPs.

...read moreread less

8 citations

Proceedings Article•DOI•

Quantitative and qualitative comparison of periodic sensor data

[...]

Akshay Jain¹, James M. Keller¹, James C. Bezdek¹•Institutions (1)

University of Missouri¹

21 Apr 2016

TL;DR: This method first summarizes the sensor measurements collected from smart homes of elderly into textual statements, and dissimilarity between the text summaries is then computed to quantify the comparison.

...read moreread less

Abstract: The recent rise in the availability of sensors to monitor our day-to-day lifestyles calls for techniques to glean information out of the generated data. For many of these sensors, the data recorded is periodic in nature. Moreover, it has been shown that a change in this periodic data is correlated to a shift in physical or mental health. We present a technique to compare this repetitive data, both qualitatively and quantitatively. Our method first summarizes the sensor measurements collected from smart homes of elderly into textual statements. Dissimilarity between the text summaries is then computed to quantify the comparison.

...read moreread less

Journal Article•

Smart sampling: A novel unsupervised boosting approach for outlier detection

[...]

Mahsa Salehi, Xuyun Zhang, James C. Bezdek, Christopher Leckie

01 Jan 2016-Lecture Notes in Computer Science

TL;DR: In this paper, a boosting algorithm for outlier detection called BSS is proposed, where each detector in the proposed BSS/DBSS is only a subset of the whole dataset, and an extended version of BSS (called DBSS) is proposed to introduce a novel source of diversity in outlier ensemble modeling.

...read moreread less

Abstract: © Springer International Publishing AG 2016. While various ensemble algorithms have been proposed for supervised ensembles or clustering ensembles, there are few ensemble based approaches for outlier detection. The main challenge in this context is the lack of knowledge about the accuracy of the outlier detectors. Hence, none of the proposed approaches focused on sequential boosting techniques. In this paper for the first time we propose a novel boosting algorithm for outlier detection called BSS, where we sequentially improve the accuracy of each ensemble detector in an unsupervised manner. We discuss the effectiveness of our approach in terms of bias-variance tradeoff. Furthermore, an extended version of BSS (called DBSS) is proposed to introduce a novel source of diversity in outlier ensemble modeling. DBSS is used to analyze the effect of changing the input parameter of BSS on its detection accuracy. Our experimental results on both synthetic and real data sets demonstrate that our approaches outperform the two state-of-the-art outlier ensemble algorithms and benefit from bias reduction. In addition, our BSS approach is robust with respect to the changing input parameter. Since each detector in our proposed BSS/DBSS is only a subset of the whole dataset, our both techniques are well suited to application environments with limited memory processors (e.g., wireless sensor networks).

...read moreread less