scispace - formally typeset
Search or ask a question

Showing papers by "James C. Bezdek published in 2019"


Journal ArticleDOI
TL;DR: This article reviews some of the key highlights of fuzzy and possibilistic clustering, based on the idea of stressing the internal nature of clusters rather than solely on metric notions or on the sharing of some significant traits.
Abstract: Fuzzy sets emerged in 1965 in a paper by Lotfi Zadeh. In 1969 Ruspini published a seminal paper that has become the basis of most fuzzy clustering algorithms. His ideas established the underlying structure for fuzzy partitioning, and also described and exemplified the first algorithm for accomplishing it. Bezdek developed the general case of the fuzzy c-means model in 1973. Many branches of this tree grew from 1969 to 1993. Then another watershed paper in fuzzy clustering appeared: Krishnapuram and Keller?s work on possibilistic clustering. This tree has also developed many branches, and together, these two topics comprise two thirds (of the conceptual field) of soft clustering (the other third belongs to probabilistic clustering in its many guises). Another important class of fuzzy and possibilistic methods, known as generalized clustering, were later developed, based on the idea of stressing the internal nature of clusters rather than solely on metric notions or on the sharing of some significant traits. This article reviews some of the key highlights of fuzzy and possibilistic clustering. This is not a comprehensive survey: that would require an article the size of an encyclopedia and an army of well-informed authors. The best we can do here is to give readers a small glimpse of the overall reach and span of Zadeh's idea in the vast jungle that is fuzzy cluster analysis.

104 citations


Journal ArticleDOI
TL;DR: A Fog-embedded privacy-preserving deep learning framework (FPPDL), which moves computation from the centralized Cloud to Fog nodes near the end devices, and achieves comparable accuracy to the centralized stochastic gradient descent framework, and delivers better accuracy than the standalone SGD framework.
Abstract: In current deep learning models, centralized architecture forces participants to pool their data to the central Cloud to train a global model, while distributed architecture requires a parameter server to mediate the training process. However, privacy issues, response delays, and computation and communication bottlenecks prevent these architectures from working well at the scale of Internet of Things devices. To counter these problems, in this paper we build a Fog-embedded privacy-preserving deep learning framework (FPPDL), which moves computation from the centralized Cloud to Fog nodes near the end devices. The experimental results on benchmark image datasets under different settings demonstrate that FPPDL achieves comparable accuracy to the centralized stochastic gradient descent (SGD) framework, and delivers better accuracy than the standalone SGD framework. Our evaluations also show that both computation and communication cost are greatly reduced by FPPDL, hence achieving the desired tradeoff between privacy and performance.

50 citations


Journal ArticleDOI
TL;DR: Experimental results suggest that FensiVAT, which can cluster large volumes of high-dimensional datasets in a few seconds, is the fastest and most accurate method of the ones tested.
Abstract: Clustering large volumes of high-dimensional data is a challenging task. Many clustering algorithms have been developed to address either handling datasets with a very large sample size or with a very high number of dimensions, but they are often impractical when the data is large in both aspects. To simultaneously overcome both the ‘curse of dimensionality’ problem due to high dimensions and scalability problems due to large sample size, we propose a new fast clustering algorithm called FensiVAT. FensiVAT is a hybrid, ensemble-based clustering algorithm which uses fast data-space reduction and an intelligent sampling strategy. In addition to clustering, FensiVAT also provides visual evidence that is used to estimate the number of clusters (cluster tendency assessment) in the data. In our experiments, we compare FensiVAT with nine state-of-the-art approaches which are popular for large sample size or high-dimensional data clustering. Experimental results suggest that FensiVAT, which can cluster large volumes of high-dimensional datasets in a few seconds, is the fastest and most accurate method of the ones tested.

39 citations


Journal ArticleDOI
TL;DR: A scalable clustering and Markov chain-based hybrid framework, called Traj-clusiVAT-based TP, for both short- and long-term TPs, which can handle a large number of overlapping trajectories in a dense road network.
Abstract: Trajectory prediction (TP) is of great importance for a wide range of location-based applications in intelligent transport systems, such as location-based advertising, route planning, traffic management, and early warning systems. In the last few years, the widespread use of GPS navigation systems and wireless communication technology enabled vehicles has resulted in huge volumes of trajectory data. The task of utilizing these data employing spatio-temporal techniques for TP in an efficient and accurate manner is an ongoing research problem. Existing TP approaches are limited to the short-term predictions. Moreover, they cannot handle a large volume of trajectory data for long-term prediction. To address these limitations, we propose a scalable clustering and Markov chain-based hybrid framework, called Traj-clusiVAT-based TP, for both short- and long-term TPs, which can handle a large number of overlapping trajectories in a dense road network. Traj-clusiVAT can also determine the number of clusters, which represent different movement behaviors in input trajectory data. In our experiments, we compare our proposed approach with a mixed Markov model-based scheme and a trajectory clustering, NETSCAN-based TP method for both short- and long-term TPs. We performed our experiments on two real, vehicle trajectory datasets, including a large-scale trajectory dataset consisting of 3.28 million trajectories obtained from 15 061 taxis in Singapore over a period of one month. The experimental results on two real trajectory datasets show that our proposed approach outperforms the existing approaches in terms of both short- and long-term prediction performances, based on the prediction accuracy and distance error (in km).

37 citations


Journal ArticleDOI
TL;DR: Two incremental versions of the Xie‐Beni and Davies‐Bouldin validity indices are developed and used to monitor and control two streaming clustering algorithms (sk‐means and online ellipsoidal clustering), and it is shown that incremental cluster validity indices can send a distress signal to online monitors when evolving structure leads an algorithm astray.
Abstract: Cluster analysis is used to explore structure in unlabeled batch data sets in a wide range of applications. An important part of cluster analysis is validating the quality of computationally obtained clusters. A large number of different internal indices have been developed for validation in the offline setting. However, this concept cannot be directly extended to the online setting because streaming algorithms do not retain the data, nor maintain a partition of it, both needed by batch cluster validity indices. In this paper, we develop two incremental versions (with and without forgetting factors) of the Xie-Beni and Davies-Bouldin validity indices, and use them to monitor and control two streaming clustering algorithms (sk-means and online ellipsoidal clustering), In this context, our new incremental validity indices are more accurately viewed as performance monitoring functions. We also show that incremental cluster validity indices can send a distress signal to online monitors when evolving structure leads an algorithm astray. Our numerical examples indicate that the incremental Xie-Beni index with a forgetting factor is superior to the other three indices tested.

29 citations


Journal ArticleDOI
TL;DR: The experimental results indicate that the proposed weighted multiple classifier framework is better than many of the benchmark algorithms, including three homogeneous ensemble methods, several well-known algorithms, and random projection-based ensembles with fixed combining rules.

28 citations


Journal ArticleDOI
TL;DR: Six methods for approximating DI are presented based on Maximin sampling, which identifies a skeleton of the full partition that contains some boundary points in each cluster that supports the assertion that computing approximations to DI with an incremental, neighborhood-based Maximin skeleton is both tractable and reliably accurate.
Abstract: Dunn’s internal cluster validity index is used to assess partition quality and subsequently identify a “best” crisp partition of ${n}$ objects. Computing Dunn’s index (DI) for partitions of ${n}~{p}$ -dimensional feature vector data has quadratic time complexity ${O(pn^{2})}$ , so its computation is impractical for very large values of ${n}$ . This note presents six methods for approximating DI. Four methods are based on Maximin sampling, which identifies a skeleton of the full partition that contains some boundary points in each cluster. Two additional methods are presented that estimate boundary points associated with unsupervised training of one class support vector machines. Numerical examples compare approximations to DI based on all six methods. Four experiments on seven real and synthetic data sets support our assertion that computing approximations to DI with an incremental, neighborhood-based Maximin skeleton is both tractable and reliably accurate.

19 citations


Journal ArticleDOI
12 Nov 2019-PLOS ONE
TL;DR: The results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage, and it is shown that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models.
Abstract: Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.

13 citations


Book ChapterDOI
16 Sep 2019
TL;DR: A novel technique called Maximin-based Anomaly Detection is proposed that addresses challenges of unsupervised anomaly detection by selecting a representative subset of data in combination with a kernel-based model construction and effectively uses active learning with a limited budget.
Abstract: Unsupervised anomaly detection is commonly performed using a distance or density based technique, such as K-Nearest neighbours, Local Outlier Factor or One-class Support Vector Machines. One-class Support Vector Machines reduce the computational cost of testing new data by providing sparse solutions. However, all these techniques have relatively high computational requirements for training. Moreover, identifying anomalies based solely on density or distance is not sufficient when both point (isolated) and cluster anomalies exist in an unlabelled training set. Finally, these unsupervised anomaly detection techniques are not readily adapted for active learning, where the training algorithm should identify examples for which labelling would make a significant impact on the accuracy of the learned model. In this paper, we propose a novel technique called Maximin-based Anomaly Detection that addresses these challenges by selecting a representative subset of data in combination with a kernel-based model construction. We show that the proposed technique (a) provides a statistically significant improvement in the accuracy as well as the computation time required for training and testing compared to several benchmark unsupervised anomaly detection techniques, and (b) effectively uses active learning with a limited budget.

7 citations


Proceedings ArticleDOI
22 Aug 2019
TL;DR: This paper shows that running the D2 seeding used in k-means++ on a random sample then clustering the whole dataset results in faster runtime and comparable accuracy compared to the original algorithm, and proposes a new method that performs the D1 seeding and clustering on the random sample.
Abstract: K-means clustering with random seeds results in arbitrarily poor clusters. Much work as been done to improve initial centroid selection, also known as seeding, however better seeding algorithms are not scalable to large or unloadable datasets. In this paper, we first show that running the D2 seeding used in k-means++ on a random sample then clustering the whole dataset results in faster runtime and comparable accuracy compared to the original algorithm. We then propose a new method that performs the D2 seeding and clustering on the random sample. This method essentially runs k-means++ on the sample, then extends cluster assignments to every other point using nearest centroid classification. This results in faster clustering and comparable clustering quality compared to the original algorithm. We demonstrate the performance of both algorithms on synthetic and real datasets.

7 citations


Proceedings ArticleDOI
01 Dec 2019
TL;DR: It is shown how dissimilarity measures between different components of a multi-variate waveform database can measure the similarity, or the lack of it, between the motion of two hands in order to differentiate between different gestures, for applications in assistive technology and smart health-care.
Abstract: Clustering waveform data is used in applications ranging from healthcare to economics and entertainment. In this paper, we present a study on clustering gestures enacted by subjects while wearing wrist-worn accelerometer sensors through different dissimilarity measures between individual components of multi-variate waveform data. We show how dissimilarity measures between different components of a multi-variate waveform database can measure the similarity, or the lack of it, between the motion of two hands in order to differentiate between different gestures, for applications in assistive technology and smart health-care. In doing so, we exploit a hierarchical clustering architecture and visualize it through single-linkage dendrograms and visual assessment of cluster tendency. Using annotations of the gestures, we describe the physical significance behind the formation of the hierarchy. We also discuss combining different dissimilarity measures by convex combination to improve clustering.

Book ChapterDOI
24 Apr 2019
TL;DR: This paper shows that the C index can be used not only to validate but also to actually find clusters, which leads to difficult discrete optimization problems which can be approximately solved by a canonical genetic algorithm.
Abstract: Clustering is an important family of unsupervised machine learning methods. Cluster validity indices are widely used to assess the quality of obtained clustering results. The C index is one of the most popular cluster validity indices. This paper shows that the C index can be used not only to validate but also to actually find clusters. This leads to difficult discrete optimization problems which can be approximately solved by a canonical genetic algorithm. Numerical experiments compare this novel approach to the well-known c-means and single linkage clustering algorithms. For all five considered popular real-world benchmark data sets the proposed method yields a better C index than any of the other (pure) clustering methods.

01 Jan 2019
TL;DR: In this paper, an unsupervised anomaly detection system that represents relationships between different locations in a city is proposed to distinguish anomalous local events from legitimate global traffic changes, which happen due to seasonal effects, weather and holidays.
Abstract: © 2019, Springer Nature Switzerland AG. Sensors deployed in different parts of a city continuously record traffic data, such as vehicle flows and pedestrian counts. We define an unexpected change in the traffic counts as an anomalous local event. Reliable discovery of such events is very important in real-world applications such as real-time crash detection or traffic congestion detection. One of the main challenges to detecting anomalous local events is to distinguish them from legitimate global traffic changes, which happen due to seasonal effects, weather and holidays. Existing anomaly detection techniques often raise many false alarms for these legitimate traffic changes, making such techniques less reliable. To address this issue, we introduce an unsupervised anomaly detection system that represents relationships between different locations in a city. Our method uses training data to estimate the traffic count at each sensor location given the traffic counts at the other locations. The estimation error is then used to calculate the anomaly score at any given time and location in the network. We test our method on two real traffic datasets collected in the city of Melbourne, Australia, for detecting anomalous local events. Empirical results show the greater robustness of our method to legitimate global changes in traffic count than four benchmark anomaly detection methods examined in this paper. Data related to this paper are available at: https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/147696bb47544a209e0a5e79e165d1b0_0.