scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2021"


Book
29 Jul 2021
TL;DR: This book presents some of the most important modeling and prediction techniques, along with relevant applications, that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years.
Abstract: An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

3,439 citations


Book
30 Sep 2021
TL;DR: This book discusses Exploratory Data Analysis, Hierarchical Methods Optimization Methods-k-Means, and more.
Abstract: INTRODUCTION TO EXPLORATORY DATA ANALYSIS Introduction to Exploratory Data Analysis What Is Exploratory Data Analysis Overview of the Text A Few Words about Notation Data Sets Used in the Book Transforming Data EDA AS PATTERN DISCOVERY Dimensionality Reduction - Linear Methods Introduction Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Nonnegative Matrix Factorization Factor Analysis Fisher's Linear Discriminant Intrinsic Dimensionality Dimensionality Reduction - Nonlinear Methods Multidimensional Scaling (MDS) Manifold Learning Artificial Neural Network Approaches Data Tours Grand Tour Interpolation Tours Projection Pursuit Projection Pursuit Indexes Independent Component Analysis Finding Clusters Introduction Hierarchical Methods Optimization Methods-k-Means Spectral Clustering Document Clustering Evaluating the Clusters Model-Based Clustering Overview of Model-Based Clustering Finite Mixtures Expectation-Maximization Algorithm Hierarchical Agglomerative Model-Based Clustering Model-Based Clustering MBC for Density Estimation and Discriminant Analysis Generating Random Variables from a Mixture Model Smoothing Scatterplots Introduction Loess Robust Loess Residuals and Diagnostics with Loess Smoothing Splines Choosing the Smoothing Parameter Bivariate Distribution Smooths Curve Fitting Toolbox GRAPHICAL METHODS FOR EDA Visualizing Clusters Dendrogram Treemaps Rectangle Plots ReClus Plots Data Image Distribution Shapes Histograms Boxplots Quantile Plots Bagplots Rangefinder Boxplot Multivariate Visualization Glyph Plots Scatterplots Dynamic Graphics Coplots Dot Charts Plotting Points as Curves Data Tours Revisited Biplots Appendix A: Proximity Measures Appendix B: Software Resources for EDA Appendix C: Description of Data Sets Appendix D: Introduction to MATLAB Appendix E: MATLAB Functions References Index Summary, Further Reading, and Exercises appear at the end of each chapter.

320 citations


Proceedings ArticleDOI
11 Mar 2021
TL;DR: MagFace as discussed by the authors introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away, which prevents models from overfitting on noisy low-quality samples and improves face recognition in the wild.
Abstract: The performance of face recognition system degrades when the variability of the acquired faces increases. Prior work alleviates this issue by either monitoring the face quality in pre-processing or predicting the data uncertainty along with the face feature. This paper proposes MagFace, a category of losses that learn a universal feature embedding whose magnitude can measure the quality of the given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, Mag-Face introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. This prevents models from overfitting on noisy low-quality samples and improves face recognition in the wild. Extensive experiments conducted on face recognition, quality assessments as well as clustering demonstrate its superiority over state-of-the-arts. The code is available at https://github.com/IrvingMeng/MagFace.

268 citations


Journal ArticleDOI
TL;DR: HuBERT as mentioned in this paper utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss, which forces the model to learn a combined acoustic and language model over the continuous inputs.
Abstract: Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960 h) and Libri-light (60,000 h) benchmarks with 10 min, 1 h, 10 h, 100 h, and 960 h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. 1 2

266 citations


Journal ArticleDOI
TL;DR: A comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions and points to several significant research questions that are unanswered in the recent literature having the same target.
Abstract: Manufacturing organizations need to use different kinds of techniques and tools in order to fulfill their foundation goals. In this aspect, using machine learning (ML) and data mining (DM) techniques and tools could be very helpful for dealing with challenges in manufacturing. Therefore, in this paper, a comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions. Furthermore, it points to several significant research questions that are unanswered in the recent literature having the same target. Our survey aims to provide researchers with a solid understanding of the main approaches and algorithms used to improve manufacturing processes over the past two decades. It presents the previous ML studies and recent advances in manufacturing by grouping them under four main subjects: scheduling, monitoring, quality, and failure. It comprehensively discusses existing solutions in manufacturing according to various aspects, including tasks (i.e., clustering, classification, regression), algorithms (i.e., support vector machine, neural network), learning types (i.e., ensemble learning, deep learning), and performance metrics (i.e., accuracy, mean absolute error). Furthermore, the main steps of knowledge discovery in databases (KDD) process to be followed in manufacturing applications are explained in detail. In addition, some statistics about the current state are also given from different perspectives. Besides, it explains the advantages of using machine learning techniques in manufacturing, expresses the ways to overcome certain challenges, and offers some possible further research directions.

237 citations


Journal ArticleDOI
TL;DR: Clustered FL (CFL) as discussed by the authors exploits geometric properties of the FL loss surface to group the client population into clusters with jointly trainable data distributions, which can be viewed as a postprocessing method that will always achieve greater or equal performance than conventional FL by allowing clients to arrive at more specialized models.
Abstract: Federated learning (FL) is currently the most widely adopted framework for collaborative training of (deep) machine learning models under privacy constraints. Albeit its popularity, it has been observed that FL yields suboptimal results if the local clients’ data distributions diverge. To address this issue, we present clustered FL (CFL), a novel federated multitask learning (FMTL) framework, which exploits geometric properties of the FL loss surface to group the client population into clusters with jointly trainable data distributions. In contrast to existing FMTL approaches, CFL does not require any modifications to the FL communication protocol to be made, is applicable to general nonconvex objectives (in particular, deep neural networks), does not require the number of clusters to be known a priori , and comes with strong mathematical guarantees on the clustering quality. CFL is flexible enough to handle client populations that vary over time and can be implemented in a privacy-preserving way. As clustering is only performed after FL has converged to a stationary point, CFL can be viewed as a postprocessing method that will always achieve greater or equal performance than conventional FL by allowing clients to arrive at more specialized models. We verify our theoretical analysis in experiments with deep convolutional and recurrent neural networks on commonly used FL data sets.

234 citations


Journal ArticleDOI
TL;DR: This article first analyzes the main factors that influence the performance of BSO and then proposes an orthogonal learning framework to improve its learning mechanism and shows that the proposed approach is very powerful in optimizing complex functions.
Abstract: In brain storm optimization (BSO), the convergent operation utilizes a clustering strategy to group the population into multiple clusters, and the divergent operation uses this cluster information to generate new individuals. However, this mechanism is inefficient to regulate the exploration and exploitation search. This article first analyzes the main factors that influence the performance of BSO and then proposes an orthogonal learning framework to improve its learning mechanism. In this framework, two orthogonal design (OD) engines (i.e., exploration OD engine and exploitation OD engine) are introduced to discover and utilize useful search experiences for performance improvements. In addition, a pool of auxiliary transmission vectors with different features is maintained and their biases are also balanced by the OD decision mechanism. Finally, the proposed algorithm is verified on a set of benchmarks and is adopted to resolve the quantitative association rule mining problem considering the support, confidence, comprehensibility, and netconf. The experimental results show that the proposed approach is very powerful in optimizing complex functions. It not only outperforms previous versions of the BSO algorithm but also outperforms several famous OD-based algorithms.

200 citations


Journal ArticleDOI
TL;DR: In this paper, the authors explored the performance of fuzzy system-based medical image processing for brain disease prediction, and designed a brain image processing and brain disease diagnosis prediction model based on improved fuzzy clustering and HPU-Net (Hybrid Pyramid U-Net Model for Brain Tumor Segmentation).
Abstract: The present work aims to explore the performance of fuzzy system-based medical image processing for brain disease prediction. The imaging mechanism of NMR (Nuclear Magnetic Resonance) and the complexity of human brain tissues cause the brain MRI (Magnetic Resonance Imaging) images to present varying degrees of noise, weak boundaries, and artifacts. Hence, improvements are made over the fuzzy clustering algorithm. While ensuring the model safety performance, a brain image processing and brain disease diagnosis prediction model is designed based on improved fuzzy clustering and HPU-Net (Hybrid Pyramid U-Net Model for Brain Tumor Segmentation). Brain MRI images collected from the Department of Brain Oncology, XX Hospital, are employed in simulation experiments to validate the performance of the proposed algorithm. Moreover, CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), FCM (Fuzzy C-Means), LDCFCM (Local Density Clustering Fuzzy C-Means), and AFCM (Adaptive Fuzzy C-Means) are included in simulation experiments for performance comparison. Results demonstrated that the proposed algorithm has more nodes, lower energy consumption, and more stable changes than other models under the same conditions. Regarding the overall network performance, the proposed algorithm can complete the data transmission tasks the fastest, basically maintaining at about 4.5 seconds on average, which performs remarkably better than other models. A further prediction performance analysis reveals that the proposed algorithm provides the highest prediction accuracy for the Whole Tumor under the DSC coefficient, reaching 0.936. Besides, its Jaccard coefficient is 0.845, proving its superior segmentation accuracy over other models. To sum up, the proposed algorithm can provide higher accuracy while ensuring energy consumption, a more apparent denoising effect, and the best segmentation and recognition effect than other models, which can provide an experimental basis for the feature recognition and predictive diagnosis of brain images.

179 citations


Journal ArticleDOI
01 Jan 2021
TL;DR: The Butterfly Optimization Algorithm (BOA) is employed to choose an optimal cluster head from a group of nodes and the outputs of the proposed methodology are compared with traditional approaches LEACH, DEEC and compared with some existing methods.
Abstract: Wireless Sensor Networks (WSNs) consist of a large number of spatially distributed sensor nodes connected through the wireless medium to monitor and record the physical information from the environment. The nodes of WSN are battery powered, so after a certain period it loose entire energy. This energy constraint affects the lifetime of the network. The objective of this study is to minimize the overall energy consumption and to maximize the network lifetime. At present, clustering and routing algorithms are widely used in WSNs to enhance the network lifetime. In this study, the Butterfly Optimization Algorithm (BOA) is employed to choose an optimal cluster head from a group of nodes. The cluster head selection is optimized by the residual energy of the nodes, distance to the neighbors, distance to the base station, node degree and node centrality. The route between the cluster head and the base station is identified by using Ant Colony Optimization (ACO), it selects the optimal route based on the distance, residual energy and node degree. The performance measures of this proposed methodology are analyzed in terms of alive nodes, dead nodes, energy consumption and data packets received by the BS. The outputs of the proposed methodology are compared with traditional approaches LEACH, DEEC and compared with some existing methods FUCHAR, CRHS, BERA, CPSO, ALOC and FLION. For example, the alive nodes of the proposed methodology are 200 at 1500 iterations which is higher compared to the CRHS and BERA methods.

174 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper proposed a superpixel-guided clustering (SGC) and guided prototype allocation (GPA) modules for multiple prototype extraction and allocation, which extracts more representative prototypes by aggregating similar feature vectors, while GPA is able to select matched prototypes to provide more accurate guidance.
Abstract: Prototype learning is extensively used for few-shot segmentation. Typically, a single prototype is obtained from the support feature by averaging the global object information. However, using one prototype to represent all the information may lead to ambiguities. In this paper, we propose two novel modules, named superpixel-guided clustering (SGC) and guided prototype allocation (GPA), for multiple prototype extraction and allocation. Specifically, SGC is a parameter-free and training-free approach, which extracts more representative prototypes by aggregating similar feature vectors, while GPA is able to select matched prototypes to provide more accurate guidance. By integrating the SGC and GPA together, we propose the Adaptive Superpixel-guided Network (ASGNet), which is a lightweight model and adapts to object scale and shape variation. In addition, our network can easily generalize to k-shot segmentation with substantial improvement and no additional computational cost. In particular, our evaluations on COCO demonstrate that ASGNet surpasses the state-of-the-art method by 5% in 5-shot segmentation.1

172 citations


Journal ArticleDOI
TL;DR: Four methods for calculating distance between individuals using dichotomous data, and the subsequent introduc- tion of these distances to a clustering algorithm such as Ward's were found to work better, in nearly all cases, than using the raw data with Ward's clustering algorithms.
Abstract: The current study examines the performance of cluster analysis with dichotomous data using distance measures based on response pattern similarity. In many contexts, such as educational and psychological testing, cluster analysis is a useful means for exploring datasets and identifying un- derlying groups among individuals. However, standard approaches to cluster analysis assume that the variables used to group observations are continu- ous in nature. This paper focuses on four methods for calculating distance between individuals using dichotomous data, and the subsequent introduc- tion of these distances to a clustering algorithm such as Ward's. The four methods in question, are potentially useful for practitioners because they are relatively easy to carry out using standard statistical software such as SAS and SPSS, and have been shown to have potential for correctly grouping ob- servations based on dichotomous data. Results of both a simulation study and application to a set of binary survey responses show that three of the four measures behave similarly, and can yield correct cluster recovery rates of between 60% and 90%. Furthermore, these methods were found to work better, in nearly all cases, than using the raw data with Ward's clustering algorithm.

Journal ArticleDOI
TL;DR: A novel graph-regularized matrix factorization model is developed to preserve the local geometric similarities of the learned common representations from different views and the semantic consistency constraint is introduced to stimulate these view-specific representations toward a unified discriminative representation.
Abstract: An important underlying assumption that guides the success of the existing multiview learning algorithms is the full observation of the multiview data. However, such rigorous precondition clearly violates the common-sense knowledge in practical applications, where in most cases, only incomplete fractions of the multiview data are given. The presence of the incomplete settings generally disables the conventional multiview clustering methods. In this article, we propose a simple but effective incomplete multiview clustering (IMC) framework, which simultaneously considers the local geometric information and the unbalanced discriminating powers of these incomplete multiview observations. Specifically, a novel graph-regularized matrix factorization model, on the one hand, is developed to preserve the local geometric similarities of the learned common representations from different views. On the other hand, the semantic consistency constraint is introduced to stimulate these view-specific representations toward a unified discriminative representation. Moreover, the importance of different views is adaptively determined to reduce the negative influence of the unbalanced incomplete views. Furthermore, an efficient learning algorithm is proposed to solve the resulting optimization problem. Extensive experimental results performed on several incomplete multiview datasets demonstrate that the proposed method can achieve superior clustering performance in comparison with some state-of-the-art multiview learning methods.

Journal ArticleDOI
TL;DR: An exact formulation based on mixed integer linear programming to fully search the solution space and produce optimal flight paths for autonomous UAVs is proposed and an original clustering-based algorithm to classify regions into clusters is designed such that coverage tasks would be carried out correctly and efficiently.
Abstract: Unmanned aerial vehicles (UAVs) have been widely applied in civilian and military applications due to their high autonomy and strong adaptability. Although UAVs can achieve effective cost reduction and flexibility enhancement in the development of large-scale systems, they result in a serious path planning and task allocation problem. Coverage path planning, which tries to seek flight paths to cover all of regions of interest, is one of the key technologies in achieving autonomous driving of UAVs and difficult to obtain optimal solutions because of its NP-Hard computational complexity. In this paper, we study the coverage path planning problem of autonomous heterogeneous UAVs on a bounded number of regions. First, with models of separated regions and heterogeneous UAVs, we propose an exact formulation based on mixed integer linear programming to fully search the solution space and produce optimal flight paths for autonomous UAVs. Then, inspired from density-based clustering methods, we design an original clustering-based algorithm to classify regions into clusters and obtain approximate optimal point-to-point paths for UAVs such that coverage tasks would be carried out correctly and efficiently. Experiments with randomly generated regions are conducted to demonstrate the efficiency and effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: This Review provides a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicates likely directions for further developments in the field.
Abstract: Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.

Proceedings ArticleDOI
Yijie Lin1, Yuanbiao Gou1, Zitao Liu, Boyun Li1, Jiancheng Lv1, Xi Peng1 
01 Jun 2021
TL;DR: In this paper, a unified framework for representation learning and cross-view data recovery is proposed, where the informative and consistent representation is learned by maximizing the mutual information across different views through contrastive learning, and the missing views are recovered by minimizing the conditional entropy of different views via dual prediction.
Abstract: In this paper, we study two challenging problems in incomplete multi-view clustering analysis, namely, i) how to learn an informative and consistent representation among different views without the help of labels and ii) how to recover the missing views from data. To this end, we propose a novel objective that incorporates representation learning and data recovery into a unified framework from the view of information theory. To be specific, the informative and consistent representation is learned by maximizing the mutual information across different views through contrastive learning, and the missing views are recovered by minimizing the conditional entropy of different views through dual prediction. To the best of our knowledge, this could be the first work to provide a theoretical framework that unifies the consistent representation learning and cross-view data recovery. Extensive experimental results show the proposed method remarkably outperforms 10 competitive multi-view clustering methods on four challenging datasets. The code is available at https://pengxi.me.

Journal ArticleDOI
TL;DR: In this article, the authors present a review of state-of-the-art DL-based approaches for clustering analysis that are based on representation learning, which they hope to be useful for bioinformatics research.
Abstract: Clustering is central to many data-driven bioinformatics research and serves a powerful computational method. In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Further, clustering is used to gain insights into biological processes in the genomics level, e.g. clustering of gene expressions provides insights on the natural structure inherent in the data, understanding gene functions, cellular processes, subtypes of cells and understanding gene regulations. Subsequently, clustering approaches, including hierarchical, centroid-based, distribution-based, density-based and self-organizing maps, have long been studied and used in classical machine learning settings. In contrast, deep learning (DL)-based representation and feature learning for clustering have not been reviewed and employed extensively. Since the quality of clustering is not only dependent on the distribution of data points but also on the learned representation, deep neural networks can be effective means to transform mappings from a high-dimensional data space into a lower-dimensional feature space, leading to improved clustering results. In this paper, we review state-of-the-art DL-based approaches for cluster analysis that are based on representation learning, which we hope to be useful, particularly for bioinformatics research. Further, we explore in detail the training procedures of DL-based clustering algorithms, point out different clustering quality metrics and evaluate several DL-based approaches on three bioinformatics use cases, including bioimaging, cancer genomics and biomedical text mining. We believe this review and the evaluation results will provide valuable insights and serve a starting point for researchers wanting to apply DL-based unsupervised methods to solve emerging bioinformatics research problems.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: A group-aware Label Transfer (GLT) algorithm is proposed, which enables the online interaction and mutual promotion of pseudo-label prediction and representation learning and can better correct the noisy pseudo label in an online fashion and narrow down the search space of the target identity.
Abstract: Unsupervised Domain Adaptive (UDA) person re-identification (ReID) aims at adapting the model trained on a labeled source-domain dataset to a target-domain dataset without any further annotations. Most successful UDA-ReID approaches combine clustering-based pseudo-label prediction with representation learning and perform the two steps in an alternating fashion. However, offline interaction between these two steps may allow noisy pseudo labels to substantially hinder the capability of the model. In this paper, we propose a Group-aware Label Transfer (GLT) algorithm, which enables the online interaction and mutual promotion of pseudo-label prediction and representation learning. Specifically, a label transfer algorithm simultaneously uses pseudo labels to train the data while refining the pseudo labels as an online clustering algorithm. It treats the online label refinery problem as an optimal transport problem, which explores the minimum cost for assigning M samples to N pseudo labels. More importantly, we introduce a group-aware strategy to assign implicit attribute group IDs to samples. The combination of the online label refining algorithm and the group-aware strategy can better correct the noisy pseudo label in an online fashion and narrow down the search space of the target identity. The effectiveness of the proposed GLT is demonstrated by the experimental results (Rank-1 accuracy) for Market1501→DukeMTMC (82.0%) and DukeMTMC→Market1501 (92.2%), remarkably closing the gap between unsupervised and supervised performance on person re-identification. 1

Journal ArticleDOI
TL;DR: Faster Mean-Shift as discussed by the authors proposes a new online seed optimization policy to adaptively determine the minimal number of seeds, accelerate computation, and save GPU memory, which achieved 7-10 times speedup compared to the state-of-the-art embedding based cell instance segmentation and tracking algorithm.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed complete IoU (CIoU) loss and cluster-NMS for enhancing geometric factors in both bounding-box regression and non-maximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.
Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted l n -norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR 100 for object detection, and +1.1 AP and +3.5 AR 100 for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU.

Journal ArticleDOI
TL;DR: An adaptive granularity learning distributed particle swarm optimization (AGLDPSO) with the help of machine-learning techniques, including clustering analysis based on locality-sensitive hashing (LSH) and adaptive granular control based on logistic regression (LR) is proposed.
Abstract: Large-scale optimization has become a significant and challenging research topic in the evolutionary computation (EC) community. Although many improved EC algorithms have been proposed for large-scale optimization, the slow convergence in the huge search space and the trap into local optima among massive suboptima are still the challenges. Targeted to these two issues, this article proposes an adaptive granularity learning distributed particle swarm optimization (AGLDPSO) with the help of machine-learning techniques, including clustering analysis based on locality-sensitive hashing (LSH) and adaptive granularity control based on logistic regression (LR). In AGLDPSO, a master–slave multisubpopulation distributed model is adopted, where the entire population is divided into multiple subpopulations, and these subpopulations are co-evolved. Compared with other large-scale optimization algorithms with single population evolution or centralized mechanism, the multisubpopulation distributed co-evolution mechanism will fully exchange the evolutionary information among different subpopulations to further enhance the population diversity. Furthermore, we propose an adaptive granularity learning strategy (AGLS) based on LSH and LR. The AGLS is helpful to determine an appropriate subpopulation size to control the learning granularity of the distributed subpopulations in different evolutionary states to balance the exploration ability for escaping from massive suboptima and the exploitation ability for converging in the huge search space. The experimental results show that AGLDPSO performs better than or at least comparable with some other state-of-the-art large-scale optimization algorithms, even the winner of the competition on large-scale optimization, on all the 35 benchmark functions from both IEEE Congress on Evolutionary Computation (IEEE CEC2010) and IEEE CEC2013 large-scale optimization test suites.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new three-phase hybrid feature selection algorithm based on correlation-guided clustering and particle swarm optimization (HFS-C-P) to tackle the above two problems at the same time.
Abstract: The ``curse of dimensionality'' and the high computational cost have still limited the application of the evolutionary algorithm in high-dimensional feature selection (FS) problems. This article proposes a new three-phase hybrid FS algorithm based on correlation-guided clustering and particle swarm optimization (PSO) (HFS-C-P) to tackle the above two problems at the same time. To this end, three kinds of FS methods are effectively integrated into the proposed algorithm based on their respective advantages. In the first and second phases, a filter FS method and a feature clustering-based method with low computational cost are designed to reduce the search space used by the third phase. After that, the third phase applies oneself to finding an optimal feature subset by using an evolutionary algorithm with the global searchability. Moreover, a symmetric uncertainty-based feature deletion method, a fast correlation-guided feature clustering strategy, and an improved integer PSO are developed to improve the performance of the three phases, respectively. Finally, the proposed algorithm is validated on 18 publicly available real-world datasets in comparison with nine FS algorithms. Experimental results show that the proposed algorithm can obtain a good feature subset with the lowest computational cost.

Journal ArticleDOI
TL;DR: In this paper, a novel K-means clustering algorithm based on a noise algorithm is developed to capture urban hotspots in which the noise algorithm was employed to randomly enhance the attribution of data points and output results of clustering by adding noise judgment in order to automatically obtain the number of clusters for the given data and initialize the center cluster.
Abstract: With the development of cities, urban congestion is nearly an unavoidable problem for almost every large-scale city. Road planning is an effective means to alleviate urban congestion, which is a classical non-deterministic polynomial time (NP) hard problem, and has become an important research hotspot in recent years. A K-means clustering algorithm is an iterative clustering analysis algorithm that has been regarded as an effective means to solve urban road planning problems by scholars for the past several decades; however, it is very difficult to determine the number of clusters and sensitively initialize the center cluster. In order to solve these problems, a novel K-means clustering algorithm based on a noise algorithm is developed to capture urban hotspots in this paper. The noise algorithm is employed to randomly enhance the attribution of data points and output results of clustering by adding noise judgment in order to automatically obtain the number of clusters for the given data and initialize the center cluster. Four unsupervised evaluation indexes, namely, DB, PBM, SC, and SSE, are directly used to evaluate and analyze the clustering results, and a nonparametric Wilcoxon statistical analysis method is employed to verify the distribution states and differences between clustering results. Finally, five taxi GPS datasets from Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China), and Beijing (China) are selected to test and verify the effectiveness of the proposed noise K-means clustering algorithm by comparing the algorithm with fuzzy C-means, K-means, and K-means plus approaches. The compared experiment results show that the noise algorithm can reasonably obtain the number of clusters and initialize the center cluster, and the proposed noise K-means clustering algorithm demonstrates better clustering performance and accurately obtains clustering results, as well as effectively capturing urban hotspots.

Journal ArticleDOI
TL;DR: This work revises the FCM algorithm to make it applicable to data with unequal cluster sizes, noise and outliers, and non-uniform mass distribution and shows that the RFCM algorithm works for both cases and outperforms the both categories of the algorithms.
Abstract: Clustering algorithms aim at finding dense regions of data based on similarities and dissimilarities of data points. Noise and outliers contribute to the computational procedure of the algorithms as well as the actual data points that leads to inaccurate and misplaced cluster centers. This problem also arises when sizes of the clusters are different that moves centers of small clusters towards large clusters. Mass of the data points is important as well as their location in engineering and physics where non-uniform mass distribution results displacement of the cluster centers towards heavier clusters even if sizes of the clusters are identical and the data are noise-free. Fuzzy C-Means (FCM) algorithm that suffers from these problems is the most popular fuzzy clustering algorithm and has been subject of numerous researches and developments though improvements are still marginal. This work revises the FCM algorithm to make it applicable to data with unequal cluster sizes, noise and outliers, and non-uniform mass distribution. Revised FCM (RFCM) algorithm employs adaptive exponential functions to eliminate impacts of noise and outliers on the cluster centers and modifies constraint of the FCM algorithm to prevent large or heavier clusters from attracting centers of small clusters. Several algorithms are reviewed and their mathematical structures are discussed in the paper including Possibilistic Fuzzy C-Means (PFCM), Possibilistic C-Means (PCM), Robust Fuzzy C-Means (FCM-σ), Noise Clustering (NC), Kernel Fuzzy C-Means (KFCM), Intuitionistic Fuzzy C-Means (IFCM), Robust Kernel Fuzzy C-Mean (KFCM-σ), Robust Intuitionistic Fuzzy C-Means (IFCM-σ), Kernel Intuitionistic Fuzzy C-Means (KIFCM), Robust Kernel Intuitionistic Fuzzy C-Means (KIFCM-σ), Credibilistic Fuzzy C-Means (CFCM), Size-insensitive integrity-based Fuzzy C-Means (siibFCM), Size-insensitive Fuzzy C-Means (csiFCM), Subtractive Clustering (SC), Density Based Spatial Clustering of Applications with Noise (DBSCAN), Gaussian Mixture Models (GMM), Spectral clustering, and Outlier Removal Clustering (ORC). Some of these algorithms are suitable for noisy data and some others are designed for data with unequal clusters. The study shows that the RFCM algorithm works for both cases and outperforms the both categories of the algorithms.

Book ChapterDOI
TL;DR: This document is a preliminary version of an in-depth review on the state of the art of clustering financial time series and the study of correlation networks and will form a basis for implementation of an open toolbox of standard tools to study correlations, hierarchies, networks and clustering in financial markets.
Abstract: We review the state of the art of clustering financial time series and the study of their correlations alongside other interaction networks. The aim of the review is to gather in one place the relevant material from different fields, e.g. machine learning, information geometry, econophysics, statistical physics, econometrics, behavioral finance. We hope it will help researchers to use more effectively this alternative modeling of the financial time series. Decision makers and quantitative researchers may also be able to leverage its insights. Finally, we also hope that this review will form the basis of an open toolbox to study correlations, hierarchies, networks and clustering in financial markets.

Journal ArticleDOI
TL;DR: The experiment results demonstrate that the proposed strategy can process noise data in the IoT to improve clustering accuracy and verified the universality and effectiveness of postprocessing strategies in the traditional image recognition field and IoT field, respectively.
Abstract: The Internet-of-Things (IoT) technology is widely used in various fields. In the Earth observation system, hyperspectral images (HSIs) are acquired by hyperspectral sensors and always transmitted to the cloud for analysis. In order to reduce cost and reply promptly, we deploy artificial intelligence (AI) models for data analysis on edge servers. Subspace clustering, the core of the AI model, is employed to analyze high-dimensional image data such as HSIs. However, most traditional subspace clustering algorithms construct a single model, which can be affected by noise more easily. It hardly balances the sparsity and connectivity of the representation coefficient matrix. Therefore, we proposed a postprocess strategy of subspace clustering for taking account of sparsity and connectivity. First, we define close neighbors as having more common neighbors and higher coefficients neighbors, where the close neighbors are selected according to the nondominated sorting algorithm. Second, the coefficients between the sample and close neighbors are reserved, incorrect, or useless connections are pruned. Then, the postprocess strategy can reserve the intrasubspace connection and prune the intersubspace connection. In experiments, we verified the universality and effectiveness of postprocessing strategies in the traditional image recognition field and IoT field, respectively. The experiment results demonstrate that the proposed strategy can process noise data in the IoT to improve clustering accuracy.

Journal ArticleDOI
TL;DR: The Partitional Implementation of Unified Form (PIUF) algorithm is designed and formulated to be used on a single machine if the processed dataset is very big and it cannot be entirely loaded in the memory and it can be scaled to multiple processing nodes for reducing the processing time required to find the optimal solution.
Abstract: This paper proposes as an element of novelty the Unified Form (UF) clustering algorithm, which treats Fuzzy C-Means (FCM) and K-Means (KM) algorithms as a single configurable algorithm. UF algorithm was designed to facilitate the FCM and KM algorithms software implementation by offering a solution to implement a single algorithm, which can be configured to work as FCM or KM. The second element of novelty of this paper is the Partitional Implementation of Unified Form (PIUF) algorithm, which is built upon the UF algorithm and designed to solve in an elegant manner the challenges of processing large datasets in a sequential manner and the scalability of the UF algorithm for processing datasets of any size. PIUF algorithm has the advantage of overcoming any possible hardware limitations that can occur if large volumes of data are processed (required to be stored, loaded in memory and processed by a certain specified computational system). PIUF algorithm is designed and formulated to be used on a single machine if the processed dataset is very big and it cannot be entirely loaded in the memory; at the same time it can be scaled to multiple processing nodes for reducing the processing time required to find the optimal solution. UF and PIUF algorithms are implemented and validated in BigTim platform, which is a distributed platform developed by the authors, and offers support for processing various datasets in a parallel manner but they can be implemented in any other data processing platforms. The Iris dataset is considered and next modified to obtain different datasets of different sizes in order to test the algorithms implementations in BigTim platform in different configurations. The analysis of PIUF algorithm and the comparison with FCM, KM and DBSCAN clustering algorithms are carried out using two performance indices; three performance indices are employed to evaluate the quality of the obtained clusters.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: This paper proposed the Hidden Unit BERT (HUBERT) model, which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model.
Abstract: Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.

Journal ArticleDOI
TL;DR: The machine learning literature is surveyed and in an optimization framework several commonly used machine learning approaches are presented for regression, classification, clustering, deep learning, and adversarial learning as well as new emerging applications in machine teaching, empirical modelLearning, and Bayesian network structure learning.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a scalable graph learning framework, which is based on the ideas of anchor points and bipartite graph to solve the problem of expensive time overhead, and cannot generalize to unseen data points.
Abstract: Graph-based subspace clustering methods have exhibited promising performance. However, they still suffer some of these drawbacks: they encounter the expensive time overhead, they fail to explore the explicit clusters, and cannot generalize to unseen data points. In this work, we propose a scalable graph learning framework, seeking to address the above three challenges simultaneously. Specifically, it is based on the ideas of anchor points and bipartite graph. Rather than building an n x n graph, where n is the number of samples, we construct a bipartite graph to depict the relationship between samples and anchor points. Meanwhile, a connectivity constraint is employed to ensure that the connected components indicate clusters directly. We further establish the connection between our method and the K-means clustering. Moreover, a model to process multiview data is also proposed, which is linearly scaled with respect to n. Extensive experiments demonstrate the efficiency and effectiveness of our approach with respect to many state-of-the-art clustering methods.

Journal ArticleDOI
19 Mar 2021
TL;DR: An overview of the machine learning algorithms that are applied for the identification and prediction of many diseases such as Naïve Bayes, logistic regression, support vector machine, K-nearest neighbor,K-means clustering, decision tree, and random forest are given.
Abstract: Nowadays, machine learning algorithms have become very important in the medical sector, especially for diagnosing disease from the medical database. Many companies using these techniques for the early prediction of diseases and enhance medical diagnostics. The motivation of this paper is to give an overview of the machine learning algorithms that are applied for the identification and prediction of many diseases such as Naive Bayes, logistic regression, support vector machine, K-nearest neighbor, K-means clustering, decision tree, and random forest. In this work, many previous studies were reviewed that used machine learning algorithms for detecting various diseases in the medical area in the last three years. A comparison is provided concerning these algorithms, assessment processes, and the obtained results. Finally, a discussion of the previous works is presented.