scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2022"


Journal ArticleDOI
TL;DR: Clustering is an essential tool in data mining research and applications as discussed by the authors and it is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning.

133 citations


Journal ArticleDOI
06 Jun 2022-Scanning
TL;DR: A novel research on hyperspectral microscopic picture using deep learning and effective unsupervised learning is explored and the Kullback–Leibler divergence is used to test the objective function convergence.
Abstract: Hyperspectral microscopy in biology and minerals, unsupervised deep learning neural network denoising SRS photos: hyperspectral resolution enhancement and denoising one hyperspectral picture is enough to teach unsupervised method. An intuitive chemical species map for a lithium ore sample is produced using k-means clustering. Many researchers are now interested in biosignals. Uncertainty limits the algorithms' capacity to evaluate these signals for further information. Even while AI systems can answer puzzles, they remain limited. Deep learning is used when machine learning is inefficient. Supervised learning needs a lot of data. Deep learning is vital in modern AI. Supervised learning requires a large labeled dataset. The selection of parameters prevents over- or underfitting. Unsupervised learning is used to overcome the challenges outlined above (performed by the clustering algorithm). To accomplish this, two processing processes were used: (1) utilizing nonlinear deep learning networks to turn data into a latent feature space (Z). The Kullback–Leibler divergence is used to test the objective function convergence. This article explores a novel research on hyperspectral microscopic picture using deep learning and effective unsupervised learning.

116 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed complete IoU (CIoU) loss and cluster-NMS for enhancing geometric factors in both bounding-box regression and non-maximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.
Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted $\ell _{n}$ -norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR 100 for object detection, and +1.1 AP and +3.5 AR 100 for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU .

113 citations


Journal ArticleDOI
TL;DR: An optimal approach to anonymization using small data is proposed in this study and it is shown that the suggested method will always finish ahead of the existing method by using the least amount of time while ensuring the greatest level of security.
Abstract: An optimal approach to anonymization using small data is proposed in this study. Map Reduce is a big data processing framework used across distributed applications. Prior to the development of a map reduce framework, data are distributed and clustered using a hybrid clustering algorithm. The algorithm used for grouping together similar techniques utilises the k-means clustering algorithm, along with the MFCM clustering algorithm. Clustered data is then fed into the map reduce frame work after it has been clustered. In order to guarantee privacy, the optimal k anonymization method is recommended. When using generalisation and randomization, there are two techniques that can be employed: K-anonymity, which is unique to each, depends on the type of the quasi identifier attribute. Our method replaces the standard k anonymization process by employing an optimization algorithm that dynamically determines the optimal k value. This algorithm uses the Modified Grey Wolf Optimization (MGWO) algorithm for optimization. The memory, execution time, accuracy, and error value are used to assess the recommended method’s practise. This experiment has shown that the suggested method will always finish ahead of the existing method by using the least amount of time while ensuring the greatest level of security. The current technique gets the lowest accuracy and the privacy proposed achieves the maximum accuracy while compared to the current technique. The solution is implemented in Java with Hadoop Map-Reduce, and it is tested and deployed in the cloud on Google Cloud Platform.

110 citations


Journal ArticleDOI
TL;DR: This work shows that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHARDII datasets and presents for the first time the derivation and update formulae for the VBX model.

110 citations


Journal ArticleDOI
TL;DR: Cl clustering algorithms based on sensor module energy states to strengthen the network longevity of wireless sensor networks is proposed (i.e. modified MPCT algorithm) in which cluster head determination depends on the every cluster power centroid as well as power of the sensor nodes.
Abstract: Wireless sensor networks (WSN) allude to gathering of spatially fragmented and committed sensors for observing and documenting various physical and climatic variables like temperature, moistness and, so on. WSN is quickly growing its work in different fields like clinical, enterprises, climate following and so on. However, the sensor nodes have restricted battery life and substitution or re-energizing of these batteries in the sensor nodes is exceptionally troublesome for the most parts. Energy effectiveness is the significant worry in the remote sensor networks as it is significant for keeping up its activity. In this paper, clustering algorithms based on sensor module energy states to strengthen the network longevity of wireless sensor networks is proposed (i.e. modified MPCT algorithm) in which cluster head determination depends on the every cluster power centroid as well as power of the sensor nodes. Correspondence between cluster leader and sink module employ a parameter distance edge for lessening energy utilization. The outcome got shows a normal increment of 60% in network lifetime compared to Low energy adaptive protocol, Energy efficient midpoint initialization algorithm (EECPK-means), Park K-means algorithm and Mobility path selection protocol.

106 citations


Journal ArticleDOI
11 Mar 2022-Science
TL;DR: This work combined genome engineering, confocal live-cell imaging, mass spectrometry, and data science to systematically map the localization and interactions of human proteins, and shows that proteins that bind RNA form a separate subgroup defined by specific localization and interaction signatures.
Abstract: Elucidating the wiring diagram of the human cell is a central goal of the postgenomic era. We combined genome engineering, confocal live-cell imaging, mass spectrometry, and data science to systematically map the localization and interactions of human proteins. Our approach provides a data-driven description of the molecular and spatial networks that organize the proteome. Unsupervised clustering of these networks delineates functional communities that facilitate biological discovery. We found that remarkably precise functional information can be derived from protein localization patterns, which often contain enough information to identify molecular interactions, and that RNA binding proteins form a specific subgroup defined by unique interaction and localization properties. Paired with a fully interactive website (opencell.czbiohub.org), our work constitutes a resource for the quantitative cartography of human cellular organization. Description Tracking proteins Improved understanding of how proteins are organized within human cells should enhance our systems-level understanding of how cells function. Cho et al. used CRISPR technology to express more than 1000 different proteins at near endogenous amounts with labels that allowed both fluorescent imaging of their location and immunoprecipitation and mass spectrometry analysis of interacting protein partners (see the Perspective by Michnick and Levy). The large-scale data are made available on an interactive website, with clustering and analysis performed by machine learning. The studies emphasize the unusual properties of RNA-binding proteins and indicate that protein localization is very specific and may allow predictions of function. —LBR Combining genome engineering, live-cell imaging, mass spectrometry, and data science are used to map the localization and interactions of human proteins. INTRODUCTION Proteins are the product of gene expression and the molecular building blocks of cells. Examples include enzymes that orchestrate the cell’s chemistry, filaments that shape the cell’s structure, or the pharmacological targets of drugs. The genome sequence provides us with the complete set of proteins that give rise to the human cell. However, systematically characterizing how proteins organize within the cell to sustain its operation remains an important goal of modern cell biology. A comprehensive map of the human proteome’s organization will serve as a reference to explore gene function in health and disease. RATIONALE Subcellular localization and physical interactions are key aspects tightly related to the function of any given protein. Proteins localize to different subcellular compartments, which enables a spatial separation of cellular functions. Proteins also physically interact with one another, forming molecular networks that connect proteins involved in the same processes. Therefore, mapping the cell’s molecular organization requires a comprehensive description of where different proteins localize and how they interact. Among other strategies, a powerful approach to map cellular architecture is to visualize individual proteins using fusions with fluorescent protein “tags.” These tags allow us not only to image protein localization in live cells, but also to measure protein interactions by serving as handles for immunopurification–mass spectrometry (IP-MS). Recent advances in genome engineering facilitate tagging of endogenous human genes, so that the corresponding proteins can be characterized in their native cellular environment. RESULTS Using high-throughput CRISPR-mediated genome editing, we constructed a library of 1310 fluorescently tagged cell lines. By performing paired IP-MS and live-cell imaging using this library, we generated a large dataset that maps the cellular localization and physical interactions of the corresponding 1310 proteins. Applying a combination of unsupervised clustering and machine learning for image analysis allowed us to objectively identify proteins that share spatial or interaction signatures. Our data provide insights into the function of individual proteins, but also enable us to derive some general principles of human cellular organization. In particular, we show that proteins that bind RNA form a separate subgroup defined by specific localization and interaction signatures. We also show that the precise spatial distribution of a given protein is very strongly correlated with its cellular function, such that fine-grained molecular insights can be derived from the analysis of imaging data. Our open-source dataset can be explored through an interactive web interface at opencell.czbiohub.org. CONCLUSION Our results show that endogenous tagging coupled with interactome and microscopy analysis provides new systems-level insights about the organization of the human proteome. The information contained within the subcellular distribution of each protein is highly specific and can be paired with advances in machine learning to extrapolate fine-grained functional information using microscopy alone. This opens exciting avenues for the characterization of understudied proteins, high-throughput screening, and modeling of complex cellular states during differentiation and disease. OpenCell: Combining endogenous tagging, live-cell imaging, and interaction proteomics to map the architecture of the human proteome. We created a library of engineered cell lines by using CRISPR to introduce fluorescent tags into 1310 individual human proteins. This allowed us to image the localization of each protein in live cells, as well as the interactions between a given target and other proteins within the cell. This large dataset enables a systems-level description of the organization of the human proteome.

102 citations


Journal ArticleDOI
01 Feb 2022-Sensors
TL;DR: An improved metaheuristics-based clustering with multihop routing protocol for underwater wireless sensor networks, named the IMCMR-UWSN technique, which helps to significantly boost the energy efficiency and lifetime of the UWSN.
Abstract: Underwater wireless sensor networks (UWSNs) comprise numerous underwater wireless sensor nodes dispersed in the marine environment, which find applicability in several areas like data collection, navigation, resource investigation, surveillance, and disaster prediction. Because of the usage of restricted battery capacity and the difficulty in replacing or charging the inbuilt batteries, energy efficiency becomes a challenging issue in the design of UWSN. Earlier studies reported that clustering and routing are considered effective ways of attaining energy efficacy in the UWSN. Clustering and routing processes can be treated as nondeterministic polynomial-time (NP) hard optimization problems, and they can be addressed by the use of metaheuristics. This study introduces an improved metaheuristics-based clustering with multihop routing protocol for underwater wireless sensor networks, named the IMCMR-UWSN technique. The major aim of the IMCMR-UWSN technique is to choose cluster heads (CHs) and optimal routes to a destination. The IMCMR-UWSN technique incorporates two major processes, namely the chaotic krill head algorithm (CKHA)-based clustering and self-adaptive glow worm swarm optimization algorithm (SA-GSO)-based multihop routing. The CKHA technique selects CHs and organizes clusters based on different parameters such as residual energy, intra-cluster distance, and inter-cluster distance. Similarly, the SA-GSO algorithm derives a fitness function involving four parameters, namely residual energy, delay, distance, and trust. Utilization of the IMCMR-UWSN technique helps to significantly boost the energy efficiency and lifetime of the UWSN. To ensure the improved performance of the IMCMR-UWSN technique, a series of simulations were carried out, and the comparative results reported the supremacy of the IMCMR-UWSN technique in terms of different measures.

88 citations


Journal ArticleDOI
TL;DR: In this paper , the authors developed an integrated approach to detect clusters in financial data, and optimize the scope of the clusters such that the clusters can be easily interpreted, which is suitable for large-scale financial datasets whose features are meaningful, and also applicable to financial mining tasks, such as data distribution interpretation and anomaly detection.
Abstract: In many financial applications, such as fraud detection, reject inference, and credit evaluation, detecting clusters automatically is critical because it helps to understand the subpatterns of the data that can be used to infer user’s behaviors and identify potential risks. Due to the complexity of human behaviors and changing social environments, the distributions of financial data are usually complex and it is challenging to find clusters and give reasonable interpretations. The goal of this study is to develop an integrated approach to detect clusters in financial data, and optimize the scope of the clusters such that the clusters can be easily interpreted. Specifically, we first proposed a new cluster quality evaluation criterion, which is free from large-scale computation and can guide base clustering algorithms such as ${k}$ -Means to detect hyperellipsoidal clusters adaptively. Then, we designed a new solver for a revised support vector data description model, which efficiently refines the centroids and scopes of the detected clusters to make the clusters tighter such that the data in the clusters share greater similarities, and thus, the clusters can be easily interpreted with eigenvectors. Using ten financial datasets, the experiments showed that the proposed algorithm can efficiently find reasonable number of clusters. The proposed approach is suitable for large-scale financial datasets whose features are meaningful, and also applicable to financial mining tasks, such as data distribution interpretation and anomaly detection.

84 citations


Journal ArticleDOI
TL;DR: In this paper , a clustering-based algorithm was proposed to classify regions into clusters and obtain approximate optimal point-to-point paths for UAVs such that coverage tasks would be carried out correctly and efficiently.
Abstract: Unmanned aerial vehicles (UAVs) have been widely applied in civilian and military applications due to their high autonomy and strong adaptability. Although UAVs can achieve effective cost reduction and flexibility enhancement in the development of large-scale systems, they result in a serious path planning and task allocation problem. Coverage path planning, which tries to seek flight paths to cover all of regions of interest, is one of the key technologies in achieving autonomous driving of UAVs and difficult to obtain optimal solutions because of its NP-Hard computational complexity. In this paper, we study the coverage path planning problem of autonomous heterogeneous UAVs on a bounded number of regions. First, with models of separated regions and heterogeneous UAVs, we propose an exact formulation based on mixed integer linear programming to fully search the solution space and produce optimal flight paths for autonomous UAVs. Then, inspired from density-based clustering methods, we design an original clustering-based algorithm to classify regions into clusters and obtain approximate optimal point-to-point paths for UAVs such that coverage tasks would be carried out correctly and efficiently. Experiments with randomly generated regions are conducted to demonstrate the efficiency and effectiveness of the proposed approach.

82 citations


Journal ArticleDOI
06 Oct 2022
TL;DR: In this paper , the authors proposed a novel framework for clustered inference on average treatment effects, which incorporates a design component that accounts for the variability induced on the estimator by the treatment assignment mechanism.
Abstract: Abstract Clustered standard errors, with clusters defined by factors such as geography, are widespread in empirical research in economics and many other disciplines. Formally, clustered standard errors adjust for the correlations induced by sampling the outcome variable from a data-generating process with unobserved cluster-level components. However, the standard econometric framework for clustering leaves important questions unanswered: (i) Why do we adjust standard errors for clustering in some ways but not others, for example, by state but not by gender, and in observational studies but not in completely randomized experiments? (ii) Is the clustered variance estimator valid if we observe a large fraction of the clusters in the population? (iii) In what settings does the choice of whether and how to cluster make a difference? We address these and other questions using a novel framework for clustered inference on average treatment effects. In addition to the common sampling component, the new framework incorporates a design component that accounts for the variability induced on the estimator by the treatment assignment mechanism. We show that, when the number of clusters in the sample is a nonnegligible fraction of the number of clusters in the population, conventional clustered standard errors can be severely inflated, and propose new variance estimators that correct for this bias.

Journal ArticleDOI
TL;DR: A prior-dependent graph (PDG) construction method can achieve substantial performance, which can be deployed in edge computing module to provide efficient solutions for massive data management and applications in AIoT.

Journal ArticleDOI
TL;DR: In this paper , an adaptive clustering-based algorithm with a symbiotic organisms search-based optimization strategy is proposed to efficiently settle the path planning problem and generate feasible paths for heterogeneous UAVs with a view to minimizing the time consumption of the search tasks.
Abstract: Due to the high maneuverability and strong adaptability, autonomous unmanned aerial vehicles (UAVs) are of high interest to many civilian and military organizations around the world. Automatic path planning which autonomously finds a good enough path that covers the whole area of interest, is an essential aspect of UAV autonomy. In this study, we focus on the automatic path planning of heterogeneous UAVs with different flight and scan capabilities, and try to present an efficient algorithm to produce appropriate paths for UAVs. First, models of heterogeneous UAVs are built, and the automatic path planning is abstracted as a multi-constraint optimization problem and solved by a linear programming formulation. Then, inspired by the density-based clustering analysis and symbiotic interaction behaviours of organisms, an adaptive clustering-based algorithm with a symbiotic organisms search-based optimization strategy is proposed to efficiently settle the path planning problem and generate feasible paths for heterogeneous UAVs with a view to minimizing the time consumption of the search tasks. Experiments on randomly generated regions are conducted to evaluate the performance of the proposed approach in terms of task completion time, execution time and deviation ratio.

Journal ArticleDOI
TL;DR: In this paper, a neighborhood linear discriminant analysis (nLDA) is proposed, in which the scatter matrices are defined on a neighborhood consisting of reverse nearest neighbors and the neighborhood can be naturally regarded as the smallest subclass.

Journal ArticleDOI
TL;DR: In this article , a neighborhood linear discriminant analysis (nLDA) is proposed to address multimodality in LDA, in which the scatter matrices are defined on a neighborhood consisting of reverse nearest neighbors.

Journal ArticleDOI
TL;DR: An improved metaheuristic-driven energy-aware cluster-based routing (IMD-EACBR) scheme for IoT-assisted WSN that intends to achieve maximum energy utilization and lifetime in the network is introduced.
Abstract: The Internet of Things (IoT) is a network of numerous devices that are consistent with one another via the internet. Wireless sensor networks (WSN) play an integral part in the IoT, which helps to produce seamless data that highly influence the network’s lifetime. Despite the significant applications of the IoT, several challenging issues such as security, energy, load balancing, and storage exist. Energy efficiency is considered to be a vital part of the design of IoT-assisted WSN; this is accomplished by clustering and multi-hop routing techniques. In view of this, we introduce an improved metaheuristic-driven energy-aware cluster-based routing (IMD-EACBR) scheme for IoT-assisted WSN. The proposed IMD-EACBR model intends to achieve maximum energy utilization and lifetime in the network. In order to attain this, the IMD-EACBR model primarily designs an improved Archimedes optimization algorithm-based clustering (IAOAC) technique for cluster head (CH) election and cluster organization. In addition, the IAOAC algorithm computes a suitability purpose that connects multiple structures specifically for energy efficiency, detachment, node degree, and inter-cluster distance. Moreover, teaching–learning-based optimization (TLBO) algorithm-based multi-hop routing (TLBO-MHR) technique is applied for optimum selection of routes to destinations. Furthermore, the TLBO-MHR method originates a suitability purpose using energy and distance metrics. The performance of the IMD-EACBR model has been examined in several aspects. Simulation outcomes demonstrated enhancements of the IMD-EACBR model over recent state-of-the-art approaches. IMD-EACBR is a model that has been proposed for the transmission of emergency data, and the TLBO-MHR technique is one that is based on the requirements for hop count and distance. In the end, the proposed network is subjected to rigorous testing using NS-3.26’s full simulation capabilities. The results of the simulation reveal improvements in performance in terms of the proportion of dead nodes, the lifetime of the network, the amount of energy consumed, the packet delivery ratio (PDR), and the latency.

Journal ArticleDOI
TL;DR: In this paper , the authors presented a study on different k-nearest neighbor (KNN) variants and their performance comparison for disease prediction and provided a relative comparison among KNN variants based on precision and recall measures.
Abstract: Abstract Disease risk prediction is a rising challenge in the medical domain. Researchers have widely used machine learning algorithms to solve this challenge. The k -nearest neighbour (KNN) algorithm is the most frequently used among the wide range of machine learning algorithms. This paper presents a study on different KNN variants (Classic one, Adaptive, Locally adaptive, k-means clustering, Fuzzy, Mutual, Ensemble, Hassanat and Generalised mean distance) and their performance comparison for disease prediction. This study analysed these variants in-depth through implementations and experimentations using eight machine learning benchmark datasets obtained from Kaggle, UCI Machine learning repository and OpenML. The datasets were related to different disease contexts. We considered the performance measures of accuracy, precision and recall for comparative analysis. The average accuracy values of these variants ranged from 64.22% to 83.62%. The Hassanaat KNN showed the highest average accuracy (83.62%), followed by the ensemble approach KNN (82.34%). A relative performance index is also proposed based on each performance measure to assess each variant and compare the results. This study identified Hassanat KNN as the best performing variant based on the accuracy-based version of this index, followed by the ensemble approach KNN. This study also provided a relative comparison among KNN variants based on precision and recall measures. Finally, this paper summarises which KNN variant is the most promising candidate to follow under the consideration of three performance measures (accuracy, precision and recall) for disease prediction. Healthcare researchers and stakeholders could use the findings of this study to select the appropriate KNN variant for predictive disease risk analytics.

Journal ArticleDOI
TL;DR: In this paper , a prior-dependent graph (PDG) construction method is proposed for high-efficiency data clustering and dimensionality reduction in Artificial Intelligence Internet of Things (AIoT) applications.

Proceedings ArticleDOI
05 Feb 2022
TL;DR: In this paper , the authors propose an Intent Contrastive Learning (ICL) approach to learn users' intent distribution functions from unlabeled user behavior sequences and optimize SR models with contrastive self-supervised learning by considering the learnt intents to improve recommendation.
Abstract: Users’ interactions with items are driven by various intents (e.g., preparing for holiday gifts, shopping for fishing equipment, etc.). However, users’ underlying intents are often unobserved/latent, making it challenging to leverage such latent intents for Sequential recommendation (SR). To investigate the benefits of latent intents and leverage them effectively for recommendation, we propose Intent Contrastive Learning (ICL), a general learning paradigm that leverages a latent intent variable into SR. The core idea is to learn users’ intent distribution functions from unlabeled user behavior sequences and optimize SR models with contrastive self-supervised learning (SSL) by considering the learnt intents to improve recommendation. Specifically, we introduce a latent variable to represent users’ intents and learn the distribution function of the latent variable via clustering. We propose to leverage the learnt intents into SR models via contrastive SSL, which maximizes the agreement between a view of sequence and its corresponding intent. The training is alternated between intent representation learning and the SR model optimization steps within the generalized expectation-maximization (EM) framework. Fusing user intent information into SR also improves model robustness. Experiments conducted on four real-world datasets demonstrate the superiority of the proposed learning paradigm, which improves performance, and robustness against data sparsity and noisy interaction issues 1.

Journal ArticleDOI
TL;DR: In this paper , the authors provide an overview of data pre-processing in machine learning, focusing on all types of problems while building the machine learning problems and discuss flipping, rotating with slight degrees and others to augment the image data.
Abstract: This review paper provides an overview of data pre-processing in Machine learning, focusing on all types of problems while building the machine learning problems. It deals with two significant issues in the pre-processing process (i). issues with data and (ii). Steps to follow to do data analysis with its best approach. As raw data are vulnerable to noise, corruption, missing, and inconsistent data, it is necessary to perform pre-processing steps, which is done using classification, clustering, and association and many other pre-processing techniques available. Poor data can primarily affect the accuracy and lead to false prediction, so it is necessary to improve the dataset's quality. So, data pre-processing is the best way to deal with such problems. It makes the knowledge extraction from the data set much easier with cleaning, Integration, transformation, and reduction methods. The issue with Data missing and significant differences in the variety of data always exists as the information is collected through multiple sources and from a real-world application. So, the data augmentation approach generates data for machine learning models. To decrease the dependency on training data and to improve the performance of the machine learning model. This paper discusses flipping, rotating with slight degrees and others to augment the image data and shows how to perform data augmentation methods without distorting the original data.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new three-phase hybrid feature selection algorithm based on correlation-guided clustering and particle swarm optimization (HFS-C-P) to tackle the above two problems at the same time.
Abstract: The “curse of dimensionality” and the high computational cost have still limited the application of the evolutionary algorithm in high-dimensional feature selection (FS) problems. This article proposes a new three-phase hybrid FS algorithm based on correlation-guided clustering and particle swarm optimization (PSO) (HFS-C-P) to tackle the above two problems at the same time. To this end, three kinds of FS methods are effectively integrated into the proposed algorithm based on their respective advantages. In the first and second phases, a filter FS method and a feature clustering-based method with low computational cost are designed to reduce the search space used by the third phase. After that, the third phase applies oneself to finding an optimal feature subset by using an evolutionary algorithm with the global searchability. Moreover, a symmetric uncertainty-based feature deletion method, a fast correlation-guided feature clustering strategy, and an improved integer PSO are developed to improve the performance of the three phases, respectively. Finally, the proposed algorithm is validated on 18 publicly available real-world datasets in comparison with nine FS algorithms. Experimental results show that the proposed algorithm can obtain a good feature subset with the lowest computational cost.

Journal ArticleDOI
TL;DR: It is demonstrated that chronic disease diagnosis can be significantly improved by heuristic-based attribute selection coupled with clustering followed by classification, and can be used to develop a decision support system to assist medical experts in the effective analysis of chronic diseases in a cost-effective manner.
Abstract: Advanced predictive analytics coupled with an effective attribute selection method plays a pivotal role in the precise assessment of chronic disorder risks in patients. Traditional attribute selection approaches suffer from premature convergence, high complexity, and computational cost. On the contrary, heuristic-based optimization to supervised methods minimizes the computational cost by eliminating outlier attributes. In this study, a novel buffer-enabled heuristic, a memory-based metaheuristic attribute selection (MMAS) model, is proposed, which performs a local neighborhood search for optimizing chronic disorders data. It is further filtered with unsupervised K-means clustering to remove outliers. The resultant data are input to the Naive Bayes classifier to determine chronic disease risks' presence. Heart disease, breast cancer, diabetes, and hepatitis are the datasets used in the research. Upon implementation of the model, a mean accuracy of 94.5% using MMAS was recorded and it dropped to 93.5% if clustering was not used. The average precision, recall, and F-score metric computed were 96.05%, 94.07%, and 95.06%, respectively. The model also has a least latency of 0.8 sec. Thus, it is demonstrated that chronic disease diagnosis can be significantly improved by heuristic-based attribute selection coupled with clustering followed by classification. It can be used to develop a decision support system to assist medical experts in the effective analysis of chronic diseases in a cost-effective manner.

Journal ArticleDOI
01 Jan 2022-Energy
TL;DR: In this paper , a deep residual neural network (DRNN) was proposed to obtain the regional dust concentration of photovoltaic (PV) panels and an image preprocessing method was designed to classify the dust accumulation.

Journal ArticleDOI
TL;DR: In this paper , the authors compared the statistical power of discrete (k-means), fuzzy (fuzzy) and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).
Abstract: Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

Journal ArticleDOI
TL;DR: A comprehensive review of various equal clustering, unequal clusters, and hybrid clustering approaches with their clustering attributes is presented to mitigate hotspot issues in heterogeneous WSNs by using various parameters such as cluster head selection, number of clusters, zone formation, transmission, and routing parameters.
Abstract: Wireless Sensor Networks (WSNs) consist of a spatially distributed set of autonomous connected sensor nodes. The deployed sensor nodes are extensively used for sensing and monitoring for environmental surveillance, military operations, transportation monitoring, and healthcare monitoring. The sensor nodes in these networks have limited resources in terms of battery, storage, and processing. In some scenarios, the sensor nodes are deployed closer to the base station and responsible to forward their own and neighbor nodes’ data towards the base station and depleted energy. This issue is called a hotspot in the network. Hotspot issues mainly appear in those locations where traffic load is more on the sensor nodes. The dynamic and unequal clustering techniques have been used and mitigate the hotspot issues. However, with few benefits, these solutions have suffered from coverage overhead, network connection issues, unbalanced energy utilization among the sink nodes, and network stability issues. In this paper, a comprehensive review of various equal clustering, unequal clustering, and hybrid clustering approaches with their clustering attributes is presented to mitigate hotspot issues in heterogeneous WSNs by using various parameters such as cluster head selection, number of clusters, zone formation, transmission, and routing parameters. This review provides a detailed platform for new researchers to explore the new and novel solutions to solve the hotspot issues in these networks.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a collaborative clustering-characteristic-based data fusion approach for intrusion detection in a blockchain-based system, where a mathematical model of data fusion is designed and an AI model is used to train and analyze data clusters in blockchain networks.
Abstract: Blockchain technology is rapidly changing the transaction behavior and efficiency of businesses in recent years. Data privacy and system reliability are critical issues that is highly required to be addressed in Blockchain environments. However, anomaly intrusion poses a significant threat to a Blockchain, and therefore, it is proposed in this article a collaborative clustering-characteristic-based data fusion approach for intrusion detection in a Blockchain-based system, where a mathematical model of data fusion is designed and an AI model is used to train and analyze data clusters in Blockchain networks. The abnormal characteristics in a Blockchain data set are identified, a weighted combination is carried out, and the weighted coefficients among several nodes are obtained after multiple rounds of mutual competition among clustering nodes. When the weighted coefficient and a similarity matching relationship follow a standard pattern, an abnormal intrusion behavior is accurately and collaboratively detected. Experimental results show that the proposed algorithm has high recognition accuracy and promising performance in the real-time detection of attacks in a Blockchain.

Journal ArticleDOI
TL;DR: A novel contrastive learning scheme by including the labels in the same embedding space as the features and performing the distance comparison between features and labels in this sharedembedding space, which drastically reduces the number of pair-wise comparisons, thus improving model performance.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a scalable graph learning framework, which is based on the ideas of anchor points and bipartite graph, to solve the problems of expensive time overhead, inability to explore the explicit clusters, and cannot generalize to unseen data points.
Abstract: Graph-based subspace clustering methods have exhibited promising performance. However, they still suffer some of these drawbacks: they encounter the expensive time overhead, they fail to explore the explicit clusters, and cannot generalize to unseen data points. In this work, we propose a scalable graph learning framework, seeking to address the above three challenges simultaneously. Specifically, it is based on the ideas of anchor points and bipartite graph. Rather than building an $n\times n$ graph, where $n$ is the number of samples, we construct a bipartite graph to depict the relationship between samples and anchor points. Meanwhile, a connectivity constraint is employed to ensure that the connected components indicate clusters directly. We further establish the connection between our method and the $K$ -means clustering. Moreover, a model to process multiview data is also proposed, which is linearly scaled with respect to $n$ . Extensive experiments demonstrate the efficiency and effectiveness of our approach with respect to many state-of-the-art clustering methods.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper examined the influential factors of China's CEEI at both national and provincial level and explored targeted provincial strategies, which are critical for China to control its CEEIs effectively and to achieve its carbon peaking aim.

Journal ArticleDOI
TL;DR: In this article , a deep learning model for detection and prevention of attack in the cloud platform is presented, which is carried out in three phases like at first, Hidden Markov Model (HMM) is incorporated for the detection of attacks.