scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2020"


Journal ArticleDOI
TL;DR: This survey aims to provide researchers and practitioners new to the field as well as more advanced readers with a solid understanding of the main approaches and algorithms developed over the past two decades, with an emphasis on the most prominent and currently relevant work.
Abstract: Semi-supervised learning is the branch of machine learning concerned with using labelled as well as unlabelled data to perform certain learning tasks. Conceptually situated between supervised and unsupervised learning, it permits harnessing the large amounts of unlabelled data available in many use cases in combination with typically smaller sets of labelled data. In recent years, research in this area has followed the general trends observed in machine learning, with much attention directed at neural network-based models and generative learning. The literature on the topic has also expanded in volume and scope, now encompassing a broad spectrum of theory, algorithms and applications. However, no recent surveys exist to collect and organize this knowledge, impeding the ability of researchers and engineers alike to utilize it. Filling this void, we present an up-to-date overview of semi-supervised learning methods, covering earlier work as well as more recent advances. We focus primarily on semi-supervised classification, where the large majority of semi-supervised learning research takes place. Our survey aims to provide researchers and practitioners new to the field as well as more advanced readers with a solid understanding of the main approaches and algorithms developed over the past two decades, with an emphasis on the most prominent and currently relevant work. Furthermore, we propose a new taxonomy of semi-supervised classification algorithms, which sheds light on the different conceptual and methodological approaches for incorporating unlabelled data into the training process. Lastly, we show how the fundamental assumptions underlying most semi-supervised learning algorithms are closely connected to each other, and how they relate to the well-known semi-supervised clustering assumption.

1,226 citations


Journal ArticleDOI
TL;DR: An unsupervised learning schema is constructed for the k-means algorithm so that it is free of initializations without parameter selection and can also simultaneously find an optimal number of clusters.
Abstract: The k-means algorithm is generally the most known and used clustering method. There are various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema for the k-means algorithm so that it is free of initializations without parameter selection and can also simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-k-means) clustering algorithm with automatically finding an optimal number of clusters without giving any initialization and parameter selection. The computational complexity of the proposed U-k-means clustering algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-k-means clustering algorithm.

545 citations


Posted Content
Junnan Li1, Pan Zhou1, Caiming Xiong1, Richard Socher1, Steven C. H. Hoi1 
TL;DR: This paper introduces prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework and proposes ProtoNCE loss, a generalized version of the InfoN CE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes.
Abstract: This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of instance-wise contrastive learning. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it implicitly encodes semantic structures of the data into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes. PCL achieves state-of-the-art results on multiple unsupervised representation learning benchmarks, with >10% accuracy improvement in low-resource transfer tasks. Code is available at this https URL.

493 citations


Journal ArticleDOI
TL;DR: This work proposes a novel subspace clustering model for multi-view data using a latent representation termed Latent Multi-View Subspace Clustering (LMSC), which explores underlying complementary information from multiple views and simultaneously seeks the underlying latent representation.
Abstract: Subspace clustering is an effective method that has been successfully applied to many applications. Here, we propose a novel subspace clustering model for multi-view data using a latent representation termed Latent Multi-View Subspace Clustering (LMSC). Unlike most existing single-view subspace clustering methods, which directly reconstruct data points using original features, our method explores underlying complementary information from multiple views and simultaneously seeks the underlying latent representation. Using the complementarity of multiple views, the latent representation depicts data more comprehensively than each individual view, accordingly making subspace representation more accurate and robust. We proposed two LMSC formulations: linear LMSC (lLMSC), based on linear correlations between latent representation and each view, and generalized LMSC (gLMSC), based on neural networks to handle general relationships. The proposed method can be efficiently optimized under the Augmented Lagrangian Multiplier with Alternating Direction Minimization (ALM-ADM) framework. Extensive experiments on diverse datasets demonstrate the effectiveness of the proposed method.

455 citations


Proceedings Article
30 Apr 2020
TL;DR: Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition and the algorithm uses a gumbel softmax or online k-means clustering to quantize the dense representations.
Abstract: We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

438 citations


Journal ArticleDOI
TL;DR: The proposed general Graph-based Multi-view Clustering (GMC) takes the data graph matrices of all views and fuses them to generate a unified graph matrix, which helps partition the data points naturally into the required number of clusters.
Abstract: Multi-view graph-based clustering aims to provide clustering solutions to multi-view data. However, most existing methods do not give sufficient consideration to weights of different views and require an additional clustering step to produce the final clusters. They also usually optimize their objectives based on fixed graph similarity matrices of all views. In this paper, we propose a general G raph-based M ulti-view C lustering (GMC) to tackle these problems. GMC takes the data graph matrices of all views and fuses them to generate a unified graph matrix. The unified graph matrix in turn improves the data graph matrix of each view, and also gives the final clusters directly. The key novelty of GMC is its learning method, which can help the learning of each view graph matrix and the learning of the unified graph matrix in a mutual reinforcement manner. A novel multi-view fusion technique can automatically weight each data graph matrix to derive the unified graph matrix. A rank constraint without introducing a tuning parameter is also imposed on the graph Laplacian matrix of the unified matrix, which helps partition the data points naturally into the required number of clusters. An alternating iterative optimization algorithm is presented to optimize the objective function. Experimental results using both toy data and real-world data demonstrate that the proposed method outperforms state-of-the-art baselines markedly.

378 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This paper proposes a new framework, which is referred to as collaborative class conditional generative adversarial net, to bypass the dependence on the source data and achieves superior performance on multiple adaptation tasks with only unlabeled target data, which verifies its effectiveness in this challenging setting.
Abstract: In this paper, we investigate a challenging unsupervised domain adaptation setting --- unsupervised model adaptation. We aim to explore how to rely only on unlabeled target data to improve performance of an existing source prediction model on the target domain, since labeled source data may not be available in some real-world scenarios due to data privacy issues. For this purpose, we propose a new framework, which is referred to as collaborative class conditional generative adversarial net to bypass the dependence on the source data. Specifically, the prediction model is to be improved through generated target-style data, which provides more accurate guidance for the generator. As a result, the generator and the prediction model can collaborate with each other without source data. Furthermore, due to the lack of supervision from source data, we propose a weight constraint that encourages similarity to the source model. A clustering-based regularization is also introduced to produce more discriminative features in the target domain. Compared to conventional domain adaptation methods, our model achieves superior performance on multiple adaptation tasks with only unlabeled target data, which verifies its effectiveness in this challenging setting.

330 citations


Posted Content
TL;DR: This work presents a systematic learning-theoretic study of personalization, and proposes and analyzes three approaches: user clustering, data interpolation, and model interpolation.
Abstract: The standard objective in machine learning is to train a single model for all users. However, in many learning scenarios, such as cloud computing and federated learning, it is possible to learn a personalized model per user. In this work, we present a systematic learning-theoretic study of personalization. We propose and analyze three approaches: user clustering, data interpolation, and model interpolation. For all three approaches, we provide learning-theoretic guarantees and efficient algorithms for which we also demonstrate the performance empirically. All of our algorithms are model-agnostic and work for any hypothesis class.

313 citations


Journal ArticleDOI
TL;DR: A new structural context descriptor is designed to characterize the structural properties of individuals in crowd scenes and a novel framework is introduced for group detection, which is able to determine the group number automatically without any parameter or threshold to be tuned.
Abstract: Detecting coherent groups is fundamentally important for crowd behavior analysis. In the past few decades, plenty of works have been conducted on this topic, but most of them have limitations due to the insufficient utilization of crowd properties and the arbitrary processing of individuals. In this study, a Multiview-based Parameter Free framework (MPF) is proposed. Based on the L1-norm and L2-norm, we design two versions of the multiview clustering method, which is the main part of the proposed framework. This paper presents the contributions on three aspects: (1) a new structural context descriptor is designed to characterize the structural properties of individuals in crowd scenes; (2) a self-weighted multiview clustering method is proposed to cluster feature points by incorporating their orientation and context similarities; and (3) a novel framework is introduced for group detection, which is able to determine the group number automatically without any parameter or threshold to be tuned. The effectiveness of the proposed framework is evaluated on real-world crowd videos, and the experimental results show its promising performance on group detection. In addition, the proposed multiview clustering method is also evaluated on a synthetic dataset and several standard benchmarks, and its superiority over the state-of-the-art competitors is demonstrated.

267 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This paper designs a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid, and presents PointGroup, a new end-to-end bottom-up architecture specifically focused on better grouping the points by exploring the void space between objects.
Abstract: Instance segmentation is an important task for scene understanding. Compared to the fully-developed 2D, 3D instance segmentation for point clouds have much room to improve. In this paper, we present PointGroup, a new end-to-end bottom-up architecture, specifically focused on better grouping the points by exploring the void space between objects. We design a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid. A clustering component is followed to utilize both the original and offset-shifted point coordinate sets, taking advantage of their complementary strength. Further, we formulate the ScoreNet to evaluate the candidate instances, followed by the Non-Maximum Suppression (NMS) to remove duplicates. We conduct extensive experiments on two challenging datasets, ScanNet v2 and S3DIS, on which our method achieves the highest performance, 63.6% and 64.0%, compared to 54.9% and 54.4% achieved by former best solutions in terms of mAP with IoU threshold 0.5.

263 citations


Posted Content
TL;DR: A one-stage online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning, which remarkably outperforms 17 competitive clustering methods on six challenging image benchmarks.
Abstract: In this paper, we propose a one-stage online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learning are respectively conducted in the row and column space by maximizing the similarities of positive pairs while minimizing those of negative ones. Our key observation is that the rows of the feature matrix could be regarded as soft labels of instances, and accordingly the columns could be further regarded as cluster representations. By simultaneously optimizing the instance- and cluster-level contrastive loss, the model jointly learns representations and cluster assignments in an end-to-end manner. Extensive experimental results show that CC remarkably outperforms 17 competitive clustering methods on six challenging image benchmarks. In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19\% (39\%) performance improvement compared with the best baseline.

Journal ArticleDOI
TL;DR: Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets.
Abstract: The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions.

Proceedings ArticleDOI
19 Jul 2020
TL;DR: In this paper, a hierarchical clustering step (FL+HC) is introduced to separate clusters of clients by the similarity of their local updates to the global joint model, and the clusters are trained independently and in parallel on specialised models.
Abstract: Federated learning (FL) is a well established method for performing machine learning tasks over massively distributed data. However in settings where data is distributed in a non-iid (not independent and identically distributed) fashion - as is typical in real world situations - the joint model produced by FL suffers in terms of test set accuracy and/or communication costs compared to training on iid data. We show that learning a single joint model is often not optimal in the presence of certain types of non-iid data. In this work we present a modification to FL by introducing a hierarchical clustering step (FL+HC) to separate clusters of clients by the similarity of their local updates to the global joint model. Once separated, the clusters are trained independently and in parallel on specialised models. We present a robust empirical analysis of the hyperparameters for FL+HC for several iid and non-iid settings. We show how FL+HC allows model training to converge in fewer communication rounds (significantly so under some non-iid settings) compared to FL without clustering. Additionally, FL+HC allows for a greater percentage of clients to reach a target accuracy compared to standard FL. Finally we make suggestions for good default hyperparameters to promote superior performing specialised models without modifying the the underlying federated learning communication protocol.

Posted Content
TL;DR: This work proposes a new framework dubbed the Iterative Federated Clustering Algorithm (IFCA), which alternately estimates the cluster identities of the users and optimizes model parameters for the user clusters via gradient descent, and analyzes the convergence rate of this algorithm first in a linear model with squared loss and then for generic strongly convex and smooth loss functions.
Abstract: We address the problem of federated learning (FL) where users are distributed and partitioned into clusters. This setup captures settings where different groups of users have their own objectives (learning tasks) but by aggregating their data with others in the same cluster (same learning task), they can leverage the strength in numbers in order to perform more efficient federated learning. For this new framework of clustered federated learning, we propose the Iterative Federated Clustering Algorithm (IFCA), which alternately estimates the cluster identities of the users and optimizes model parameters for the user clusters via gradient descent. We analyze the convergence rate of this algorithm first in a linear model with squared loss and then for generic strongly convex and smooth loss functions. We show that in both settings, with good initialization, IFCA is guaranteed to converge, and discuss the optimality of the statistical error rate. In particular, for the linear model with two clusters, we can guarantee that our algorithm converges as long as the initialization is slightly better than random. When the clustering structure is ambiguous, we propose to train the models by combining IFCA with the weight sharing technique in multi-task learning. In the experiments, we show that our algorithm can succeed even if we relax the requirements on initialization with random initialization and multiple restarts. We also present experimental results showing that our algorithm is efficient in non-convex problems such as neural networks. We demonstrate the benefits of IFCA over the baselines on several clustered FL benchmarks.

Journal ArticleDOI
TL;DR: iLearn is a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences.
Abstract: With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

Journal ArticleDOI
TL;DR: In this article, a review of clustering techniques applied in air pollution studies is presented, focusing on spatio-temporal characteristics of air pollutants, pollutant behavior in terms of source, transport pathways, apportionment and links to meteorological conditions.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors integrated imputation and clustering into a unified learning procedure, which does not require that there is at least one complete base kernel matrix over all the samples.
Abstract: Multiple kernel clustering (MKC) algorithms optimally combine a group of pre-specified base kernel matrices to improve clustering performance. However, existing MKC algorithms cannot efficiently address the situation where some rows and columns of base kernel matrices are absent. This paper proposes two simple yet effective algorithms to address this issue. Different from existing approaches where incomplete kernel matrices are first imputed and a standard MKC algorithm is applied to the imputed kernel matrices, our first algorithm integrates imputation and clustering into a unified learning procedure. Specifically, we perform multiple kernel clustering directly with the presence of incomplete kernel matrices, which are treated as auxiliary variables to be jointly optimized. Our algorithm does not require that there be at least one complete base kernel matrix over all the samples. Also, it adaptively imputes incomplete kernel matrices and combines them to best serve clustering. Moreover, we further improve this algorithm by encouraging these incomplete kernel matrices to mutually complete each other. The three-step iterative algorithm is designed to solve the resultant optimization problems. After that, we theoretically study the generalization bound of the proposed algorithms. Extensive experiments are conducted on 13 benchmark data sets to compare the proposed algorithms with existing imputation-based methods. Our algorithms consistently achieve superior performance and the improvement becomes more significant with increasing missing ratio, verifying the effectiveness and advantages of the proposed joint imputation and clustering.

Posted Content
TL;DR: A novel soft softmax-triplet loss is proposed to support learning with soft pseudo triplet labels for achieving the optimal domain adaptation performance in person re-identification models.
Abstract: Person re-identification (re-ID) aims at identifying the same persons' images across different cameras. However, domain diversities between different datasets pose an evident challenge for adapting the re-ID model trained on one dataset to another one. State-of-the-art unsupervised domain adaptation methods for person re-ID transferred the learned knowledge from the source domain by optimizing with pseudo labels created by clustering algorithms on the target domain. Although they achieved state-of-the-art performances, the inevitable label noise caused by the clustering procedure was ignored. Such noisy pseudo labels substantially hinders the model's capability on further improving feature representations on the target domain. In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner. In addition, the common practice is to adopt both the classification loss and the triplet loss jointly for achieving optimal performances in person re-ID models. However, conventional triplet loss cannot work with softly refined labels. To solve this problem, a novel soft softmax-triplet loss is proposed to support learning with soft pseudo triplet labels for achieving the optimal domain adaptation performance. The proposed MMT framework achieves considerable improvements of 14.4%, 18.2%, 13.1% and 16.4% mAP on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks. Code is available at this https URL.

Proceedings ArticleDOI
20 Apr 2020
TL;DR: Structural Deep Clustering Network (SDCN) as discussed by the authors integrates the structural information into deep clustering by designing a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism to unify these two different deep neural architectures.
Abstract: Clustering is a fundamental task in data analysis. Recently, deep clustering, which derives inspiration primarily from deep learning approaches, achieves state-of-the-art performance and has attracted considerable attention. Current deep clustering methods usually boost the clustering results by means of the powerful representation ability of deep learning, e.g., autoencoder, suggesting that learning an effective representation for clustering is a crucial requirement. The strength of deep clustering methods is to extract the useful representations from the data itself, rather than the structure of data, which receives scarce attention in representation learning. Motivated by the great success of Graph Convolutional Network (GCN) in encoding the graph structure, we propose a Structural Deep Clustering Network (SDCN) to integrate the structural information into deep clustering. Specifically, we design a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism to unify these two different deep neural architectures and guide the update of the whole model. In this way, the multiple structures of data, from low-order to high-order, are naturally combined with the multiple representations learned by autoencoder. Furthermore, we theoretically analyze the delivery operator, i.e., with the delivery operator, GCN improves the autoencoder-specific representation as a high-order graph regularization constraint and autoencoder helps alleviate the over-smoothing problem in GCN. Through comprehensive experiments, we demonstrate that our propose model can consistently perform better over the state-of-the-art techniques.

Posted Content
TL;DR: This paper deviates from recent works, and advocate a two-step approach where feature learning and clustering are decoupled, and achieves promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime without the use of any ground-truth annotations.
Abstract: Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision Several recent approaches have tried to tackle this problem in an end-to-end fashion In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled First, a self-supervised task from representation learning is employed to obtain semantically meaningful features Second, we use the obtained features as a prior in a learnable clustering approach In doing so, we remove the ability for cluster learning to depend on low-level features, which is present in current end-to-end learning approaches Experimental evaluation shows that we outperform state-of-the-art methods by large margins, in particular +266% on CIFAR10, +250% on CIFAR100-20 and +213% on STL10 in terms of classification accuracy Furthermore, our method is the first to perform well on a large-scale dataset for image classification In particular, we obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime without the use of any ground-truth annotations The code is made publicly available at this https URL

Journal ArticleDOI
TL;DR: This paper proposes a novel robust graph learning scheme to learn reliable graphs from the real-world noisy data by adaptively removing noise and errors in the raw data and shows that the proposed model outperforms the previous state-of-the-art methods.
Abstract: Learning graphs from data automatically have shown encouraging performance on clustering and semisupervised learning tasks. However, real data are often corrupted, which may cause the learned graph to be inexact or unreliable. In this paper, we propose a novel robust graph learning scheme to learn reliable graphs from the real-world noisy data by adaptively removing noise and errors in the raw data. We show that our proposed model can also be viewed as a robust version of manifold regularized robust principle component analysis (RPCA), where the quality of the graph plays a critical role. The proposed model is able to boost the performance of data clustering, semisupervised classification, and data recovery significantly, primarily due to two key factors: 1) enhanced low-rank recovery by exploiting the graph smoothness assumption and 2) improved graph construction by exploiting clean data recovered by RPCA. Thus, it boosts the clustering, semisupervised classification, and data recovery performance overall. Extensive experiments on image/document clustering, object recognition, image shadow removal, and video background subtraction reveal that our model outperforms the previous state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this paper, a hybrid representative selection strategy and a fast approximation method for $K$K -nearest representatives are proposed for the construction of a sparse affinity sub-matrix.
Abstract: This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for $K$ K -nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning 10-million-level nonlinearly-separable datasets on a PC with 64 GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669 .

Journal ArticleDOI
03 Apr 2020
TL;DR: In this article, the authors propose a method that iteratively divides samples into latent domains via clustering, and then trains the domain-invariant feature extractor shared among the divided domains via adversarial learning.
Abstract: When domains, which represent underlying data distributions, vary during training and testing processes, deep neural networks suffer a drop in their performance. Domain generalization allows improvements in the generalization performance for unseen target domains by using multiple source domains. Conventional methods assume that the domain to which each sample belongs is known in training. However, many datasets, such as those collected via web crawling, contain a mixture of multiple latent domains, in which the domain of each sample is unknown. This paper introduces domain generalization using a mixture of multiple latent domains as a novel and more realistic scenario, where we try to train a domain-generalized model without using domain labels. To address this scenario, we propose a method that iteratively divides samples into latent domains via clustering, and which trains the domain-invariant feature extractor shared among the divided latent domains via adversarial learning. We assume that the latent domain of images is reflected in their style, and thus, utilize style features for clustering. By using these features, our proposed method successfully discovers latent domains and achieves domain generalization even if the domain labels are not given. Experiments show that our proposed method can train a domain-generalized model without using domain labels. Moreover, it outperforms conventional domain generalization methods, including those that utilize domain labels.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper introduced a privacy-preserving machine learning technique named federated learning and proposed a Federated Learning-based Gated Recurrent Unit neural network algorithm (FedGRU), which differs from current centralized learning methods and updates universal learning models through a secure parameter aggregation mechanism rather than directly sharing raw data among organizations.
Abstract: Existing traffic flow forecasting approaches by deep learning models achieve excellent success based on a large volume of datasets gathered by governments and organizations. However, these datasets may contain lots of user's private data, which is challenging the current prediction approaches as user privacy is calling for the public concern in recent years. Therefore, how to develop accurate traffic prediction while preserving privacy is a significant problem to be solved, and there is a trade-off between these two objectives. To address this challenge, we introduce a privacy-preserving machine learning technique named federated learning and propose a Federated Learning-based Gated Recurrent Unit neural network algorithm (FedGRU) for traffic flow prediction. FedGRU differs from current centralized learning methods and updates universal learning models through a secure parameter aggregation mechanism rather than directly sharing raw data among organizations. In the secure parameter aggregation mechanism, we adopt a Federated Averaging algorithm to reduce the communication overhead during the model parameter transmission process. Furthermore, we design a Joint Announcement Protocol to improve the scalability of FedGRU. We also propose an ensemble clustering-based scheme for traffic flow prediction by grouping the organizations into clusters before applying FedGRU algorithm. Through extensive case studies on a real-world dataset, it is shown that FedGRU's prediction accuracy is 90.96% higher than the advanced deep learning models, which confirm that FedGRU can achieve accurate and timely traffic prediction without compromising the privacy and security of raw data.

Journal ArticleDOI
TL;DR: The proposed method is the first work that exploits the graph learning and spectral clustering techniques to learn the common representation for incomplete multiview clustering and achieves the best performance in comparison with some state-of-the-art methods.
Abstract: In this paper, we propose a general framework for incomplete multiview clustering. The proposed method is the first work that exploits the graph learning and spectral clustering techniques to learn the common representation for incomplete multiview clustering. First, owing to the good performance of low-rank representation in discovering the intrinsic subspace structure of data, we adopt it to adaptively construct the graph of each view. Second, a spectral constraint is used to achieve the low-dimensional representation of each view based on the spectral clustering. Third, we further introduce a co-regularization term to learn the common representation of samples for all views, and then use the ${k}$ -means to partition the data into their respective groups. An efficient iterative algorithm is provided to optimize the model. Experimental results conducted on seven incomplete multiview datasets show that the proposed method achieves the best performance in comparison with some state-of-the-art methods, which proves the effectiveness of the proposed method in incomplete multiview clustering.

Journal ArticleDOI
TL;DR: In this paper, the adversarial training principle is applied to enforce the latent codes to match a prior Gaussian or uniform distribution, which can be used to learn the graph embedding effectively.
Abstract: Graph embedding aims to transfer a graph into vectors to facilitate subsequent graph-analytics tasks like link prediction and graph clustering. Most approaches on graph embedding focus on preserving the graph structure or minimizing the reconstruction errors for graph data. They have mostly overlooked the embedding distribution of the latent codes, which unfortunately may lead to inferior representation in many cases. In this article, we present a novel adversarially regularized framework for graph embedding. By employing the graph convolutional network as an encoder, our framework embeds the topological information and node content into a vector representation, from which a graph decoder is further built to reconstruct the input graph. The adversarial training principle is applied to enforce our latent codes to match a prior Gaussian or uniform distribution. Based on this framework, we derive two variants of the adversarial models, the adversarially regularized graph autoencoder (ARGA) and its variational version, and adversarially regularized variational graph autoencoder (ARVGA), to learn the graph embedding effectively. We also exploit other potential variations of ARGA and ARVGA to get a deeper understanding of our designs. Experimental results that compared 12 algorithms for link prediction and 20 algorithms for graph clustering validate our solutions.

Proceedings Article
30 Apr 2020
TL;DR: In this paper, the authors proposed to maximize the information between labels and input data indices to solve the cross-entropy minimization problem for unsupervised learning of deep neural networks.
Abstract: Combining clustering and representation learning is one of the most promising approaches for unsupervised learning of deep neural networks. However, doing so naively leads to ill posed learning problems with degenerate solutions. In this paper, we propose a novel and principled learning formulation that addresses these issues. The method is obtained by maximizing the information between labels and input data indices. We show that this criterion extends standard cross-entropy minimization to an optimal transport problem, which we solve efficiently for millions of input images and thousands of labels using a fast variant of the Sinkhorn-Knopp algorithm. The resulting method is able to self-label visual data so as to train highly competitive image representations without manual labels. Compared to the best previous method in this class, namely DeepCluster, our formulation minimizes a single objective function for both representation learning and clustering; it also significantly outperforms DeepCluster in standard benchmarks.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: The iterative training mechanism is followed but clustering is discarded, since it incurs loss from hard quantization, yet its only product, image-level similarity, can be easily replaced by pairwise computation and a softened classification task.
Abstract: Person re-identification (re-ID) is an important topic in computer vision. This paper studies the unsupervised setting of re-ID, which does not require any labeled information and thus is freely deployed to new scenarios. There are very few studies under this setting, and one of the best approach till now used iterative clustering and classification, so that unlabeled images are clustered into pseudo classes for a classifier to get trained, and the updated features are used for clustering and so on. This approach suffers two problems, namely, the difficulty of determining the number of clusters, and the hard quantization loss in clustering. In this paper, we follow the iterative training mechanism but discard clustering, since it incurs loss from hard quantization, yet its only product, image-level similarity, can be easily replaced by pairwise computation and a softened classification task. With these improvements, our approach becomes more elegant and is more robust to hyper-parameter changes. Experiments on two image-based and video-based datasets demonstrate state-of-the-art performance under the unsupervised re-ID setting.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work describes the proposed method as Structurally Regularized Deep Clustering (SRDC), where it enhances target discrimination with clustering of intermediate network features, and enhance structural regularization with soft selection of less divergent source examples.
Abstract: Unsupervised domain adaptation (UDA) is to make predictions for unlabeled data on a target domain, given labeled data on a source domain whose distribution shifts from the target one. Mainstream UDA methods learn aligned features between the two domains, such that a classifier trained on the source features can be readily applied to the target ones. However, such a transferring strategy has a potential risk of damaging the intrinsic discrimination of target data. To alleviate this risk, we are motivated by the assumption of structural domain similarity, and propose to directly uncover the intrinsic target discrimination via discriminative clustering of target data. We constrain the clustering solutions using structural source regularization that hinges on our assumed structural domain similarity. Technically, we use a flexible framework of deep network based discriminative clustering that minimizes the KL divergence between predictive label distribution of the network and an introduced auxiliary one; replacing the auxiliary distribution with that formed by ground-truth labels of source data implements the structural source regularization via a simple strategy of joint network training. We term our proposed method as Structurally Regularized Deep Clustering (SRDC), where we also enhance target discrimination with clustering of intermediate network features, and enhance structural regularization with soft selection of less divergent source examples. Careful ablation studies show the efficacy of our proposed SRDC. Notably, with no explicit domain alignment, SRDC outperforms all existing methods on three UDA benchmarks.

Journal ArticleDOI
TL;DR: This work states this joint problem as a co-clustering problem that is principled and tractable by existing algorithms, and demonstrates the effectiveness of this approach by combining bottom-up motion segmentation by grouping of point trajectories with high-level multiple object tracking by clustering of bounding boxes.
Abstract: Models for computer vision are commonly defined either w.r.t. low-level concepts such as pixels that are to be grouped, or w.r.t. high-level concepts such as semantic objects that are to be detected and tracked. Combining bottom-up grouping with top-down detection and tracking, although highly desirable, is a challenging problem. We state this joint problem as a co-clustering problem that is principled and tractable by existing algorithms. We demonstrate the effectiveness of this approach by combining bottom-up motion segmentation by grouping of point trajectories with high-level multiple object tracking by clustering of bounding boxes. We show that solving the joint problem is beneficial at the low-level, in terms of the FBMS59 motion segmentation benchmark, and at the high-level, in terms of the Multiple Object Tracking benchmarks MOT15, MOT16, and the MOT17 challenge, and is state-of-the-art in some metrics.