scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2014"


Journal ArticleDOI
27 Jun 2014-Science
TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.
Abstract: Cluster analysis is aimed at classifying elements into categories on the basis of their similarity. Its applications range from astronomy to bioinformatics, bibliometrics, and pattern recognition. We propose an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded from the analysis, and clusters are recognized regardless of their shape and of the dimensionality of the space in which they are embedded. We demonstrate the power of the algorithm on several test cases.

3,441 citations


Journal ArticleDOI
TL;DR: The R package NbClust provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user.
Abstract: Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

1,912 citations


Journal ArticleDOI
TL;DR: It is shown that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and an automated configuration procedure for finding the best algorithm to search a particular data set is described.
Abstract: For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.

1,339 citations


Posted Content
TL;DR: This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.
Abstract: Deep convolutional neural networks (CNN) has become the most promising method for object recognition, repeatedly demonstrating record breaking results for image classification and object detection in recent years. However, a very deep CNN generally involves many layers with millions of parameters, making the storage of the network model to be extremely large. This prohibits the usage of deep CNNs on resource limited hardware, especially cell phones or other embedded devices. In this paper, we tackle this model storage issue by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, we have found in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods. Simply applying k-means clustering to the weights or conducting product quantization can lead to a very good balance between model size and recognition accuracy. For the 1000-category classification task in the ImageNet challenge, we are able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN.

1,139 citations


ReportDOI
01 May 2014
TL;DR: This work presents GraphChi, a disk-based system for computing efficiently on graphs with billions of edges, and builds on the basis of Parallel Sliding Windows to propose a new data structure Partitioned Adjacency Lists, which is used to design an online graph database graphChi-DB.
Abstract: : Current systems for graph computation require a distributed computing cluster to handle very large real-world problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible developing distributed graph algorithms still remains challenging, especially to non-experts. In this work, we present GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. By using a well-known method to break large graphs into small parts, and a novel Parallel Sliding Windows algorithm, GraphChi is able to execute several advanced data mining, graph mining and machine learning algorithms on very large graphs, using just a single consumer-level computer. We show, through experiments and theoretical analysis, that GraphChi performs well on both SSDs and rotational hard drives. We build on the basis of Parallel Sliding Windows to propose a new data structure Partitioned Adjacency Lists, which we use to design an online graph database GraphChi-DB.We demonstrate that, on a single PC, GraphChi-DB can process over one hundred thousand graph updates per second, while simultaneously performing computation. GraphChi-DB compares favorably to existing graph databases, particularly on data that is much larger than the available memory. We evaluate our work both experimentally and theoretically. Based on the Parallel Sliding Windows algorithm, we propose new I/O efficient algorithms for solving fundamental graph problems. We also propose a novel algorithm for simulating billions of random walks in parallel on a single computer. By repeating experiments reported for existing distributed systems we show that with only fraction of the resources, GraphChi can solve the same problems in a very reasonable time. Our work makes large-scale graph computation available to anyone with a modern PC.

907 citations


Book
01 May 2014
TL;DR: This textbook for senior undergraduate and graduate data mining courses provides a broad yet in-depth overview of data mining, integrating related concepts from machine learning and statistics.
Abstract: The fundamental algorithms in data mining and analysis form the basis for the emerging field of data science, which includes automated methods to analyze patterns and models for all kinds of data, with applications ranging from scientific discovery to business intelligence and analytics. This textbook for senior undergraduate and graduate data mining courses provides a broad yet in-depth overview of data mining, integrating related concepts from machine learning and statistics. The main parts of the book include exploratory data analysis, pattern mining, clustering, and classification. The book lays the basic foundations of these tasks, and also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. With its comprehensive coverage, algorithmic perspective, and wealth of examples, this book offers solid guidance in data mining for students, researchers, and practitioners alike. Key features: Covers both core methods and cutting-edge research Algorithmic approach with open-source implementations Minimal prerequisites: all key mathematical concepts are presented, as is the intuition behind the formulas Short, self-contained chapters with class-tested examples and exercises allow for flexibility in designing a course and for easy reference Supplementary website with lecture slides, videos, project ideas, and more

844 citations


Journal ArticleDOI
TL;DR: Concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as a comparison, both from a theoretical and an empirical perspective are introduced.
Abstract: Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.

833 citations


Journal ArticleDOI
25 Sep 2014-PeerJ
TL;DR: In this paper, the authors proposed Swarm, a fast, scalable, and input-order independent approach for amplicon clustering that reduces the influence of clustering parameters and produces robust operational taxonomic units.
Abstract: Popular de novo amplicon clustering methods suffer from two fundamental flaws: arbitrary global clustering thresholds, and input-order dependency induced by centroid selection. Swarm was developed to address these issues by first clustering nearly identical amplicons iteratively using a local threshold, and then by using clusters’ internal structure and amplicon abundances to refine its results. This fast, scalable, and input-order independent approach reduces the influence of clustering parameters and produces robust operational taxonomic units.

699 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: This paper proposes a novel clustering model to learn the data similarity matrix and clustering structure simultaneously and derives an efficient algorithm to optimize the proposed challenging problem, and shows the theoretical analysis on the connections between the method and the K-means clustering, and spectral clustering.
Abstract: Many clustering methods partition the data groups based on the input data similarity matrix. Thus, the clustering results highly depend on the data similarity learning. Because the similarity measurement and data clustering are often conducted in two separated steps, the learned data similarity may not be the optimal one for data clustering and lead to the suboptimal results. In this paper, we propose a novel clustering model to learn the data similarity matrix and clustering structure simultaneously. Our new model learns the data similarity matrix by assigning the adaptive and optimal neighbors for each data point based on the local distances. Meanwhile, the new rank constraint is imposed to the Laplacian matrix of the data similarity matrix, such that the connected components in the resulted similarity matrix are exactly equal to the cluster number. We derive an efficient algorithm to optimize the proposed challenging problem, and show the theoretical analysis on the connections between our method and the K-means clustering, and spectral clustering. We also further extend the new clustering model for the projected clustering to handle the high-dimensional data. Extensive empirical results on both synthetic data and real-world benchmark data sets show that our new clustering methods consistently outperforms the related clustering approaches.

695 citations


Journal ArticleDOI
TL;DR: It is shown in this paper that all the supervised, semi-supervised, and unsupervised ELMs can actually be put into a unified framework, which provides new perspectives for understanding the mechanism of random feature mapping, which is the key concept in ELM theory.
Abstract: Extreme learning machines (ELMs) have proven to be efficient and effective learning mechanisms for pattern classification and regression. However, ELMs are primarily applied to supervised learning problems. Only a few existing research papers have used ELMs to explore unlabeled data. In this paper, we extend ELMs for both semi-supervised and unsupervised tasks based on the manifold regularization, thus greatly expanding the applicability of ELMs. The key advantages of the proposed algorithms are as follows: 1) both the semi-supervised ELM (SS-ELM) and the unsupervised ELM (US-ELM) exhibit learning capability and computational efficiency of ELMs; 2) both algorithms naturally handle multiclass classification or multicluster clustering; and 3) both algorithms are inductive and can handle unseen data at test time directly. Moreover, it is shown in this paper that all the supervised, semi-supervised, and unsupervised ELMs can actually be put into a unified framework. This provides new perspectives for understanding the mechanism of random feature mapping, which is the key concept in ELM theory. Empirical study on a wide range of data sets demonstrates that the proposed algorithms are competitive with the state-of-the-art semi-supervised or unsupervised learning algorithms in terms of accuracy and efficiency.

678 citations


Journal ArticleDOI
TL;DR: An MCDM-based approach to rank a selection of popular clustering algorithms in the domain of financial risk analysis and indicates that the repeated-bisection method leads to good 2-way clustering solutions on the selected financial risk data sets.

Proceedings ArticleDOI
TL;DR: This paper develops Communities from Edge Structure and Node Attributes (CESNA), an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes that statistically models the interaction between the network structure and the node attributes, which leads to more accurate community detection as well as improved robustness in the presence of noise in thenetwork structure.
Abstract: Community detection algorithms are fundamental tools that allow us to uncover organizational principles in networks. When detecting communities, there are two possible sources of information one can use: the network structure, and the features and attributes of nodes. Even though communities form around nodes that have common edges and common attributes, typically, algorithms have only focused on one of these two data modalities: community detection algorithms traditionally focus only on the network structure, while clustering algorithms mostly consider only node attributes. In this paper, we develop Communities from Edge Structure and Node Attributes (CESNA), an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes. CESNA statistically models the interaction between the network structure and the node attributes, which leads to more accurate community detection as well as improved robustness in the presence of noise in the network structure. CESNA has a linear runtime in the network size and is able to process networks an order of magnitude larger than comparable approaches. Last, CESNA also helps with the interpretation of detected communities by finding relevant node attributes for each community.

Journal ArticleDOI
TL;DR: An accurate and robust method for detecting texts in natural scene images using a fast and effective pruning algorithm to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations is proposed.
Abstract: Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks. In this paper, we propose an accurate and robust method for detecting texts in natural scene images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the single-link clustering algorithm, where distance weights and clustering threshold are learned automatically by a novel self-training distance metric learning algorithm. The posterior probabilities of text candidates corresponding to non-text are estimated with a character classifier; text candidates with high non-text probabilities are eliminated and texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition database; the f-measure is over 76%, much better than the state-of-the-art performance of 71%. Experiments on multilingual, street view, multi-orientation and even born-digital databases also demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: This paper starts with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporates a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts.
Abstract: This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

Proceedings Article
27 Jul 2014
TL;DR: This work proposes a simple method, which first learns a nonlinear embedding of the original graph by stacked autoencoder, and then runs k-means algorithm on the embedding to obtain clustering result, which significantly outperforms conventional spectral clustering.
Abstract: Recently deep learning has been successfully adopted in many applications such as speech recognition and image classification. In this work, we explore the possibility of employing deep learning in graph clustering. We propose a simple method, which first learns a nonlinear embedding of the original graph by stacked autoencoder, and then runs k-means algorithm on the embedding to obtain clustering result. We show that this simple method has solid theoretical foundation, due to the similarity between autoencoder and spectral clustering in terms of what they actually optimize. Then, we demonstrate that the proposed method is more efficient and flexible than spectral clustering. First, the computational complexity of autoencoder is much lower than spectral clustering: the former can be linear to the number of nodes in a sparse graph while the latter is super quadratic due to eigenvalue decomposition. Second, when additional sparsity constraint is imposed, we can simply employ the sparse autoencoder developed in the literature of deep learning; however, it is nonstraightforward to implement a sparse spectral method. The experimental results on various graph datasets show that the proposed method significantly outperforms conventional spectral clustering, which clearly indicates the effectiveness of deep learning in graph clustering.

Proceedings Article
27 Jul 2014
TL;DR: This paper proposes a novel Markov chain method for Robust Multi-view Spectral Clustering (RMSC), which has a flavor of lowrank and sparse decomposition, and has superior performance over several state-of-the-art methods for multi-view clustering.
Abstract: Multi-view clustering, which seeks a partition of the data in multiple views that often provide complementary information to each other, has received considerable attention in recent years. In real life clustering problems, the data in each view may have considerable noise. However, existing clustering methods blindly combine the information from multi-view data with possibly considerable noise, which often degrades their performance. In this paper, we propose a novel Markov chain method for Robust Multi-view Spectral Clustering (RMSC). Our method has a flavor of lowrank and sparse decomposition, where we firstly construct a transition probability matrix from each single view, and then use these matrices to recover a shared low-rank transition probability matrix as a crucial input to the standard Markov chain method for clustering. The optimization problem of RMSC has a low-rank constraint on the transition probability matrix, and simultaneously a probabilistic simplex constraint on each of its rows. To solve this challenging optimization problem, we propose an optimization procedure based on the Augmented Lagrangian Multiplier scheme. Experimental results on various real world datasets show that the proposed method has superior performance over several state-of-the-art methods for multi-view clustering.

Journal ArticleDOI
TL;DR: These two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets are developed: strict and relaxed hierarchical clustering, which provide the best current approaches to inferring partitions on very large datasets.
Abstract: Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics. We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere. We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores. These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.

Journal Article
TL;DR: This research is working on the development of a hybrid model using LEACH based energy efficient and K-means based quick clustering algorithms to produce a new cluster scheme for WSNs with dynamic selection of the number of the clusters automatically.
Abstract: consist of hundreds of thousands of small and cost effective sensor nodes. Sensor nodes are used to sense the environmental or physiological parameters like temperature, pressure, etc. For the connectivity of the sensor nodes, they use wireless transceiver to send and receive the inter-node signals. Sensor nodes, because connect their selves wirelessly, use routing process to route the packet to make them reach from source to destination. These sensor nodes run on batteries and they carry a limited battery life. Clustering is the process of creating virtual sub-groups of the sensor nodes, which helps the sensor nodes to lower routing computations and to lower the size routing data. There is a wide space available for the research on energy efficient clustering algorithms for the WSNs. LEACH, PEGASIS and HEED are the popular energy efficient clustering protocols for WSNs. In this research, we are working on the development of a hybrid model using LEACH based energy efficient and K-means based quick clustering algorithms to produce a new cluster scheme for WSNs with dynamic selection of the number of the clusters automatically. In the proposed method, finding an optimum "k" value is performed by Elbow method and clustering is done by k-means algorithm, hence routing protocol LEACH which is a traditional energy efficient protocol takes the work ahead of sending data from the cluster heads to the base station. The results of simulation show that at the end of some certain part of running the proposed algorithm, at some point the marginal gain will drop dramatically and gives an angle in the graph. The correct "k" i.e. number of clusters is chosen at this point, hence the "elbow criterion".

Posted Content
TL;DR: An architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species recognition is proposed, and a novel graph-based clustering algorithm for learning a compact pose normalization space is proposed.
Abstract: We propose an architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species. Our architecture first computes an estimate of the object's pose; this is used to compute local image features which are, in turn, used for classification. The features are computed by applying deep convolutional nets to image patches that are located and normalized by the pose. We perform an empirical study of a number of pose normalization schemes, including an investigation of higher order geometric warping functions. We propose a novel graph-based clustering algorithm for learning a compact pose normalization space. We perform a detailed investigation of state-of-the-art deep convolutional feature implementations and fine-tuning feature learning for fine-grained classification. We observe that a model that integrates lower-level feature layers with pose-normalized extraction routines and higher-level feature layers with unaligned image features works best. Our experiments advance state-of-the-art performance on bird species recognition, with a large improvement of correct classification rates over previous methods (75% vs. 55-65%).

Proceedings ArticleDOI
24 Aug 2014
TL;DR: This paper proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering and found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge.
Abstract: Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.

Journal ArticleDOI
TL;DR: An up-to-date review of all major nature inspired metaheuristic algorithms employed till date for partitional clustering and key issues involved during formulation of various metaheuristics as a clustering problem and major application areas are discussed.
Abstract: The partitional clustering concept started with K-means algorithm which was published in 1957. Since then many classical partitional clustering algorithms have been reported based on gradient descent approach. The 1990 kick started a new era in cluster analysis with the application of nature inspired metaheuristics. After initial formulation nearly two decades have passed and researchers have developed numerous new algorithms in this field. This paper embodies an up-to-date review of all major nature inspired metaheuristic algorithms employed till date for partitional clustering. Further, key issues involved during formulation of various metaheuristics as a clustering problem and major application areas are discussed.

Journal ArticleDOI
TL;DR: A state-of-the-art and comprehensive survey on clustering approaches in WSNs, which surveys the proposed approaches in the past few years in a classified manner and compares them based on different metrics such as mobility, cluster count, cluster size, and algorithm complexity.

Journal ArticleDOI
TL;DR: This work poses the problem of fitting a union of subspaces to a collection of data points drawn from one or more subspaced and corrupted by noise and/or gross errors as a non-convex optimization problem, and solves the problem using an alternating minimization approach.

Journal ArticleDOI
TL;DR: This paper presents a data mining (DM) based approach to developing ensemble models for predicting next-day energy consumption and peak power demand, with the aim of improving the prediction accuracy.

Journal ArticleDOI
TL;DR: This paper presents Linear/Nonlinear Programming (LP/NLP) formulations of these problems followed by two proposed algorithms for the same based on particle swarm optimization (PSO) followed by results compared with the existing algorithms to demonstrate their superiority.

Journal ArticleDOI
TL;DR: The proposed dynamic clustering algorithm can achieve significant performance gain over existing naive clustering schemes and is shown to solve the weighted sum rate maximization problem through a generalized weighted minimum mean square error approach.
Abstract: This paper considers a downlink cloud radio access network (C-RAN) in which all the base-stations (BSs) are connected to a central computing cloud via digital backhaul links with finite capacities. Each user is associated with a user-centric cluster of BSs; the central processor shares the user's data with the BSs in the cluster, which then cooperatively serve the user through joint beamforming. Under this setup, this paper investigates the user scheduling, BS clustering, and beamforming design problem from a network utility maximization perspective. Differing from previous works, this paper explicitly considers the per-BS backhaul capacity constraints. We formulate the network utility maximization problem for the downlink C-RAN under two different models depending on whether the BS clustering for each user is dynamic or static over different user scheduling time slots. In the former case, the user-centric BS cluster is dynamically optimized for each scheduled user along with the beamforming vector in each time-frequency slot, whereas in the latter case, the user-centric BS cluster is fixed for each user and we jointly optimize the user scheduling and the beamforming vector to account for the backhaul constraints. In both cases, the nonconvex per-BS backhaul constraints are approximated using the reweighted l 1 -norm technique. This approximation allows us to reformulate the per-BS backhaul constraints into weighted per-BS power constraints and solve the weighted sum rate maximization problem through a generalized weighted minimum mean square error approach. This paper shows that the proposed dynamic clustering algorithm can achieve significant performance gain over existing naive clustering schemes. This paper also proposes two heuristic static clustering schemes that can already achieve a substantial portion of the gain.

Journal ArticleDOI
TL;DR: The suggested approach offers a desirable compromise between low computational complexity and reconstruction quality, when comparing it with state-of-the-art methods for single image super-resolution.
Abstract: We address single image super-resolution using a statistical prediction model based on sparse representations of low- and high-resolution image patches. The suggested model allows us to avoid any invariance assumption, which is a common practice in sparsity-based approaches treating this task. Prediction of high resolution patches is obtained via MMSE estimation and the resulting scheme has the useful interpretation of a feedforward neural network. To further enhance performance, we suggest data clustering and cascading several levels of the basic algorithm. We suggest a training scheme for the resulting network and demonstrate the capabilities of our algorithm, showing its advantages over existing methods based on a low- and high-resolution dictionary pair, in terms of computational complexity, numerical criteria, and visual appearance. The suggested approach offers a desirable compromise between low computational complexity and reconstruction quality, when comparing it with state-of-the-art methods for single image super-resolution.

Journal ArticleDOI
TL;DR: This two-part paper has surveyed different multiobjective evolutionary algorithms for clustering, association rule mining, and several other data mining tasks, and provided a general discussion on the scopes for future research in this domain.
Abstract: The aim of any data mining technique is to build an efficient predictive or descriptive model of a large amount of data. Applications of evolutionary algorithms have been found to be particularly useful for automatic processing of large quantities of raw noisy data for optimal parameter setting and to discover significant and meaningful information. Many real-life data mining problems involve multiple conflicting measures of performance, or objectives, which need to be optimized simultaneously. Under this context, multiobjective evolutionary algorithms are gradually finding more and more applications in the domain of data mining since the beginning of the last decade. In this two-part paper, we have made a comprehensive survey on the recent developments of multiobjective evolutionary algorithms for data mining problems. In this paper, Part I, some basic concepts related to multiobjective optimization and data mining are provided. Subsequently, various multiobjective evolutionary approaches for two major data mining tasks, namely feature selection and classification, are surveyed. In Part II of this paper, we have surveyed different multiobjective evolutionary algorithms for clustering, association rule mining, and several other data mining tasks, and provided a general discussion on the scopes for future research in this domain.

Journal ArticleDOI
TL;DR: The concepts of feature relevance, general procedures, evaluation criteria, and the characteristics of feature selection are introduced and guidelines are provided for user to select a feature selection algorithm without knowing the information of each algorithm.
Abstract: Relevant feature identification has become an essential task to apply data mining algorithms effectively in real-world scenarios. Therefore, many feature selection methods have been proposed to obtain the relevant feature or feature subsets in the literature to achieve their objectives of classification and clustering. This paper introduces the concepts of feature relevance, general procedures, evaluation criteria, and the characteristics of feature selection. A comprehensive overview, categorization, and comparison of existing feature selection methods are also done, and the guidelines are also provided for user to select a feature selection algorithm without knowing the information of each algorithm. We conclude this work with real world applications, challenges, and future research directions of feature selection.

Journal ArticleDOI
TL;DR: Existing softwares for model-based clustering of high-dimensional data will be reviewed, their practical use will be illustrated on real-world data sets and clustering methods based on variable selection are reviewed.