scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2008"


Book ChapterDOI
15 Sep 2008
TL;DR: Cluster analysis as mentioned in this paper is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics, which is one of the most fundamental modes of understanding and learning.
Abstract: The practice of classifying objects according to perceived similarities is the basis for much of science. Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms in to taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes cluster analysis (unsupervised learning) from discriminant analysis (supervised learning). The objective of cluster analysis is to simply find a convenient and valid organization of the data, not to establish rules for separating future data into categories.

4,255 citations


Journal ArticleDOI
TL;DR: This work introduces a class of benchmark graphs, that account for the heterogeneity in the distributions of node degrees and of community sizes, and uses this benchmark to test two popular methods of community detection, modularity optimization, and Potts model clustering.
Abstract: Community structure is one of the most important features of real networks and reveals the internal organization of the nodes. Many algorithms have been proposed but the crucial issue of testing, i.e., the question of how good an algorithm is, with respect to others, is still open. Standard tests include the analysis of simple artificial graphs with a built-in community structure, that the algorithm has to recover. However, the special graphs adopted in actual tests have a structure that does not reflect the real properties of nodes and communities found in real networks. Here we introduce a class of benchmark graphs, that account for the heterogeneity in the distributions of node degrees and of community sizes. We use this benchmark to test two popular methods of community detection, modularity optimization, and Potts model clustering. The results show that the benchmark poses a much more severe test to algorithms than standard benchmarks, revealing limits that may not be apparent at a first analysis.

2,772 citations


Journal ArticleDOI
TL;DR: The versatile Cytoscape plugin NetworkAnalyzer computes and displays a comprehensive set of topological parameters, which includes the number of nodes, edges, and connected components, the network diameter, radius, density, centralization, heterogeneity, and clustering coefficient, and the characteristic path length.
Abstract: Summary: Rapidly increasing amounts of molecular interaction data are being produced by various experimental techniques and computational prediction methods. In order to gain insight into the organization and structure of the resultant large complex networks formed by the interacting molecules, we have developed the versatile Cytoscape plugin NetworkAnalyzer. It computes and displays a comprehensive set of topological parameters, which includes the number of nodes, edges, and connected components, the network diameter, radius, density, centralization, heterogeneity, and clustering coefficient, the characteristic path length, and the distributions of node degrees, neighborhood connectivities, average clustering coefficients, and shortest path lengths. NetworkAnalyzer can be applied to both directed and undirected networks and also contains extra functionality to construct the intersection or union of two networks. It is an interactive and highly customizable application that requires no expert knowledge in graph theory from the user. Availability: NetworkAnalyzer can be downloaded via the Cytoscape web site: http://www.cytoscape.org Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

1,476 citations


Proceedings ArticleDOI
26 Oct 2008
TL;DR: This paper explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles, and performs very well, with recall and precision of almost 75%.
Abstract: This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words - indexing, clustering, retrieval, and summarization to name a few - could use the techniques described here to draw on a vast network of concepts and semantics.

1,342 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: The proposed semantic texton forests are ensembles of decision trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors, and give at least a five-fold increase in execution speed.
Abstract: We propose semantic texton forests, efficient and powerful new low-level features. These are ensembles of decision trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors. They are extremely fast to both train and test, especially compared with k-means clustering and nearest-neighbor assignment of feature descriptors. The nodes in the trees provide (i) an implicit hierarchical clustering into semantic textons, and (ii) an explicit local classification estimate. Our second contribution, the bag of semantic textons, combines a histogram of semantic textons over an image region with a region prior category distribution. The bag of semantic textons is computed over the whole image for categorization, and over local rectangular regions for segmentation. Including both histogram and region prior allows our segmentation algorithm to exploit both textural and semantic context. Our third contribution is an image-level prior for segmentation that emphasizes those categories that the automatic categorization believes to be present. We evaluate on two datasets including the very challenging VOC 2007 segmentation dataset. Our results significantly advance the state-of-the-art in segmentation accuracy, and furthermore, our use of efficient decision forests gives at least a five-fold increase in execution speed.

1,162 citations


Journal ArticleDOI
TL;DR: It is proposed that the topological differences between divisions of normal cortex may represent the outcome of different growth processes for multimodal and transmodal networks and that neurodevelopmental abnormalities in schizophrenia specifically impact multi-modal cortical organization.
Abstract: The complex organization of connectivity in the human brain is incompletely understood. Recently, topological measures based on graph theory have provided a new approach to quantify large-scale cortical networks. These methods have been applied to anatomical connectivity data on nonhuman species, and cortical networks have been shown to have small-world topology, associated with high local and global efficiency of information transfer. Anatomical networks derived from cortical thickness measurements have shown the same organizational properties of the healthy human brain, consistent with similar results reported in functional networks derived from resting state functional magnetic resonance imaging (MRI) and magnetoencephalographic data. Here we show, using anatomical networks derived from analysis of inter-regional covariation of gray matter volume in MRI data on 259 healthy volunteers, that classical divisions of cortex (multimodal, unimodal, and transmodal) have some distinct topological attributes. Although all cortical divisions shared nonrandom properties of small-worldness and efficient wiring (short mean Euclidean distance between connected regions), the multimodal network had a hierarchical organization, dominated by frontal hubs with low clustering, whereas the transmodal network was assortative. Moreover, in a sample of 203 people with schizophrenia, multimodal network organization was abnormal, as indicated by reduced hierarchy, the loss of frontal and the emergence of nonfrontal hubs, and increased connection distance. We propose that the topological differences between divisions of normal cortex may represent the outcome of different growth processes for multimodal and transmodal networks and that neurodevelopmental abnormalities in schizophrenia specifically impact multimodal cortical organization.

1,160 citations


Journal ArticleDOI
TL;DR: This work describes the full process for mining e-learning data step by step as well as how to apply the main data mining techniques used, such as statistics, visualization, classification, clustering and association rule mining of Moodle data.
Abstract: Educational data mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from the educational context. This work is a survey of the specific application of data mining in learning management systems and a case study tutorial with the Moodle system. Our objective is to introduce it both theoretically and practically to all users interested in this new research area, and in particular to online instructors and e-learning administrators. We describe the full process for mining e-learning data step by step as well as how to apply the main data mining techniques used, such as statistics, visualization, classification, clustering and association rule mining of Moodle data. We have used free data mining tools so that any user can immediately begin to apply data mining without having to purchase a commercial tool or program a specific personalized tool.

1,049 citations


Anna Huang1
01 Jan 2008
TL;DR: A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy, and a comparison of these measures in partitional clustering for text document datasets is compared and analyzed.
Abstract: Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy. In this paper, we compare and analyze the eectiveness of these measures in partitional clustering for text document datasets. Our experiments utilize the standard Kmeans algorithm and we report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering.

1,010 citations


Book ChapterDOI
12 Oct 2008
TL;DR: It is shown that the complexity of the recently introduced medoid-shift algorithm in clustering N points is O(N 2), with a small constant, if the underlying distance is Euclidean, which makes medoid shift considerably faster than mean shift, contrarily to what previously believed.
Abstract: We show that the complexity of the recently introduced medoid-shift algorithm in clustering N points is O(N 2), with a small constant, if the underlying distance is Euclidean. This makes medoid shift considerably faster than mean shift, contrarily to what previously believed. We then exploit kernel methods to extend both mean shift and the improved medoid shift to a large family of distances, with complexity bounded by the effective rank of the resulting kernel matrix, and with explicit regularization constraints. Finally, we show that, under certain conditions, medoid shift fails to cluster data points belonging to the same mode, resulting in over-fragmentation. We propose remedies for this problem, by introducing a novel, simple and extremely efficient clustering algorithm, called quick shift, that explicitly trades off under- and over-fragmentation. Like medoid shift, quick shift operates in non-Euclidean spaces in a straightforward manner. We also show that the accelerated medoid shift can be used to initialize mean shift for increased efficiency. We illustrate our algorithms to clustering data on manifolds, image segmentation, and the automatic discovery of visual categories.

865 citations


Journal ArticleDOI
TL;DR: A survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters and an explicit proof of the fact that these two paradigms have the same objective is reported.

832 citations


01 Apr 2008
TL;DR: In this paper, an alternate short proof of NP-hardness of Euclidean sum-of-squares clustering is provided. But this proof is not valid for the general case.
Abstract: A recent proof of NP-hardness of Euclidean sum-of-squares clustering, due to Drineas et al. (Mach. Learn. 56:9---33, 2004), is not valid. An alternate short proof is provided.

Journal ArticleDOI
TL;DR: This work almost settles a long-standing conjecture of Bang-Jensen and Thomassen and shows that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
Abstract: We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the extent of disagreement with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.

Journal ArticleDOI
TL;DR: The R package clValid contains functions for validating the results of a clustering analysis, and the user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), to choose from.
Abstract: The R package clValid contains functions for validating the results of a clustering analysis There are three main types of cluster validation measures available, \internal", \stability", and \biological" The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM),

Posted Content
TL;DR: This paper presents a novel framework for extracting ratable aspects of objects from online user reviews and argues that multi-grain models are more appropriate for this task since standard models tend to produce topics that correspond to global properties of objects rather than aspects of an object that tend to be rated by a user.
Abstract: In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews. Our models are based on extensions to standard topic modeling methods such as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models are more appropriate for our task since standard models tend to produce topics that correspond to global properties of objects (e.g., the brand of a product type) rather than the aspects of an object that tend to be rated by a user. The models we present not only extract ratable aspects, but also cluster them into coherent topics, e.g., `waitress' and `bartender' are part of the same topic `staff' for restaurants. This differentiates it from much of the previous work which extracts aspects through term frequency analysis with minimal clustering. We evaluate the multi-grain models both qualitatively and quantitatively to show that they improve significantly upon standard topic models.

Journal ArticleDOI
01 Jan 2008
TL;DR: Differential evolution has emerged as one of the fast, robust, and efficient global search heuristics of current interest as mentioned in this paper, which has been applied to the automatic clustering of large unlabeled data sets.
Abstract: Differential evolution (DE) has emerged as one of the fast, robust, and efficient global search heuristics of current interest. This paper describes an application of DE to the automatic clustering of large unlabeled data sets. In contrast to most of the existing clustering techniques, the proposed algorithm requires no prior knowledge of the data to be classified. Rather, it determines the optimal number of partitions of the data "on the run." Superiority of the new method is demonstrated by comparing it with two recently developed partitional clustering techniques and one popular hierarchical clustering algorithm. The partitional clustering algorithms are based on two powerful well-known optimization algorithms, namely the genetic algorithm and the particle swarm optimization. An interesting real-world application of the proposed method to automatic segmentation of images is also reported.

Book
28 Aug 2008
TL;DR: Techniques covered range from traditional multivariate methods, such as multiple regression, principal components, canonical variates, linear discriminant analysis, factor analysis, clustering, multidimensional scaling, and correspondence analysis, to the newer methods of density estimation, projection pursuit, neural networks, and classification and regression trees.
Abstract: Remarkable advances in computation and data storage and the ready availability of huge data sets have been the keys to the growth of the new disciplines of data mining and machine learning, while the enormous success of the Human Genome Project has opened up the field of bioinformatics. These exciting developments, which led to the introduction of many innovative statistical tools for high-dimensional data analysis, are described here in detail. The author takes a broad perspective; for the first time in a book on multivariate analysis, nonlinear methods are discussed in detail as well as linear methods. Techniques covered range from traditional multivariate methods, such as multiple regression, principal components, canonical variates, linear discriminant analysis, factor analysis, clustering, multidimensional scaling, and correspondence analysis, to the newer methods of density estimation, projection pursuit, neural networks, multivariate reduced-rank regression, nonlinear manifold learning, bagging, boosting, random forests, independent component analysis, support vector machines, and classification and regression trees. Another unique feature of this book is the discussion of database management systems. This book is appropriate for advanced undergraduate students, graduate students, and researchers in statistics, computer science, artificial intelligence, psychology, cognitive sciences, business, medicine, bioinformatics, and engineering. Familiarity with multivariable calculus, linear algebra, and probability and statistics is required. The book presents a carefully-integrated mixture of theory and applications, and of classical and modern multivariate statistical techniques, including Bayesian methods. There are over 60 interesting data sets used as examples in the book, over 200 exercises, and many color illustrations and photographs.

Journal ArticleDOI
TL;DR: New optimization and estimation techniques to address two fundamental problems in machine learning are developed, which serve as the basis for the Automatic Linguistic Indexing of Pictures - Real Time (ALIPR) system of fully automatic and high speed annotation for online pictures.
Abstract: Developing effective methods for automated annotation of digital pictures continues to challenge computer scientists. The capability of annotating pictures by computers can lead to breakthroughs in a wide range of applications, including Web image search, online picture-sharing communities, and scientific experiments. In this work, the authors developed new optimization and estimation techniques to address two fundamental problems in machine learning. These new techniques serve as the basis for the automatic linguistic indexing of pictures - real time (ALIPR) system of fully automatic and high-speed annotation for online pictures. In particular, the D2-clustering method, in the same spirit as K-Means for vectors, is developed to group objects represented by bags of weighted vectors. Moreover, a generalized mixture modeling technique (kernel smoothing as a special case) for nonvector data is developed using the novel concept of hypothetical local mapping (HLM). ALIPR has been tested by thousands of pictures from an Internet photo-sharing site, unrelated to the source of those pictures used in the training process. Its performance has also been studied at an online demonstration site, where arbitrary users provide pictures of their choices and indicate the correctness of each annotation word. The experimental results show that a single computer processor can suggest annotation terms in real time and with good accuracy.

Journal ArticleDOI
TL;DR: It is proved that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other is only consistent under strong additional assumptions, which are not always satisfied in real data.
Abstract: Consistency is a key property of all statistical procedures analyzing randomly sampled data. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that, for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result, we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.

Book
30 Sep 2008
TL;DR: This edition includes many more worked examples and diagrams to help give greater understanding of the methods and their application, including semi-supervised learning, combining clustering algorithms, and relevance feedback.
Abstract: This book considers classical and current theory and practice, of both supervised and unsupervised pattern recognition, to build a complete background for professionals and students of engineering. The authors, leading experts in the field of pattern recognition, have provided an up-to-date, self-contained volume encapsulating this wide spectrum of information. The very latest methods are incorporated in this edition: semi-supervised learning, combining clustering algorithms, and relevance feedback.This edition includes many more worked examples and diagrams (in two colour) to help give greater understanding of the methods and their application. Computer-based problems will be included with MATLAB code. An accompanying book contains extra worked examples and MATLAB code of all the examples used in this book.Thoroughly developed to include many more worked examples to give greater understanding of this mathematically oriented subjectMany more diagrams included--now in two color--to provide greater insight through visual presentationAn accompanying manual includes Matlab code of the methods and algorithms in the book, together with solved problems and real-life data sets in medical imaging, remote sensing and audio recognition. The Manual is available separately or at a special packaged price (ISBN: 9780123744869).Latest hot topics included to further the reference value of the text including semi-supervised learning, combining clustering algorithms, and relevance feedback.

16 Apr 2008
TL;DR: This note describes the performance of the ATLAS calorimeter clustering algorithms, which provide inputs for particle identification, and summarizes the steps of the calorimeters reconstruction softwar e.
Abstract: This note describes the performance of the ATLAS calorimeter clustering algorithms, which provide inputs for particle identification. ATLAS uses two principal alg orithms. The first is the “sliding-window” algorithm, which clusters calorimeter cells within fixe d-size rectangles; results from this are used for electron, photon, and tau lepto n identification. The second is the “topological” algorithm, which clusters together neighboring ce lls, as long as the signal in the cells is significant compared to noise. The results of this seco nd algorithm are further used for jet and missing transverse energy reconstruction . This note first summarizes the steps of the calorimeter reconstruction softwar e. A detailed description of the two clustering algorithms is then given. A last section su mmarizes their performance. The results presented in this note are obtained with the ATLAS ATHENA software releases 12 and 13. ATL-LARG-PUB-2008-002

Proceedings ArticleDOI
23 Oct 2008
TL;DR: This work presents a personalization algorithm for recommendation in folksonomies which relies on hierarchical tag clusters and presents extensive experimental results on two real world dataset, suggesting that guysonomies encompassing only one topic domain, rather than many topics, present an easier target for recommendation.
Abstract: Collaborative tagging applications allow Internet users to annotate resources with personalized tags. The complex network created by many annotations, often called a folksonomy, permits users the freedom to explore tags, resources or even other user's profiles unbound from a rigid predefined conceptual hierarchy. However, the freedom afforded users comes at a cost: an uncontrolled vocabulary can result in tag redundancy and ambiguity hindering navigation. Data mining techniques, such as clustering, provide a means to remedy these problems by identifying trends and reducing noise. Tag clusters can also be used as the basis for effective personalized recommendation assisting users in navigation. We present a personalization algorithm for recommendation in folksonomies which relies on hierarchical tag clusters. Our basic recommendation framework is independent of the clustering method, but we use a context-dependent variant of hierarchical agglomerative clustering which takes into account the user's current navigation context in cluster selection. We present extensive experimental results on two real world dataset. While the personalization algorithm is successful in both cases, our results suggest that folksonomies encompassing only one topic domain, rather than many topics, present an easier target for recommendation, perhaps because they are more focused and often less sparse. Furthermore, context dependent cluster selection, an integral step in our personalization algorithm, demonstrates more utility for recommendation in multi-topic folksonomies than in single-topic folksonomies. This observation suggests that topic selection is an important strategy for recommendation in multi-topic folksonomies.

Journal ArticleDOI
TL;DR: This paper model the distribution of the texture features using a mixture of Gaussian distributions, allowing the mixture components to be degenerate or nearly-degenerate, and shows that such a mixture distribution can be effectively segmented by a simple agglomerative clustering algorithm derived from a lossy data compression approach.

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This paper proposes a novel context-aware query suggestion approach which is in two steps, and outperforms two baseline methods in both coverage and quality of suggestions.
Abstract: Query suggestion plays an important role in improving the usability of search engines. Although some recently proposed methods can make meaningful query suggestions by mining query patterns from search logs, none of them are context-aware - they do not take into account the immediately preceding queries as context in query suggestion. In this paper, we propose a novel context-aware query suggestion approach which is in two steps. In the offine model-learning step, to address data sparseness, queries are summarized into concepts by clustering a click-through bipartite. Then, from session data a concept sequence suffix tree is constructed as the query suggestion model. In the online query suggestion step, a user's search context is captured by mapping the query sequence submitted by the user to a sequence of concepts. By looking up the context in the concept sequence sufix tree, our approach suggests queries to the user in a context-aware manner. We test our approach on a large-scale search log of a commercial search engine containing 1:8 billion search queries, 2:6 billion clicks, and 840 million query sessions. The experimental results clearly show that our approach outperforms two baseline methods in both coverage and quality of suggestions.

Journal ArticleDOI
TL;DR: This paper considers a general problem of learning from pairwise constraints in the form of must-links and cannot-links, and aims to learn a Mahalanobis distance metric.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed robust path-based spectral clustering method consistently outperforms other methods due to its higher robustness, and comparisons with some other methods show this method to be significantly more robust than spectral clusters and path- based clustering.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: It is proved that the problem of achieving (k,delta) -anonymity by space translation with minimum distortion is NP-hard, and a greedy algorithm based on clustering and enhanced with ad hoc pre-processing and outlier removal techniques is proposed.
Abstract: Preserving individual privacy when publishing data is a problem that is receiving increasing attention. According to the fc-anonymity principle, each release of data must be such that each individual is indistinguishable from at least k - 1 other individuals. In this paper we study the problem of anonymity preserving data publishing in moving objects databases. We propose a novel concept of k-anonymity based on co-localization that exploits the inherent uncertainty of the moving object's whereabouts. Due to sampling and positioning systems (e.g., GPS) imprecision, the trajectory of a moving object is no longer a polyline in a three-dimensional space, instead it is a cylindrical volume, where its radius delta represents the possible location imprecision: we know that the trajectory of the moving object is within this cylinder, but we do not know exactly where. If another object moves within the same cylinder they are indistinguishable from each other. This leads to the definition of (k,delta) -anonymity for moving objects databases. We first characterize the (k, delta)-anonymity problem and discuss techniques to solve it. Then we focus on the most promising technique by the point of view of information preservation, namely space translation. We develop a suitable measure of the information distortion introduced by space translation, and we prove that the problem of achieving (k,delta) -anonymity by space translation with minimum distortion is NP-hard. Faced with the hardness of our problem we propose a greedy algorithm based on clustering and enhanced with ad hoc pre-processing and outlier removal techniques. The resulting method, named NWA (Never Walk .Alone), is empirically evaluated in terms of data quality and efficiency. Data quality is assessed both by means of objective measures of information distortion, and by comparing the results of the same spatio-temporal range queries executed on the original database and on the (k, delta)-anonymized one. Experimental results show that for a wide range of values of delta and k, the relative error introduced is kept low, confirming that NWA produces high quality (k, delta)-anonymized data.

Journal ArticleDOI
TL;DR: The proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring, based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories.
Abstract: During the last years, the task of automatic event analysis in video sequences has gained an increasing attention among the research community. The application domains are disparate, ranging from video surveillance to automatic video annotation for sport videos or TV shots. Whatever the application field, most of the works in event analysis are based on two main approaches: the former based on explicit event recognition, focused on finding high-level, semantic interpretations of video sequences, and the latter based on anomaly detection. This paper deals with the second approach, where the final goal is not the explicit labeling of recognized events, but the detection of anomalous events differing from typical patterns. In particular, the proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring. The proposed approach is based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories. Particular attention is given to trajectory classification in absence of a priori information on the distribution of outliers. Experimental results prove the validity of the proposed approach.

Reference BookDOI
12 Aug 2008
TL;DR: This volume delivers thorough coverage of the capabilities and limitations of constrained clustering methods as well as introduces new types of constraints and clustering algorithms.
Abstract: Since the initial work on constrained clustering, there have been numerous advances in methods, applications, and our understanding of the theoretical properties of constraints and constrained clustering algorithms. Bringing these developments together, Constrained Clustering: Advances in Algorithms, Theory, and Applications presents an extensive collection of the latest innovations in clustering data analysis methods that use background knowledge encoded as constraints. Algorithms The first five chapters of this volume investigate advances in the use of instance-level, pairwise constraints for partitional and hierarchical clustering. The book then explores other types of constraints for clustering, including cluster size balancing, minimum cluster size,and cluster-level relational constraints. Theory It also describes variations of the traditional clustering under constraints problem as well as approximation algorithms with helpful performance guarantees. Applications The book ends by applying clustering with constraints to relational data, privacy-preserving data publishing, and video surveillance data. It discusses an interactive visual clustering approach, a distance metric learning approach, existential constraints, and automatically generated constraints. With contributions from industrial researchers and leading academic experts who pioneered the field, this volume delivers thorough coverage of the capabilities and limitations of constrained clustering methods as well as introduces new types of constraints and clustering algorithms.

Journal ArticleDOI
TL;DR: A new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters, in addition to improving prediction accuracy and interpretation.
Abstract: Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.

Proceedings ArticleDOI
22 Apr 2008
TL;DR: This paper introduces CHEF - cluster head election mechanism using fuzzy logic, and proves efficiency of CHEF compared with LEACH using the matlab, showing that CHEF is about 22.7% more efficient than LEACH.
Abstract: In designing the wireless sensor networks, the energy is the most important consideration because the lifetime of the sensor node is limited by the battery of it. To overcome this demerit many research have been done. The clustering is the one of the representative approaches. In the clustering, the cluster heads gather data from nodes, aggregate it and send the information to the base station. In this way, the sensor nodes can reduce communication overheads that may be generated if each sensor node reports sensed data to the base station independently. LEACH is one of the most famous clustering mechanisms. It elects a cluster head based on probability model. This approach may reduce the network lifetime because LEACH does not consider the distribution of sensor nodes and the energy remains of each node. However, using the location and the energy information in the clustering can generate big overheads. In this paper we introduce CHEF - cluster head election mechanism using fuzzy logic. By using fuzzy logic, collecting and calculating overheads can be reduced and finally the lifetime of the sensor networks can be prolonged. To prove efficiency of CHEF, we simulated CHEF compared with LEACH using the matlab. Our simulation results show that CHEF is about 22.7% more efficient than LEACH.