Showing papers on "Cluster analysis published in 2008"

PDF

Open Access

Book Chapter•DOI•

Data Clustering: 50 Years Beyond K-means

[...]

Anil K. Jain¹•Institutions (1)

15 Sep 2008

TL;DR: Cluster analysis as mentioned in this paper is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics, which is one of the most fundamental modes of understanding and learning.

...read moreread less

Abstract: The practice of classifying objects according to perceived similarities is the basis for much of science. Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms in to taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes cluster analysis (unsupervised learning) from discriminant analysis (supervised learning). The objective of cluster analysis is to simply find a convenient and valid organization of the data, not to establish rules for separating future data into categories.

...read moreread less

4,255 citations

Journal Article•DOI•

Benchmark graphs for testing community detection algorithms

[...]

Andrea Lancichinetti, Santo Fortunato, Filippo Radicchi

24 Oct 2008-Physical Review E

TL;DR: This work introduces a class of benchmark graphs, that account for the heterogeneity in the distributions of node degrees and of community sizes, and uses this benchmark to test two popular methods of community detection, modularity optimization, and Potts model clustering.

...read moreread less

Abstract: Community structure is one of the most important features of real networks and reveals the internal organization of the nodes. Many algorithms have been proposed but the crucial issue of testing, i.e., the question of how good an algorithm is, with respect to others, is still open. Standard tests include the analysis of simple artificial graphs with a built-in community structure, that the algorithm has to recover. However, the special graphs adopted in actual tests have a structure that does not reflect the real properties of nodes and communities found in real networks. Here we introduce a class of benchmark graphs, that account for the heterogeneity in the distributions of node degrees and of community sizes. We use this benchmark to test two popular methods of community detection, modularity optimization, and Potts model clustering. The results show that the benchmark poses a much more severe test to algorithms than standard benchmarks, revealing limits that may not be apparent at a first analysis.

...read moreread less

2,772 citations

Journal Article•DOI•

Computing topological parameters of biological networks

[...]

Yassen Assenov¹, Fidel Ramírez¹, Sven-Eric Schelhorn¹, Thomas Lengauer¹, Mario Albrecht¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

15 Jan 2008-Bioinformatics

TL;DR: The versatile Cytoscape plugin NetworkAnalyzer computes and displays a comprehensive set of topological parameters, which includes the number of nodes, edges, and connected components, the network diameter, radius, density, centralization, heterogeneity, and clustering coefficient, and the characteristic path length.

...read moreread less

Abstract: Summary: Rapidly increasing amounts of molecular interaction data are being produced by various experimental techniques and computational prediction methods. In order to gain insight into the organization and structure of the resultant large complex networks formed by the interacting molecules, we have developed the versatile Cytoscape plugin NetworkAnalyzer. It computes and displays a comprehensive set of topological parameters, which includes the number of nodes, edges, and connected components, the network diameter, radius, density, centralization, heterogeneity, and clustering coefficient, the characteristic path length, and the distributions of node degrees, neighborhood connectivities, average clustering coefficients, and shortest path lengths. NetworkAnalyzer can be applied to both directed and undirected networks and also contains extra functionality to construct the intersection or union of two networks. It is an interactive and highly customizable application that requires no expert knowledge in graph theory from the user. Availability: NetworkAnalyzer can be downloaded via the Cytoscape web site: http://www.cytoscape.org Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

1,476 citations

Proceedings Article•DOI•

Learning to link with wikipedia

[...]

David Milne, Ian H. Witten

26 Oct 2008

TL;DR: This paper explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles, and performs very well, with recall and precision of almost 75%.

...read moreread less

Abstract: This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words - indexing, clustering, retrieval, and summarization to name a few - could use the techniques described here to draw on a vast network of concepts and semantics.

...read moreread less

1,342 citations

Proceedings Article•DOI•

Semantic texton forests for image categorization and segmentation

[...]

Jamie Shotton¹, Matthew Johnson², Roberto Cipolla²•Institutions (2)

Toshiba¹, University of Cambridge²

23 Jun 2008

TL;DR: The proposed semantic texton forests are ensembles of decision trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors, and give at least a five-fold increase in execution speed.

...read moreread less

Abstract: We propose semantic texton forests, efficient and powerful new low-level features. These are ensembles of decision trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors. They are extremely fast to both train and test, especially compared with k-means clustering and nearest-neighbor assignment of feature descriptors. The nodes in the trees provide (i) an implicit hierarchical clustering into semantic textons, and (ii) an explicit local classification estimate. Our second contribution, the bag of semantic textons, combines a histogram of semantic textons over an image region with a region prior category distribution. The bag of semantic textons is computed over the whole image for categorization, and over local rectangular regions for segmentation. Including both histogram and region prior allows our segmentation algorithm to exploit both textural and semantic context. Our third contribution is an image-level prior for segmentation that emphasizes those categories that the automatic categorization believes to be present. We evaluate on two datasets including the very challenging VOC 2007 segmentation dataset. Our results significantly advance the state-of-the-art in segmentation accuracy, and furthermore, our use of efficient decision forests gives at least a five-fold increase in execution speed.

...read moreread less

1,162 citations

Journal Article•DOI•

Hierarchical Organization of Human Cortical Networks in Health and Schizophrenia

[...]

Danielle S. Bassett¹, Edward T. Bullmore², Edward T. Bullmore³, Beth A. Verchinski¹, Venkata S. Mattay¹, Daniel R. Weinberger¹, Andreas Meyer-Lindenberg¹ - Show less +3 more•Institutions (3)

National Institutes of Health¹, University of Cambridge², Cambridge University Hospitals NHS Foundation Trust³

10 Sep 2008-The Journal of Neuroscience

TL;DR: It is proposed that the topological differences between divisions of normal cortex may represent the outcome of different growth processes for multimodal and transmodal networks and that neurodevelopmental abnormalities in schizophrenia specifically impact multi-modal cortical organization.

...read moreread less

Abstract: The complex organization of connectivity in the human brain is incompletely understood. Recently, topological measures based on graph theory have provided a new approach to quantify large-scale cortical networks. These methods have been applied to anatomical connectivity data on nonhuman species, and cortical networks have been shown to have small-world topology, associated with high local and global efficiency of information transfer. Anatomical networks derived from cortical thickness measurements have shown the same organizational properties of the healthy human brain, consistent with similar results reported in functional networks derived from resting state functional magnetic resonance imaging (MRI) and magnetoencephalographic data. Here we show, using anatomical networks derived from analysis of inter-regional covariation of gray matter volume in MRI data on 259 healthy volunteers, that classical divisions of cortex (multimodal, unimodal, and transmodal) have some distinct topological attributes. Although all cortical divisions shared nonrandom properties of small-worldness and efficient wiring (short mean Euclidean distance between connected regions), the multimodal network had a hierarchical organization, dominated by frontal hubs with low clustering, whereas the transmodal network was assortative. Moreover, in a sample of 203 people with schizophrenia, multimodal network organization was abnormal, as indicated by reduced hierarchy, the loss of frontal and the emergence of nonfrontal hubs, and increased connection distance. We propose that the topological differences between divisions of normal cortex may represent the outcome of different growth processes for multimodal and transmodal networks and that neurodevelopmental abnormalities in schizophrenia specifically impact multimodal cortical organization.

...read moreread less

1,160 citations

Journal Article•DOI•

Data mining in course management systems: Moodle case study and tutorial

[...]

Cristóbal Romero¹, Sebastián Ventura¹, Enrique García¹•Institutions (1)

University of Córdoba (Spain)¹

01 Aug 2008-Computer Education

TL;DR: This work describes the full process for mining e-learning data step by step as well as how to apply the main data mining techniques used, such as statistics, visualization, classification, clustering and association rule mining of Moodle data.

...read moreread less

Abstract: Educational data mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from the educational context. This work is a survey of the specific application of data mining in learning management systems and a case study tutorial with the Moodle system. Our objective is to introduce it both theoretically and practically to all users interested in this new research area, and in particular to online instructors and e-learning administrators. We describe the full process for mining e-learning data step by step as well as how to apply the main data mining techniques used, such as statistics, visualization, classification, clustering and association rule mining of Moodle data. We have used free data mining tools so that any user can immediately begin to apply data mining without having to purchase a commercial tool or program a specific personalized tool.

...read moreread less

1,049 citations

[...]

Anna Huang¹•Institutions (1)

University of Waikato¹

01 Jan 2008

TL;DR: A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy, and a comparison of these measures in partitional clustering for text document datasets is compared and analyzed.

...read moreread less

Abstract: Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy. In this paper, we compare and analyze the eectiveness of these measures in partitional clustering for text document datasets. Our experiments utilize the standard Kmeans algorithm and we report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering.

...read moreread less

1,010 citations

Book Chapter•DOI•

Quick Shift and Kernel Methods for Mode Seeking

[...]

Andrea Vedaldi¹, Stefano Soatto¹•Institutions (1)

University of California, Los Angeles¹

12 Oct 2008

TL;DR: It is shown that the complexity of the recently introduced medoid-shift algorithm in clustering N points is O(N 2), with a small constant, if the underlying distance is Euclidean, which makes medoid shift considerably faster than mean shift, contrarily to what previously believed.

...read moreread less

Abstract: We show that the complexity of the recently introduced medoid-shift algorithm in clustering N points is O(N 2), with a small constant, if the underlying distance is Euclidean. This makes medoid shift considerably faster than mean shift, contrarily to what previously believed. We then exploit kernel methods to extend both mean shift and the improved medoid shift to a large family of distances, with complexity bounded by the effective rank of the resulting kernel matrix, and with explicit regularization constraints. Finally, we show that, under certain conditions, medoid shift fails to cluster data points belonging to the same mode, resulting in over-fragmentation. We propose remedies for this problem, by introducing a novel, simple and extremely efficient clustering algorithm, called quick shift, that explicitly trades off under- and over-fragmentation. Like medoid shift, quick shift operates in non-Euclidean spaces in a straightforward manner. We also show that the accelerated medoid shift can be used to initialize mean shift for increased efficiency. We illustrate our algorithms to clustering data on manifolds, image segmentation, and the automatic discovery of visual categories.

...read moreread less

865 citations

Journal Article•DOI•

A survey of kernel and spectral methods for clustering

[...]

Maurizio Filippone¹, Francesco Camastra², Francesco Masulli¹, Stefano Rovetta¹•Institutions (2)

University UCINF¹, Applied Science Private University²

01 Jan 2008-Pattern Recognition

TL;DR: A survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters and an explicit proof of the fact that these two paradigms have the same objective is reported.

...read moreread less

832 citations

NP-Hardness of Euclidean Sum-of-Squares Clustering

[...]

Daniel Aloise¹, Amit Deshpande², Pierre Hansen³, Preyas Popat⁴•Institutions (4)

École Polytechnique de Montréal¹, Microsoft², HEC Montréal³, Chennai Mathematical Institute⁴

01 Apr 2008

TL;DR: In this paper, an alternate short proof of NP-hardness of Euclidean sum-of-squares clustering is provided. But this proof is not valid for the general case.

...read moreread less

Abstract: A recent proof of NP-hardness of Euclidean sum-of-squares clustering, due to Drineas et al. (Mach. Learn. 56:9---33, 2004), is not valid. An alternate short proof is provided.

...read moreread less

Journal Article•DOI•

Aggregating inconsistent information: Ranking and clustering

[...]

Nir Ailon¹, Moses Charikar², Alantha Newman³•Institutions (3)

Google¹, Princeton University², Rutgers University³

05 Nov 2008-Journal of the ACM

TL;DR: This work almost settles a long-standing conjecture of Bang-Jensen and Thomassen and shows that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.

...read moreread less

Abstract: We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the extent of disagreement with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.

...read moreread less

Journal Article•DOI•

clValid: An R Package for Cluster Validation

[...]

Guy N. Brock, Vasyl Pihur, Susmita Datta, Somnath Datta

18 Mar 2008-Journal of Statistical Software

TL;DR: The R package clValid contains functions for validating the results of a clustering analysis, and the user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), to choose from.

...read moreread less

Abstract: The R package clValid contains functions for validating the results of a clustering analysis There are three main types of cluster validation measures available, \internal", \stability", and \biological" The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM),

...read moreread less

Posted Content•

Modeling Online Reviews with Multi-grain Topic Models

[...]

Ivan Titov¹, Ryan McDonald²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Google²

07 Jan 2008-arXiv: Information Retrieval

TL;DR: This paper presents a novel framework for extracting ratable aspects of objects from online user reviews and argues that multi-grain models are more appropriate for this task since standard models tend to produce topics that correspond to global properties of objects rather than aspects of an object that tend to be rated by a user.

...read moreread less

Abstract: In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews. Our models are based on extensions to standard topic modeling methods such as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models are more appropriate for our task since standard models tend to produce topics that correspond to global properties of objects (e.g., the brand of a product type) rather than the aspects of an object that tend to be rated by a user. The models we present not only extract ratable aspects, but also cluster them into coherent topics, e.g., `waitress' and `bartender' are part of the same topic `staff' for restaurants. This differentiates it from much of the previous work which extracts aspects through term frequency analysis with minimal clustering. We evaluate the multi-grain models both qualitatively and quantitatively to show that they improve significantly upon standard topic models.

...read moreread less

Journal Article•DOI•

Automatic Clustering Using an Improved Differential Evolution Algorithm

[...]

Swagatam Das¹, Ajith Abraham², Amit Konar¹•Institutions (2)

Jadavpur University¹, Norwegian University of Science and Technology²

01 Jan 2008

TL;DR: Differential evolution has emerged as one of the fast, robust, and efficient global search heuristics of current interest as mentioned in this paper, which has been applied to the automatic clustering of large unlabeled data sets.

...read moreread less

Abstract: Differential evolution (DE) has emerged as one of the fast, robust, and efficient global search heuristics of current interest. This paper describes an application of DE to the automatic clustering of large unlabeled data sets. In contrast to most of the existing clustering techniques, the proposed algorithm requires no prior knowledge of the data to be classified. Rather, it determines the optimal number of partitions of the data "on the run." Superiority of the new method is demonstrated by comparing it with two recently developed partitional clustering techniques and one popular hierarchical clustering algorithm. The partitional clustering algorithms are based on two powerful well-known optimization algorithms, namely the genetic algorithm and the particle swarm optimization. An interesting real-world application of the proposed method to automatic segmentation of images is also reported.

...read moreread less

Book•

Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning

[...]

Alan Julian Izenman

28 Aug 2008

TL;DR: Techniques covered range from traditional multivariate methods, such as multiple regression, principal components, canonical variates, linear discriminant analysis, factor analysis, clustering, multidimensional scaling, and correspondence analysis, to the newer methods of density estimation, projection pursuit, neural networks, and classification and regression trees.

...read moreread less

Abstract: Remarkable advances in computation and data storage and the ready availability of huge data sets have been the keys to the growth of the new disciplines of data mining and machine learning, while the enormous success of the Human Genome Project has opened up the field of bioinformatics. These exciting developments, which led to the introduction of many innovative statistical tools for high-dimensional data analysis, are described here in detail. The author takes a broad perspective; for the first time in a book on multivariate analysis, nonlinear methods are discussed in detail as well as linear methods. Techniques covered range from traditional multivariate methods, such as multiple regression, principal components, canonical variates, linear discriminant analysis, factor analysis, clustering, multidimensional scaling, and correspondence analysis, to the newer methods of density estimation, projection pursuit, neural networks, multivariate reduced-rank regression, nonlinear manifold learning, bagging, boosting, random forests, independent component analysis, support vector machines, and classification and regression trees. Another unique feature of this book is the discussion of database management systems. This book is appropriate for advanced undergraduate students, graduate students, and researchers in statistics, computer science, artificial intelligence, psychology, cognitive sciences, business, medicine, bioinformatics, and engineering. Familiarity with multivariable calculus, linear algebra, and probability and statistics is required. The book presents a carefully-integrated mixture of theory and applications, and of classical and modern multivariate statistical techniques, including Bayesian methods. There are over 60 interesting data sets used as examples in the book, over 200 exercises, and many color illustrations and photographs.

...read moreread less

Journal Article•DOI•

Real-Time Computerized Annotation of Pictures

[...]

Jia Li¹, James Z. Wang¹•Institutions (1)

Pennsylvania State University¹

01 Jun 2008-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: New optimization and estimation techniques to address two fundamental problems in machine learning are developed, which serve as the basis for the Automatic Linguistic Indexing of Pictures - Real Time (ALIPR) system of fully automatic and high speed annotation for online pictures.

...read moreread less

Abstract: Developing effective methods for automated annotation of digital pictures continues to challenge computer scientists. The capability of annotating pictures by computers can lead to breakthroughs in a wide range of applications, including Web image search, online picture-sharing communities, and scientific experiments. In this work, the authors developed new optimization and estimation techniques to address two fundamental problems in machine learning. These new techniques serve as the basis for the automatic linguistic indexing of pictures - real time (ALIPR) system of fully automatic and high-speed annotation for online pictures. In particular, the D2-clustering method, in the same spirit as K-Means for vectors, is developed to group objects represented by bags of weighted vectors. Moreover, a generalized mixture modeling technique (kernel smoothing as a special case) for nonvector data is developed using the novel concept of hypothetical local mapping (HLM). ALIPR has been tested by thousands of pictures from an Internet photo-sharing site, unrelated to the source of those pictures used in the training process. Its performance has also been studied at an online demonstration site, where arbitrary users provide pictures of their choices and indicate the correctness of each annotation word. The experimental results show that a single computer processor can suggest annotation terms in real time and with good accuracy.

...read moreread less

Journal Article•DOI•

Consistency of spectral clustering

[...]

U von Luxburg¹, Mikhail Belkin², Olivier Bousquet¹•Institutions (2)

Max Planck Society¹, University of Chicago²

01 Apr 2008-Annals of Statistics

TL;DR: It is proved that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other is only consistent under strong additional assumptions, which are not always satisfied in real data.

...read moreread less

Abstract: Consistency is a key property of all statistical procedures analyzing randomly sampled data. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that, for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result, we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.

...read moreread less

Book•

Pattern Recognition, Fourth Edition

[...]

Sergios Theodoridis, Konstantinos Koutroumbas

30 Sep 2008

TL;DR: This edition includes many more worked examples and diagrams to help give greater understanding of the methods and their application, including semi-supervised learning, combining clustering algorithms, and relevance feedback.

...read moreread less

Abstract: This book considers classical and current theory and practice, of both supervised and unsupervised pattern recognition, to build a complete background for professionals and students of engineering. The authors, leading experts in the field of pattern recognition, have provided an up-to-date, self-contained volume encapsulating this wide spectrum of information. The very latest methods are incorporated in this edition: semi-supervised learning, combining clustering algorithms, and relevance feedback.This edition includes many more worked examples and diagrams (in two colour) to help give greater understanding of the methods and their application. Computer-based problems will be included with MATLAB code. An accompanying book contains extra worked examples and MATLAB code of all the examples used in this book.Thoroughly developed to include many more worked examples to give greater understanding of this mathematically oriented subjectMany more diagrams included--now in two color--to provide greater insight through visual presentationAn accompanying manual includes Matlab code of the methods and algorithms in the book, together with solved problems and real-life data sets in medical imaging, remote sensing and audio recognition. The Manual is available separately or at a special packaged price (ISBN: 9780123744869).Latest hot topics included to further the reference value of the text including semi-supervised learning, combining clustering algorithms, and relevance feedback.

...read moreread less

Calorimeter Clustering Algorithms : Description and Performance

[...]

Walter Lampl, Peter Loch, Sven Menke, Srinivasan Rajagopalan, Sandrine Laplace, Guillaume Unal, Hong Ma, Scott Snyder, Damir Lelas, David Rousseau - Show less +6 more

16 Apr 2008

TL;DR: This note describes the performance of the ATLAS calorimeter clustering algorithms, which provide inputs for particle identification, and summarizes the steps of the calorimeters reconstruction softwar e.

...read moreread less

Abstract: This note describes the performance of the ATLAS calorimeter clustering algorithms, which provide inputs for particle identification. ATLAS uses two principal alg orithms. The first is the “sliding-window” algorithm, which clusters calorimeter cells within fixe d-size rectangles; results from this are used for electron, photon, and tau lepto n identification. The second is the “topological” algorithm, which clusters together neighboring ce lls, as long as the signal in the cells is significant compared to noise. The results of this seco nd algorithm are further used for jet and missing transverse energy reconstruction . This note first summarizes the steps of the calorimeter reconstruction softwar e. A detailed description of the two clustering algorithms is then given. A last section su mmarizes their performance. The results presented in this note are obtained with the ATLAS ATHENA software releases 12 and 13. ATL-LARG-PUB-2008-002

...read moreread less

Proceedings Article•DOI•

Personalized recommendation in social tagging systems using hierarchical clustering

[...]

Andriy Shepitsen¹, Jonathan Gemmell¹, Bamshad Mobasher¹, Robin Burke¹•Institutions (1)

DePaul University¹

23 Oct 2008

TL;DR: This work presents a personalization algorithm for recommendation in folksonomies which relies on hierarchical tag clusters and presents extensive experimental results on two real world dataset, suggesting that guysonomies encompassing only one topic domain, rather than many topics, present an easier target for recommendation.

...read moreread less

Abstract: Collaborative tagging applications allow Internet users to annotate resources with personalized tags. The complex network created by many annotations, often called a folksonomy, permits users the freedom to explore tags, resources or even other user's profiles unbound from a rigid predefined conceptual hierarchy. However, the freedom afforded users comes at a cost: an uncontrolled vocabulary can result in tag redundancy and ambiguity hindering navigation. Data mining techniques, such as clustering, provide a means to remedy these problems by identifying trends and reducing noise. Tag clusters can also be used as the basis for effective personalized recommendation assisting users in navigation. We present a personalization algorithm for recommendation in folksonomies which relies on hierarchical tag clusters. Our basic recommendation framework is independent of the clustering method, but we use a context-dependent variant of hierarchical agglomerative clustering which takes into account the user's current navigation context in cluster selection. We present extensive experimental results on two real world dataset. While the personalization algorithm is successful in both cases, our results suggest that folksonomies encompassing only one topic domain, rather than many topics, present an easier target for recommendation, perhaps because they are more focused and often less sparse. Furthermore, context dependent cluster selection, an integral step in our personalization algorithm, demonstrates more utility for recommendation in multi-topic folksonomies than in single-topic folksonomies. This observation suggests that topic selection is an important strategy for recommendation in multi-topic folksonomies.

...read moreread less

Journal Article•DOI•

Unsupervised segmentation of natural images via lossy data compression

[...]

Allen Y. Yang¹, John Wright², Yi Ma², S. Shankar Sastry¹•Institutions (2)

University of California, Berkeley¹, Urbana University²

01 May 2008-Computer Vision and Image Understanding

TL;DR: This paper model the distribution of the texture features using a mixture of Gaussian distributions, allowing the mixture components to be degenerate or nearly-degenerate, and shows that such a mixture distribution can be effectively segmented by a simple agglomerative clustering algorithm derived from a lossy data compression approach.

...read moreread less

Proceedings Article•DOI•

Context-aware query suggestion by mining click-through and session data

[...]

Huanhuan Cao¹, Daxin Jiang², Jian Pei³, Qi He⁴, Zhen Liao⁵, Enhong Chen¹, Hang Li² - Show less +3 more•Institutions (5)

University of Science and Technology of China¹, Microsoft², Simon Fraser University³, Nanyang Technological University⁴, Nankai University⁵

24 Aug 2008

TL;DR: This paper proposes a novel context-aware query suggestion approach which is in two steps, and outperforms two baseline methods in both coverage and quality of suggestions.

...read moreread less

Abstract: Query suggestion plays an important role in improving the usability of search engines. Although some recently proposed methods can make meaningful query suggestions by mining query patterns from search logs, none of them are context-aware - they do not take into account the immediately preceding queries as context in query suggestion. In this paper, we propose a novel context-aware query suggestion approach which is in two steps. In the offine model-learning step, to address data sparseness, queries are summarized into concepts by clustering a click-through bipartite. Then, from session data a concept sequence suffix tree is constructed as the query suggestion model. In the online query suggestion step, a user's search context is captured by mapping the query sequence submitted by the user to a sequence of concepts. By looking up the context in the concept sequence sufix tree, our approach suggests queries to the user in a context-aware manner. We test our approach on a large-scale search log of a commercial search engine containing 1:8 billion search queries, 2:6 billion clicks, and 840 million query sessions. The experimental results clearly show that our approach outperforms two baseline methods in both coverage and quality of suggestions.

...read moreread less

Journal Article•DOI•

Learning a Mahalanobis distance metric for data clustering and classification

[...]

Shiming Xiang¹, Feiping Nie¹, Changshui Zhang¹•Institutions (1)

Tsinghua University¹

01 Dec 2008-Pattern Recognition

TL;DR: This paper considers a general problem of learning from pairwise constraints in the form of must-links and cannot-links, and aims to learn a Mahalanobis distance metric.

...read moreread less

Journal Article•DOI•

Robust path-based spectral clustering

[...]

Hong Chang¹, Dit-Yan Yeung²•Institutions (2)

Xerox¹, Hong Kong University of Science and Technology²

01 Jan 2008-Pattern Recognition

TL;DR: Experimental results show that the proposed robust path-based spectral clustering method consistently outperforms other methods due to its higher robustness, and comparisons with some other methods show this method to be significantly more robust than spectral clusters and path- based clustering.

...read moreread less

Proceedings Article•DOI•

Never Walk Alone: Uncertainty for Anonymity in Moving Objects Databases

[...]

Osman Abul, Francesco Bonchi¹, Mirco Nanni¹•Institutions (1)

Istituto di Scienza e Tecnologie dell'Informazione¹

07 Apr 2008

TL;DR: It is proved that the problem of achieving (k,delta) -anonymity by space translation with minimum distortion is NP-hard, and a greedy algorithm based on clustering and enhanced with ad hoc pre-processing and outlier removal techniques is proposed.

...read moreread less

Abstract: Preserving individual privacy when publishing data is a problem that is receiving increasing attention. According to the fc-anonymity principle, each release of data must be such that each individual is indistinguishable from at least k - 1 other individuals. In this paper we study the problem of anonymity preserving data publishing in moving objects databases. We propose a novel concept of k-anonymity based on co-localization that exploits the inherent uncertainty of the moving object's whereabouts. Due to sampling and positioning systems (e.g., GPS) imprecision, the trajectory of a moving object is no longer a polyline in a three-dimensional space, instead it is a cylindrical volume, where its radius delta represents the possible location imprecision: we know that the trajectory of the moving object is within this cylinder, but we do not know exactly where. If another object moves within the same cylinder they are indistinguishable from each other. This leads to the definition of (k,delta) -anonymity for moving objects databases. We first characterize the (k, delta)-anonymity problem and discuss techniques to solve it. Then we focus on the most promising technique by the point of view of information preservation, namely space translation. We develop a suitable measure of the information distortion introduced by space translation, and we prove that the problem of achieving (k,delta) -anonymity by space translation with minimum distortion is NP-hard. Faced with the hardness of our problem we propose a greedy algorithm based on clustering and enhanced with ad hoc pre-processing and outlier removal techniques. The resulting method, named NWA (Never Walk .Alone), is empirically evaluated in terms of data quality and efficiency. Data quality is assessed both by means of objective measures of information distortion, and by comparing the results of the same spatio-temporal range queries executed on the original database and on the (k, delta)-anonymized one. Experimental results show that for a wide range of values of delta and k, the relative error introduced is kept low, confirming that NWA produces high quality (k, delta)-anonymized data.

...read moreread less

Journal Article•DOI•

Trajectory-Based Anomalous Event Detection

[...]

Claudio Piciarelli¹, Christian Micheloni¹, Gian Luca Foresti¹•Institutions (1)

University of Udine¹

01 Nov 2008-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: The proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring, based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories.

...read moreread less

Abstract: During the last years, the task of automatic event analysis in video sequences has gained an increasing attention among the research community. The application domains are disparate, ranging from video surveillance to automatic video annotation for sport videos or TV shots. Whatever the application field, most of the works in event analysis are based on two main approaches: the former based on explicit event recognition, focused on finding high-level, semantic interpretations of video sequences, and the latter based on anomaly detection. This paper deals with the second approach, where the final goal is not the explicit labeling of recognized events, but the detection of anomalous events differing from typical patterns. In particular, the proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring. The proposed approach is based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories. Particular attention is given to trajectory classification in absence of a priori information on the distribution of outliers. Experimental results prove the validity of the proposed approach.

...read moreread less

Reference Book•DOI•

Constrained Clustering: Advances in Algorithms, Theory, and Applications

[...]

Sugato Basu, Ian Davidson, Kiri L. Wagstaff¹•Institutions (1)

University of California, Davis¹

12 Aug 2008

TL;DR: This volume delivers thorough coverage of the capabilities and limitations of constrained clustering methods as well as introduces new types of constraints and clustering algorithms.

...read moreread less

Abstract: Since the initial work on constrained clustering, there have been numerous advances in methods, applications, and our understanding of the theoretical properties of constraints and constrained clustering algorithms. Bringing these developments together, Constrained Clustering: Advances in Algorithms, Theory, and Applications presents an extensive collection of the latest innovations in clustering data analysis methods that use background knowledge encoded as constraints. Algorithms The first five chapters of this volume investigate advances in the use of instance-level, pairwise constraints for partitional and hierarchical clustering. The book then explores other types of constraints for clustering, including cluster size balancing, minimum cluster size,and cluster-level relational constraints. Theory It also describes variations of the traditional clustering under constraints problem as well as approximation algorithms with helpful performance guarantees. Applications The book ends by applying clustering with constraints to relational data, privacy-preserving data publishing, and video surveillance data. It discusses an interactive visual clustering approach, a distance metric learning approach, existential constraints, and automatically generated constraints. With contributions from industrial researchers and leading academic experts who pioneered the field, this volume delivers thorough coverage of the capabilities and limitations of constrained clustering methods as well as introduces new types of constraints and clustering algorithms.

...read moreread less

Journal Article•DOI•

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR.

[...]

Howard D. Bondell¹, Brian J. Reich¹•Institutions (1)

North Carolina State University¹

01 Mar 2008-Biometrics

TL;DR: A new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters, in addition to improving prediction accuracy and interpretation.

...read moreread less

Abstract: Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.

...read moreread less

Proceedings Article•DOI•

CHEF: Cluster Head Election mechanism using Fuzzy logic in Wireless Sensor Networks

[...]

Jong-Myoung Kim¹, Seon-Ho Park¹, Young-Ju Han¹, Tai-Myoung Chung¹•Institutions (1)

Sungkyunkwan University¹

22 Apr 2008

TL;DR: This paper introduces CHEF - cluster head election mechanism using fuzzy logic, and proves efficiency of CHEF compared with LEACH using the matlab, showing that CHEF is about 22.7% more efficient than LEACH.

...read moreread less

Abstract: In designing the wireless sensor networks, the energy is the most important consideration because the lifetime of the sensor node is limited by the battery of it. To overcome this demerit many research have been done. The clustering is the one of the representative approaches. In the clustering, the cluster heads gather data from nodes, aggregate it and send the information to the base station. In this way, the sensor nodes can reduce communication overheads that may be generated if each sensor node reports sensed data to the base station independently. LEACH is one of the most famous clustering mechanisms. It elects a cluster head based on probability model. This approach may reduce the network lifetime because LEACH does not consider the distribution of sensor nodes and the energy remains of each node. However, using the location and the energy information in the clustering can generate big overheads. In this paper we introduce CHEF - cluster head election mechanism using fuzzy logic. By using fuzzy logic, collecting and calculating overheads can be reduced and finally the lifetime of the sensor networks can be prolonged. To prove efficiency of CHEF, we simulated CHEF compared with LEACH using the matlab. Our simulation results show that CHEF is about 22.7% more efficient than LEACH.

...read moreread less

Collapse