A simple and fast algorithm for K-medoids clustering

doi:10.1016/J.ESWA.2008.01.039

Home
/
Papers
/
A simple and fast algorithm for K-medoids clustering

Journal Article•DOI•

A simple and fast algorithm for K-medoids clustering

Hae-Sang Park¹, Chi-Hyuck Jun¹•Institutions (1)

Pohang University of Science and Technology¹

01 Mar 2009-Expert Systems With Applications (Pergamon Press, Inc.)-Vol. 36, Iss: 2, pp 3336-3341

TL;DR: Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.

read less

Abstract: This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Comprehensive Survey of Clustering Algorithms

[...]

Dongkuan Xu¹, Yingjie Tian¹•Institutions (1)

Chinese Academy of Sciences¹

12 Aug 2015-Annals of Data Science

TL;DR: This review paper begins at the definition of clustering, takes the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyzes the clustered algorithms from two perspectives, the traditional ones and the modern ones.

...read moreread less

Abstract: Data analysis is used as a common method in modern science research, which is across communication science, computer science and biology science. Clustering, as the basic composition of data analysis, plays a significant role. On one hand, many tools for cluster analysis have been created, along with the information increase and subject intersection. On the other hand, each clustering algorithm has its own strengths and weaknesses, due to the complexity of information. In this review paper, we begin at the definition of clustering, take the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyze the clustering algorithms from two perspectives, the traditional ones and the modern ones. All the discussed clustering algorithms will be compared in detail and comprehensively shown in Appendix Table 22.

...read moreread less

1,234 citations

Cites background from "A simple and fast algorithm for K-m..."

...K-means [7] and K-medoids [8] are the two most famous ones of this kind of clustering algorithms....
[...]

Journal Article•DOI•

A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis

[...]

Adil Fahad¹, Najlaa Alshatri¹, Zahir Tari¹, Abdullah Alamri¹, Ibrahim Khalil¹, Albert Y. Zomaya², Sebti Foufou³, Abdelaziz Bouras³ - Show less +4 more•Institutions (3)

RMIT University¹, University of Sydney², Qatar University³

12 Jun 2014-IEEE Transactions on Emerging Topics in Computing

TL;DR: Concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as a comparison, both from a theoretical and an empirical perspective are introduced.

...read moreread less

Abstract: Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.

...read moreread less

833 citations

Journal Article•DOI•

Molecular Docking: Shifting Paradigms in Drug Discovery.

[...]

Luca Pinzi¹, Giulio Rastelli¹•Institutions (1)

University of Modena and Reggio Emilia¹

04 Sep 2019-International Journal of Molecular Sciences

TL;DR: This review describes how molecular docking was firstly applied to assist in drug discovery tasks, and illustrates newer and emergent uses and applications of docking, including prediction of adverse effects, polypharmacology, drug repurposing, and target fishing and profiling.

...read moreread less

Abstract: Molecular docking is an established in silico structure-based method widely used in drug discovery. Docking enables the identification of novel compounds of therapeutic interest, predicting ligand-target interactions at a molecular level, or delineating structure-activity relationships (SAR), without knowing a priori the chemical structure of other target modulators. Although it was originally developed to help understanding the mechanisms of molecular recognition between small and large molecules, uses and applications of docking in drug discovery have heavily changed over the last years. In this review, we describe how molecular docking was firstly applied to assist in drug discovery tasks. Then, we illustrate newer and emergent uses and applications of docking, including prediction of adverse effects, polypharmacology, drug repurposing, and target fishing and profiling, discussing also future applications and further potential of this technique when combined with emergent techniques, such as artificial intelligence.

...read moreread less

663 citations

Cites methods from "A simple and fast algorithm for K-m..."

...Then, they performed clustering via the K-medoids method on the calculated molecular dynamics trajectories [88] to identify MD-derived representative conformations of the investigated targets....
[...]

Journal Article•DOI•

Machine Learning: Algorithms, Real-World Applications and Research Directions

[...]

Iqbal H. Sarker¹•Institutions (1)

Swinburne University of Technology¹

22 Mar 2021

TL;DR: In this paper, the authors present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application and highlight the challenges and potential research directions based on their study.

...read moreread less

Abstract: In the current age of the Fourth Industrial Revolution (4IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning, which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

...read moreread less

659 citations

Journal Article•DOI•

Survey of machine learning techniques for malware analysis

[...]

Daniele Ucci¹, Leonardo Aniello², Roberto Baldoni¹•Institutions (2)

Sapienza University of Rome¹, University of Southampton²

01 Mar 2019-Computers & Security

TL;DR: This survey aims at providing an overview on the way machine learning has been used so far in the context of malware analysis in Windows environments, i.e. for the analysis of Portable Executables.

...read moreread less

316 citations

Cites background from "A simple and fast algorithm for K-m..."

...3, apply to k-medoids as well, but it is less sensitive to outliers [84]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Some methods for classification and analysis of multivariate observations

[...]

James B. MacQueen

01 Jan 1967

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

...read moreread less

24,320 citations

"A simple and fast algorithm for K-m..." refers methods in this paper

...The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step....
[...]
...K-means clustering (MacQueen, 1967) and partitioning around medoids (Kaufman & Rousseeuw, 1990) are well known techniques for performing non-hierarchical clustering....
[...]

Journal Article•DOI•

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

[...]

Peter J. Rousseeuw¹•Institutions (1)

University of Fribourg¹

01 Nov 1987-Journal of Computational and Applied Mathematics

TL;DR: A new graphical display is proposed for partitioning techniques, where each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation, and provides an evaluation of clustering validity.

...read moreread less

14,144 citations

"A simple and fast algorithm for K-m..." refers background in this paper

...Ng and Han (1994) proposed an efficient PAM-based algorithm, which updates new medoids from some neighboring objects. van der Laan, Pollard, and Bryan (2003) tried to maximize the silhouette proposed by Rousseeuw (1987) instead of minimizing the sum of distances to the closest medoid in PAM....
[...]

Book•

Finding Groups in Data: An Introduction to Cluster Analysis

[...]

Leonard Kaufman¹, Peter J. Rousseeuw²•Institutions (2)

Vrije Universiteit Brussel¹, Ukraine International Airlines²

01 Jan 1990

TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.

...read moreread less

Abstract: 1. Introduction. 2. Partitioning Around Medoids (Program PAM). 3. Clustering large Applications (Program CLARA). 4. Fuzzy Analysis. 5. Agglomerative Nesting (Program AGNES). 6. Divisive Analysis (Program DIANA). 7. Monothetic Analysis (Program MONA). Appendix 1. Implementation and Structure of the Programs. Appendix 2. Running the Programs. Appendix 3. Adapting the Programs to Your Needs. Appendix 4. The Program CLUSPLOT. References. Author Index. Subject Index.

...read moreread less

10,537 citations

"A simple and fast algorithm for K-m..." refers methods in this paper

...Among many algorithms for K-medoids clustering, partitioning around medoids (PAM) proposed by Kaufman and Rousseeuw (1990) is known to be most powerful....
[...]
...Kaufman and Rousseeuw (1990) also proposed an algorithm called CLARA, which applies the PAM to sampled objects instead of all objects....
[...]
...K-means clustering (MacQueen, 1967) and partitioning around medoids (Kaufman & Rousseeuw, 1990) are well known techniques for performing non-hierarchical clustering....
[...]

Book•DOI•

Finding Groups in Data

[...]

Leonard Kaufman, Peter J. Rousseeuw

01 Jan 1990

TL;DR: In this article, an electrical signal transmission system for railway locomotives and rolling stock is proposed, where a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count, and a spike pulse of greater selected amplitude is transmitted, occurring immediately after the axle count pulse to which it relates, whenever an overheated axle box is detected.

...read moreread less

Abstract: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count, and a spike pulse of greater selected amplitude is transmitted, occurring immediately after the axle count pulse to which it relates, whenever an overheated axle box is detected. To enable the signal receiving equipment to determine on which side of a train the overheated box is located, the spike pulses are of two different amplitudes corresponding, respectively, to opposite sides of the train.

...read moreread less

9,011 citations

Journal Article•DOI•

Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability

[...]

Lucien Le Cam, Neyman Jerzy

01 Jan 1969

3,106 citations