scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1984"


Journal ArticleDOI
TL;DR: A FORTRAN-IV coding of the fuzzy c -means (FCM) clustering program is transmitted, which generates fuzzy partitions and prototypes for any set of numerical data.

5,287 citations


Journal ArticleDOI
TL;DR: It is shown that under certain conditions the K-means algorithm may fail to converge to a local minimum, and that it converges under differentiability conditions to a Kuhn-Tucker point.
Abstract: The K-means algorithm is a commonly used technique in cluster analysis. In this paper, several questions about the algorithm are addressed. The clustering problem is first cast as a nonconvex mathematical program. Then, a rigorous proof of the finite convergence of the K-means-type algorithm is given for any metric. It is shown that under certain conditions the algorithm may fail to converge to a local minimum, and that it converges under differentiability conditions to a Kuhn-Tucker point. Finally, a method for obtaining a local-minimum solution is given.

1,180 citations


Journal ArticleDOI
TL;DR: A centroid SAHN clustering algorithm that requires 0(n2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics is described.
Abstract: Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n 3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n 3) worst-case time, can reasonably be expected to exhibit 0(n 2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n 2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n 2) time and space. Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n 2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.

877 citations


Journal ArticleDOI
TL;DR: An algorithm for constructing models on the basis of fuzzy and nonfuzzy data with the aid of fuzzy discretization and clustering techniques is proposed.

524 citations


Journal ArticleDOI
TL;DR: A simple formula to predict the distance traveled by fleets of vehicles in physical distribution problems involving a depot and its area of influence is developed.
Abstract: The purpose of this paper is to develop a simple formula to predict the distance traveled by fleets of vehicles in physical distribution problems involving a depot and its area of influence. Since the transportation cost of operating a break-bulk terminal or a warehouse is intimately related to the distance traveled, the availability of such a simple formula should facilitate the study of more complex logistics problems. A simple manual dispatching strategy intended to mimic what dispatchers do, but simple enough to admit analytical modeling is presented. Since the formulas agree rather well with the length of nearly optimal computer built tours, the predictions should approximate distances achievable in practice; the formulas seem realistic. The technique is a variant of the classical “cluster-first, route-second” approach to vehicle routing problems. In these approaches, the depot influence area is first partitioned into districts containing clusters of stops; one vehicle route is then constructed to serve each cluster. Our procedure is characterized by the way district shapes are chosen; ignoring shape during the clustering step can increase significantly travel distances. The technique is simple. To exercise it, one needs only a pencil, eraser, and a scale map showing the destinations. Once mastered, the technique takes only a few minutes. This time should increase only linearly with the number of destinations. For repetitive problems, the technique can be enhanced with the help of interactive computer graphics. A newspaper delivery problem for the city of San Francisco is used as an illustration.

258 citations


Journal ArticleDOI
TL;DR: Empirical results using both a primal heuristic and a hybrid heuristic-subgradient method for problems having n ⩽ 100 show that the algorithms locate close to optimal solutions without resorts to tree enumeration.

170 citations


Journal ArticleDOI
TL;DR: In this article, the authors developed a new theory for galaxy clustering in an expanding universe based on the thermodynamics of gravitating systems and applied to the highly nonlinear regime of strong clustering.
Abstract: We develop a new theory for galaxy clustering in an expanding universe. It is based on the thermodynamics of gravitating systems and applies to the highly nonlinear regime of strong clustering. There are no free parameters in the simplest form of this theory. It predicts distribution functions of all orders, from voids to hundreds of galaxies. Comparison of these predictions with the results of numerical N-body experiments shows substantial agreement. Comparison with the observed distribution of galaxies may determine whether it has unrelaxed structure that retains information from much easier epochs of the universe.

152 citations


Posted Content
TL;DR: A new method is proposed (SYNCLUS, SYNthesizedCLUStering) for dealing with the problem of how can the various contributory variables in a specific battery be weighted so as to enhance some cluster structure that may be present.
Abstract: In the application of clustering methods to real world data sets, two problems frequently arise: (a) how can the various contributory variables in a specific battery be weighted so as to enhance some cluster structure that may be present, and (b) how can various alternative batteries be combined to produce a single clustering that "best" incorporates each contributory set. A new method is proposed (SYNCLUS, SYNthesized CLUStering) for dealing with these two problems.

146 citations


Journal ArticleDOI
TL;DR: On applique la methode a la fonction de correlation angulaire a deux points du catalogue de galaxies de 14 mag de Zwicky as discussed by the authors, on trouve des erreurs standards de σ(θ 0 )=0,01 and σ (γ)=0,13 avec des moyennes de γ= 0,80 and θ 0 =0,06 radians.
Abstract: On applique la methode a la fonction de correlation angulaire a deux points du catalogue de galaxies de 14 mag de Zwicky. Si la fonction a deux points est de la forme ω(θ)=(θ 0 /θ)γ, on trouve des erreurs standards de σ(θ 0 )=0,01 et σ(γ)=0,13 avec des moyennes de γ=0,80 et θ 0 =0,06 radians

141 citations


Journal ArticleDOI
TL;DR: In this paper, a scaling theory for aggregation by means of kinetic clustering of clusters is developed, whereby a global picture of static and dynamic critical properties emerges, whereby the dynamic critical exponent can be related to the fractal dimension.
Abstract: A scaling theory is developed for aggregation by means of kinetic clustering of clusters. A global picture of static and dynamic critical properties emerges, whereby the dynamic critical exponent can be related to the fractal dimension. Furthermore, the growth process is described in terms of a purely kinetic model. The scaling predictions agree well with numerical results.

138 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new method (SYNCLUS, SYNthesizedCLUStering) for dealing with two problems: (a) how to the various contributory variables in a specific battery be weighted so as to enhance some cluster structure that may be present, and (b) how can various alternative batteries be combined to produce a single clustering that best incorporates each contributory set.
Abstract: In the application of clustering methods to real world data sets, two problems frequently arise: (a) how can the various contributory variables in a specific battery be weighted so as to enhance some cluster structure that may be present, and (b) how can various alternative batteries be combined to produce a single clustering that “best” incorporates each contributory set. A new method is proposed (SYNCLUS, SYNthesizedCLUStering) for dealing with these two problems.

Journal ArticleDOI
TL;DR: This work presents an approach for picture indexing and abstraction, and illustrates by examples how to apply abstraction operations to obtain various picture indexes, and how to construct icons to facilitate accessing of pictorial data.
Abstract: We present an approach for picture indexing and abstraction. Picture indexing facilitates information retrieval from a pictorial database consisting of picture objects and picture relations. To construct picture indexes, abstraction operations to perform picture object clustering and classification are formulated. To substantiate the abstraction operations, we also formalize syntactic abstraction rules and semantic abstraction rules. We then illustrate by examples how to apply these abstraction operations to obtain various picture indexes, and how to construct icons to facilitate accessing of pictorial data.

Journal ArticleDOI
TL;DR: In this article, the authors focus on methods that are based on a random sample of points and that use a combination of clustering and local search to identify all the local optima that are potentially global.
Abstract: SYNOPTIC ABSTRACTThe most efficient methods for finding the global minimum of an objective function (not necessarily convex) are those that embody stochastic elements In this survey, we focus on methods that are based on a random sample of points and that use a combination of clustering and local search to identify all the local optima that are potentially global Special attention is paid to a proper termination criterion for the sequence of sampling, clustering and searching, and to the analysis of the result produced by the method

01 Jan 1984
TL;DR: In this paper, a new method for determining the number of groups in a numerical classification is proposed, based on the average similarity of an individual with the members of its group, including itself.
Abstract: A new method for determining the number of groups in a numerical classification is proposed. Extensive tests of the criterion for the "correct" or "optimum" number of groups are reported. The criterion may be used with any definition of similarity whose possible values are bounded by zero and unity, and with any agglomerative clustering method, whether it be hierarchical or nonhierarchical. It may also be used in conjunction with divisive clustering methods for which the similarity coefficients can conveniently be obtained. The procedure is based on the average similarity of an individual with the members of its group, including itself, and readily lends itself to interactive computation if one wishes to find the partition that maximizes the overall average similarity for a given number of groups. In that sense, the procedure may also be considered to be a clustering method.

Journal ArticleDOI
TL;DR: A cluster-analytic approach to benefit segmentation is described using an illustrative empirical example drawn from the Austrian domestic travel market, demonstrating that some benefits are incompatible with each other at the segment level, and that the "average" vacationer may be only a statistical artifact.
Abstract: A cluster-analytic approach to benefit segmentation is described using an illustrative empirical example drawn from the Austrian domestic travel market. The results demonstrate that some benefits are incompatible with each other at the segment level, and that the "average" vacationer may be only a statistical artifact.

Journal ArticleDOI
TL;DR: New approaches to unsupervised fuzzy classification of multidimensional data by using ‘semi-fuzzy’ or ‘soft’ clustering techniques to achieve this goal are discussed.

Journal ArticleDOI
TL;DR: A new index for the level of disease clustering in time, which is devised to the case where the data are grouped into several equally spaced intervals, applicable to both temporal and cyclical clustering.
Abstract: This paper presents a new index for the level of disease clustering in time, which is devised to the case where the data are grouped into several equally spaced intervals. This index is applicable to both temporal and cyclical clustering. The asymptotic distribution of this index is derived under the null hypothesis of no clustering in time. Monte Carlo simulation studies show that the asymptotic results are good approximations when the sample size is as small as the number of intervals, an average of one per interval. The powers of the test based on this index for both types of clustering are compared with those of several existing procedures. Tables of upper percentage points of this index are given.

Journal ArticleDOI
TL;DR: Several techniques are given for the uniform generation of trees for use in Monte Carlo studies of clustering and tree representations and general strategies are reviewed for random selection from a set of combinatorial objects.
Abstract: Several techniques are given for the uniform generation of trees for use in Monte Carlo studies of clustering and tree representations. First, general strategies are reviewed for random selection from a set of combinatorial objects with special emphasis on two that use random mapping operations. Theorems are given on how the number of such objects in the set (e.g., whether the number is prime) affects which strategies can be used. Based on these results, methods are presented for the random generation of six types of binary unordered trees. Three types of labeling and both rooted and unrooted forms are considered. Presentation of each method includes the theory of the method, the generation algorithm, an analysis of its computational complexity and comments on the distribution of trees over which it samples. Formal proofs and detailed algorithms are in appendices.

Journal ArticleDOI
TL;DR: The CONCLUS model and algorithm are described in detail, as well as their flexibility for use in various applications, and Monte Carlo results are presented for two synthetic data sets with appropriate discussion of the resulting implications.
Abstract: In many classification problems, one often possesses external and/or internal information concerning the objects or units to be analyzed which makes it appropriate to impose constraints on the set of allowable classifications and their characteristics. CONCLUS, or CONstrained CLUStering, is a new methodology devised to perform constrained classification in either an overlapping or nonoverlapping (hierarchical or nonhierarchial) manner. This paper initially reviews the related classification literature. A discussion of the use of constraints in clustering problems is then presented. The CONCLUS model and algorithm are described in detail, as well as their flexibility for use in various applications. Monte Carlo results are presented for two synthetic data sets with appropriate discussion of the resulting implications. An illustration of CONCLUS is presented with respect to a sales territory design problem where the objects classified are various Forbes-500 companies. Finally, the discussion section highlights the main contribution of the paper and offers some areas for future research.

Journal ArticleDOI
TL;DR: This work gives an overview of the cluster analysis and pattern recognition methods and performs hierarchical clustering on a small amount of data for reducing the feature set to an orthogonal one.

Journal ArticleDOI
TL;DR: The DPP program is equipped with a leading verb command language for input and job scheduling, thus providing an efficient and user-friendly operator/program interface, and with a data-base organization that accommodates a wide variety of data structures.

Journal ArticleDOI
TL;DR: A nonparametric data reduction technique is proposed that is iterative and based on the use of a criterion function and nearest neighbor density estimates to select samples that are ``representative'' of the entire data set.
Abstract: A nonparametric data reduction technique is proposed. Its goal is to select samples that are ``representative'' of the entire data set. The technique is iterative and is based on the use of a criterion function and nearest neighbor density estimates. Experiments are presented to demonstrate the algorithm.

Posted Content
TL;DR: A new methodology called INDTREES (for INdividual Differences in TREE Structures) for fitting various(discrete) tree structures to three-way proximity data for relieving three common types of maladies.
Abstract: Models for the representation of proximity data (similarities/dissimilarities) can be categorized into one of three groups of models: continuous spatial models, discrete nonspatial models, and hybrid models (which combine aspects of both spatial and discrete models). Multidimensional scaling models and associated methods, used for the spatial representation of such proximity data, have been devised to accommodate two, three, and higher-way arrays. At least one model/method for overlapping (but generally non-hierarchical) clustering called INDCLUS (Carroll and Arabic 1983) has been devised for the ease of three-way arrays of proximity data. Tree-fitting methods, used for the discrete network representation of such proximity data, have only thus far been devised to handle two-way arrays. This paper develops a new methodology called INDTREES (for individual Differences in TREE Structures) for fitting various (discrete) tree structures to three-way proximity data. This individual differences generalization is one in which different individuals, for example, are assumed to base their judgments on the same family of trees, but are allowed to have different node heights and/or branch lengths. We initially present an introductory overview focussing on existing two-way models. The INDTREES model and algorithm are then described in detail. Monte Carlo results for the INDTREES fitting of four different three-way data sets are presented. In the application, a single ultrametric tree is fitted to three-way proximity data derived from intention-to-buy-data for various brands of over-the-counter pain relievers for relieving three common types of maladies. Finally, we briefly describe how the INDTREES procedure can be extended to accommodate hybrid modelling, as well as to handle other types of applications.

Posted Content
TL;DR: This work introduces a recently developed clustering method that appears to be more suited to the analysis of such nonsymmetric data, and describes an application and comparison of the various approaches.
Abstract: Rao and Sabavala (1981) recently proposed a hierarchical clustering methodology applied to normalized brand switching matrices to assess competitive market structure. We introduce a recently developed clustering method that appears to be more suited to the analysis of such nonsymmetric data, and describe an application and comparison of the various approaches.

Journal ArticleDOI
TL;DR: Three techniques for disease time-space clustering analysis, those of Knox, Mantel and Ederer-Myers-Mantel, were applied to simulated data so as to study their sensitivities and indicate that the three techniques may not be sufficiently sensitive to the clustering in a real data set of HD cases.
Abstract: Three techniques for disease time-space clustering analysis, those of Knox, Mantel and Ederer-Myers-Mantel, were applied to simulated data so as to study their sensitivities. The simulated data corresponded to three alternative non-null models for the distribution, transmission and development of Hodgkin's disease (HD) which were formulated in accordance with the results of published studies. The results indicate that the three techniques may not be sufficiently sensitive to the clustering in a real data set of HD cases. Therefore, the inconclusive results obtained to date with regard to clustering of HD may be related to the low power of the statistical techniques employed.

Journal ArticleDOI
TL;DR: In this paper, an individual differences generalization is proposed for fitting various discrete tree structures to three-way proximity data, in which different individuals are assumed to base their judgments on the same family of trees, but are allowed to have different node heights and/or branch lengths.
Abstract: Models for the representation of proximity data (similarities/dissimilarities) can be categorized into one of three groups of models: continuous spatial models, discrete nonspatial models, and hybrid models (which combine aspects of both spatial and discrete models). Multidimensional scaling models and associated methods, used for thespatial representation of such proximity data, have been devised to accommodate two, three, and higher-way arrays. At least one model/method for overlapping (but generally non-hierarchical) clustering called INDCLUS (Carroll and Arabie 1983) has been devised for the case of three-way arrays of proximity data. Tree-fitting methods, used for thediscrete network representation of such proximity data, have only thus far been devised to handle two-way arrays. This paper develops a new methodology called INDTREES (for INdividual Differences in TREE Structures) for fitting various(discrete) tree structures to three-way proximity data. This individual differences generalization is one in which different individuals, for example, are assumed to base their judgments on the same family of trees, but are allowed to have different node heights and/or branch lengths.

Journal ArticleDOI
TL;DR: In this paper, a hierarchical clustering method was proposed for the analysis of nonsymmetric data, and an application and comparison of the various approaches were described, and compared with a recently developed clustering approach.
Abstract: Rao and Sabavala (1981) recently proposed a hierarchical clustering methodology applied to normalized brand switching matrices to assess competitive market structure. We introduce a recently developed clustering method that appears to be more suited to the analysis of such nonsymmetric data, and describe an application and comparison of the various approaches.


Journal ArticleDOI
TL;DR: A new clustering algorithm based on a K‐means approach which requires no user parameter specification is presented, which performs as well or better than the previously used clustering techniques when tested as part of a speaker independent isolated word recognition system.
Abstract: Recent studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker‐independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human‐interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly, because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered tokens with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since the user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a new clustering algorithm based on a K‐means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker independent isolated word recognition system.

Proceedings ArticleDOI
F. Montel1, P.L. Gouel1
TL;DR: A new algorithm to lump fluid compounds into hypothetical components has been designed to bridge the gap between the increasing amount of analytical data provided by modern lab equipments and the simplified fluid description required in compositional model studies.
Abstract: This paper presents a new algorithm to lump fluid compounds into hypothetical components. It has been designed to bridge the gap between the increasing amount of analytical data provided by modern lab equipments and the simplified fluid description required in compositional model studies. The main feature of the method is a lumping scheme based on the similarities of a few properties of all the compounds identified by chromatographic analysis. An iterative clustering algorithm around mobile centers yields a classification into pseudo-components optimum with respect to the considered equation of state. It can be adapted to any equation of state, any number of compounds or properties. An other noticeable feature of this procedure is the definition of a mixed calculation of critical properties. A criterion to choose between an accurate calculation of true critical properties and the classical mixing rules is suggested. Numerical simulations of PVT behavior were completed for an oil and a gas condensate defined by 150 compounds. The clustering algorithm determined simplified fluids lumped into 7 and 5 components. True critical properties calculation for the hypothetical components and classical mixing rules were tested and the mixed calculation proved to be the best.