scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1991"


Journal ArticleDOI
TL;DR: The authors present a fuzzy validity criterion based on a validity function which identifies compact and separate fuzzy c-partitions without assumptions as to the number of substructures inherent in the data.
Abstract: The authors present a fuzzy validity criterion based on a validity function which identifies compact and separate fuzzy c-partitions without assumptions as to the number of substructures inherent in the data. This function depends on the data set, geometric distance measure, distance between cluster centroids and more importantly on the fuzzy partition generated by any fuzzy algorithm used. The function is mathematically justified via its relationship to a well-defined hard clustering validity function, the separation index for which the condition of uniqueness has already been established. The performance of this validity function compares favorably to that of several others. The application of this validity function to color image segmentation in a computer color vision system for recognition of IC wafer defects which are otherwise impossible to detect using gray-scale image processing is discussed. >

3,237 citations


Journal ArticleDOI
TL;DR: A texture segmentation algorithm inspired by the multi-channel filtering theory for visual information processing in the early stages of human visual system is presented, which is based on reconstruction of the input image from the filtered images.

2,351 citations


Journal ArticleDOI
01 Jan 1991
TL;DR: It is shown that the second smallest eigenvalue of a matrix derived from the netlist gives a provably good approximation of the optimal ratio cut partition cost.
Abstract: Partitioning of circuit netlists in VLSI design is considered. It is shown that the second smallest eigenvalue of a matrix derived from the netlist gives a provably good approximation of the optimal ratio cut partition cost. It is also demonstrated that fast Lanczos-type methods for the sparse symmetric eigenvalue problem are a robust basis for computing heuristic ratio cuts based on the eigenvector of this second eigenvalue. Effective clustering methods are an immediate by-product of the second eigenvector computation and are very successful on the difficult input classes proposed in the CAD literature. The intersection graph representation of the circuit netlist is considered, as a basis for partitioning, a heuristic based on spectral ratio cut partitioning of the netlist intersection graph is proposed. The partitioning heuristics were tested on industry benchmark suites, and the results were good in terms of both solution quality and runtime. Several types of algorithmic speedups and directions for future work are discussed. >

1,282 citations


Journal ArticleDOI
TL;DR: The approach presented is applicable to a variety of fuzzy clustering algorithms as well as regression analysis, and its ability to detect ‘good’ clusters amongst noisy data is demonstrated.

722 citations


Journal ArticleDOI
TL;DR: In this paper, an econometric model of stock price clustering was derived and estimated, and it was shown that traders would frequently use odd sixteenths when trading low-price stocks, if exchange regulations permitted trading on sixteenth's.
Abstract: Stock prices cluster on round fractions. Clustering increases with price level and volatility, and decreases with capitalization and transaction frequency. Clustering is pervasive. Price clustering will occur if traders use discrete price sets to simplify their negotiations. Exchange regulations require that most stocks be traded on eighths. Clustering on larger fractions will occur if traders choose to use discrete price sets based on quarters, halves, or whole numbers. An econometric model of clustering is derived and estimated. Projections from the results suggest that traders would frequently use odd sixteenths when trading low-price stocks, if exchange regulations permitted trading on sixteenths. Article published by Oxford University Press on behalf of the Society for Financial Studies in its journal, The Review of Financial Studies.

573 citations


Book ChapterDOI
01 Jan 1991
TL;DR: This new release of the package contains 27 FORTRAN and 4 BASIC programs written for IBM-PC and compatible machines and Apple Macintosh II, Plus and SE personal computers.
Abstract: This new release of the package contains 27 FORTRAN and 4 BASIC programs written for IBM-PC and compatible machines and Apple Macintosh II, Plus and SE personal computers. Widely used standard multivariate methods and specific data analytical techniques, some of them suggested by the author, are represented in the package. The procedures programmed include hierarchical and non-hierarchical clustering, fuzzy classifications, block clustering, ordination, character ranking, comparison of classifications and ordinations, consensus methods, Monte Carlo simulation of distributions of partition agreement, simulated sampling based on digitized point patterns, and information theory functions for evaluating species assemblages. This paper gives general information on the programs; technical details are presented in the user’s manual.

492 citations


Journal ArticleDOI
TL;DR: In this paper, the authors analyse statistically the long-term properties of several instrumental earthquake catalogues and find that longterm clustering characterizes the occurrence of all earthquakes-shallow, intermediate, and deep.
Abstract: SUMMARY We analyse statistically the long-term properties of several instrumental earthquake catalogues. Complete catalogues exhibit both short- and long-term clustering for earthquakes of all depth ranges. After accounting for the effect of short-term clustering, we find that in residual (declustered) catalogues, long-term clustering, not periodicity, characterizes the occurrence of all earthquakes-shallow, intermediate, and deep. The degree of clustering in residual catalogues is the same for earthquakes in different depth ranges. Circumstantial evidence indicates that the long-term v,ariation of seismicity is governed by a power-law temporal distribution; as in short-term clustering, it is scale invariant. The fractal dimension of an earthquake set on the time axis is of the order of 0.8-0.9. Therefore, mainshock occurrence is closer to a stationary Poisson process than standard aftershock sequences of shallow earthquakes.

482 citations


Journal ArticleDOI
TL;DR: A technology for automatically assembling large software libraries which promote software reuse by helping the user locate the components closest to her/his needs is described.
Abstract: A technology for automatically assembling large software libraries which promote software reuse by helping the user locate the components closest to her/his needs is described. Software libraries are automatically assembled from a set of unorganized components by using information retrieval techniques. The construction of the library is done in two steps. First, attributes are automatically extracted from natural language documentation by using an indexing scheme based on the notions of lexical affinities and quantity of information. Then a hierarchy for browsing is automatically generated using a clustering technique which draws only on the information provided by the attributes. Due to the free-text indexing scheme, tools following this approach can accept free-style natural language queries. >

475 citations


Journal ArticleDOI
TL;DR: Some guidelines are given for when to use the sample clustering and sample weights in the analysis of complex survey data and how to use them depend on certain features of the design.
Abstract: BACKGROUND. Since large-scale health surveys usually have complicated sampling schemes, there is often a question as to whether the sampling design must be considered in the analysis of the data. A recent disagreement concerning the analysis of a body iron stores-cancer association found in the first National Health and Nutrition Examination Survey and its follow-up is used to highlight the issues. METHODS. We explain and illustrate the importance of two aspects of the sampling design: clustering and weighting of observations. The body iron stores-cancer data are reanalyzed by utilizing or ignoring various aspects of the sampling design. Simple formulas are given to describe how using the sampling design of a survey in the analysis will affect the conclusions of that analysis. RESULTS. The different analyses of the body iron stores-cancer data lead to very different conclusions. Application of the simple formulas suggests that utilization of the sample clustering in the analysis is appropriate, but that a...

455 citations


Journal ArticleDOI
TL;DR: The simulated annealing approach for solving optimization problems is described and is proposed for solving the clustering problem and the parameters of the algorithm are discussed in detail and it is shown that the algorithm converges to a global solution of the clustered problem.

435 citations


Journal ArticleDOI
TL;DR: The results of the adaptive segmentation algorithm of Lakshamanan and Derin are compared with a simple nearest-neighbor classification scheme to show that if enough information is available, simple techniques could be used as alternatives to computationally expensive schemes.
Abstract: The problem of unsupervised segmentation of textured images is considered. The only explicit assumption made is that the intensity data can be modeled by a Gauss Markov random field (GMRF). The image is divided into a number of nonoverlapping regions and the GMRF parameters are computed from each of these regions. A simple clustering method is used to merge these regions. The parameters of the model estimated from the clustered segments are then used in two different schemes, one being all approximation to the maximum a posterior estimate of the labels and the other minimizing the percentage misclassification error. The proposed approach is contrasted with the algorithm of S. Lakshamanan and H. Derin (1989), which uses a simultaneous parameter estimation and segmentation scheme. The results of the adaptive segmentation algorithm of Lakshamanan and Derin are compared with a simple nearest-neighbor classification scheme to show that if enough information is available, simple techniques could be used as alternatives to computationally expensive schemes. >

01 Feb 1991
TL;DR: A new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks, is introduced, suggesting that the poor statistical characteristics of a syntactic indexing phrase representation negate its desirable semantic characteristics.
Abstract: This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role of manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improvements in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its desirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurrence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probabilistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, including dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church's stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.

Journal ArticleDOI
01 Sep 1991
TL;DR: A special-purpose point clustering algorithm is described, and its application to automatic grid generation, a technique used to solve partial differential equations, is considered.
Abstract: A special-purpose point clustering algorithm is described, and its application to automatic grid generation, a technique used to solve partial differential equations, is considered. Extensions of techniques common in computer vision and pattern recognition literature are used to partition points into a set of enclosing rectangles. Examples from 2-D calculations are shown, but the algorithm generalizes readily to three dimensions. >

Journal ArticleDOI
TL;DR: An approach to the assessment of spatial clustering based on the second-moment properties of a labelled point process and an application to published data on the spatial distribution of childhood leukaemia and lymphoma in North Humberside are described.
Abstract: Motivated by recent interest in the possible spatial clustering of rare diseases, the paper develops an approach to the assessment of spatial clustering based on the second-moment properties of a labelled point process. The concept of no spatial clustering is identified with the hypothesis that in a realisation of a stationary spatial point process consisting of events of two qualitatively different types, the type 1 events are a random sample from the superposition of type 1 and type 2 events. A diagnostic plot for estimating the nature and physical scale of clustering effects is proposed. The availability of Monte Carlo tests of significance is noted. An application to published data on the spatial distribution of childhood leukaemia and lymphoma in North Humberside is described.

Journal ArticleDOI
TL;DR: A new dissimilarity measure, based on “position”, “span” and “content” of symbolic objects is proposed for symbolic clustering, and the results of the application of the algorithm on numeric data of known number of classes are described first to show the efficacy of the method.

Journal ArticleDOI
TL;DR: A clustering algorithm based on the minimum volume ellipsoid (MVE) robust estimator is proposed that was successfully applied to several computer vision problems formulated in the feature space paradigm: multithresholding of gray level images, analysis of the Hough space, and range image segmentation.
Abstract: A clustering algorithm based on the minimum volume ellipsoid (MVE) robust estimator is proposed. The MVE estimator identifies the least volume region containing h percent of the data points. The clustering algorithm iteratively partitions the space into clusters without prior information about their number. At each iteration, the MVE estimator is applied several times with values of h decreasing from 0.5. A cluster is hypothesized for each ellipsoid. The shapes of these clusters are compared with shapes corresponding to a known unimodal distribution by the Kolmogorov-Smirnov test. The best fitting cluster is then removed from the space, and a new iteration starts. Constrained random sampling keeps the computation low. The clustering algorithm was successfully applied to several computer vision problems formulated in the feature space paradigm: multithresholding of gray level images, analysis of the Hough space, and range image segmentation. >

Journal ArticleDOI
TL;DR: An iterative algorithm that finds a locally optimal partition for an arbitrary loss function, in time linear in N for each iteration, is presented and it is proven that the globally optimal partition must satisfy a nearest neighbour condition using divergence as the distance measure.
Abstract: An iterative algorithm that finds a locally optimal partition for an arbitrary loss function, in time linear in N for each iteration is presented. The algorithm is a K-means-like clustering algorithm that uses as its distance measure a generalization of Kullback's information divergence. Moreover, it is proven that the globally optimal partition must satisfy a nearest neighbour condition using divergence as the distance measure. These results generalize similar results of L. Breiman et al. (1984) to an arbitrary number of classes or regression variables and to an arbitrary number of bills. Experimental results on a text-to-speech example are provided and additional applications of the algorithm, including the design of variable combinations, surrogate splits, composite nodes, and decision graphs, are suggested. >

01 Jan 1991
TL;DR: In this article, a method for separating speech from speakers engaged in dialogs is described, assuming no prior knowledge of the speakers, employing a distance measure between speech segments used in conjunction with a clustering algorithm, to perform the segregation.
Abstract: A method for segregating speech from speakers engaged in dialogs is described. The method, assuming no prior knowledge of the speakers, employs a distance measure between speech segments used in conjunction with a clustering algorithm, to perform the segregation. Properties of the distance measure are discussed and an air traffic control application is described.

Journal ArticleDOI
TL;DR: An efficient non-hierarchical clustering algorithm, based on initial seeds obtained from the assignment method, for finding part-families and machine cells for group technology (GT) is presented in this article.
Abstract: An efficient nonhierarchical clustering algorithm, based on initial seeds obtained from the assignment method, for finding part-families and machine cells for group technology (GT) is presented. By a process of alternate clustering and generating seeds from rows and columns, the zero-one machine-component incidence matrix was block-diagonalized with the aim of minimizing exceptional elements (intercell movements) and blanks (machine idling). The algorithm is compared with the existing nonhierarchical clustering method and is found to yield favourable results.

Proceedings ArticleDOI
26 Oct 1991
TL;DR: It is shown that reasonable predictions of quality level are possible for the functional tests, but that scan tests produce significantly worse quality levels than predicted, Apparent clustering of defects resulted in very good quality levels for fault coverages less than 99%.
Abstract: This paper discusses the use of stuck-at fault coverage as a means of determining quality levels. Data from a part tested with both functional and scan tests is analyzed and compared to three existing theories. It is shown that reasonable predictions of quality level are possible for the functional tests, but that scan tests produce significantly worse quality levels than predicted, Apparent clustering of defects resulted in very good quality levels for fault coverages less than 99%.

Journal ArticleDOI
TL;DR: In this paper, a "close neighbor algorithm" is proposed to solve the problem of visual identification of machine groups and part families in cellular manufacturing systems, which overcomes many deficiencies of the CDR and ASM methods.
Abstract: The first step in creating a cellular manufacturing system is to identify machine groups and form part families. Clustering and data organization (CDR) algorithms (such as the bond energy algorithm) and array sorting (ARS) methods (such as the rank order clustering algorithm) have been proposed to solve the machine and part grouping problem. However, these methods do not always produce a solution matrix that has a block diagonal structure, making visual identification of machine groups and part families extremely difficult. This paper presents a ‘close neighbour algorithm’ to solve this problem. The algorithm overcomes many deficiencies of the CDR and ASM methods. The algorithm is tested against ten existing algorithms in solving test problems from the literature. Test results show that the algorithm is very reliable and efficient.

Journal ArticleDOI
TL;DR: The empirical results support the effectiveness of the data bindings clustering approach for localizing error-prone system structure and quantify ratios of coupling and strength in software systems.
Abstract: Using measures of data interaction called data bindings, the authors quantify ratios of coupling and strength in software systems and use the ratios to identify error-prone system structures. A 148000 source line system from a prediction environment was selected for empirical analysis. Software error data were collected from high-level system design through system testing and from field operation of the system. The authors use a set of five tools to calculate the data bindings automatically and use a clustering technique to determine a hierarchical description of each of the system's 77 subsystems. A nonparametric analysis of variance model is used to characterize subsystems and individual routines that had either many or few errors or high or low error correction effort. The empirical results support the effectiveness of the data bindings clustering approach for localizing error-prone system structure. >

Patent
30 Apr 1991
TL;DR: In this paper, a system and method of logically and physically clustering data (tuples) in a database is presented, where data objects stored in the domains may be stored in a particular domain based upon a locality-of-reference algorithm in which a tuple of data is placed in a domain if and only if all objects referenced by the tuple are contained in the domain.
Abstract: A system and method of logically and physically clustering data (tuples) in a database. The database management system of the invention partitions (declusters) a set of relations into smaller so-called local relations and reclusters the local relations into constructs called domains. The domains are self-contained in that a domain contains the information for properly accessing and otherwise manipulating the data it contains. In other words, the data objects stored in the domains may be stored in a particular domain based upon a locality-of-reference algorithm in which a tuple of data is placed in a domain if and only if all objects referenced by the tuple are contained in the domain. On the other hand, the data objects stored in a domain may be clustered so that a tuple of data is placed in a domain based on the domain of the object referenced by a particular element of the tuple. By clustering the related object data in this manner, the database management system may more efficiently cache data to a user application program requesting data related to a particular data object. The system may also more efficiently lock and check-in and check-out data from the database so as to improve concurrency. Moreover, versioning may be more readily supported by copying tuples of a particular domain into a new domain which can then be updated as desired.

Journal ArticleDOI
TL;DR: One algorithm, called the ISNC algorithm and based on clustering around seeds, performs significantly better than all other algorithms on the primary performance measure (and performs well on the secondary measures) for the randomly generated data set, and performs well for the problems from the literature.

Journal ArticleDOI
01 May 1991
TL;DR: This method broadens the applications horizon of the FCM family by enabling users to match discontinuous multidimensional numerical data structures with similarity measures that have nonhyperelliptical topologies.
Abstract: An extension of the hard and fuzzy c-means (HCM/FCM) clustering algorithms is described. Specifically, these models are extended to admit the case where the (dis)similarity measure on pairs of numerical vectors includes two members of the Minkowski or p-norm family, viz., the p=1 and p= infinity norms. In the absence of theoretically necessary conditions to guide a numerical solution of the nonlinear constrained optimization problem associated with this case, it is shown that a certain basis exchange algorithm can be used to find approximate critical points of the new objective functions. This method broadens the applications horizon of the FCM family by enabling users to match discontinuous multidimensional numerical data structures with similarity measures that have nonhyperelliptical topologies. >

Proceedings ArticleDOI
01 May 1991
TL;DR: GIDEON, a genetic algorithm system to heuristically solve the vehicle routing problem with time windows, consists of two distinct modules: a global clustering module that assigns customers to vehicles by a process called genetic sectoring and a local route optimization module (SWITCH-OPT).
Abstract: Addresses the vehicle routing problem with time windows (VRPTW). The VRPTW involves routing a fleet of vehicles, of limited capacity and travel time, from a central depot to a set of geographically dispersed customers with known demands within specified time windows. The authors describe GIDEON, a genetic algorithm system to heuristically solve the VRPTW. GIDEON consists of two distinct modules: a global clustering module that assigns customers to vehicles by a process called genetic sectoring (GENSECT) and a local route optimization module (SWITCH-OPT). On a standard set of 56 VRPTW problems obtained from the literature, GIDEON did better than the alternate methods on 41 of them, with an average reduction of 3.9% in fleet size and 4.4% in distance traveled for the 56 problems. GIDEON took an average of 127 CPU seconds to solve a problem on the Solbourne 5/802 computer. >

Journal ArticleDOI
TL;DR: In this paper, a fuzzy c-means clustering algorithm is proposed to formulate the problem of cellular cell formation, which not only reveals the specific part family that a part belongs to, but also provides the degree of membership of a part associated with each part.
Abstract: Cell formation, one of the most important problems faced in designing cellular manufacturing systems, is to group parts with similar geometry, function, material and process into part families and the corresponding machines into machine cells. There has been an extensive amount of work in this area and, consequently, numerous analytical approaches have been developed. One common weakness of these conventional approaches is that they implicitly assume that disjoint part families exist in the data; therefore, a part can only belong to one part family. In practice, it is clear that some parts definitely belong to certain part families, whereas there exist parts that may belong to more than one family. In this study, we propose a fuzzy c-means clustering algorithm to formulate the problem. The fuzzy approach offers a special advantage over conventional clustering. It not only reveals the specific part family that a part belongs to, but also provides the degree of membership of a part associated with each part...

Journal ArticleDOI
TL;DR: Results of using APARTY in the design process show improved register-transfer designs and the number of global routing wires is generally reduced by over 50% by following the partitioning scheme suggested by APARTy.
Abstract: APARTY is an architectural partitioning tool that uses a novel multistage clustering algorithm to extract the high level structure of an IC design by concentrating on area and interconnect considerations. Performance is addressed implicitly. APARTY works within the framework of the system architect's workbench and can pass system-level structural information along to register-transfer level (RTL) tools to guide the completion of a data-path design. The multistage clustering algorithm and how it is used by APARTY to choose partitions are described. The system architect's workbench and how architectural partitioning can be used to guide synthesis are also described. Results of using APARTY in the design process show improved register-transfer designs. In particular, the number of global routing wires is generally reduced by over 50% by following the partitioning scheme suggested by APARTY. >

Journal ArticleDOI
TL;DR: In this paper, the formation of machine and part groups is a central issue in the design of cellular manufacturing systems, and the Hamiltonian path problem is used to form a distance measure for machines and parts.
Abstract: The formation of machine and part groups is a central issue in the design of cellular manufacturing systems. The part-machine incidence matrix has formed the basis of several techniques for cell formation. In this paper, we propose formulating machine and part ordering as a Hamiltonian Path Problem. Similarity coefficients are used to form a distance measure for machines and parts. The resulting solutions are shown to be better than those obtained from binary clustering on a set of test problems.

Journal ArticleDOI
TL;DR: In this paper, the information criterion for discrete data is closely related to the classification maximum likelihood criterion for the latent class model, which can be derived from the Bryant-Windham construction.
Abstract: We show that a well-known clustering criterion for discrete data, the information criterion, is closely related to the classification maximum likelihood criterion for the latent class model. This relation can be derived from the Bryant-Windham construction. Emphasis is placed on binary clustering criteria which are analyzed under the maximum likelihood approach for different multivariate Bernoulli mixtures. This alternative form of criterion reveals non-apparent aspects of clustering techniques. All the criteria discussed can be optimized with the alternating optimization algorithm. Some illustrative applications are included.