scispace - formally typeset
Search or ask a question

Showing papers in "Data Mining and Knowledge Discovery in 2007"


Journal ArticleDOI
TL;DR: The utility of the new symbolic representation of time series formed is demonstrated, which allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Abstract: Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.

1,452 citations


Journal ArticleDOI
TL;DR: It is believed that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run, however, there are still some challenging research issues that need to be solved before frequent patternmining can claim a cornerstone approach in data mining applications.
Abstract: Frequent pattern mining has been a focused theme in data mining research for over a decade. Abundant literature has been dedicated to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset mining in transaction databases to numerous research frontiers, such as sequential pattern mining, structured pattern mining, correlation mining, associative classification, and frequent pattern-based clustering, as well as their broad applications. In this article, we provide a brief overview of the current status of frequent pattern mining and discuss a few promising research directions. We believe that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run. However, there are still some challenging research issues that need to be solved before frequent pattern mining can claim a cornerstone approach in data mining applications.

1,448 citations


Journal ArticleDOI
TL;DR: Experiments with synthetic and real-life logs show that the fitness measure indeed leads to the mining of process models that are complete (can reproduce all the behavior in the log) and precise (do not allow for extra behavior that cannot be derived from the event log).
Abstract: One of the aims of process mining is to retrieve a process model from an event log. The discovered models can be used as objective starting points during the deployment of process-aware information systems (Dumas et al., eds., Process-Aware Information Systems: Bridging People and Software Through Process Technology. Wiley, New York, 2005) and/or as a feedback mechanism to check prescribed models against enacted ones. However, current techniques have problems when mining processes that contain non-trivial constructs and/or when dealing with the presence of noise in the logs. Most of the problems happen because many current techniques are based on local information in the event log. To overcome these problems, we try to use genetic algorithms to mine process models. The main motivation is to benefit from the global search performed by this kind of algorithms. The non-trivial constructs are tackled by choosing an internal representation that supports them. The problem of noise is naturally tackled by the genetic algorithm because, per definition, these algorithms are robust to noise. The main challenge in a genetic approach is the definition of a good fitness measure because it guides the global search performed by the genetic algorithm. This paper explains how the genetic algorithm works. Experiments with synthetic and real-life logs show that the fitness measure indeed leads to the mining of process models that are complete (can reproduce all the behavior in the log) and precise (do not allow for extra behavior that cannot be derived from the event log). The genetic algorithm is implemented as a plug-in in the ProM framework.

511 citations


Journal ArticleDOI
TL;DR: This paper proposes an algorithm that is able to deal with both kinds of causal dependencies between tasks, i.e., explicit and implicit ones, and implements it in the ProM framework and experimental results shows that the algorithm indeed significantly improves existing process mining techniques.
Abstract: Process mining aims at extracting information from event logs to capture the business process as it is being executed. Process mining is particularly useful in situations where events are recorded but there is no system enforcing people to work in a particular way. Consider for example a hospital where the diagnosis and treatment activities are recorded in the hospital information system, but where health-care professionals determine the "careflow." Many process mining approaches have been proposed in recent years. However, in spite of many researchers' persistent efforts, there are still several challenging problems to be solved. In this paper, we focus on mining non-free-choice constructs, i.e., situations where there is a mixture of choice and synchronization. Although most real-life processes exhibit non-free-choice behavior, existing algorithms are unable to adequately deal with such constructs. Using a Petri-net-based representation, we will show that there are two kinds of causal dependencies between tasks, i.e., explicit and implicit ones. We propose an algorithm that is able to deal with both kinds of dependencies. The algorithm has been implemented in the ProM framework and experimental results shows that the algorithm indeed significantly improves existing process mining techniques.

312 citations


Journal ArticleDOI
TL;DR: An algorithm is introduced that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features, whose values capture the relevance of features within the corresponding cluster.
Abstract: Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high-dimensional data such as text data.

238 citations


Journal ArticleDOI
TL;DR: The vision of the future of data mining is sketched, starting from the classic definition of “data mining” and elaborate on topics that — in the opinion — will set trends in data mining.
Abstract: Over recent years data mining has been establishing itself as one of the major disciplines in computer science with growing industrial impact. Undoubtedly, research in data mining will continue and even increase over coming decades. In this article, we sketch our vision of the future of data mining. Starting from the classic definition of "data mining", we elaborate on topics that -- in our opinion -- will set trends in data mining.

179 citations


Journal ArticleDOI
TL;DR: This paper constructs a condensed representation of all frequent itemsets, by removing those itemsets for which the support can be derived, resulting in the so called Non-Derivable Itemsets (NDI) representation.
Abstract: All frequent itemset mining algorithms rely heavily on the monotonicity principle for pruning. This principle allows for excluding candidate itemsets from the expensive counting phase. In this paper, we present sound and complete deduction rules to derive bounds on the support of an itemset. Based on these deduction rules, we construct a condensed representation of all frequent itemsets, by removing those itemsets for which the support can be derived, resulting in the so called Non-Derivable Itemsets (NDI) representation. We also present connections between our proposal and recent other proposals for condensed representations of frequent itemsets. Experiments on real-life datasets show the effectiveness of the NDI representation, making the search for frequent non-derivable itemsets a useful and tractable alternative to mining all frequent itemsets.

150 citations


Journal ArticleDOI
TL;DR: The paper presents the results obtained by the rule extraction algorithm on a simulated dataset and on two different datasets related to biomedical applications: the first one concerns the analysis of time series coming from the monitoring of different clinical variables during hemodialysis sessions, while the other deals with the biological problem of inferring relationships between genes from DNA microarray data.
Abstract: A large volume of research in temporal data mining is focusing on discovering temporal rules from time-stamped data. The majority of the methods proposed so far have been mainly devoted to the mining of temporal rules which describe relationships between data sequences or instantaneous events and do not consider the presence of complex temporal patterns into the dataset. Such complex patterns, such as trends or up and down behaviors, are often very interesting for the users. In this paper we propose a new kind of temporal association rule and the related extraction algorithm; the learned rules involve complex temporal patterns in both their antecedent and consequent. Within our proposed approach, the user defines a set of complex patterns of interest that constitute the basis for the construction of the temporal rule; such complex patterns are represented and retrieved in the data through the formalism of knowledge-based Temporal Abstractions. An Apriori-like algorithm looks then for meaningful temporal relationships (in particular, precedence temporal relationships) among the complex patterns of interest. The paper presents the results obtained by the rule extraction algorithm on a simulated dataset and on two different datasets related to biomedical applications: the first one concerns the analysis of time series coming from the monitoring of different clinical variables during hemodialysis sessions, while the other one deals with the biological problem of inferring relationships between genes from DNA microarray data.

135 citations


Journal ArticleDOI
TL;DR: This work shows that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm, and shows that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets.
Abstract: The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.

119 citations


Journal ArticleDOI
TL;DR: This paper proposes a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text).
Abstract: The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The presence of duplicate case reports is an important data quality problem and their detection remains a formidable challenge, especially in the WHO drug safety database where reports are anonymised before submission. In this paper, we propose a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text). We propose two extensions of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields, and we demonstrate the effectiveness both at identifying the most likely duplicate for a given case report (94.7% accuracy) and at discriminating true duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other knowledge discovery applications as well.

108 citations


Journal ArticleDOI
TL;DR: The Time Series Knowledge Representation (TSKR) is defined as a new language for expressing temporal knowledge in time interval data that has a hierarchical structure, with levels corresponding to the temporal concepts duration, coincidence, and partial order.
Abstract: We present a new method for the understandable description of local temporal relationships in multivariate data, called Time Series Knowledge Mining (TSKM). We define the Time Series Knowledge Representation (TSKR) as a new language for expressing temporal knowledge in time interval data. The patterns have a hierarchical structure, with levels corresponding to the temporal concepts duration, coincidence, and partial order. The patterns are very compact, but offer details for each element on demand. In comparison with related approaches, the TSKR is shown to have advantages in robustness, expressivity, and comprehensibility. The search for coincidence and partial order in interval data can be formulated as instances of the well known frequent itemset problem. Efficient algorithms for the discovery of the patterns are adapted accordingly. A novel form of search space pruning effectively reduces the size of the mining result to ease interpretation and speed up the algorithms. Human interaction is used during the mining to analyze and validate partial results as early as possible and guide further processing steps. The efficacy of the methods is demonstrated using two real life data sets. In an application to sports medicine the results were recognized as valid and useful by an expert of the field.

Journal ArticleDOI
TL;DR: It is shown that data mining is closely related to compression and Kolmogorov complexity; and the latter is undecidable, so data mining will always be an art, where the goal will be to find better models that fit the authors' datasets as best as possible.
Abstract: Will we ever have a theory of data mining analogous to the relational algebra in databases? Why do we have so many clearly different clustering algorithms? Could data mining be automated? We show that the answer to all these questions is negative, because data mining is closely related to compression and Kolmogorov complexity; and the latter is undecidable. Therefore, data mining will always be an art, where our goal will be to find better models (patterns) that fit our datasets as best as possible.

Journal ArticleDOI
TL;DR: The notion of a choice-set of constraints is introduced and it is proved that the feasibility problem for choice-sets is efficiently solvable for both ML and CL constraints.
Abstract: Recent work has looked at extending clustering algorithms with instance level must-link (ML) and cannot-link (CL) background information. Our work introduces ? and ? cluster level constraints that influence inter-cluster distances and cluster composition. The addition of background information, though useful at providing better clustering results, raises the important feasibility question: Given a collection of constraints and a set of data, does there exist at least one partition of the data set satisfying all the constraints? We study the complexity of the feasibility problem for each of the above constraints separately and also for combinations of constraints. Our results clearly delineate combinations of constraints for which the feasibility problem is computationally intractable (i.e., NP-complete) from those for which the problem is efficiently solvable (i.e., in the computational class P). We also consider the ML and CL constraints in conjunctive and disjunctive normal forms (CNF and DNF respectively). We show that for ML constraints, the feasibility problem is intractable for CNF but efficiently solvable for DNF. Unfortunately, for CL constraints, the feasibility problem is intractable for both CNF and DNF. This effectively means that CL-constraints in a non-trivial form cannot be efficiently incorporated into clustering algorithms. To overcome this, we introduce the notion of a choice-set of constraints and prove that the feasibility problem for choice-sets is efficiently solvable for both ML and CL constraints. We also present empirical results which indicate that the feasibility problem occurs extensively in real world problems.

Journal ArticleDOI
TL;DR: The transformation of the data mining and knowledge discovery field over the last 10 years is surveyed from the unique vantage point of KDnuggets as a leading chronicler of the field.
Abstract: I survey the transformation of the data mining and knowledge discovery field over the last 10 years from the unique vantage point of KDnuggets as a leading chronicler of the field. Analysis of the most frequent words in KDnuggets News leads to revealing observations.

Journal ArticleDOI
TL;DR: How the sparseness of the data affects the redundancy and containment between the rules is discussed and a new methodology for organizing and grouping the association rules with the same consequent is provided.
Abstract: The high dimensionality of massive data results in the discovery of a large number of association rules. The huge number of rules makes it difficult to interpret and react to all of the rules, especially because many rules are redundant and contained in other rules. We discuss how the sparseness of the data affects the redundancy and containment between the rules and provide a new methodology for organizing and grouping the association rules with the same consequent. It consists of finding metarules, rules that express the associations between the discovered rules themselves. The information provided by the metarules is used to reorganize and group related rules. It is based only on data-determined relationships between the rules. We demonstrate the suggested approach on actual manufacturing data and show its effectiveness on several benchmark data sets.

Journal ArticleDOI
TL;DR: This paper proposes a new ant-based clustering algorithm, Tree-Traversing Ant (TTA), for concepts formation as part of an ontology learning system, and attempts to achieve an adaptable clustering method that is highly scalable and portable across domains.
Abstract: Many conventional methods for concepts formation in ontology learning have relied on the use of predefined templates and rules, and static resources such as WordNet. Such approaches are not scalable, difficult to port between different domains and incapable of handling knowledge fluctuations. Their results are far from desirable, either. In this paper, we propose a new ant-based clustering algorithm, Tree-Traversing Ant (TTA), for concepts formation as part of an ontology learning system. With the help of Normalized Google Distance (NGD) and n° of Wikipedia (n°W) as measures for similarity and distance between terms, we attempt to achieve an adaptable clustering method that is highly scalable and portable across domains. Evaluations with an seven datasets show promising results with an average lexical overlap of 97% and ontological improvement of 48%. At the same time, the evaluations demonstrated several advantages that are not simultaneously present in standard ant-based and other conventional clustering methods.

Journal ArticleDOI
TL;DR: Two algorithms, BiBoost (Bipartite Boosting) and MultBoost (Multiparty Boosting), that allow two or more participants to construct a boosting classifier without explicitly sharing their data sets are described.
Abstract: We describe two algorithms, BiBoost (Bipartite Boosting) and MultBoost (Multiparty Boosting), that allow two or more participants to construct a boosting classifier without explicitly sharing their data sets. We analyze both the computational and the security aspects of the algorithms. The algorithms inherit the excellent generalization performance of AdaBoost. Experiments indicate that the algorithms are better than AdaBoost executed separately by the participants, and that, independently of the number of participants, they perform close to AdaBoost executed using the entire data set.

Journal ArticleDOI
TL;DR: This position paper proposes knowledge-rich data mining as a focus of research, and describes initial steps in pursuing it.
Abstract: This position paper proposes knowledge-rich data mining as a focus of research, and describes initial steps in pursuing it.

Journal ArticleDOI
TL;DR: This work proposes a new approach called CrossClus, which performs multi-relational clustering under user’s guidance, which requires the user to provide a training set of features, and designs efficient and accurate approaches for both feature selection and object clustering.
Abstract: Most structured data in real-life applications are stored in relational databases containing multiple semantically linked relations. Unlike clustering in a single table, when clustering objects in relational databases there are usually a large number of features conveying very different semantic information, and using all features indiscriminately is unlikely to generate meaningful results. Because the user knows her goal of clustering, we propose a new approach called CrossClus, which performs multi-relational clustering under user's guidance. Unlike semi-supervised clustering which requires the user to provide a training set, we minimize the user's effort by using a very simple form of user guidance. The user is only required to select one or a small set of features that are pertinent to the clustering goal, and CrossClus searches for other pertinent features in multiple relations. Each feature is evaluated by whether it clusters objects in a similar way with the user specified features. We design efficient and accurate approaches for both feature selection and object clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of CrossClus.

Journal ArticleDOI
TL;DR: This paper describes a scalable, effective algorithm to identify groups of correlated attributes that can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such as the set of polynomials.
Abstract: The problem of identifying meaningful patterns in a database lies at the very heart of data mining. A core objective of data mining processes is the recognition of inter-attribute correlations. Not only are correlations necessary for predictions and classifications --- since rules would fail in the absence of pattern --- but also the identification of groups of mutually correlated attributes expedites the selection of a representative subset of attributes, from which existing mappings allow others to be derived. In this paper, we describe a scalable, effective algorithm to identify groups of correlated attributes. This algorithm can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such as the set of polynomials. We show the results of our evaluation of the algorithm applied to synthetic and real world datasets, and demonstrate that it is able to spot the correlated attributes. Moreover, the execution time of the proposed technique is linear on the number of elements and of correlations in the dataset.

Journal ArticleDOI
TL;DR: A new form of self-organizing map which is based on a nonlinear projection of latent points into data space, identical to that performed in the Generative Topographic Mapping (GTM), is reviewed.
Abstract: We review a new form of self-organizing map which is based on a nonlinear projection of latent points into data space, identical to that performed in the Generative Topographic Mapping (GTM) [Bishop et al. (1997) Neurl Comput 10(1): 215---234]. But whereas the GTM is an extension of a mixture of experts, our new model is an extension of a product of experts [Hinton (2000) Technical report GCNU TR 2000-004, Gatsby Computational Neuroscience Unit, University College, London]. We show visualisation results on some real data sets and compare with the GTM. We then introduce a second mapping based on harmonic averages and show that it too creates a topographic mapping of the data. We compare these mappings on real and artificial data sets.

Journal ArticleDOI
TL;DR: This paper conceptualize the trigger event problem and shows how survival analysis can be used to quantify the importance of addressing various trigger events, and the method is illustrated on four real data sets from different industries and countries.
Abstract: This paper discusses a new application of data mining, quantifying the importance of responding to trigger events with reactive contacts. Trigger events happen during a customer's lifecycle and indicate some change in the relationship with the company. If detected early, the company can respond to the problem and retain the customer; otherwise the customer may switch to another company. It is usually easy to identify many potential trigger events. What is needed is a way of prioritizing which events demand interventions. We conceptualize the trigger event problem and show how survival analysis can be used to quantify the importance of addressing various trigger events. The method is illustrated on four real data sets from different industries and countries.

Journal ArticleDOI
TL;DR: A systematic approach to combining the distinct problems of FS and classification is proposed and general recommendations on the optimal choice of the combination of FS decomposition type and classifier aggregation method for multiclass microarray datasets are made.
Abstract: The high dimensionality of microarray datasets endows the task of multiclass tissue classification with various difficulties--the main challenge being the selection of features deemed relevant and non-redundant to form the predictor set for classifier training. The necessity of varying the emphases on relevance and redundancy, through the use of the degree of differential prioritization (DDP) during the search for the predictor set is also of no small importance. Furthermore, there are several types of decomposition technique for the feature selection (FS) problem--all-classes-at-once, one-vs.-all (OVA) or pairwise (PW). Also, in multiclass problems, there is the need to consider the type of classifier aggregation used--whether non-aggregated (a single machine), or aggregated (OVA or PW). From here, first we propose a systematic approach to combining the distinct problems of FS and classification. Then, using eight well-known multiclass microarray datasets, we empirically demonstrate the effectiveness of the DDP in various combinations of FS decomposition types and classifier aggregation methods. Aided by the variable DDP, feature selection leads to classification performance which is better than that of rank-based or equal-priorities scoring methods and accuracies higher than previously reported for benchmark datasets with large number of classes. Finally, based on several criteria, we make general recommendations on the optimal choice of the combination of FS decomposition type and classifier aggregation method for multiclass microarray datasets.

Journal ArticleDOI
TL;DR: This paper investigates relational peculiarity-oriented mining by extending to record-level peculiariarity, and shows that relational peculIarity- oriented mining is effective.
Abstract: Peculiarity rules are a new type of useful knowledge that can be discovered by searching the relevance among peculiar data. A main task in mining such knowledge is peculiarity identification. Previous methods for finding peculiar data focus on attribute values. By extending to record-level peculiarity, this paper investigates relational peculiarity-oriented mining. Peculiarity rules are mined, and more importantly explained, in a relational mining framework. Several experiments are carried out and the results show that relational peculiarity-oriented mining is effective.

Journal ArticleDOI
TL;DR: A framework to detect the inconsistency in biological databases using ontologies is proposed and a numeric estimate is provided to measure the inconsistency and identify those biological databases that are appropriate for further mining applications, aiding in enhancing the quality of databases and guaranteeing accurate and efficient mining of biological databases.
Abstract: The rapid growth of life science databases demands the fusion of knowledge from heterogeneous databases to answer complex biological questions. The discrepancies in nomenclature, various schemas and incompatible formats of biological databases, however, result in a significant lack of interoperability among databases. Therefore, data preparation is a key prerequisite for biological database mining. Integrating diverse biological molecular databases is an essential action to cope with the heterogeneity of biological databases and guarantee efficient data mining. However, the inconsistency in biological databases is a key issue for data integration. This paper proposes a framework to detect the inconsistency in biological databases using ontologies. A numeric estimate is provided to measure the inconsistency and identify those biological databases that are appropriate for further mining applications. This aids in enhancing the quality of databases and guaranteeing accurate and efficient mining of biological databases.

Journal ArticleDOI
TL;DR: A novel context sensitive algorithm for evaluation of ordinal attributes which exploits the information hidden in ordering of attributes’ and class’ values and provides a separate score for each value of the attribute is proposed.
Abstract: We propose a novel context sensitive algorithm for evaluation of ordinal attributes which exploits the information hidden in ordering of attributes' and class' values and provides a separate score for each value of the attribute. Similar to feature selection algorithm ReliefF, the proposed algorithm exploits the contextual information via selection of nearest instances. The ordEval algorithm outputs probabilistic factors corresponding to the effect an increase/decrease of attribute's value has on the class value. While the ordEval algorithm is general and can be used for analysis of any survey with graded answers, we show its utility on an important marketing problem of customer (dis)satisfaction. We develop a visualization technique and show how we can use it to detect and confirm several findings from marketing theory.

Journal ArticleDOI
TL;DR: This work models each cluster and class as a multivariate Gaussian distribution and estimates their degree of separation through an information theoretic measure (i.e., through relative entropy or Kullback–Leibler distance).
Abstract: Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known (i.e., external) classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that are meaningful and novel. An assessment of the information gained with new clusters can be effected by looking at the degree of separation between each new cluster and its most similar class. Our approach models each cluster and class as a multivariate Gaussian distribution and estimates their degree of separation through an information theoretic measure (i.e., through relative entropy or Kullback---Leibler distance). The inherently large computational cost of this step is alleviated by first projecting all data over the single dimension that best separates both distributions (using Fisher's Linear Discriminant). We test our algorithm on a dataset of Martian surfaces using the traditional division into geological units as external classes and the new, hydrology-inspired, automatically performed division as novel clusters. We find the new partitioning constitutes a formally meaningful classification that deviates substantially from the traditional classification.

Journal ArticleDOI
TL;DR: A wavelet power spectrum based approach has been presented for feature selection and successful classification of the multi class dataset and was applied on SRBCT and the breast cancer datasets which are multi class cancer datasets.
Abstract: Data mining techniques are widely used in many fields. One of the applications of data mining in the field of the Bioinformatics is classification of tissue samples. In the present work, a wavelet power spectrum based approach has been presented for feature selection and successful classification of the multi class dataset. The proposed method was applied on SRBCT and the breast cancer datasets which are multi class cancer datasets. The selected features are almost those selected in previous works. The method was able to produce almost 100% accurate classification results. The method is very simple and robust to noise. No extensive preprocessing is required. The classification was performed with comparatively very lesser number of features than those used in the original works. No information is lost due to the initial pruning of the data usually performed using a threshold in other methods. The method utilizes the inherent nature of the data in performing various tasks. So, the method can be used for a wide range of data.