scispace - formally typeset
Search or ask a question

Showing papers in "Data Mining and Knowledge Discovery in 1997"


Journal ArticleDOI
TL;DR: This work gives efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and presents detailed experimental results that are in use in telecommunication alarm management.
Abstract: Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in a sequence. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present detailed experimental results. The methods are in use in telecommunication alarm management.

1,593 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that the bias and variance components of the estimation error combine to influence classification in a very different way than with squared error on the probabilities themselves, and that certain types of (very high) bias can be canceled by low variance to produce accurate classification.
Abstract: The classification problem is considered in which an output variable y assumes discrete values with respective probabilities that depend upon the simultaneous values of a set of input variables x = {x_1,....,x_n}. At issue is how error in the estimates of these probabilities affects classification error when the estimates are used in a classification rule. These effects are seen to be somewhat counter intuitive in both their strength and nature. In particular the bias and variance components of the estimation error combine to influence classification in a very different way than with squared error on the probabilities themselves. Certain types of (very high) bias can be canceled by low variance to produce accurate classification. This can dramatically mitigate the effect of the bias associated with some simple estimators like “naive” Bayes, and the bias induced by the curse-of-dimensionality on nearest-neighbor procedures. This helps explain why such simple methods are often competitive with and sometimes superior to more sophisticated ones for classification, and why “bagging/aggregating” classifiers can often improve accuracy. These results also suggest simple modifications to these procedures that can (sometimes dramatically) further improve their classification performance.

1,066 citations


Journal ArticleDOI
TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.
Abstract: One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a set of sentences S ⊆ L determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.

952 citations


Journal ArticleDOI
TL;DR: This paper uses a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions, which are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies.
Abstract: One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a series of data mining techniques. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions. Then the indicators are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies. Finally, the outputs of the monitors are used as features in a system that learns to combine evidence to generate high-confidence alarms. The system has been applied to the problem of detecting cellular cloning fraud based on a database of call records. Experiments indicate that this automatic approach performs better than hand-crafted methods for detecting fraud. Furthermore, this approach can adapt to the changing conditions typical of fraud detection environments.

950 citations


Journal ArticleDOI
TL;DR: Several phenomena that can, if ignored, invalidate an experimental comparison and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining.
Abstract: An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.

918 citations


Journal ArticleDOI
TL;DR: An efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in- memory summary of the data distribution, and implemented in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and compared with other available methods.
Abstract: Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

805 citations


Journal ArticleDOI
David Heckerman1
TL;DR: Methods for constructing Bayesian networks from prior knowledge are discussed and Bayesian statistical methods for using data to improve these models are summarized.
Abstract: A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data modeling. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.

655 citations


Journal ArticleDOI
TL;DR: This paper describes new parallel association mining algorithms that use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets, and presents results on the performance of the algorithms on various databases, and compares it against a well known parallel algorithm.
Abstract: Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude.

341 citations


Journal ArticleDOI
TL;DR: This paper presents a simple, efficient computer-based method for discovering causal relationships from databases that contain observational data, and allows interested readers to rapidly program and apply the method to their own databases, as a start toward using more elaborate causal discovery algorithms.
Abstract: This paper presents a simple, efficient computer-based method for discovering causal relationships from databases that contain observational data. Observational data is passively observed, as contrasted with experimental data. Most of the databases available for data mining are observational. There is great potential for mining such databases to discover causal relationships. We illustrate how observational data can constrain the causal relationships among measured variables, sometimes to the point that we can conclude that one variable is causing another variable. The presentation here is based on a constraint-based approach to causal discovery. A primary purpose of this paper is to present the constraint-based causal discovery method in the simplest possible fashion in order to (1) readily convey the basic ideas that underlie more complex constraint-based causal discovery techniques, and (2) permit interested readers to rapidly program and apply the method to their own databases, as a start toward using more elaborate causal discovery algorithms.

212 citations


Journal ArticleDOI
TL;DR: Data mining is on the interface of Computer Science andStatistics, utilizing advances in both disciplines to make progress in extracting information from large databases, and opportunities where close cooperation between the statistical and computational communities might reasonably provide synergy for further progress in data analysis are identified.
Abstract: Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statistical themes and lessons that are directly relevant to data mining and attempts to identify opportunities where close cooperation between the statistical and computational communities might reasonably provide synergy for further progress in data analysis.

206 citations


Journal ArticleDOI
TL;DR: The pre-processing of raw data that the program performs is highlighted, the data mining aspects of the software are described and how the interpretation of patterns supports the process of knowledge discovery is described.
Abstract: Advanced Scout is a PC-based data mining application used by National Basketball Association (NBA) coaching staffs to discover interesting patterns in basketball game data. We describe Advanced Scout software from the perspective of data mining and knowledge discovery. This paper highlights the pre-processing of raw data that the program performs, describes the data mining aspects of the software and how the interpretation of patterns supports the process of knowledge discovery. The underlying technique of attribute focusing as the basis of the algorithm is also described. The process of pattern interpretation is facilitated by allowing the user to relate patterns to video tape.

Journal ArticleDOI
TL;DR: A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors and is utilized to discover very useful survival curves for breast cancer patients from a medical database.
Abstract: Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.

Journal ArticleDOI
TL;DR: A suite of visual interfaces built for telephone fraud detection can combine human detection with machines' far greater computational capacity by building domain-specific interfaces that present information visually.
Abstract: Human pattern recognition skills are remarkable and in many situations far exceed the ability of automated mining algorithms. By building domain-specific interfaces that present information visually, we can combine human detection with machines‘ far greater computational capacity. We illustrate our ideas by describing a suite of visual interfaces we built for telephone fraud detection.


Journal ArticleDOI
TL;DR: Algorithm and techniques for construction of data cubes on distributed-memory parallel computers are presented, showing that they are scalable to a large number of processors, providing a high performance platform for OLAP and data mining on parallel systems.
Abstract: On-Line Analytical Processing (OLAP) techniques are increasingly being used in decision support systems to provide analysis of data. Queries posed on such systems are quite complex and require different views of data. Analytical models need to capture the multidimensionality of the underlying data, a task for which multidimensional databases are well suited. Multidimensional OLAP systems store data in multidimensional arrays on which analytical operations are performed. Knowledge discovery and data mining requires complex operations on the underlying data which can be very expensive in terms of computation time. High performance parallel systems can reduce this analysis time. Precomputed aggregate calculations in a Data Cube can provide efficient query processing for OLAP applications. In this article, we present algorithms for construction of data cubes on distributed-memory parallel computers. Data is loaded from a relational database into a multidimensional array. We present two methods, sort-based and hash-based for loading the base cube and compare their performances. Data cubes are used to perform consolidation queries used in roll-up operations using dimension hierarchies. Finally, we show how data cubes are used for data mining using Attribute Focusing techniques. We present results for these on the IBM-SP2 parallel machine. Results show that our algorithms and techniques for OLAP and data mining on parallel systems are scalable to a large number of processors, providing a high performance platform for such applications.

Journal ArticleDOI
TL;DR: The new IsoDen method based on isodensity surfaces has been developed to overcome some of the shortcomings of FOF, and has been implemented on a variety of computer systems, and successfully used to extract halos from simulations with up to 16.8 million particles.
Abstract: Cosmological N-body simulations on parallel computers produce large datasets—gigabytes at each instant of simulated cosmological time, and hundreds of gigabytes over the course of a simulation. These large datasets require further analysis before they can be compared to astronomical observations. The “Halo World” tools include two methods for performing halo finding: identifying all of the gravitationally stable clusters in a point-sampled density field. One of these methods is a parallel implementation of the friends of friends (FOF) algorithm, widely used in the field of N-body cosmology. The new IsoDen method based on isodensity surfaces has been developed to overcome some of the shortcomings of FOF. Parallel processing is the only viable way of obtaining the necessary performance and storage capacity to carry out these analysis tasks. Ultimately, we must also plan to use disk storage as the only economically viable alternative for storing and manipulating such large data sets. Both IsoDen and friends of friends have been implemented on a variety of computer systems, with parallelism up to 512 processors, and successfully used to extract halos from simulations with up to 16.8 million particles.

Journal ArticleDOI
TL;DR: This work addresses the problem of computing primary images when access to the images is expensive, which is the case when the images cannot be kept locally, but must be accessed through slow communication such as the Internet, or stored in a compressed form.
Abstract: Large collections of images can be indexed by their projections on a few “primary” images. The optimal primary images are the eigenvectors of a large covariance matrix. We address the problem of computing primary images when access to the images is expensive. This is the case when the images cannot be kept locally, but must be accessed through slow communication such as the Internet, or stored in a compressed form. A distributed algorithm that computes optimal approximations to the eigenvectors (known as Ritz vectors) in one pass through the image set is proposed. When iterated, the algorithm can recover the exact eigenvectors. The widely used SVD technique for computing the primary images of a small image set is a special case of the proposed algorithm. In applications to image libraries and learning, it is necessary to compute different primary images for several sub-categories of the image set. The proposed algorithm can compute these additional primary images “offline”, without the image data. Similar computation by other algorithms is impractical even when access to the images is inexpensive.