scispace - formally typeset
Search or ask a question

Showing papers by "Jiawei Han published in 2001"


Proceedings ArticleDOI
02 Apr 2001
TL;DR: This work proposes a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern Mining, and shows that Pre fixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.
Abstract: Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern mining methods follow the methodology of A priori which may substantially reduce the number of combinations to be examined. Howeve6 Apriori still encounters problems when a sequence database is large andor when sequential patterns to be mined are numerous ano we propose a novel sequential pattern mining method, called Prefixspan (i.e., Prefix-projected - Ettern_ mining), which explores prejxprojection in sequential pattern mining. Prefixspan mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Moreover; prefi-projection substantially reduces the size of projected databases and leads to efJicient processing. Our performance study shows that Prefixspan outperforms both the Apriori-based GSP algorithm and another recently proposed method; Frees pan, in mining large sequence data bases.

1,975 citations


Proceedings ArticleDOI
29 Nov 2001
TL;DR: The authors propose a new associative classification method, CMAR, i.e., Classification based on Multiple Association Rules, which extends an efficient frequent pattern mining method, FP-growth, constructs a class distribution-associated FP-tree, and mines large databases efficiently.
Abstract: Previous studies propose that associative classification has high classification accuracy and strong flexibility at handling unstructured data. However, it still suffers from the huge set of mined rules and sometimes biased classification or overfitting since the classification is based on only a single high-confidence rule. The authors propose a new associative classification method, CMAR, i.e., Classification based on Multiple Association Rules. The method extends an efficient frequent pattern mining method, FP-growth, constructs a class distribution-associated FP-tree, and mines large databases efficiently. Moreover, it applies a CR-tree structure to store and retrieve mined association rules efficiently, and prunes rules effectively based on confidence, correlation and database coverage. The classification is performed based on a weighted /spl chi//sup 2/ analysis using multiple strong association rules. Our extensive experiments on 26 databases from the UCI machine learning database repository show that CMAR is consistent, highly effective at classification of various kinds of databases and has better average classification accuracy in comparison with CBA and C4.5. Moreover, our performance study shows that the method is highly efficient and scalable in comparison with other reported associative classification methods.

1,336 citations



Book ChapterDOI
01 Dec 2001
TL;DR: This book discusses the power of Spatial Data Mining to Enhance the Applicability of GIS Technology, and the role of a Multitier Ontological Framework in Reasoning to Discover Meaningful Patterns of Sustainable Mobility.
Abstract: Introduction Harvey J. Miller and Jiawei Han Spatiotemporal Data Mining Paradigms and Methodologies John F. Roddick and Brian G. Lees Fundamentals of Spatial Data Warehousing for Geographic Knowledge Discovery Yvan Bedard and Jiawei Han Analysis of Spatial Data with Map Cubes: Highway Traffic Data Chang-Tien Lu, Arnold P. Boedihardjo, and Shashi Shekhar NEW! Data Quality Issues and Geographic Knowledge Discovery Marc Gervais, Yvan Bedard, Marie-Andree Levesque, Eveline Bernier, and Rodolphe Devillers Spatial Classification and Prediction Models for Geospatial Data Mining Shashi Shekhar, Ranga Raju Vatsavai, and Sanjay Chawla An Overview of Clustering Methods in Geographic Data Analysis Jiawei Han, Jae-Gil Lee, and Micheline Kamber NEW! Computing Medoids in Large Spatial Datasets Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou NEW! Looking for a Relationship? Try GWR A. Stewart Fotheringham, Martin Charlton, and Urska Demsar Leveraging the Power of Spatial Data Mining to Enhance the Applicability of GIS Technology Donato Malerba, Antonietta Lanza, and Annalisa Appice Visual Exploration and Explanation in Geography: Analysis with Light Mark Gahegan NEW! Multivariate Spatial Clustering and Geovisualization Diansheng Guo NEW! Toward Knowledge Discovery about Geographic Dynamics in Spatiotemporal Databases} May Yuan NEW! The Role of a Multitier Ontological Framework in Reasoning to Discover Meaningful Patterns of Sustainable Mobility Monica Wachowicz, Jose Macedo, Chiara Renso, and Arend Ligtenberg NEW! Periodic Pattern Discovery from Trajectories of Moving Objects Huiping Cao, Nikos Mamoulis, and David W. Cheung NEW! Decentralized Spatial Data Mining for Geosensor Networks Patrick Laube and Matt Duckham NEW! Beyond Exploratory Visualization of Space-Time Paths Menno-Jan Kraak and Otto Huisman

724 citations



Proceedings ArticleDOI
29 Nov 2001
TL;DR: This study shows that H-mine has high performance in various kinds of data, outperforms the previously developed algorithms in different settings, and is highly scalable in mining large databases.
Abstract: Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as dense vs. sparse, long vs. short patterns, memory-based vs. disk-based, etc. In this study, we propose a simple and novel hyper-linked data structure, H-struct and a new mining algorithm, H-mine, which takes advantage of this data structure and dynamically adjusts links in the mining process. A distinct feature of this method is that it has very limited and precisely predictable space overhead and runs really fast in memory-based setting. Moreover it can be scaled up to very large databases by database partitioning, and when the data set becomes dense, (conditional) FP-trees can be constructed dynamically as part of the mining process. Our study shows that H-mine has high performance in various kinds of data, outperforms the previously developed algorithms in different settings, and is highly scalable in mining large databases. This study also proposes a new data mining methodology, space-preserving mining, which may have strong impact in the future development of efficient and scalable data mining methods.

452 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: A notion of convertible constraints is developed and systematically analyzed, classify, and characterize this class and techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining are developed.
Abstract: Recent work has highlighted the importance of the constraint based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. The authors study constraints which cannot be handled with existing theory and techniques. For example, avg(S) /spl theta/ /spl nu/, median(S) /spl theta/ /spl nu/, sum(S) /spl theta/ /spl nu/ (S can contain items of arbitrary values) (/spl theta//spl isin/{/spl ges/, /spl les/}), are customarily regarded as "tough" constraints in that they cannot be pushed inside an algorithm such as a priori. We develop a notion of convertible constraints and systematically analyze, classify, and characterize this class. We also develop techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining. Results from our detailed experiments show the effectiveness of the techniques developed.

372 citations


Proceedings ArticleDOI
26 Aug 2001
TL;DR: A novel method to efficiently find the top-n local outliers in large databases using the concept of "micro-cluster" is proposed to compress the data.
Abstract: Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOF values for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOF for all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers.

356 citations


Proceedings ArticleDOI
01 May 2001
TL;DR: For efficient computation of iceberg cubes with the average measure, a top-k average pruning method is proposed and two previously studied methods are extended, Top-k Apriori and Bottom-k BUC are studied, and a new iceberg cubing method, called Top- k H-Cubing, is developed.
Abstract: It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multi-dimensional aggregations for OLAP and data mining.In this paper, we study efficient methods for computing iceberg cubes with some popularly used complex measures, such as average, and develop a methodology that adopts a weaker but anti-monotonic condition for testing and pruning search space. In particular, for efficient computation of iceberg cubes with the average measure, we propose a top-k average pruning method and extend two previously studied methods, Apriori and BUC, to Top-k Apriori and Top-k BUC. To further improve the performance, an interesting hypertree structure, called H-tree, is designed and a new iceberg cubing method, called Top-k H-Cubing, is developed. Our performance study shows that Top-k BUC and Top-k H-Cubing are two promising candidates for scalable computation, and Top-k H-Cubing has better performance in most cases.

309 citations


Proceedings ArticleDOI
05 Oct 2001
TL;DR: This paper examines feasible combinations of efficient sequential pattern mining and multi-dimensional analysis methods, as well as develop uniform methods for high-performance mining, which integrates the multidimensional analysis and sequential data mining.
Abstract: Sequential pattern mining, which finds the set of frequent subsequences in sequence databases, is an important data-mining task and has broad applications. Usually, sequence patterns are associated with different circumstances, and such circumstances form a multiple dimensional space. For example, customer purchase sequences are associated with region, time, customer group, and others. It is interesting and useful to mine sequential patterns associated with multi-dimensional information.In this paper, we propose the theme of multi-dimensional sequential pattern mining, which integrates the multidimensional analysis and sequential data mining. We also thoroughly explore efficient methods for multi-dimensional sequential pattern mining. We examine feasible combinations of efficient sequential pattern mining and multi-dimensional analysis methods, as well as develop uniform methods for high-performance mining. Extensive experiments show the advantages as well as limitations of these methods. Some recommendations on selecting proper method with respect to data set properties are drawn.

215 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: It is shown that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'.
Abstract: Clustering in spatial data mining is to group similar objects based on their distance, connectivity, or their relative density in space. In the real world there exist many physical obstacles such as rivers, lakes and highways, and their presence may affect the result of clustering substantially. We study the problem of clustering in the presence of obstacles and define it as a COD (Clustering with Obstructed Distance) problem. As a solution to this problem, we propose a scalable clustering algorithm, called COD-CLARANS. We discuss various forms of pre-processed information that could enhance the efficiency of COD-CLARANS. In the strictest sense, the COD problem can be treated as a change in distance function and thus could be handled by current clustering algorithms by changing the distance function. However, we show that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'. We conduct various performance studies to show that COD-CLARANS is both efficient and effective.

Book ChapterDOI
11 Oct 2001
TL;DR: The penetration of data warehouses into the management and exploitation of spatial databases is a major trend as it is for non-spatial databases.
Abstract: Recent years have witnessed major changes in the Geographic Information System (GIS) market, from technological offerings to user requests. For example, spatial databases used to be implemented in GISs or in Computer-Assisted Design (CAD) systems coupled with a Relational Data Base Management System (RDBMS). Today, spatial databases are also implemented in spatial extensions of universal servers, in spatial engine software components, in GIS web servers, in analytical packages using so-called 'data cubes' and in spatial data warehouses. Such databases are structured according to either a relational, object-oriented, multi-dimensional or hybrid paradigm. In addition, these offerings are integrated as a piece of the overall technological framework of the organization and they are implemented according to very diverse architectures responding to differing users' contexts: centralized vs distributed, thin-clients vs thick-clients, Local Area Network (LAN) vs intranets, spatial data warehouses vs legacy systems, etc. As one may say, 'Gone are the days of a spatial database implemented solely on a stand-alone GIS' (Bédard 1999). In fact, this evolution of the GIS market follows the general trends of mainstream Information Technologies (IT). Among all these possibilities, the penetration of data warehouses into the management and exploitation of spatial databases is a major trend as it is for non-spatial databases. According to Rawling and Kucera (1997), 'the term Data Warehouse has become the hottest industry buzzword of the decade just behind Internet and information highway'. More specifically, this penetration of data warehouses allows developers to build new solutions geared towards one major need which has never been solved efficiently insofar: to provide a unified view of dispersed heterogeneous databases in order to efficiently feed the decision-support tools used for strategic decision making. In fact, the data warehouse emerged as the unifying solution to a series of individual circumstances related to providing the necessary basis for global knowledge discovery. First, large organizations often have several departmental or application-oriented independent databases which may overlap in content. Usually, such systems work properly for day-today operational-level decisions. However, when one needs to obtain aggregated or summarized information integrating data from these different

Book ChapterDOI
05 Sep 2001
TL;DR: An efficient collaborative filtering method, called RecTree (which stands for RECommendation Tree), that addresses the scalability problem with a divide-and-conquer approach and outperforms the well-known collaborative filter, CorrCF, in both execution time and accuracy.
Abstract: Many people rely on the recommendations of trusted friends to find restaurants or movies, which match their tastes. But, what if your friends have not sampled the item of interest? Collaborative filtering (CF) seeks to increase the effectiveness of this process by automating the derivation of a recommendation, often from a clique of advisors that we have no prior personal relationship with. CF is a promising tool for dealing with the information overload that we face in the networked world. Prior works in CF have dealt with improving the accuracy of the predictions. However, it is still challenging to scale these methods to large databases. In this study, we develop an efficient collaborative filtering method, called RecTree (which stands for RECommendation Tree) that addresses the scalability problem with a divide-and-conquer approach. The method first performs an efficient k-means-like clustering to group data and creates neighborhood of similar users, and then performs subsequent clustering based on smaller, partitioned databases. Since the progressive partitioning reduces the search space dramatically, the search for an advisory clique will be faster than scanning the entire database of users. In addition, the partitions contain users that are more similar to each other than those in other partitions. This characteristic allows RecTree to avoid the dilution of opinions from good advisors by a multitude of poor advisors and thus yielding a higher overall accuracy. Based on our experiments and performance study, RecTree outperforms the well-known collaborative filter, CorrCF, in both execution time and accuracy. In particular, RecTree's execution time scales by O(nlog2(n)) with the dataset size while CorrCF scales quadratically.

Book ChapterDOI
04 Jan 2001
TL;DR: In this article, a scalable constrained clustering algorithm is developed which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints.
Abstract: Constrained clustering--finding clusters that satisfy user-specified constraints--is highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. A scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. Our algorithm consists of two phases: pivot movement and deadlock resolution. For both phases, we show that finding the optimal solution is NP-hard. We then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. By experiments, we show the effectiveness and efficiency of the heuristics.



Proceedings Article
11 Sep 2001
TL;DR: An efficient algorithm is developed, which pushes constraints deep into the computation process, finding all gradient-probe cell pairs in one pass, and explores bi-directional pruning between probe cells and gradient cells, utilizing transformed measures and dimensions.
Abstract: Constrained gradient analysis (similar to the “cubegrade” problem posed by Imielinski, et al. [9]) is to extract pairs of similar cell characteristics associated with big changes in measure in a data cube. Cells are considered similar if they are related by roll-up, drill-down, or 1-dimensional mutation operation. Constrained gradient queries are expressive, capable of capturing trends in data and answering “what-if” questions. To facilitate our discussion, we call one cell in a gradient pair probe cell and the other gradient cell. An efficient algorithm is developed, which pushes constraints deep into the computation process, finding all gradient-probe cell pairs in one pass. It explores bi-directional pruning between probe cells and gradient cells, utilizing transformed measures and dimensions. Moreover, it adopts a hyper-tree structure and an H-cubing method to compress data and maximize sharing of computation. Our performance study shows that this algorithm is efficient and scalable.




Proceedings ArticleDOI
01 May 2001
TL;DR: Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage.
Abstract: Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.


Book
01 Jan 2001
TL;DR: Detailed descriptions of geographic data warehouses and data depositories, techniques for geographic data mining and knowledge discovery, and applications of geographicData Mining and Knowledge discovery are provided.
Abstract: Overview of geographic data mining and knowledge discovery . Geographic data warehouses and data depositories. Techniques for geographic data mining and knowledge discovery. Geographic knowledge discovery and geographic visualization. Applications of geographic data mining and knowledge discovery.

Book ChapterDOI
01 Jan 2001
TL;DR: This paper introduces an association-based spatial classification algorithm, called SPARC (SPatial Association Rule-based Classification), for efficient spatial classification in large geospatial databases and shows that SPARC is efficient for classification of spatial objects in large databases.
Abstract: Spatial classification is to classify spatial objects based on the spatial and nonspatial features of these objects in a database. The classification results, taken as the models for the data, can be used for better understanding of the relationships among the objects in the database and for prediction of characteristics and features of new objects. Spatial classification is a challenging task due to the sparsity of spatial features which leads to high dimensionality and also the “curse of dimensionality. In this paper, we introduce an association-based spatial classification algorithm, called SPARC (SPatial Association Rule-based Classification), for efficient spatial classification in large geospatial databases. SPARC explores spatial association-based classification and integrates a few important techniques developed in spatial indexing and data mining to achieve high scalability when classifying a large number of spatial data objects. These techniques include micro-clustering, spatial join indexing, feature reduction by frequent pattern mining, and association-based classification. Our performance study shows that SPARC is efficient for classification of spatial objects in large databases.



Book ChapterDOI
11 Oct 2001