scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2002"


Journal ArticleDOI
TL;DR: This paper compares the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets and provides evidence that Ant- Miner is competitive with CN1 with respect to predictive accuracy and the rule lists discovered are considerably simpler than those discovered by CN2.
Abstract: The paper proposes an algorithm for data mining called Ant-Miner (ant-colony-based data miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is inspired by both research on the behavior of real ant colonies and some data mining concepts as well as principles. We compare the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets. The results provide evidence that: 1) Ant-Miner is competitive with CN2 with respect to predictive accuracy, and 2) the rule lists discovered by Ant-Miner are considerably simpler (smaller) than those discovered by CN2.

994 citations


Journal ArticleDOI
TL;DR: This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy.
Abstract: Discrete values have important roles in data mining and knowledge discovery They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy Furthermore, many induction algorithms found in the literature require discrete features All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task There are numerous discretization methods available in the literature It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances We also identify some issues yet to solve and future research for discretization

981 citations


Proceedings ArticleDOI
07 Aug 2002
TL;DR: Using a set of benchmark data from a KDD (knowledge discovery and data mining) competition designed by DARPA, it is demonstrated that efficient and accurate classifiers can be built to detect intrusions.
Abstract: Information security is an issue of serious global concern. The complexity, accessibility, and openness of the Internet have served to increase the security risk of information systems tremendously. This paper concerns intrusion detection. We describe approaches to intrusion detection using neural networks and support vector machines. The key ideas are to discover useful patterns or features that describe user behavior on a system, and use the set of relevant features to build classifiers that can recognize anomalies and known intrusions, hopefully in real time. Using a set of benchmark data from a KDD (knowledge discovery and data mining) competition designed by DARPA, we demonstrate that efficient and accurate classifiers can be built to detect intrusions. We compare the performance of neural networks based, and support vector machine based, systems for intrusion detection.

779 citations


Journal ArticleDOI
TL;DR: A survey of the available literature on data mining using soft computing based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model is provided.
Abstract: The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.

630 citations


Book
21 Aug 2002
TL;DR: In this article, the authors integrate two areas of computer science, namely data mining and evolutionary algorithms, to discover comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.
Abstract: From the Publisher: This book integrates two areas of computer science, namely data mining and evolutionary algorithms. Both these areas have become increasingly popular in the last few years, and their integration is currently an active research area. In general, data mining consists of extracting knowledge from data. In particular, in this book we emphasize the importance of discovering comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions. In contrast, most rule induction methods perform a local, greedy search in the space of candidate rules. Intuitively, the global search of evolutionary algorithms can discover interesting rules and patterns that would be missed by the greedy search performed by most rule induction methods.

608 citations


Book
15 Jun 2002
TL;DR: Part A: Data mining and knowledge discovery Part B: Fundamental Concepts Part C: The process of knowledge discovery in databases Part D: Discovery Systems Part E: Interdisciplinary links of KDD Part F: Business problems Part G: Industry sectors Part H: KDD in practice: case studies
Abstract: Part A: Data mining and knowledge discovery Part B: Fundamental Concepts Part C: The process of knowledge discovery in databases Part D: Discovery Systems Part E: Interdisciplinary links of KDD Part F: Business problems Part G: Industry sectors Part H: KDD in practice: case studies

502 citations


Journal ArticleDOI
01 Aug 2002
TL;DR: A new algorithm called TITANIC for computing (iceberg) concept lattices is presented, based on data mining techniques with a level-wise approach, and shows an important gain in efficiency, especially for weakly correlated data.
Abstract: We introduce the notion of iceberg concept lattices and show their use in knowledge discovery in databases. Iceberg lattices are a conceptual clustering method, which is well suited for analyzing very large databases. They also serve as a condensed representation of frequent itemsets, as starting point for computing bases of association rules, and as a visualization method for association rules. Iceberg concept lattices are based on the theory of Formal Concept Analysis, a mathematical theory with applications in data analysis, information retrieval, and knowledge discovery. We present a new algorithm called TITANIC for computing (iceberg) concept lattices. It is based on data mining techniques with a level-wise approach. In fact, TITANIC can be used for a more general problem: Computing arbitrary closure systems when the closure operator comes along with a so-called weight function. The use of weight functions for computing closure systems has not been discussed in the literature up to now. Applications providing such a weight function include association rule mining, functional dependencies in databases, conceptual clustering, and ontology engineering. The algorithm is experimentally evaluated and compared with Ganter's Next-Closure algorithm. The evaluation shows an important gain in efficiency, especially for weakly correlated data.

494 citations


Journal ArticleDOI
TL;DR: The confluence of temporal databases and data mining is investigated, the work to date is surveyed, and the issues involved and the outstanding problems in temporal data mining are explored.
Abstract: With the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit time-varying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining.

442 citations


Journal ArticleDOI
01 Jun 2002
TL;DR: The knowledge warehouse proposed here suggests a different direction for DSS in the next decade based on an expanded purpose of DSS, which suggests that the effectiveness of a DSS will, in the future, be measured based on how well it promotes and enhances knowledge.
Abstract: Decision support systems (DSS) are becoming increasingly more critical to the daily operation of organizations. Data warehousing, an integral part of this, provides an infrastructure that enables businesses to extract, cleanse, and store vast amounts of data. The basic purpose of a data warehouse is to empower the knowledge workers with information that allows them to make decisions based on a solid foundation of fact. However, only a fraction of the needed information exists on computers; the vast majority of a firm's intellectual assets exist as knowledge in the minds of its employees. What is needed is a new generation of knowledge-enabled systems that provides the infrastructure needed to capture, cleanse, store, organize, leverage, and disseminate not only data and information but also the knowledge of the firm. The purpose of this paper is to propose, as an extension to the data warehouse model, a knowledge warehouse (KW) architecture that will not only facilitate the capturing and coding of knowledge but also enhance the retrieval and sharing of knowledge across the organization. The knowledge warehouse proposed here suggests a different direction for DSS in the next decade. This new direction is based on an expanded purpose of DSS. That is, the purpose of DSS in knowledge improvement. This expanded purpose of DSS also suggests that the effectiveness of a DSS will, in the future, be measured based on how well it promotes and enhances knowledge, how well it improves the mental model(s) and understanding of the decision maker(s) and thereby how well it improves his/her decision making.

393 citations


Journal ArticleDOI
TL;DR: The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art.
Abstract: The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.

365 citations


Proceedings ArticleDOI
07 Aug 2002
TL;DR: The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.
Abstract: Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.

Book
06 Oct 2002
TL;DR: This chapter discusses data mining techniques for managing Uncertainty in Rule-Based Systems, which involves Integrating Data Mining, Expert Systems, and Intelligent Agents.
Abstract: (Each Chapter concludes with a Chapter Summary, Key Terms, and Exercises.) Preface. I. DATA MINING FUNDAMENTALS. 1. Data Mining: A First View. Data Mining: A Definition. What Can Computers Learn? Is Data Mining Appropriate for my Problem? Expert Systems or Data Mining? A Simple Data Mining Process Model. Why not Simple Search? Data Mining Applications. 2. Data Mining: A Closer Look. Data Mining Strategies. Supervised Data Mining Techniques. Association Rules. Clustering Techniques. Evaluating Performance. 3. Basic Data Mining Techniques. Decision Trees. Generating Association Rules. The K-Means Algorithm. Genetic Learning. Choosing a Data Mining Technique. 4. An Excel-Based Data Mining Tool. The iData Analyzer. ESX: A Multipurpose Tool for Data Mining. iDAV Format for Data Mining. A Five-Step Approach for Unsupervised Clustering. A Six-Step Approach for Supervised Learning. Techniques for Generating Rules. Instance Typicality. Special Considerations and Features. II. TOOLS FOR KNOWLEDGE DISCOVERY. 5. Knowledge Discovery in Databases. A KDD Process Model. Step 1: Goal Identification. Step 2: Creating a Target Data Set. Step 3: Data Preprocessing. Step 4: Data Transformation. Step 5: Data Mining. Step 6: Interpretation and Evaluation. Step 7: Taking Action. The CRISP-DM Process Model. Experimenting with ESX. 6. The Data Warehouse. Operational Databases. Data Warehouse Design. On-line Analytical Processing (OLAP). Excel Pivot Tables for Data Analysis. 7. Formal Evaluation Techniques. What Should be Evaluated? Tools for Evaluation. Computing Test Set Confidence Intervals. Comparing Supervised Learner Models. Attribute Evaluation. Unsupervised Evaluation Techniques. Evaluating Supervised Models with Numeric Output. III. ADVANCED DATA MINING TECHNIQUES. 8. Neural Networks. Feed-Forward Neural Networks. Neural Network Training: A Conceptual View. Neural Network Explanation. General Considerations. Neural Network Learning: A Detailed View. 9. Building Neural Networks with iDA. A Four-Step Approach for Backpropagation Learning. A Four-Step Approach for Neural Network Clustering. ESX for Neural Network Cluster Analysis. 10. Statistical Techniques. Linear Regression Analysis. Logistic Regression. Bayes Classifier. Clustering Algorithms. Heuristics or Statistics? 11. Specialized Techniques. Time-Series Analysis. Mining the Web. Mining Textual Data. Improving Performance. IV. INTELLIGENT SYSTEMS. 12. Rule-Based Systems. Exploring Artificial Intelligence. Problem Solving as a State Space Search. Expert Systems. Structuring a Rule-Based System. 13. Managing Uncertainty in Rule-Based Systems. Uncertainty: Sources and Solutions. Fuzzy Rule-Based Systems. A Probability-Based Approach to Uncertainty. 14. Intelligent Agents. Characteristics of Intelligent Agents. Types of Agents. Integrating Data Mining, Expert Systems, and Intelligent Agents. Appendix. Appendix A: Software Installation. Appendix B: Datasets for Data Mining. Appendix C: Decision Tree Attribute Selection. Appendix D: Statistics for Performance Evaluation. Appendix E: Excel 97 Pivot Tables. Bibliography.

01 Jan 2002
TL;DR: This analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.
Abstract: Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence of missing data, many Machine Learning algorithms handle missing data in a rather naive way. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we analyse the use of the k-nearest neighbour as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.

Proceedings ArticleDOI
24 Feb 2002
TL;DR: New metrics are introduced in order to demonstrate how security issues can be taken into consideration in the general framework of association rule mining, and it is shown that the complexity of the new heuristics is similar to that of the original algorithms.
Abstract: The current trend in the application space towards systems of loosely coupled and dynamically bound components that enables just-in-time integration jeopardizes the security of information that is shared between the broker, the requester, and the provider at runtime. In particular, new advances in data mining and knowledge discovery that allow for the extraction of hidden knowledge in an enormous amount of data, impose new threats on the seamless integration of information. We consider the problem of building privacy preserving algorithms for one category of data mining techniques, association rule mining. We introduce new metrics in order to demonstrate how security issues can be taken into consideration in the general framework of association rule mining, and we show that the complexity of the new heuristics is similar to that of the original algorithms.

Proceedings ArticleDOI
09 Dec 2002
TL;DR: This paper carefully motivate, then introduces, a nontrivial definition of time series motifs, and proposes an efficient algorithm to discover them, and demonstrates the utility and efficiency of the approach on several real world datasets.
Abstract: The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs", because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this paper we carefully motivate, then introduce, a nontrivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.

Patent
11 Mar 2002
TL;DR: In this paper, the authors proposed a method for constructing segmentation-based predictive models, such as decision-tree classifiers, where data records are partitioned into a plurality of segments and separate predictive models are constructed for each segment.
Abstract: The present invention generally relates to computer databases and, more particularly, to data mining and knowledge discovery. The invention specifically relates to a method for constructing segmentation-based predictive models, such as decision-tree classifiers, wherein data records are partitioned into a plurality of segments and separate predictive models are constructed for each segment. The present invention contemplates a computerized method for automatically building segmentation-based predictive models that substantially improves upon the modeling capabilities of decision trees and related technologies, and that automatically produces models that are competitive with, if not better than, those produced by data analysts and applied statisticians using traditional, labor-intensive statistical techniques. The invention achieves these properties by performing segmentation and multivariate statistical modeling within each segment simultaneously. Segments are constructed so as to maximize the accuracies of the predictive models within each segment. Simultaneously, the multivariate statistical models within each segment are refined so as to maximize their respective predictive accuracies.

Journal ArticleDOI
TL;DR: The approach of this work is to consider the intended reuse and level of application for knowledge in order to determine the requirements for its acquisition, and an overall framework for the requirements of capturing, storing and reusing information and knowledge in engineering design is generated.

Journal ArticleDOI
01 Jun 2002
TL;DR: An integrative framework is presented for building enterprise decision support environments using model marts and model warehouses as repositories for knowledge obtained through various conversions.
Abstract: Decision support and knowledge management processes are interdependent activities in many organizations. In this paper, we propose an approach for integrating decision support and knowledge management processes using knowledge discovery techniques. Based on the proposed approach, an integrative framework is presented for building enterprise decision support environments using model marts and model warehouses as repositories for knowledge obtained through various conversions. This framework is expected to guide further research on the development of the next generation decision support environments.

BookDOI
01 Jan 2002
TL;DR: This book discusses Granular Computing in Data Mining, Granular computing with Closeness and Negligibility Relations, and the application of Granularity Computing to Confirm Compliance with Non-Proliferation Treaty.
Abstract: 1: Granular Computing - A New Paradigm.- Some Reflections on Information Granulation and its Centrality in Granular Computing, Computing with Words, the Computational Theory of Perceptions and Precisiated Natural Language.- 2: Granular Computing in Data Mining.- Data Mining Using Granular Computing: Fast Algorithms for Finding Association Rules.- Knowledge Discovery with Words Using Cartesian Granule Features: An Analysis for Classification Problems.- Validation of Concept Representation with Rule Induction and Linguistic Variables.- Granular Computing Using Information Tables.- A Query-Driven Interesting Rule Discovery Using Association and Spanning Operations.- 3: Data Mining.- An Interactive Visualization System for Mining Association Rules.- Algorithms for Mining System Audit Data.- Scoring and Ranking the Data Using Association Rules.- Finding Unexpected Patterns in Data.- Discovery of Approximate Knowledge in Medical Databases Based on Rough Set Model.- 4: Granular Computing.- Observability and the Case of Probability.- Granulation and Granularity via Conceptual Structures: A Perspective From the Point of View of Fuzzy Concept Lattices.- Granular Computing with Closeness and Negligibility Relations.- Application of Granularity Computing to Confirm Compliance with Non-Proliferation Treaty.- Basic Issues of Computing with Granular Probabilities.- Multi-dimensional Aggregation of Fuzzy Numbers Through the Extension Principle.- On Optimal Fuzzy Information Granulation.- Ordinal Decision Making with a Notion of Acceptable: Denoted Ordinal Scales.- A Framework for Building Intelligent Information-Processing Systems Based on Granular Factor Space.- 5: Rough Sets and Granular Computing.- GRS: A Generalized Rough Sets Model.- Structure of Upper and Lower Approximation Spaces of Infinite Sets.- Indexed Rough Approximations, A Polymodal System, and Generalized Possibility Measures.- Granularity, Multi-valued Logic, Bayes' Theorem and Rough Sets.- The Generic Rough Set Inductive Logic Programming (gRS-ILP) Model.- Possibilistic Data Analysis and Its Similarity to Rough Sets.

Journal ArticleDOI
TL;DR: Data mining and knowledge discovery attempts to turn raw data into nuggets and create special edges in this ever competitive world for science discovery and business intelligence.
Abstract: The digital technologies and computer advances with the booming internet uses have ledto massive data collection (corporate data, data warehouses, webs, just to name a few) andinformation (or misinformation) explosion. Szalay and Gray described this phenomenon as“drowning in data” (Szalay and Gray, 1999). They reported that each year the detectors attheCERNparticlecolliderinSwitzerlandrecord1petabyteofdata;andresearchersinareasof science from astronomy to the human genome are facing the same problems and chokingon information. A very natural question is “now that we have gathered so much data, whatdo we do with it?” Raw data is rarely of direct use and manual analysis simply cannotkeep pace with the fast growth of data. Data mining and knowledge discovery (KDD), as anew emerging field comprising disciplines such as databases, statistics, machine learning,comes to the rescue. KDD attempts to turn raw data into nuggets and create special edgesin this ever competitive world for science discovery and business intelligence.TheKDDprocessisdefinedinFayyadetal.(1996)as

Journal ArticleDOI
TL;DR: The proposed model provides a new way to model and manage teamwork processes and a reference model for coordinating the knowledge flow process with the workflow process is suggested to provide an integrated approach to model teamwork process.
Abstract: To realize effective knowledge sharing in teamwork, this paper proposes a knowledge flow model for peer-to-peer knowledge sharing and management in cooperative teams. The model consists of the concepts, rules and methods about the knowledge flow, the knowledge flow process model, and the knowledge flow engine. A reference model for coordinating the knowledge flow process with the workflow process is suggested to provide an integrated approach to model teamwork process. We also discuss the peer-to-peer knowledge-sharing paradigm in large-scale teams and propose the approach for constructing a knowledge flow network from the corresponding workflow. The proposed model provides a new way to model and manage teamwork processes.

Journal ArticleDOI
TL;DR: The extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level and to show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set.
Abstract: In this paper we use genetic programming for changing the representation of the input data for machine learners. In particular, the topic of interest here is feature construction in the learning-from-examples paradigm, where new features are built based on the original set of attributes. The paper first introduces the general framework for GP-based feature construction. Then, an extended approach is proposed where the useful components of representation (features) are preserved during an evolutionary run, as opposed to the standard approach where valuable features are often lost during search. Finally, we present and discuss the results of an extensive computational experiment carried out on several reference data sets. The outcomes show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set. In particular, the extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level.

Patent
17 Jun 2002
TL;DR: In this paper, the authors propose a data mining platform consisting of a plurality of system modules, each formed from plurality of components, each consisting of an input data component, a data analysis engine for processing the input data, an output data component for outputting the results of the data analysis, and a web server to access and monitor the other modules within the unit.
Abstract: The data mining platform comprises a plurality of system modules, each formed from a plurality of components. Each module has an input data component, a data analysis engine for processing the input data, an output data component for outputting the results of the data analysis, and a web server to access and monitor the other modules within the unit and to provide communication to other units. Each module processes a different type of data, for example, a first module processes microarray (gene expression) data while a second module processes biomedical literature on the Internet for information supporting relationships between genes and diseases and gene functionality. In the preferred embodiment, the data analysis engine is a kernel-based learning machine, and in particular, one or more support vector machines (SVMs). The data analysis engine includes a pre-processing function for feature selection, for reducing the amount of data to be processed by selecting the optimum number of attributes, or “features”, relevant to the information to be discovered.

Journal ArticleDOI
TL;DR: The model organizes knowledge in a three-dimensional knowledge space, and provides a knowledge grid operation language, KGOL, which enables people to conveniently share knowledge with each other when they work on the Internet.
Abstract: This paper proposes a knowledge grid model for sharing and managing globally distributed knowledge resources. The model organizes knowledge in a three-dimensional knowledge space, and provides a knowledge grid operation language, KGOL. Internet users can use the KGOL to create their knowledge grids, to put knowledge to them, to edit knowledge, to partially or wholly open their grids to all or some particular grids, and to get the required knowledge from the open knowledge of all the knowledge grids. The model enables people to conveniently share knowledge with each other when they work on the Internet. A software platform based on the proposed model has been implemented and used for knowledge sharing in research teams.

Journal ArticleDOI
TL;DR: Knowledge discovery in databases and data mining are tools that allow identification of valid, useful, and previously unknown patterns so that the construction manager may analyze the large amount of construction project data.
Abstract: As the construction industry is adapting to new computer technologies in terms of hardware and software, computerized construction data are becoming increasingly available. The explosive growth of many business, government, and scientific databases has begun to far outpace our ability to interpret and digest the data. Such volumes of data clearly overwhelm the traditional methods of data analysis such as spreadsheets and ad-hoc queries. The traditional methods can create informative reports from data, but cannot analyze the contents of those reports. A significant need exists for a new generation of techniques and tools with the ability to automatically assist humans in analyzing the mountains of data for useful knowledge. Knowledge discovery in databases (KDD) and data mining (DM) are tools that allow identification of valid, useful, and previously unknown patterns so that the construction manager may analyze the large amount of construction project data. These technologies combine techniques from machin...

Proceedings ArticleDOI
07 Aug 2002
TL;DR: A parallel granular neural network (GNN) is developed to speed up data mining and knowledge discovery process for credit card fraud detection and gives fewer average training errors with larger amount of past training data.
Abstract: A parallel granular neural network (GNN) is developed to speed up data mining and knowledge discovery process for credit card fraud detection. The entire system is parallelized on the Silicon Graphics Origin 2000, which is a shared memory multiprocessor system consisting of 24-CPU, 4G main memory, and 200 GB hard-drive. In simulations, the parallel fuzzy neural network running on a 24-processor system is trained in parallel using training data sets, and then the trained parallel fuzzy neural network discovers fuzzy rules for future prediction. A parallel learning algorithm is implemented in C. The data are extracted into a flat file from an SQL server database containing sample Visa Card transactions and then preprocessed for applying in fraud detection. The data are classified into three categories: first for training, second for prediction, and third for fraud detection. After learning from training data, the GNN is used to predict on a second set of data and later the third set of data is applied for fraud detection. GNN gives fewer average training errors with larger amount of past training data. The higher the fraud detection error is, the greater the possibility of that transaction being actually fraudulent.

Journal ArticleDOI
TL;DR: This paper proposes a new methodology to identify the customers whose decisions will be positively influenced by campaigns, which is easy to implement and can be used in conjunction with most commonly used supervised learning algorithms.
Abstract: In database marketing, data mining has been used extensively to find the optimal customer targets so as to maximize return on investment. In particular, using marketing campaign data, models are typically developed to identify characteristics of customers who are most likely to respond. While these models are helpful in identifying the likely responders, they may be targeting customers who have decided to take the desirable action or not regardless of whether they receive the campaign contact (e.g. mail, call). Based on many years of business experience, we identify the appropriate business objective and its associated mathematical objective function. We point out that the current approach is not directly designed to solve the appropriate business objective. We then propose a new methodology to identify the customers whose decisions will be positively influenced by campaigns. The proposed methodology is easy to implement and can be used in conjunction with most commonly used supervised learning algorithms. An example using simulated data is used to illustrate the proposed methodology. This paper may provide the database marketing industry with a simple but significant methodological improvement and open a new area for further research and development.

Proceedings Article
23 Jul 2002
TL;DR: The KDD 2002 conference, held from 23rd to 26th July 2002, was the eighth in the series and represented a return to the country in which the series was launched: the first washeld in Montreal, Canada, and this, the eighth, was held in Edmonton, Canada.
Abstract: The KDD 2002 conference, held from 23rd to 26th July 2002, was the eighth in the series. It represented a return to the country in which the series was launched: the first was held in Montreal, Canada, and this, the eighth, was held in Edmonton, Canada. In the years between the first conference in the series and this present one, data mining has be, come a well-established discipline. It has continued to strengthen its links to other data analytic disciplines, including statistics, machine learning, pattern recognition, visualization, and database technology, but has now clearly carved out a niche of its own. Over the period in which this series has been running, hardware technology has continued to advance in great leaps, with the result that large databases have continued to grow in both number and size. The implication is that the challenge of data mining is even more important, that the problems requiring data mining solutions are ever more ubiquitous, and that new tools and methods for tackling are even more necessary.KDD 2002 received a record number of submitted papers - 307 in total, 37 of which were considered for the industral/applicafion track. Among the 270 research submissions, 32 were selected (12%) for full papers; and among the 37 industrial/application submissions, 12 (32%) were selected for full papers. An additional 44 submissions were chosen to be presented as posters, a vast majority of which were research submissions. This low rate of acceptance reflects a conscious effort to maintain the very high standards of quality and relevance, which have been achieved by previous conferences in the series. It means that the papers and posters in the proceedings represent the cutting edge of data mining problemsl solutions, and technology. On the other hand, this policy inevitably meant that many excellent contributions did not make it to the final program. The choice had to be informed by balance as well as quality - KDD 2002 had to showcase research in data mining across the entire frontier of the discipline. This breadth was reflected in the choice of invited speakers, both well known in the data mining; community, but from different backgrounds: Daryl Pregibon and Geoff Hinton. The program also includes 6 workshops in such diverse areas as 'Data Mining in Bioinformatics', 'Web Mining', 'Multimedia Data Mining', 'Multi-Relational Data Mining', 'Temporal Data Mining', and 'Fractals in Data Mining' as well as 6 tutorials on 'Text Mining for Bioinformatics', 'Querying and Mining Data Streams', 'Link Analysis', 'Multivariate Density Estimation', 'Common Reasons Data Mining Projects Fail', and 'Visual Data Mining'.

Journal ArticleDOI
TL;DR: The present article describes the range of text mining techniques that have been applied to scientific documents and divides 'automated reading' into four general subtasks: text categorization, named entity tagging, fact extraction, and collection-wide analysis.

Book
01 May 2002
TL;DR: This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework.
Abstract: This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All of these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. The book aims at researchers and advanced professionals in time series modelling, empirical data modelling, knowledge discovery, data mining and data fusion.