Showing papers on "Knowledge extraction published in 2002"

PDF

Open Access

Journal Article•DOI•

Data mining with an ant colony optimization algorithm

[...]

Rafael Stubs Parpinelli, Heitor Silvério Lopes, Alex A. Freitas

01 Aug 2002-IEEE Transactions on Evolutionary Computation

TL;DR: This paper compares the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets and provides evidence that Ant- Miner is competitive with CN1 with respect to predictive accuracy and the rule lists discovered are considerably simpler than those discovered by CN2.

...read moreread less

Abstract: The paper proposes an algorithm for data mining called Ant-Miner (ant-colony-based data miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is inspired by both research on the behavior of real ant colonies and some data mining concepts as well as principles. We compare the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets. The results provide evidence that: 1) Ant-Miner is competitive with CN2 with respect to predictive accuracy, and 2) the rule lists discovered by Ant-Miner are considerably simpler (smaller) than those discovered by CN2.

...read moreread less

994 citations

Journal Article•DOI•

Discretization: An Enabling Technique

[...]

Huan Liu¹, Farhad Hussain¹, Chew Lim Tan¹, Manoranjan Dash¹•Institutions (1)

National University of Singapore¹

01 Oct 2002-Data Mining and Knowledge Discovery

TL;DR: This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy.

...read moreread less

Abstract: Discrete values have important roles in data mining and knowledge discovery They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy Furthermore, many induction algorithms found in the literature require discrete features All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task There are numerous discretization methods available in the literature It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances We also identify some issues yet to solve and future research for discretization

...read moreread less

981 citations

Proceedings Article•DOI•

Intrusion detection using neural networks and support vector machines

[...]

Srinivas Mukkamala, Guadalupe I. Janoski, Andrew H. Sung

07 Aug 2002

TL;DR: Using a set of benchmark data from a KDD (knowledge discovery and data mining) competition designed by DARPA, it is demonstrated that efficient and accurate classifiers can be built to detect intrusions.

...read moreread less

Abstract: Information security is an issue of serious global concern. The complexity, accessibility, and openness of the Internet have served to increase the security risk of information systems tremendously. This paper concerns intrusion detection. We describe approaches to intrusion detection using neural networks and support vector machines. The key ideas are to discover useful patterns or features that describe user behavior on a system, and use the set of relevant features to build classifiers that can recognize anomalies and known intrusions, hopefully in real time. Using a set of benchmark data from a KDD (knowledge discovery and data mining) competition designed by DARPA, we demonstrate that efficient and accurate classifiers can be built to detect intrusions. We compare the performance of neural networks based, and support vector machine based, systems for intrusion detection.

...read moreread less

779 citations

Journal Article•DOI•

Data mining in soft computing framework: a survey

[...]

Sushmita Mitra, Sankar K. Pal¹, Pabitra Mitra¹•Institutions (1)

Indian Statistical Institute¹

01 Jan 2002-IEEE Transactions on Neural Networks

TL;DR: A survey of the available literature on data mining using soft computing based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model is provided.

...read moreread less

Abstract: The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.

...read moreread less

630 citations

Book•

Data Mining and Knowledge Discovery with Evolutionary Algorithms

[...]

Alex A. Freitas

21 Aug 2002

TL;DR: In this article, the authors integrate two areas of computer science, namely data mining and evolutionary algorithms, to discover comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.

...read moreread less

Abstract: From the Publisher: This book integrates two areas of computer science, namely data mining and evolutionary algorithms. Both these areas have become increasingly popular in the last few years, and their integration is currently an active research area. In general, data mining consists of extracting knowledge from data. In particular, in this book we emphasize the importance of discovering comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions. In contrast, most rule induction methods perform a local, greedy search in the space of candidate rules. Intuitively, the global search of evolutionary algorithms can discover interesting rules and patterns that would be missed by the greedy search performed by most rule induction methods.

...read moreread less

608 citations

Book•

Handbook of Data Mining and Knowledge Discovery

[...]

Willi Klösgen¹, Jan M. Zytkow²•Institutions (2)

Fraunhofer Society¹, University of North Carolina at Charlotte²

15 Jun 2002

TL;DR: Part A: Data mining and knowledge discovery Part B: Fundamental Concepts Part C: The process of knowledge discovery in databases Part D: Discovery Systems Part E: Interdisciplinary links of KDD Part F: Business problems Part G: Industry sectors Part H: KDD in practice: case studies

...read moreread less

Abstract: Part A: Data mining and knowledge discovery Part B: Fundamental Concepts Part C: The process of knowledge discovery in databases Part D: Discovery Systems Part E: Interdisciplinary links of KDD Part F: Business problems Part G: Industry sectors Part H: KDD in practice: case studies

...read moreread less

502 citations

Journal Article•DOI•

Computing iceberg concept lattices with TITANIC

[...]

Gerd Stumme¹, Rafik Taouil², Yves Bastide³, Nicolas Pasquier⁴, Lotfi Lakhal⁵ - Show less +1 more•Institutions (5)

Karlsruhe Institute of Technology¹, French Institute for Research in Computer Science and Automation², Blaise Pascal University³, University of Nice Sophia Antipolis⁴, Centre national de la recherche scientifique⁵

01 Aug 2002

TL;DR: A new algorithm called TITANIC for computing (iceberg) concept lattices is presented, based on data mining techniques with a level-wise approach, and shows an important gain in efficiency, especially for weakly correlated data.

...read moreread less

Abstract: We introduce the notion of iceberg concept lattices and show their use in knowledge discovery in databases. Iceberg lattices are a conceptual clustering method, which is well suited for analyzing very large databases. They also serve as a condensed representation of frequent itemsets, as starting point for computing bases of association rules, and as a visualization method for association rules. Iceberg concept lattices are based on the theory of Formal Concept Analysis, a mathematical theory with applications in data analysis, information retrieval, and knowledge discovery. We present a new algorithm called TITANIC for computing (iceberg) concept lattices. It is based on data mining techniques with a level-wise approach. In fact, TITANIC can be used for a more general problem: Computing arbitrary closure systems when the closure operator comes along with a so-called weight function. The use of weight functions for computing closure systems has not been discussed in the literature up to now. Applications providing such a weight function include association rule mining, functional dependencies in databases, conceptual clustering, and ontology engineering. The algorithm is experimentally evaluated and compared with Ganter's Next-Closure algorithm. The evaluation shows an important gain in efficiency, especially for weakly correlated data.

...read moreread less

494 citations

Journal Article•DOI•

A survey of temporal knowledge discovery paradigms and methods

[...]

John F. Roddick¹, Myra Spiliopoulou•Institutions (1)

University of South Australia¹

01 Jul 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The confluence of temporal databases and data mining is investigated, the work to date is surveyed, and the issues involved and the outstanding problems in temporal data mining are explored.

...read moreread less

Abstract: With the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit time-varying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining.

...read moreread less

442 citations

Journal Article•DOI•

Knowledge warehouse: an architectural integration of knowledge management, decision support, artificial intelligence and data warehousing

[...]

Hamid R. Nemati¹, David M. Steiger², Lakshmi S. Iyer¹, Richard T. Herschel³•Institutions (3)

University of North Carolina at Greensboro¹, University of Maine², Saint Joseph's University³

01 Jun 2002

TL;DR: The knowledge warehouse proposed here suggests a different direction for DSS in the next decade based on an expanded purpose of DSS, which suggests that the effectiveness of a DSS will, in the future, be measured based on how well it promotes and enhances knowledge.

...read moreread less

Abstract: Decision support systems (DSS) are becoming increasingly more critical to the daily operation of organizations. Data warehousing, an integral part of this, provides an infrastructure that enables businesses to extract, cleanse, and store vast amounts of data. The basic purpose of a data warehouse is to empower the knowledge workers with information that allows them to make decisions based on a solid foundation of fact. However, only a fraction of the needed information exists on computers; the vast majority of a firm's intellectual assets exist as knowledge in the minds of its employees. What is needed is a new generation of knowledge-enabled systems that provides the infrastructure needed to capture, cleanse, store, organize, leverage, and disseminate not only data and information but also the knowledge of the firm. The purpose of this paper is to propose, as an extension to the data warehouse model, a knowledge warehouse (KW) architecture that will not only facilitate the capturing and coding of knowledge but also enhance the retrieval and sharing of knowledge across the organization. The knowledge warehouse proposed here suggests a different direction for DSS in the next decade. This new direction is based on an expanded purpose of DSS. That is, the purpose of DSS in knowledge improvement. This expanded purpose of DSS also suggests that the effectiveness of a DSS will, in the future, be measured based on how well it promotes and enhances knowledge, how well it improves the mental model(s) and understanding of the decision maker(s) and thereby how well it improves his/her decision making.

...read moreread less

393 citations

Journal Article•DOI•

Web mining in soft computing framework: relevance, state of the art and future directions

[...]

Sankar K. Pal, V. Talwar¹, Pabitra Mitra²•Institutions (2)

Netaji Subhas Institute of Technology¹, Indian Statistical Institute²

01 Sep 2002-IEEE Transactions on Neural Networks

TL;DR: The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art.

...read moreread less

Abstract: The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.

...read moreread less

365 citations

Proceedings Article•DOI•

TAILOR: a record linkage toolbox

[...]

Mohamed G. Elfeky¹, Vassilios S. Verykios¹, Ahmed K. Elmagarmid¹•Institutions (1)

Purdue University¹

07 Aug 2002

TL;DR: The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.

...read moreread less

Abstract: Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.

...read moreread less

Book•

Data Mining: A Tutorial Based Primer

[...]

Richard J. Roiger

06 Oct 2002

TL;DR: This chapter discusses data mining techniques for managing Uncertainty in Rule-Based Systems, which involves Integrating Data Mining, Expert Systems, and Intelligent Agents.

...read moreread less

Abstract: (Each Chapter concludes with a Chapter Summary, Key Terms, and Exercises.) Preface. I. DATA MINING FUNDAMENTALS. 1. Data Mining: A First View. Data Mining: A Definition. What Can Computers Learn? Is Data Mining Appropriate for my Problem? Expert Systems or Data Mining? A Simple Data Mining Process Model. Why not Simple Search? Data Mining Applications. 2. Data Mining: A Closer Look. Data Mining Strategies. Supervised Data Mining Techniques. Association Rules. Clustering Techniques. Evaluating Performance. 3. Basic Data Mining Techniques. Decision Trees. Generating Association Rules. The K-Means Algorithm. Genetic Learning. Choosing a Data Mining Technique. 4. An Excel-Based Data Mining Tool. The iData Analyzer. ESX: A Multipurpose Tool for Data Mining. iDAV Format for Data Mining. A Five-Step Approach for Unsupervised Clustering. A Six-Step Approach for Supervised Learning. Techniques for Generating Rules. Instance Typicality. Special Considerations and Features. II. TOOLS FOR KNOWLEDGE DISCOVERY. 5. Knowledge Discovery in Databases. A KDD Process Model. Step 1: Goal Identification. Step 2: Creating a Target Data Set. Step 3: Data Preprocessing. Step 4: Data Transformation. Step 5: Data Mining. Step 6: Interpretation and Evaluation. Step 7: Taking Action. The CRISP-DM Process Model. Experimenting with ESX. 6. The Data Warehouse. Operational Databases. Data Warehouse Design. On-line Analytical Processing (OLAP). Excel Pivot Tables for Data Analysis. 7. Formal Evaluation Techniques. What Should be Evaluated? Tools for Evaluation. Computing Test Set Confidence Intervals. Comparing Supervised Learner Models. Attribute Evaluation. Unsupervised Evaluation Techniques. Evaluating Supervised Models with Numeric Output. III. ADVANCED DATA MINING TECHNIQUES. 8. Neural Networks. Feed-Forward Neural Networks. Neural Network Training: A Conceptual View. Neural Network Explanation. General Considerations. Neural Network Learning: A Detailed View. 9. Building Neural Networks with iDA. A Four-Step Approach for Backpropagation Learning. A Four-Step Approach for Neural Network Clustering. ESX for Neural Network Cluster Analysis. 10. Statistical Techniques. Linear Regression Analysis. Logistic Regression. Bayes Classifier. Clustering Algorithms. Heuristics or Statistics? 11. Specialized Techniques. Time-Series Analysis. Mining the Web. Mining Textual Data. Improving Performance. IV. INTELLIGENT SYSTEMS. 12. Rule-Based Systems. Exploring Artificial Intelligence. Problem Solving as a State Space Search. Expert Systems. Structuring a Rule-Based System. 13. Managing Uncertainty in Rule-Based Systems. Uncertainty: Sources and Solutions. Fuzzy Rule-Based Systems. A Probability-Based Approach to Uncertainty. 14. Intelligent Agents. Characteristics of Intelligent Agents. Types of Agents. Integrating Data Mining, Expert Systems, and Intelligent Agents. Appendix. Appendix A: Software Installation. Appendix B: Datasets for Data Mining. Appendix C: Decision Tree Attribute Selection. Appendix D: Statistics for Performance Evaluation. Appendix E: Excel 97 Pivot Tables. Bibliography.

...read moreread less

A Study of K-Nearest Neighbour as an Imputation Method.

[...]

Gustavo E. A. P. A. Batista¹, Maria Carolina Monard•Institutions (1)

University of São Paulo¹

01 Jan 2002

TL;DR: This analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.

...read moreread less

Abstract: Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence of missing data, many Machine Learning algorithms handle missing data in a rather naive way. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we analyse the use of the k-nearest neighbour as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.

...read moreread less

Proceedings Article•DOI•

Privacy preserving association rule mining

[...]

Yucel Saygin¹, Vassilios S. Verykios², Ahmed K. Elmagarmid³•Institutions (3)

Sabancı University¹, Drexel University², Purdue University³

24 Feb 2002

TL;DR: New metrics are introduced in order to demonstrate how security issues can be taken into consideration in the general framework of association rule mining, and it is shown that the complexity of the new heuristics is similar to that of the original algorithms.

...read moreread less

Abstract: The current trend in the application space towards systems of loosely coupled and dynamically bound components that enables just-in-time integration jeopardizes the security of information that is shared between the broker, the requester, and the provider at runtime. In particular, new advances in data mining and knowledge discovery that allow for the extraction of hidden knowledge in an enormous amount of data, impose new threats on the seamless integration of information. We consider the problem of building privacy preserving algorithms for one category of data mining techniques, association rule mining. We introduce new metrics in order to demonstrate how security issues can be taken into consideration in the general framework of association rule mining, and we show that the complexity of the new heuristics is similar to that of the original algorithms.

...read moreread less

Proceedings Article•DOI•

Mining motifs in massive time series databases

[...]

P. Patel¹, Eamonn Keogh¹, Jessica Lin¹, Stefano Lonardi¹•Institutions (1)

University of California, Riverside¹

09 Dec 2002

TL;DR: This paper carefully motivate, then introduces, a nontrivial definition of time series motifs, and proposes an efficient algorithm to discover them, and demonstrates the utility and efficiency of the approach on several real world datasets.

...read moreread less

Abstract: The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs", because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this paper we carefully motivate, then introduce, a nontrivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.

...read moreread less

Patent•

Method for constructing segmentation-based predictive models

[...]

Edwin P. D. Pednault¹, Ramesh Natarajan¹•Institutions (1)

IBM¹

11 Mar 2002

TL;DR: In this paper, the authors proposed a method for constructing segmentation-based predictive models, such as decision-tree classifiers, where data records are partitioned into a plurality of segments and separate predictive models are constructed for each segment.

...read moreread less

Abstract: The present invention generally relates to computer databases and, more particularly, to data mining and knowledge discovery. The invention specifically relates to a method for constructing segmentation-based predictive models, such as decision-tree classifiers, wherein data records are partitioned into a plurality of segments and separate predictive models are constructed for each segment. The present invention contemplates a computerized method for automatically building segmentation-based predictive models that substantially improves upon the modeling capabilities of decision trees and related technologies, and that automatically produces models that are competitive with, if not better than, those produced by data analysts and applied statisticians using traditional, labor-intensive statistical techniques. The invention achieves these properties by performing segmentation and multivariate statistical modeling within each segment simultaneously. Segments are constructed so as to maximize the accuracies of the predictive models within each segment. Simultaneously, the multivariate statistical models within each segment are refined so as to maximize their respective predictive accuracies.

...read moreread less

Journal Article•DOI•

A framework for the requirements of capturing, storing and reusing information and knowledge in engineering design

[...]

Ben Hicks¹, Stephen Culley¹, R D Allen¹, Glen Mullineux¹•Institutions (1)

University of Bath¹

01 Aug 2002-International Journal of Information Management

TL;DR: The approach of this work is to consider the intended reuse and level of application for knowledge in order to determine the requirements for its acquisition, and an overall framework for the requirements of capturing, storing and reusing information and knowledge in engineering design is generated.

...read moreread less

Journal Article•DOI•

Integrating knowledge management into enterprise environments for the next generation decision support

[...]

Narasimha Bolloju¹, Mohamed Khalifa¹, Efraim Turban¹•Institutions (1)

City University of Hong Kong¹

01 Jun 2002

TL;DR: An integrative framework is presented for building enterprise decision support environments using model marts and model warehouses as repositories for knowledge obtained through various conversions.

...read moreread less

Abstract: Decision support and knowledge management processes are interdependent activities in many organizations. In this paper, we propose an approach for integrating decision support and knowledge management processes using knowledge discovery techniques. Based on the proposed approach, an integrative framework is presented for building enterprise decision support environments using model marts and model warehouses as repositories for knowledge obtained through various conversions. This framework is expected to guide further research on the development of the next generation decision support environments.

...read moreread less

Book•DOI•

Data mining, rough sets and granular computing

[...]

Tsau Young Lin¹, Yiyu Yao², Lotfi A. Zadeh³•Institutions (3)

San Jose State University¹, University of Regina², University of California, Berkeley³

01 Jan 2002

TL;DR: This book discusses Granular Computing in Data Mining, Granular computing with Closeness and Negligibility Relations, and the application of Granularity Computing to Confirm Compliance with Non-Proliferation Treaty.

...read moreread less

Abstract: 1: Granular Computing - A New Paradigm.- Some Reflections on Information Granulation and its Centrality in Granular Computing, Computing with Words, the Computational Theory of Perceptions and Precisiated Natural Language.- 2: Granular Computing in Data Mining.- Data Mining Using Granular Computing: Fast Algorithms for Finding Association Rules.- Knowledge Discovery with Words Using Cartesian Granule Features: An Analysis for Classification Problems.- Validation of Concept Representation with Rule Induction and Linguistic Variables.- Granular Computing Using Information Tables.- A Query-Driven Interesting Rule Discovery Using Association and Spanning Operations.- 3: Data Mining.- An Interactive Visualization System for Mining Association Rules.- Algorithms for Mining System Audit Data.- Scoring and Ranking the Data Using Association Rules.- Finding Unexpected Patterns in Data.- Discovery of Approximate Knowledge in Medical Databases Based on Rough Set Model.- 4: Granular Computing.- Observability and the Case of Probability.- Granulation and Granularity via Conceptual Structures: A Perspective From the Point of View of Fuzzy Concept Lattices.- Granular Computing with Closeness and Negligibility Relations.- Application of Granularity Computing to Confirm Compliance with Non-Proliferation Treaty.- Basic Issues of Computing with Granular Probabilities.- Multi-dimensional Aggregation of Fuzzy Numbers Through the Extension Principle.- On Optimal Fuzzy Information Granulation.- Ordinal Decision Making with a Notion of Acceptable: Denoted Ordinal Scales.- A Framework for Building Intelligent Information-Processing Systems Based on Granular Factor Space.- 5: Rough Sets and Granular Computing.- GRS: A Generalized Rough Sets Model.- Structure of Upper and Lower Approximation Spaces of Infinite Sets.- Indexed Rough Approximations, A Polymodal System, and Generalized Possibility Measures.- Granularity, Multi-valued Logic, Bayes' Theorem and Rough Sets.- The Generic Rough Set Inductive Logic Programming (gRS-ILP) Model.- Possibilistic Data Analysis and Its Similarity to Rough Sets.

...read moreread less

Journal Article•DOI•

On Issues of Instance Selection

[...]

Huan Liu¹, Hiroshi Motoda²•Institutions (2)

Arizona State University¹, Osaka University²

01 Apr 2002-Data Mining and Knowledge Discovery

TL;DR: Data mining and knowledge discovery attempts to turn raw data into nuggets and create special edges in this ever competitive world for science discovery and business intelligence.

...read moreread less

Abstract: The digital technologies and computer advances with the booming internet uses have ledto massive data collection (corporate data, data warehouses, webs, just to name a few) andinformation (or misinformation) explosion. Szalay and Gray described this phenomenon as“drowning in data” (Szalay and Gray, 1999). They reported that each year the detectors attheCERNparticlecolliderinSwitzerlandrecord1petabyteofdata;andresearchersinareasof science from astronomy to the human genome are facing the same problems and chokingon information. A very natural question is “now that we have gathered so much data, whatdo we do with it?” Raw data is rarely of direct use and manual analysis simply cannotkeep pace with the fast growth of data. Data mining and knowledge discovery (KDD), as anew emerging ﬁeld comprising disciplines such as databases, statistics, machine learning,comes to the rescue. KDD attempts to turn raw data into nuggets and create special edgesin this ever competitive world for science discovery and business intelligence.TheKDDprocessisdeﬁnedinFayyadetal.(1996)as

...read moreread less

Journal Article•DOI•

A knowledge flow model for peer-to-peer team knowledge sharing and management

[...]

Hai Zhuge¹•Institutions (1)

Chinese Academy of Sciences¹

01 Jul 2002-Expert Systems With Applications

TL;DR: The proposed model provides a new way to model and manage teamwork processes and a reference model for coordinating the knowledge flow process with the workflow process is suggested to provide an integrated approach to model teamwork process.

...read moreread less

Abstract: To realize effective knowledge sharing in teamwork, this paper proposes a knowledge flow model for peer-to-peer knowledge sharing and management in cooperative teams. The model consists of the concepts, rules and methods about the knowledge flow, the knowledge flow process model, and the knowledge flow engine. A reference model for coordinating the knowledge flow process with the workflow process is suggested to provide an integrated approach to model teamwork process. We also discuss the peer-to-peer knowledge-sharing paradigm in large-scale teams and propose the approach for constructing a knowledge flow network from the corresponding workflow. The proposed model provides a new way to model and manage teamwork processes.

...read moreread less

Journal Article•DOI•

Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks

[...]

Krzysztof Krawiec¹•Institutions (1)

Poznań University of Technology¹

01 Dec 2002-Genetic Programming and Evolvable Machines

TL;DR: The extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level and to show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set.

...read moreread less

Abstract: In this paper we use genetic programming for changing the representation of the input data for machine learners. In particular, the topic of interest here is feature construction in the learning-from-examples paradigm, where new features are built based on the original set of attributes. The paper first introduces the general framework for GP-based feature construction. Then, an extended approach is proposed where the useful components of representation (features) are preserved during an evolutionary run, as opposed to the standard approach where valuable features are often lost during search. Finally, we present and discuss the results of an extensive computational experiment carried out on several reference data sets. The outcomes show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set. In particular, the extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level.

...read moreread less

Patent•

Data mining platform for bioinformatics and other knowledge discovery

[...]

Isabelle Guyon, Edward P. Reiss, René Doursat, Jason Weston, David D. Lewis - Show less +1 more

17 Jun 2002

TL;DR: In this paper, the authors propose a data mining platform consisting of a plurality of system modules, each formed from plurality of components, each consisting of an input data component, a data analysis engine for processing the input data, an output data component for outputting the results of the data analysis, and a web server to access and monitor the other modules within the unit.

...read moreread less

Abstract: The data mining platform comprises a plurality of system modules, each formed from a plurality of components. Each module has an input data component, a data analysis engine for processing the input data, an output data component for outputting the results of the data analysis, and a web server to access and monitor the other modules within the unit and to provide communication to other units. Each module processes a different type of data, for example, a first module processes microarray (gene expression) data while a second module processes biomedical literature on the Internet for information supporting relationships between genes and diseases and gene functionality. In the preferred embodiment, the data analysis engine is a kernel-based learning machine, and in particular, one or more support vector machines (SVMs). The data analysis engine includes a pre-processing function for feature selection, for reducing the amount of data to be processed by selecting the optimum number of attributes, or “features”, relevant to the information to be discovered.

...read moreread less

Journal Article•DOI•

A knowledge grid model and platform for global knowledge sharing

[...]

Hai Zhuge¹•Institutions (1)

Chinese Academy of Sciences¹

01 May 2002-Expert Systems With Applications

TL;DR: The model organizes knowledge in a three-dimensional knowledge space, and provides a knowledge grid operation language, KGOL, which enables people to conveniently share knowledge with each other when they work on the Internet.

...read moreread less

Abstract: This paper proposes a knowledge grid model for sharing and managing globally distributed knowledge resources. The model organizes knowledge in a three-dimensional knowledge space, and provides a knowledge grid operation language, KGOL. Internet users can use the KGOL to create their knowledge grids, to put knowledge to them, to edit knowledge, to partially or wholly open their grids to all or some particular grids, and to get the required knowledge from the open knowledge of all the knowledge grids. The model enables people to conveniently share knowledge with each other when they work on the Internet. A software platform based on the proposed model has been implemented and used for knowledge sharing in research teams.

...read moreread less

Journal Article•DOI•

Data Preparation Process for Construction Knowledge Generation through Knowledge Discovery in Databases

[...]

Lucio Soibelman¹, Hyunjoo Kim¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2002-Journal of Computing in Civil Engineering

TL;DR: Knowledge discovery in databases and data mining are tools that allow identification of valid, useful, and previously unknown patterns so that the construction manager may analyze the large amount of construction project data.

...read moreread less

Abstract: As the construction industry is adapting to new computer technologies in terms of hardware and software, computerized construction data are becoming increasingly available. The explosive growth of many business, government, and scientific databases has begun to far outpace our ability to interpret and digest the data. Such volumes of data clearly overwhelm the traditional methods of data analysis such as spreadsheets and ad-hoc queries. The traditional methods can create informative reports from data, but cannot analyze the contents of those reports. A significant need exists for a new generation of techniques and tools with the ability to automatically assist humans in analyzing the mountains of data for useful knowledge. Knowledge discovery in databases (KDD) and data mining (DM) are tools that allow identification of valid, useful, and previously unknown patterns so that the construction manager may analyze the large amount of construction project data. These technologies combine techniques from machin...

...read moreread less

Proceedings Article•DOI•

Parallel granular neural networks for fast credit card fraud detection

[...]

M. Syeda¹, Yan-Qing Zhang¹, Yi Pan¹•Institutions (1)

Georgia State University¹

07 Aug 2002

TL;DR: A parallel granular neural network (GNN) is developed to speed up data mining and knowledge discovery process for credit card fraud detection and gives fewer average training errors with larger amount of past training data.

...read moreread less

Abstract: A parallel granular neural network (GNN) is developed to speed up data mining and knowledge discovery process for credit card fraud detection. The entire system is parallelized on the Silicon Graphics Origin 2000, which is a shared memory multiprocessor system consisting of 24-CPU, 4G main memory, and 200 GB hard-drive. In simulations, the parallel fuzzy neural network running on a 24-processor system is trained in parallel using training data sets, and then the trained parallel fuzzy neural network discovers fuzzy rules for future prediction. A parallel learning algorithm is implemented in C. The data are extracted into a flat file from an SQL server database containing sample Visa Card transactions and then preprocessed for applying in fraud detection. The data are classified into three categories: first for training, second for prediction, and third for fraud detection. After learning from training data, the GNN is used to predict on a second set of data and later the third set of data is applied for fraud detection. GNN gives fewer average training errors with larger amount of past training data. The higher the fraud detection error is, the greater the possibility of that transaction being actually fraudulent.

...read moreread less

Journal Article•DOI•

The true lift model: a novel data mining approach to response modeling in database marketing

[...]

Victor S. Y. Lo¹•Institutions (1)

Fidelity Investments¹

01 Dec 2002-Sigkdd Explorations

TL;DR: This paper proposes a new methodology to identify the customers whose decisions will be positively influenced by campaigns, which is easy to implement and can be used in conjunction with most commonly used supervised learning algorithms.

...read moreread less

Abstract: In database marketing, data mining has been used extensively to find the optimal customer targets so as to maximize return on investment. In particular, using marketing campaign data, models are typically developed to identify characteristics of customers who are most likely to respond. While these models are helpful in identifying the likely responders, they may be targeting customers who have decided to take the desirable action or not regardless of whether they receive the campaign contact (e.g. mail, call). Based on many years of business experience, we identify the appropriate business objective and its associated mathematical objective function. We point out that the current approach is not directly designed to solve the appropriate business objective. We then propose a new methodology to identify the customers whose decisions will be positively influenced by campaigns. The proposed methodology is easy to implement and can be used in conjunction with most commonly used supervised learning algorithms. An example using simulated data is used to illustrate the proposed methodology. This paper may provide the database marketing industry with a simple but significant methodological improvement and open a new area for further research and development.

...read moreread less

Proceedings Article•

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

[...]

Osmar R. Zaïane¹, Randy Goebel¹, David J. Hand², Daniel A. Keim³, Raymond T. Ng⁴ - Show less +1 more•Institutions (4)

University of Alberta¹, Imperial College London², AT&T³, University of British Columbia⁴

23 Jul 2002

TL;DR: The KDD 2002 conference, held from 23rd to 26th July 2002, was the eighth in the series and represented a return to the country in which the series was launched: the first washeld in Montreal, Canada, and this, the eighth, was held in Edmonton, Canada.

...read moreread less

Abstract: The KDD 2002 conference, held from 23rd to 26th July 2002, was the eighth in the series. It represented a return to the country in which the series was launched: the first was held in Montreal, Canada, and this, the eighth, was held in Edmonton, Canada. In the years between the first conference in the series and this present one, data mining has be, come a well-established discipline. It has continued to strengthen its links to other data analytic disciplines, including statistics, machine learning, pattern recognition, visualization, and database technology, but has now clearly carved out a niche of its own. Over the period in which this series has been running, hardware technology has continued to advance in great leaps, with the result that large databases have continued to grow in both number and size. The implication is that the challenge of data mining is even more important, that the problems requiring data mining solutions are ever more ubiquitous, and that new tools and methods for tackling are even more necessary.KDD 2002 received a record number of submitted papers - 307 in total, 37 of which were considered for the industral/applicafion track. Among the 270 research submissions, 32 were selected (12%) for full papers; and among the 37 industrial/application submissions, 12 (32%) were selected for full papers. An additional 44 submissions were chosen to be presented as posters, a vast majority of which were research submissions. This low rate of acceptance reflects a conscious effort to maintain the very high standards of quality and relevance, which have been achieved by previous conferences in the series. It means that the papers and posters in the proceedings represent the cutting edge of data mining problemsl solutions, and technology. On the other hand, this policy inevitably meant that many excellent contributions did not make it to the final program. The choice had to be informed by balance as well as quality - KDD 2002 had to showcase research in data mining across the entire frontier of the discipline. This breadth was reflected in the choice of invited speakers, both well known in the data mining; community, but from different backgrounds: Daryl Pregibon and Geoff Hinton. The program also includes 6 workshops in such diverse areas as 'Data Mining in Bioinformatics', 'Web Mining', 'Multimedia Data Mining', 'Multi-Relational Data Mining', 'Temporal Data Mining', and 'Fractals in Data Mining' as well as 6 tutorials on 'Text Mining for Bioinformatics', 'Querying and Mining Data Streams', 'Link Analysis', 'Multivariate Density Estimation', 'Common Reasons Data Mining Projects Fail', and 'Visual Data Mining'.

...read moreread less

Journal Article•DOI•

Getting to the (c)ore of knowledge: mining biomedical literature

[...]

Berry de Bruijn¹, Joel Martin¹•Institutions (1)

National Research Council¹

04 Dec 2002-International Journal of Medical Informatics

TL;DR: The present article describes the range of text mining techniques that have been applied to scientific documents and divides 'automated reading' into four general subtasks: text categorization, named entity tagging, fact extraction, and collection-wide analysis.

...read moreread less

Book•

Adaptive modelling, estimation and fusion from data: a neurofuzzy approach

[...]

Chris Harris¹, Xia Hong¹, Qiang Gan²•Institutions (2)

University of Southampton¹, University of Essex²

01 May 2002

TL;DR: This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework.

...read moreread less

Abstract: This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All of these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. The book aims at researchers and advanced professionals in time series modelling, empirical data modelling, knowledge discovery, data mining and data fusion.

...read moreread less

Collapse