scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 1998"



Book
31 Jul 1998
TL;DR: Feature Selection for Knowledge Discovery and Data Mining offers an overview of the methods developed since the 1970's and provides a general framework in order to examine these methods and categorize them and suggests guidelines for how to use different methods under various circumstances.
Abstract: From the Publisher: With advanced computer technologies and their omnipresent usage, data accumulates in a speed unmatchable by the human's capacity to process data. To meet this growing challenge, the research community of knowledge discovery from databases emerged. The key issue studied by this community is, in layman's terms, to make advantageous use of large stores of data. In order to make raw data useful, it is necessary to represent, process, and extract knowledge for various applications. Feature Selection for Knowledge Discovery and Data Mining offers an overview of the methods developed since the 1970's and provides a general framework in order to examine these methods and categorize them. This book employs simple examples to show the essence of representative feature selection methods and compares them using data sets with combinations of intrinsic properties according to the objective of feature selection. In addition, the book suggests guidelines for how to use different methods under various circumstances and points out new challenges in this exciting area of research. Feature Selection for Knowledge Discovery and Data Mining is intended to be used by researchers in machine learning, data mining, knowledge discovery, and databases as a toolbox of relevant tools that help in solving large real-world problems. This book is also intended to serve as a reference book or secondary text for courses on machine learning, data mining, and databases.

1,867 citations


Journal ArticleDOI
TL;DR: This book can be used by researchers and graduate students in machine learning, data mining, and knowledge discovery, who wish to understand techniques of feature extraction, construction and selection for data pre-processing and to solve large size, real-world problems.
Abstract: From the Publisher: The book can be used by researchers and graduate students in machine learning, data mining, and knowledge discovery, who wish to understand techniques of feature extraction, construction and selection for data pre-processing and to solve large size, real-world problems. The book can also serve as a reference book for those who are conducting research about feature extraction, construction and selection, and are ready to meet the exciting challenges ahead of us.

953 citations


Proceedings Article
24 Aug 1998
TL;DR: WaveCluster is proposed, a novel clustering approach based on wavelet transforms which can effectively identify arbitrary shape clusters at different degrees of accuracy and is highly efficient in terms of time complexity.
Abstract: Many applications require the management of spatial data Clustering large spatial databases is an important problem which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval A good clustering approach should be efficient and detect clusters of arbitrary shape It must be insensitive to the outliers (noise) and the order of input data We propose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements Using multiresolution property of wavelet transforms, we can effectively identify arbitrary shape clusters at different degrees of accuracy We also demonstrate that WaveCluster is highly efficient in terms of time complexity Experimental results on very large data sets are presented which show the efficiency and effectiveness of the proposed approach compared to the other recent clustering methods This research is supported by Xerox Corporation Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment Proceedings of the 24th VLDB Conference New York, USA, 1998

809 citations


BookDOI
01 Jan 1998

801 citations



Book
31 Aug 1998
TL;DR: This chapter discusses data mining and knowledge discovery through the lens of machine learning, and some of the techniques used in this chapter were previously described in the preface.
Abstract: Foreword. Preface. 1. Data Mining and Knowledge Discovery. 2. Rough Sets. 3. Fuzzy Sets. 4. Bayesian Methods. 5. Evolutionary Computing. 6. Machine Learning. 7. Neural Networks. 8. Clustering. 9. Preprocessing. Index.

552 citations


Journal ArticleDOI
TL;DR: A model for a new type of topic-specific overview resource that provides efficient access to distributed information is developed, which is a freely accessible Web resource that offers one hypertext 'card' for each of the more than 7000 human genes that have an approved gene symbol published by the HUGO/GDB nomenclature committee.
Abstract: Motivation: Modem biology is shifting from the 'one gene one postdoc' approach to genomic analyses that include the simultaneous monitoring of thousands of genes. The importance of efficient access to concise and integrated biomedical information to support data analysis and decision making is therefore increasing rapidly, in both academic and industrial research. However, knowledge discovery in the widely scattered resources relevant for biomedical research is often a cumbersome and non-trivial task, one that requires a significant amount of training and effort. Results: To develop a model for a new type of topic-specific overview resource that provides efficient access to distributed information, we designed a database called 'GeneCards'. It is a freely accessible Web resource that offers one hypertext 'card' for each of the more than 7000 human genes that currently have an approved gene symbol published by the HUGO/GDB nomenclature committee. The presented information aims at giving immediate insight into current knowledge about the respective gene, including a focus on its functions in health and disease. It is compiled by Perl scripts that automatically extract relevant information from several databases, including SWISS-PROT, OMIM, Genatlas and GDB. Analyses of the interactions of users with the Web interface of GeneCards triggered development of easy-to-scan displays optimized for human browsing. Also, we developed algorithms that offer 'ready-to-click' query reformulation support, to facilitate information retrieval and exploration. Many of the long-term users turn to GeneCards to quickly access information about the function of very large sets of genes, for example in the realm of large-scale expression studies using 'DNA chip' technology or two-dimensional protein electrophoresis. Availability: Freely available at http://bioinformatics.weizmann.ac.il/cards/ Contact: cards@bioinformatics.weizmann.ac. il.

402 citations


Journal ArticleDOI
David J. Hand1
TL;DR: Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners.
Abstract: Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners. New problems arise, partly as a consequence of the sheer size of the data sets involved, and partly because of issues of pattern matching. However, since statistics provides the intellectual glue underlying the effort, it is important for statisticians to become involved. There are very real opportunities for statisticians to make significant contributions.

362 citations


Proceedings Article
01 Jul 1998
TL;DR: Technical design issues faced in the development of Open Knowledge Base Connectivity are discussed, how OKBC improves upon GFP is highlighted, and practical experiences in using it are reported on.
Abstract: The technology for building large knowledge bases (KBs) is yet to witness a breakthrough so that a KB can be constructed by the assembly of prefabricated knowledge components. Knowledge components include both pieces of domain knowledge (for example, theories of economics or fault diagnosis) and KB tools (for example, editors and theorem provers). Most of the current KB development tools can only manipulate knowledge residing in the knowledge representation system (KRS) for which the tools were originally developed. Open Knowledge Base Connectivity (OKBC) is an application programming interface for accessing KRSs, and was developed to enable the construction of reusable KB tools. OKBC improves upon its predecessor, the Generic Frame Protocol (GFP), in several significant ways. OKBC can be used with a much larger range of systems because its knowledge model supports an assertional view of a KRS. OKBC provides an explicit treatment of entities that are not frames, and it has a much better way of controlling inference and specifying default values. OKBC can be used on practically any platform because it supports network transparency and has implementations for multiple programming languages. In this paper, we discuss technical design issues faced in the development of OKBC, highlight how OKBC improves upon GFP, and report on practical experiences in using it.

354 citations


Proceedings Article
01 Jul 1998
TL;DR: This work shows how information extraction can be cast as a standard machine learning problem, and argues for the suitability of relational learning in solving it, and the implementation of a general-purpose relational learner for information extraction, SRV.
Abstract: Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.

Patent
Ronald M. Swartz1, Jeffrey L. Winkler1, Evelyn A. Janos1, Igor Markidan1, Qun Dou1 
29 Jun 1998
TL;DR: In this paper, the authors present a method and apparatus for first integrating the operation of various independent software applications directed to the management of information within an enterprise, which is an expandable architecture with built-in knowledge integration features that facilitate the monitoring of information flow into, out of, and between the integrated information management applications.
Abstract: The present invention is a method and apparatus for first integrating the operation of various independent software applications directed to the management of information within an enterprise. The system architecture is, however, an expandable architecture, with built-in knowledge integration features that facilitate the monitoring of information flow into, out of, and between the integrated information management applications so as to assimilate knowledge information and facilitate the control of such information. Also included are additional tools which, using the knowledge information enable the more efficient use of the knowledge within an enterprise, including the ability to develop a context for and visualization of such knowledge.

Journal ArticleDOI
TL;DR: A rule induction method is introduced, which extracts not only classification rules but also other medical knowledge needed for diagnosis from clinical cases, and is evaluated on three clinical databases.

Proceedings ArticleDOI
06 Jan 1998
TL;DR: While knowledge discovery often refers to the process of discovering useful knowledge from data, data mining focuses on the application of algorithms for extracting patterns from data.
Abstract: While knowledge discovery often refers to the process of discovering useful knowledge from data, data mining focuses on the application of algorithms for extracting patterns from data. Knowledge discovery seeks to find patterns in data and to infer rules (that is, to discover new information) that queries and reports do not reveal effectively. Thus, knowledge discovery has a R&D flavor and data mining an operational process one. Data mining is a basis of knowledge discovery.

Book ChapterDOI
23 Sep 1998
TL;DR: This paper describes the Term Extraction module of the Document Explorer system, and provides experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995–1996.
Abstract: Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the document. Terms and additional higher-level entities are then organized in a hierarchical taxonomy. In this paper we will describe the Term Extraction module of the Document Explorer system, and provide experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995–1996.

Book
01 Oct 1998
TL;DR: When you need to find rough sets in knowledge discovery 2 applications case studies and software systems, the following book can be a great choice.
Abstract: Many people are trying to be smarter every day. How's about you? There are many ways to evoke this case you can find knowledge and lesson everywhere you want. However, it will involve you to get what call as the preferred thing. When you need this kind of sources, the following book can be a great choice. rough sets in knowledge discovery 2 applications case studies and software systems is the PDF of the book.

Journal ArticleDOI
01 Mar 1998
TL;DR: In this article, a data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases and data warehouses, including characterization, comparison, association, classification, prediction, and clustering.
Abstract: Great efforts have been paid in the Intelligent Database Systems Research Lab for the research and development of efficient data mining methods and construction of on-line analytical data mining systems.Our work has been focused on the integration of data mining and OLAP technologies and the development of scalable, integrated, and multiple data mining functions. A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases and data warehouses. The system implements a wide spectrum of data mining functions, including characterization, comparison, association, classification, prediction, and clustering. It also builds up a user-friendly, interactive data mining environment and a set of knowledge visualization tools. In-depth research has been performed on the efficiency and scalability of data mining methods. Moreover, the research has been extended to spatial data mining, multimedia data mining, text mining, and Web mining with several new data mining system prototypes constructed or under construction, including GeoMiner, MultiMediaMiner, and WebLogMiner.This article summarizes our research and development activities in the last several years and shares our experiences and lessons with the readers.

Journal ArticleDOI
01 Jun 1998
TL;DR: This paper describes the KDT system for Knowledge Discovery in Text, in which documents are labeled by keywords, and knowledge discovery is performed by analyzing the co-occurrence frequencies of the various keywords labeling the documents.
Abstract: Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Knowledge Discovery in Text, in which documents are labeled by keywords, and knowledge discovery is performed by analyzing the co-occurrence frequencies of the various keywords labeling the documents. We show how this keyword-frequency approach supports a range of KDD operations, providing a suitable foundation for knowledge discovery and exploration for collections of unstructured text.

Book ChapterDOI
TL;DR: ROSE software package is an interactive, modular system designed for analysis and knowledge discovery based on rough set theory in 32-bit operating systems on PC computers that includes generation of decision rules for classification systems and knowledgeiscovery.
Abstract: This paper briefly describes ROSE software package. It is an interactive, modular system designed for analysis and knowledge discovery based on rough set theory in 32-bit operating systems on PC computers. It implements classical rough set theory as well as its extension based on variable precision model. It includes generation of decision rules for classification systems and knowledge discovery.

Journal ArticleDOI
01 Oct 1998
TL;DR: The paper describes the MIKE (Model-based and Incremental Knowledge Engineering) approach for developing knowledge-based systems, which integrates semiformal and formal specification techniques together with prototyping into a coherent framework.
Abstract: The paper describes the MIKE (Model-based and Incremental Knowledge Engineering) approach for developing knowledge-based systems. MIKE integrates semiformal and formal specification techniques together with prototyping into a coherent framework. All activities in the building process of a knowledge-based system are embedded in a cyclic process model. For the semiformal representation we use a hypermedia-based formalism which serves as a communication basis between expert and knowledge engineer during knowledge acquisition. The semiformal knowledge representation is also the basis for formalization, resulting in a formal and executable model specified in the Knowledge Acquisition and Representation Language (KARL). Since KARL is executable, the model of expertise can be developed and validated by prototyping. A smooth transition from a semiformal to a formal specification and further on to design is achieved because all the description techniques rely on the same conceptual model to describe the functional and nonfunctional aspects of the system. Thus, the system is thoroughly documented at different description levels, each of which focuses on a distinct aspect of the entire development effort. Traceability of requirements is supported by linking the different models to each other.

Book ChapterDOI
01 Jan 1998
TL;DR: This paper presents two examples of Text Mining tasks, association extraction and prototypical document extraction, along with several related NLP techniques.
Abstract: In the general framework of knowledge discovery, Data Mining techniques are usually dedicated to information extraction from structured databases. Text Mining techniques, on the other hand, are dedicated to information extraction from unstructured textual data and Natural Language Processing (NLP) can then be seen as an interesting tool for the enhancement of information extraction procedures. In this paper, we present two examples of Text Mining tasks, association extraction and prototypical document extraction, along with several related NLP techniques.

Journal ArticleDOI
TL;DR: This paper presents a case study of a machine-aided knowledge discovery process within the general area of drug design, and the Inductive Logic Programming (ILP) system progol is applied to the problem of identifying potential pharmacophores for ACE inhibition.
Abstract: This paper presents a case study of a machine-aided knowledge discovery process within the general area of drug design. Within drug design, the particular problem of pharmacophore discovery is isolated, and the Inductive Logic Programming (ILP) system progol is applied to the problem of identifying potential pharmacophores for ACE inhibition. The case study reported in this paper supports four general lessons for machine learning and knowledge discovery, as well as more specific lessons for pharmacophore discovery, for Inductive Logic Programming, and for ACE inhibition. The general lessons for machine learning and knowledge discovery are as follows. 1. An initial rediscovery step is a useful tool when approaching a new application domain. 2. General machine learning heuristics may fail to match the details of an application domain, but it may be possible to successfully apply a heuristic-based algorithm in spite of the mismatch. 3. A complete search for all plausible hypotheses can provide useful information to a user, although experimentation may be required to choose between competing hypotheses. 4. A declarative knowledge representation facilitates the development and debugging of background knowledge in collaboration with a domain expert, as well as the communication of final results.

Book ChapterDOI
23 Sep 1998
TL;DR: It is clarified that CKDD can be understood as a human-centered approach of Knowledge Discovery in Databases, which led to the software system TOSCANA, which is presented as a CKDD tool in this paper.
Abstract: In this paper we discuss Conceptual Knowledge Discovery in Databases (CKDD) as it is developing in the field of Conceptual Knowledge Processing (cf. [29],[30]). Conceptual Knowledge Processing is based on the mathematical theory of Formal Concept Analysis which has become a successful theory for data analysis during the last 18 years. This approach relies on the pragmatic philosophy of Ch.S. Peirce [15] who claims that we can only analyze and argue within restricted contexts where we always rely on pre-knowledge and common sense. The development of Formal Concept Analysis led to the software system TOSCANA, which is presented as a CKDD tool in this paper. TOSCANA is a flexible navigation tool that allows dynamic browsing through and zooming into the data. It supports the exploration of large databases by visualizing conceptual aspects inherent to the data. We want to clarify that CKDD can be understood as a human-centered approach of Knowledge Discovery in Databases. The actual discussion about human-centered Knowledge Discovery is therefore briefly summarized in Section 1.

Journal ArticleDOI
01 May 1998
TL;DR: CMM, a meta-learner that seeks to retain most of the accuracy gains of multiple model approaches, while still producing a single comprehensible model, is proposed and evaluated.
Abstract: If it is to qualify as knowledge, a learner's output should be accurate, stable and comprehensible. Learning multiple models can improve significantly on the accuracy and stability of single models, but at the cost of losing their comprehensibility when they possess it, as do, for example, simple decision trees and rule sets. This article proposes and evaluates CMM, a meta-learner that seeks to retain most of the accuracy gains of multiple model approaches, while still producing a single comprehensible model. CMM is based on reapplying the base learner to recover the frontiers implicit in the multiple model ensemble. This is done by giving the base learner a new training set, composed of a large number of examples generated and classified according to the ensemble, plus the original examples. CMM is evaluated using C4.5RULES as the base learner, and bagging as the multiple-model methodology. On 26 benchmark datasets, CMM retains on average 60% of the accuracy gains obtained by bagging relative to a single run of C4.5RULES, while producing a rule set whose complexity is typically a small multiple 2--6 of C4.5RULES's, and also improving stability. Further studies show that accuracy and complexity can be traded off by varying the number of artificial examples generated.

Proceedings Article
27 Aug 1998
TL;DR: In this paper, a knowledge discovery method for structured data is presented, where patterns reflect the one-tomany and many-to-many relationships of several tables, and background knowledge, represented in a uniform manner in some of the tables, has an essential role here, unlike in most data mining settings for the discovery of frequent patterns.
Abstract: The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI'97 as a research challenge for artificial intelligence. Our approach to the problem is descriptive rather than based on classification; the goal being to find common substructures and properties in chemical compounds, and in this way to contribute to scientific insight. This approach contrasts with previous machine learning research on this problem, which has mainly concentrated on predicting the toxicity of unknown chemicals. Our contribution to the field of data mining is the ability to discover useful frequent patterns that are beyond the complexity of association rules or their known variants. This is vital to the problem, which requires the discovery of patterns that are out of the reach of simple transformations to frequent itemsets. We present a knowledge discovery method for structured data, where patterns reflect the one-to-many and many-to-many relationships of several tables. Background knowledge, represented in a uniform manner in some of the tables, has an essential role here, unlike in most data mining settings for the discovery of frequent patterns.

Journal ArticleDOI
Sung Ho Ha1, Sang Chan Park1
TL;DR: This paper presents the data mining process from data extraction to knowledge interpretation and data mining tasks, and corresponding algorithms, and proposes a new marketing strategy that fully utilizes the knowledge resulting from data mining.
Abstract: Data mining, which is also referred to as knowledge discovery in databases, is the process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions. In this paper, we present the data mining process from data extraction to knowledge interpretation and data mining tasks, and corresponding algorithms. Before applying data mining techniques to a real-world application, we build a data mart on the enterprise Intranet. RFM (recency, frequency, and monetary) data extracted from the data mart are used extensively for our analysis. We then propose a new marketing strategy that fully utilizes the knowledge resulting from data mining.

Journal ArticleDOI
TL;DR: The article describes the TA3IVF system--a case-based reasoning system which relies on context-based relevance assessment to assist in knowledge visualization, interactive data exploration and discovery in this domain.

Proceedings Article
01 Jan 1998
TL;DR: WYSIWYM editing is an alternative solution in which the texts to view and edit the knowledge are generated not by the user but by the system, and each choice directly updates the knowledge base.
Abstract: Many kinds of knowledge-based system would be easier to develop and maintain if domain experts (as opposed to knowledge engineers) were in a position to define and edit the knowledge. From the viewpoint of domain experts, the best medium for defining the knowledge would be a text in natural language; however, natural lan- guage input cannot be decoded reliably unless written in controlled languages, which are difficult for domain experts to learn and use. WYSIWYM editing is an alternative solution in which the texts em- ployed to view and edit the knowledge are generated not by the user but by the system. The user can add knowledge by clicking on 'an- chors' in the text and choosing from a list of semantic alternatives; each choice directly updates the knowledge base, from which a new text is then generated.

Journal ArticleDOI
TL;DR: This paper reviews approaches for automated pattern spotting and knowledge discovery in spatially referenced data and indicates that there appears to be a general lack of understanding associated with the use and application of various clustering methods in the geographic domain.
Abstract: This paper reviews approaches for automated pattern spotting and knowledge discovery in spatially referenced data. This is an emerging field which to date has received developmental contributions primarily from researchers in statistics and knowledge discovery in databases (KDD). The field of geographical information systems (GIS) has, however, recognized its importance as a means for providing more exploratory analysis functionality. Tools based upon automated approaches that identify potentially important relationships in spatial data are essential in GIS in order to effectively deal with the increasing amounts of information being gathered. Clustering techniques are proving to be valuable, but there appears to be a general lack of understanding associated with the use and application of various clustering methods in the geographic domain. Further, there is little if any recognition of the relationships between clustering methods. As a result, the development of techniques known to be problematic or inf...

Proceedings Article
01 Jan 1998
TL;DR: Document Explorer is described, a tool that implements text mining at the term level, in which knowledge discovery takes place on a more focused collection of words and phrases that are extracted from and label each document.
Abstract: Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest This paper presents an intermediate approach, one that we call text mining at the term level, in which knowledge discovery takes place on a more focused collection of words and phrases that are extracted from and label each document These terms plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process This paper describes Document Explorer, our tool that implements text mining at the term level It consists of a document retrieval module, which converts retrieved documents from their native formats into documents represented using the SGML mark-up language used by Document Explorer; a two-stage term-extraction approach, in which terms are first proposed in a termgeneration stage, and from which a smaller set are then selected in a term-filtering stage in light of their frequencies of occurrence elsewhere in the collection; our taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and our knowledge-discovery tools for the resulting term-labeled documents Finally, we evaluate our approach on a collection of patent records as well as Reuters newswire stories Our results confirm that Text Mining serves as a powerful technique to manage knowledge encapsulated in large document collections