scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 1996"


Journal ArticleDOI
U.M. Feyyad1
TL;DR: Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
Abstract: Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.

4,806 citations


Journal ArticleDOI
TL;DR: An overview of this emerging field is provided, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases.
Abstract: ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.

4,782 citations



Journal ArticleDOI
TL;DR: In this paper, a survey of the available data mining techniques is provided and a comparative study of such techniques is presented, based on a database researcher's point-of-view.
Abstract: Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information-providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided and to increase business opportunities. In response to such a demand, this article provides a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.

2,327 citations



Proceedings Article
02 Aug 1996
TL;DR: The KDD process and basic data mining algorithms are defined, links between data mining, knowledge discovery, and other related fields are described, and an analysis of challenges facing practitioners in the field is analyzed.
Abstract: This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe links between data mining, knowledge discovery, and other related fields. We then define the KDD process and basic data mining algorithms, discuss application issues and conclude with an analysis of challenges facing practitioners in the field.

865 citations


Journal ArticleDOI
TL;DR: The focus of the paper is on studying subjective measures of interestingness, which are classified into actionable and unexpected, and the relationship between them is examined.
Abstract: One of the central problems in the field of knowledge discovery is the development of good measures of interestingness of discovered patterns. Such measures of interestingness are divided into objective measures-those that depend only on the structure of a pattern and the underlying data used in the discovery process, and the subjective measures-those that also depend on the class of users who examine the pattern. The focus of the paper is on studying subjective measures of interestingness. These measures are classified into actionable and unexpected, and the relationship between them is examined. The unexpected measure of interestingness is defined in terms of the belief system that the user has. Interestingness of a pattern is expressed in terms of how it affects the belief system. The paper also discusses how this unexpected measure of interestingness can be used in the discovery process.

746 citations


Journal ArticleDOI
TL;DR: This article presents a comprehensive introduction and summary of the main basic concepts and bibliography in the area of Data Mining, nowadays and can be considered as a good starting point for newcomers in the field.
Abstract: The term knowledge discovery in databases or KDD, for short, was coined in 1989 to refer to the broad process of finding knowledge in data, and to emphasize the “high-level” application of particular Data Mining (DM) methods (Fayyad, Piatetski-Shapiro, & Smyth, 1996). Fayyad considers DM as one of the phases of the KDD process. The DM phase concerns, mainly, the means by which the patterns are extracted and enumerated from data. The literature is sometimes a source of some confusion because the two terms are indistinctively used, making it difficult to determine exactly each of the concepts (Benoît, 2002). Nowadays, the two terms are, usually, indistinctly used. Efforts are being developed in order to create standards and rules in the field of DM with great relevance being given to the subject of inductive databases (De Raedt, 2003) (Imielinski & Mannila, 1996). Within the context of inductive databases a great relevance is given to the so called DM languages. This article presents a comprehensive introduction and summary of the main basic concepts and bibliography in the area of DM, nowadays. Thus, the main contribution of this article is that it can be considered as a good starting point for newcomers in the area. The remaining of this article is organized as follows. Firstly, DM and the KDD process are introduced. Following, the main DM tasks, methods/algorithms, and models/patterns are organized and succinctly explained. SEMMA and CRISP-DM are next introduced and compared with KDD. A brief explanation of standards for DM is then presented. The article concludes with possible future research directions and conclusion. BACKGROUND

570 citations


Proceedings Article
02 Aug 1996
TL;DR: Three field matching algorithms are described, one of which is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences, and their performance on real-world datasets is evaluated.
Abstract: To combine information from heterogeneous sources, equivalent data in the multiple sources must be identified. This task is the field matching problem. Specifically, the task is to determine whether or not two syntactic values are alternative designations of the same semantic entity. For example the addresses Dept. of Comput. Sci. and Eng., University of California, San Diego, 9500 Gilman Dr. Dept. 0114, La Jolla. CA 92093 and UCSD, Computer Science and Engineering Department, CA 92093-0114 do designate the same department. This paper describes three field matching algorithms, and evaluates their performance on real-world datasets. One proposed method is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences. Several applications of field matching in knowledge discovery are described briefly, including WEBFIND, which is a new software tool that discovers scientific papers published on the worldwide web. WEBFIND uses external information sources to guide its search for authors and papers. Like many other worldwide web tools, WEBFIND needs to solve the field matching problem in order to navigate between information sources.

557 citations


Journal ArticleDOI
TL;DR: The concept of data mining as a querying process and the first steps toward efficient development of knowledge discovery applications are discussed.
Abstract: DATABASE MINING IS NOT SIMPLY ANOTHER buzzword for statistical data analysis or inductive learning. Database mining sets new challenges to database technology: new concepts and methods are needed for query languages, basic operations, and query processing strategies. The most important new component is the ad hoc nature of knowledge and data discovery (KDD) queries and the need for efficient query compilation into a multitude of existing and new data analysis methods. Hence, database mining builds upon the existing body of work in statistics and machine learning but provides completely new functionalities. The current generation of database systems are designed mainly to support business applications. The success of Structured Query Language (SQL) has capitalized on a small number of primitives sufficient to support a vast majority of such applications. Unfortunately, these primitives are not sufficient to capture the emerging family of new applications dealing with knowledge discovery. Most current KDD systems offer isolated discovery features using tree inducers, neural nets, and rule discovery algorithms. Such systems cannot be embedded into a large application and typically offer just one knowledge dis-The concept of data mining as a querying process and the first steps toward efficient development of knowledge discovery applications are discussed.

547 citations


Journal ArticleDOI
TL;DR: An efficient algorithm called DMA (Distributed Mining of Association rules), which generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, in distributed databases.
Abstract: Many sequential algorithms have been proposed for the mining of association rules. However, very little work has been done in mining association rules in distributed databases. A direct application of sequential algorithms to distributed databases is not effective, because it requires a large amount of communication overhead. In this study, an efficient algorithm called DMA (Distributed Mining of Association rules), is proposed. It generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, where n is the number of sites in a distributed database. The algorithm has been implemented on an experimental testbed, and its performance is studied. The results show that DMA has superior performance, when compared with the direct application of a popular sequential algorithm, in distributed databases.


Journal ArticleDOI
TL;DR: A knowledge-based framework for the creation of abstract, interval-based concepts from time-stamped clinical data, the knowledge- based temporal-abstraction (KBTA) method is defined and the RESUME system implements the KBTA method.

Proceedings Article
01 Feb 1996
TL;DR: A method for discovering informative patterns from data that can be reduced to only a few representative data entries and an attractive candidate for new applications in knowledge discovery is presented.
Abstract: We present a method for discovering informative patterns from data With this method, large databases can be reduced to only a few representative data entries Our framework encompasses also methods for cleaning databases containing corrupted data Both on-line and off-line algorithms are proposed and experimentally checked on databases of handwritten images The generality of the framework makes it an attractive candidate for new applications in knowledge discovery

Journal ArticleDOI
TL;DR: The goal here is to provide a brief overview of the key issues in knowledge discovery in an industrial context and outline representative applications.
Abstract: a phenomenal rate. From the financial sector to telecommunications operations , companies increasingly rely on analysis of huge amounts of data to compete. Although ad hoc mixtures of statistical techniques and file management tools once sufficed for digging through mounds of corporate data, the size of modern data warehouses, the mission-critical nature of the data, and the speed with which analyses need to be made now call for a new approach. A new generation of techniques and tools is emerging to intelligently assist humans in analyzing mountains of data and finding critical nuggets of useful knowledge, and in some cases to perform analyses automatically. These techniques and tools are the subject of the growing field of knowledge discovery in databases (KDD) [5]. KDD is an umbrella term describing a variety of activities for making sense of data. We use the term to describe the overall process of finding useful patterns in data, including not only the data mining step of running specific discovery algorithms but also pre-and postprocessing and a host of other important activities. Our goal here is to provide a brief overview of the key issues in knowledge discovery in an industrial context and outline representative applications. The different data mining methods at the core of the KDD process can have different goals. In general, we distinguish two types: • Verification, in which the system is limited to verifying a user's hypothesis, and • Discovery, in which the system finds new patterns. Ad hoc techniques—no longer adequate for sifting through vast collections of data—are giving way to data mining and knowledge discovery for turning corporate data into competitive business advantage.

Journal Article
TL;DR: Knowledge discovery in databases (KDD) and data mining as discussed by the authors is the emerging field of knowledge discovery in data and is the subject of this paper. But, whether the context is business, medicine, science, or government, the datasets themselves (in raw form) are of little direct value.
Abstract: AS WE MARCH INTO THE AGE of digital information, the problem of data overload looms ominously ahead. Our ability to analyze and understand massive datasets lags far behind our ability to gather and store the data. A new generation of computational techniques and tools is required to support the extraction of useful knowledge from the rapidly growing volumes of data. These techniques and tools are the subject of the emerging field of knowledge discovery in databases (KDD) and data mining. Large databases of digital information are ubiquitous. Data from the neighborhood store’s checkout register, your bank’s credit card authorization device, records in your doctor’s office, patterns in your telephone calls, and many more applications generate streams of digital records archived in huge databases, sometimes in so-called data warehouses. Current hardware and database technology allow efficient and inexpensive reliable data storage and access. However, whether the context is business, medicine, science, or government, the datasets themselves (in raw form) are of little direct value. What is of value is the knowledge that can be inferred from the data and put to use. For example, the marketing database of a consumer U s a m a F a y y a d ,

Proceedings Article
02 Aug 1996
TL;DR: A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases that provides a user-friendly, interactive data mining environment with good performance.
Abstract: A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classification, and prediction. By incorporating several interesting data mining techniques, including attribute-oriented induction, statistical analysis, progressive deepening for mining multiple-level knowledge, and meta-rule guided mining, the system provides a user-friendly, interactive data mining environment with good performance.

Proceedings ArticleDOI
18 Jun 1996
TL;DR: An overview of the area is given and some of the research issues are presented, especially from the database angle, which aim at semiautomatic tools for the analysis of large data sets.
Abstract: Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We give an overview of the area and present some of the research issues, especially from the database angle.

Proceedings Article
01 Feb 1996

Proceedings ArticleDOI
26 Feb 1996
TL;DR: The TASA (Telecommunication Network Alarm Sequence Analyzer) system for discovering and browsing knowledge from large alarm databases is described, built on the basis of viewing knowledge discovery as an interactive and iterative process, containing data collection, pattern discovery, rule postprocessing, etc.
Abstract: A telecommunication network produces daily large amounts of alarm data. The data contains hidden valuable knowledge about the behavior of the network. This knowledge can be used in filtering redundant alarms, locating problems in, the network, and possibly in predicting severe faults. We describe the TASA (Telecommunication Network Alarm Sequence Analyzer) system for discovering and browsing knowledge from large alarm databases. The system is built on the basis of viewing knowledge discovery as an interactive and iterative process, containing data collection, pattern discovery, rule postprocessing, etc. The system uses a novel framework for locating frequently occurring episodes from sequential data. The TASA system offers a variety of selection and ordering criteria for episodes, and supports iterative retrieval from the discovered knowledge. This means that a large part of the iterative nature of the KDD process can be replaced by iteration in the rule postprocessing stage. The user interface is based on dynamically generated HTML. The system is in experimental use, and the results are encouraging: some of the discovered knowledge is being integrated into the alarm handling software of telecommunication operators.

Proceedings Article
01 Feb 1996
TL;DR: The task of this chapter is to provide a perspective on statistical techniques applicable to KDD, and below some major advances in statistics in the last few decades are reviewed.
Abstract: The quest to nd models usefully characterizing data is a process central to the scientiic method, and has been carried out on many fronts. Researchers from an expanding number of elds have designed algorithms to discover rules or equations that capture key relationships between variables in a database. The task of this chapter is to provide a perspective on statistical techniques applicable to KDD; accordingly, we review below some major advances in statistics in the last few decades. We next highlight some distinctives of what may be called a \statistical viewpoint." Finally we overview some innuential classical and modern statistical methods for practical model induction. It would be unfortunate if the KDD community dismissed statistical methods on the basis of courses that they took on statistics several to many years ago. The following provides a rough chronology of \recent" signiicant contributions in statistics that are relevant to the KDD community. The noteworthy fact is that this time period coincides with the signiicant increases in computing horsepower and memory, powerful and expressive programming languages, and general accessibility to computing that has propelled us into

Proceedings Article
02 Aug 1996
TL;DR: FACT takes a query-centered view of knowledge discovery, in which a discovery request is viewed as a query over the implicit set of possible results supported by a collection of documents, and where background knowledge is used to specify constraints on the desired results of this query process.
Abstract: This paper describes the FACT system for knowledge discovery from text. It discovers associations - patterns of co-occurrence -amongst keywords labeling the items in a collection of textual documents. In addition, FACT is able to use background knowledge about the keywords labeling the documents in its discovery process. FACT takes a query-centered view of knowledge discovery, in which a discovery request is viewed as a query over the implicit set of possible results supported by a collection of documents, and where background knowledge is used to specify constraints on the desired results of this query process. Execution of a knowledge-discovery query is structured so that these background-knowledge constraints can be exploited in the search for possible results. Finally, rather than requiring a user to specify an explicit query expression in the knowledge-discovery query language, FACT presents the user with a simple-to-use graphical interface to the query language, with the language providing a well-defined semantics for the discovery actions performed by a user through the interface.

Journal ArticleDOI
TL;DR: Preliminary results generated from the semantic retrieval research component of the Illinois Digital Library Initiative (DLI) project are presented, which aimed to create graphs of domain-specific concepts and their weighted co-occurrence relationships for all major engineering domains.
Abstract: This research presents preliminary results generated from the semantic retrieval research component of the Illinois Digital Library Initiative (DLI) project. Using a variation of the automatic thesaurus generation techniques, to which we refer to as the concept space approach, we aimed to create graphs of domain-specific concepts (terms) and their weighted co-occurrence relationships for all major engineering domains. Merging these concept spaces and providing traversal paths across different concept spaces could potentially help alleviate the vocabulary (difference) problem evident in large-scale information retrieval. In order to address the scalability issue related to large-scale information retrieval and analysis for the current Illinois DLI project, we conducted experiments using the concept space approach on parallel supercomputers. Our test collection included computer science and electrical engineering abstracts extracted from the INSPEC database. The concept space approach called for extensive textual and statistical analysis (a form of knowledge discovery) based on automatic indexing and co-occurrence analysis algorithms, both previously tested in the biology domain. Initial testing results using a 512-node CM-5 and a 16-processor SGI Power Challenge were promising.

09 Nov 1996
TL;DR: The Co4 system is dedicated to the representation of formal knowledge in an object and task based manner and is fully interleaved with hyper-documents and thus provides integration of formal and informal knowledge.
Abstract: The Co4 system is dedicated to the representation of formal knowledge in an object and task based manner. It is fully interleaved with hyper-documents and thus provides integration of formal and informal knowledge. Moreover, consensus about the content of the knowledge bases is enforced with the help of a protocol for integrating knowledge through several levels of consensual knowledge bases. Co4 is presented here as addressing three claims about corporate memory: (1) it must be formalised to the greatest possible extent so that its semantics is clear and its manipulation can be automated; (2) it cannot be totally formalised and thus formal and informal knowledge must be organised such that they refer to each other; (3) in order to be useful, it must be accepted by the people involved (providers and users) and thus must be non contradictory and consensual.


Dissertation
Xiaohua Hu1
03 Oct 1996
TL;DR: The method is able to identify the essential subset of nonredundant attributes that determine the discovery task, and can learn different kinds of knowledge rules efficiently from large databases with noisy data and in a dynamic environment and deal with databases with incomplete information.
Abstract: Knowledge discovery systems face challenging problems from the real-world databases which tend to be very large, redundant, noisy and dynamic. In this thesis, we develop an attribute-oriented rough set approach for knowledge discovery in databases. The method adopts the artificial intelligent "learning from examples" paradigm combined with rough set theory and database operations. The learning procedure consists of two phases: data generalization and data reduction. In data generalization, our method generalizes the data by performing attribute-oriented concept tree ascension, thus some undesirable attributes are removed and a set of tuples may be generalized to the same generalized tuple. The goal of data reduction is to find a minimal subset of interesting attributes that have all the essential information of the generalized relation; thus the minimal subset of the attributes can be used rather than the entire attribute set of the generalized relation. By removing those attributes which are not important and/or essential, the rules generated are more concise and ellicacious. Our method integrates a variety of knowledge discovery algorithms, such as DBChar for deriving characteristic rules. DBClass for classification rules. DBDeci for decision rules. DBMaxi for maximal generalized rules. DMBkbs for multiple sets of knowledge rules and DBTrend for data trend regularities, which permit a user to discover various kinds of relationships and regularities in the data. This integration inherit the advantages of the attribute-oriented induction model and rough set theory. Our method makes some contribution to the KDD. A generalized rough set model is formally defined with the ability to handle statistical information and also consider the importance of attributes and objects in the databases. Our method is able to identify the essential subset of nonredundant attributes (factors) that determine the discovery task, and can learn different kinds of knowledge rules efficiently from large databases with noisy data and in a dynamic environment and deal with databases with incomplete information. A prototype system DBROUGH was constructed under a Unix/C/Sybase environment. Our system implements a number of novel ideas. In our system, we use attribute-oriented induction rather than tuple-oriented induction, thus greatly improving the learning efficiency. By integrating rough set techniques into the learning procedure, the derived knowledge rules are particularly concise and pertinent, since only the relevant and/or important attributes (factors) to the learning task are considered. In our system, the combination of transition network and concept hierarchy provides a nice mechanism to handle dynamic characteristic of data in the databases. For applications with noisy data, our system can generate multiple sets of knowledge rules through a decision matrix to improve the learning accuracy. The experiments using the NSERC information system illustrate the promise of attribute-oriented rough set learning for knowledge discovery for databases. (Abstract shortened by UMI.)

Journal ArticleDOI
TL;DR: This study shows that knowledge discovery substantially broadens the spectrum of intelligent query answering and may have deep implications on query answering in data- and knowledge-base systems.
Abstract: Knowledge discovery facilitates querying database knowledge and intelligent query answering in database systems. We investigate the application of discovered knowledge, concept hierarchies, and knowledge discovery tools for intelligent query answering in database systems. A knowledge-rich data model is constructed to incorporate discovered knowledge and knowledge discovery tools. Queries are classified into data queries and knowledge queries. Both types of queries can be answered directly by simple retrieval or intelligently by analyzing the intent of query and providing generalized, neighborhood or associated information using stored or discovered knowledge. Techniques have been developed for intelligent query answering using discovered knowledge and/or knowledge discovery tools, which includes generalization, data summarization, concept clustering, rule discovery, query rewriting, deduction, lazy evaluation, application of multiple-layered databases, etc. Our study shows that knowledge discovery substantially broadens the spectrum of intelligent query answering and may have deep implications on query answering in data- and knowledge-base systems.

Proceedings Article
02 Aug 1996
TL;DR: This paper surveys the growing number of industrial applications of data mining and knowledge discovery, and describes some representative applications, and examines how to assess the potential of a knowledge discovery application.
Abstract: This paper surveys the growing number of industrial applications of data mining and knowledge discovery. We look at the existing tools, describe some representative applications, and discuss the major issues and problems for building and deploying successful applications and their adoption by business users. Finally, we examine how to assess the potential of a knowledge discovery application.

Journal ArticleDOI
01 Apr 1996
TL;DR: An extension of the conventional definition of mass functions in Evidence Theory for use in Data Mining, as a means to represent evidence of the existence of rules in the database is suggested.
Abstract: Data Mining or Knowledge Discovery in Databases is currently one of the most exciting and challenging areas where database techniques are coupled with techniques from Artificial Intelligence and mathematical sub-disciplines to great potential advantage. It has been defined as the non-trivial extraction of implicit, previously unknown and potentially useful information from data. A lot of research effort is being directed towards building tools for discovering interesting patterns which are hidden below the surface in databases. However, most of the work being done in this field has been problem-specific and no general framework has yet been proposed for Data Mining. In this paper we seek to remedy this by proposing, EDM — Evidence-based Data Mining — a general framework for Data Mining based on Evidence Theory. Having a general framework for Data Mining offers a number of advantages. It provides a common method for representing knowledge which allows prior knowledge from the user or knowledge discoveryd by another discovery process to be incorporated into the discovery process. A common knowledge representation also supports the discovery of meta-knowledge from knowledge discovered by different Data Mining techniques. Furthermore, a general framework can provide facilities that are common to most discovery processes, e.g. incorporating domain knowledge and dealing with missing values. The framework presented in this paper has the following additional advantages. The framework is inherently parallel. Thus, algorithms developed within this framework will also be parallel and will therefore be expected to be efficient for large data sets — a necessity as most commercial data sets, relational or otherwise, are very large. This is compounded by the fact that the algorithms are complex. Also, the parallelism within the framework allows its use in parallel, distributed and heterogeneous databases. The framework is easily updated and new discovery methods can be readily incorporated within the framework, making it ‘general’ in the functional sense in addition to the representational sense considered above. The framework provides an intuitive way of dealing with missing data during the discovery process using the concept of Ignorance borrowed from Evidence Theory. The framework consists of a method for representing data and knowledge, and methods for data manipulation or knowledge discovery. We suggest an extension of the conventional definition of mass functions in Evidence Theory for use in Data Mining, as a means to represent evidence of the existence of rules in the database. The discovery process within EDM consists of a series of operations on the mass functions. Each operation is carried out by an EDM operator. We provide a classification for the EDM operators based on the discovery functions performed by them and discuss aspects of the induction, domain and combination operator classes. The application of EDM to two separate Data Mining tasks is also addressed, highlighting the advantages of using a general framework for Data Mining in general and, in particular, using one that is based on Evidence Theory.

01 Jan 1996
TL;DR: Examples are presented showing how the use of SHOE can support a new generation of knowledge-based search and knowledge discovery tools that operate on the WorM-Wide Web.
Abstract: This paper describes SHOE, a set of Simple HTML Ontology Extensions. SHOE allows World-Wide Web authors to annotate their pages with ontology-based knowledge about page contents. We present examples showing how the use of SHOE can support a new generation of knowledge-based search and knowledge discovery tools that operate on the WorM-Wide Web.