scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2001"


Book ChapterDOI
03 Sep 2001
TL;DR: This work uses KDD to analyse data from mutant phenotype growth experiments with the yeast S. cerevisiae to predict novel gene functions, and learns rules which are accurate and biologically meaningful.
Abstract: The biological sciences are undergoing an explosion in the amount of available data. New data analysis methods are needed to deal with the data. We present work using KDD to analyse data from mutant phenotype growth experiments with the yeast S. cerevisiae to predict novel gene functions. The analysis of the data presented a number of challenges: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We developed resampling strategies and modified the algorithm C4.5 to deal with these problems. Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 putative genes of currently unknown function at an estimated accuracy of ≥ 80%.

714 citations


Book
03 Sep 2001
TL;DR: Leading researchers from the fields of data mining, data visualization, and statistics present findings organized around topics introduced in two recent international knowledge discovery and data mining workshops as formal chapters that together comprise a complete, cohesive body of research.
Abstract: From the Publisher: Mainstream data mining techniques significantly limit the role of human reasoning and insight. Likewise, in data visualization, the role of computational analysis is relatively small. The power demonstrated individually by these approaches to knowledge discovery suggests that somehow uniting the two could lead to increased efficiency and more valuable results. But is this true? How might it be achieved? And what are the consequences for data-dependent enterprises? Information Visualization in Data Mining and Knowledge Discovery is the first book to ask and answer these thought-provoking questions. It is also the first book to explore the fertile ground of uniting data mining and data visualization principles in a new set of knowledge discovery techniques. Leading researchers from the fields of data mining, data visualization, and statistics present findings organized around topics introduced in two recent international knowledge discovery and data mining workshops. Collected and edited by three of the area's most influential figures, these chapters introduce the concepts and components of visualization, detail current efforts to include visualization and user interaction in data mining, and explore the potential for further synthesis of data mining algorithms and data visualization techniques. This incisive, groundbreaking research is sure to wield a strong influence in subsequent efforts in both academic and corporate settings. Features Details advances made by leading researchers from the fields of data mining, data visualization, and statistics. Provides a useful introduction to the science of visualization, sketches the current role for visualization in data mining, and then takes a long look into its mostly untapped potential. Presents the findings of recent international KDD workshops as formal chapters that together comprise a complete, cohesive body of research. Offerss compelling and practical information for professionals and researchers in database technology, data mining, knowledge discovery, artificial intelligence, machine learning, neural networks, statistics, pattern recognition, information retrieval, high-performance computing, and data visualization. Author Biography: Usama Fayyad is co-founder, president, and CEO of digiMine, a data warehousing and data mining ASP. Prior to digiMine, he founded and led Microsoft's Data Mining and Exploration Group, where he developed data mining prediction components for Microsoft Site Server and scalable algorithms for mining large databases. Georges G. Grinstein is a professor of computer science, director of the Institute for Visualization and Perception Research, and co-director of the Center for Bioinformatics and Computational Biology at the University of Massachusetts, Lowell. He is currently the chief technologist for AnVil Informatics, a data exploration company. Andreas Wierse is the managing director of VirCinity, a spin-off company of the Computing Centre of the University of Stuttgart. Previously, he worked at the Computer Centre, where he designed and implemented distributed data management for the COVISE visualization system and maintained a wide range of graphics workstations.

431 citations


01 Jan 2001

375 citations



Journal ArticleDOI
TL;DR: This work has developed a method that extends and transforms traditional author co-citation analysis by extracting structural patterns from the scientific literature and representing them in a 3D knowledge landscape.
Abstract: To make knowledge visualizations clear and easy to interpret, we have developed a method that extends and transforms traditional author co-citation analysis by extracting structural patterns from the scientific literature and representing them in a 3D knowledge landscape.

240 citations


BookDOI
01 Jan 2001
TL;DR: This volume serves as a comprehensive reference for graduate students, practitioners and researchers in KDD to report new developments and applications, to share hard-learned experiences in order to avoid similar pitfalls, and to shed light on the future development of instance selection.
Abstract: The ability to analyze and understand massive data sets lags far behind the ability to gather and store the data. To meet this challenge, knowledge discovery and data mining (KDD) is growing rapidly as an emerging field. However, no matter how powerful computers are now or will be in the future, KDD researchers and practitioners must consider how to manage ever-growing data which is, ironically, due to the extensive use of computers and ease of data collection with computers. Many different approaches have been used to address the data explosion issue, such as algorithm scale-up and data reduction. Instance, example, or tuple selection pertains to methods or algorithms that select or search for a representative portion of data that can fulfill a KDD task as if the whole data is used. Instance selection is directly related to data reduction and becomes increasingly important in many KDD applications due to the need for processing efficiency and/or storage efficiency. One of the major means of instance selection is sampling whereby a sample is selected for testing and analysis, and randomness is a key element in the process. Instance selection also covers methods that require search. Examples can be found in density estimation (finding the representative instances -- data points -- for a cluster); boundary hunting (finding the critical instances to form boundaries to differentiate data points of different classes); and data squashing (producing weighted new data with equivalent sufficient statistics). Other important issues related to instance selection extend to unwanted precision, focusing, concept drifts, noise/outlier removal, data smoothing, etc. Instance Selection and Construction for Data Mining brings researchers and practitioners together to report new developments and applications, to share hard-learned experiences in order to avoid similar pitfalls, and to shed light on the future development of instance selection. This volume serves as a comprehensive reference for graduate students, practitioners and researchers in KDD.

228 citations


Book
01 Jan 2001
TL;DR: This paper presents a data mining technique and an interestingness framework based on heuristic measures of interestingness that were developed in the second part of this monograph on interestingness and data mining.
Abstract: List of Figures. List of Tables. Preface. Acknowledgments. 1. Introduction. 2. Background and Related Work. 3. A Data Mining Technique. 4. Heuristic Measures of Interestingness. 5. An Interestingness Framework. 6. Experimental Analyses. 7. Conclusion. Appendices. Index.

221 citations


Journal ArticleDOI
TL;DR: A new method of extraction is presented that captures nonmonotonic rules encoded in the network, and it is proved that such a method is sound and able to keep the soundness of the extraction algorithm.

217 citations


Journal ArticleDOI
01 Feb 2001
TL;DR: This correspondence introduces a general methodology for knowledge discovery in TSDB, based on signal processing techniques and the information-theoretic fuzzy approach to knowledge discovery and demonstrates the approach on two types of time series: stock-market data and weather data.
Abstract: Adding the dimension of time to databases produces time series databases (TSDB) and introduces new aspects and difficulties to data mining and knowledge discovery. In this correspondence, we introduce a general methodology for knowledge discovery in TSDB. The process of knowledge discovery in TSDR includes cleaning and filtering of time series data, identifying the most important predicting attributes, and extracting a set of association rules that can be used to predict the time series behavior in the future. Our method is based on signal processing techniques and the information-theoretic fuzzy approach to knowledge discovery. The computational theory of perception (CTP) is used to reduce the set of extracted rules by fuzzification and aggregation. We demonstrate our approach on two types of time series: stock-market data and weather data.

192 citations


Patent
03 Aug 2001
TL;DR: In this paper, a system and method for searching documents in a data source and more particularly, to a system for analyzing and clustering of documents for a search engine is presented. But the system is not suitable for large scale data sets.
Abstract: A system and method for searching documents in a data source and more particularly, to a system and method for analyzing and clustering of documents for a search engine. The system and method includes analyzing and processing documents to secure the infrastructure and standards for optimal document processing. By incorporating Computational Intelligence (CI) and statistical methods, the document information is analyzed and clustered using novel techniques for knowledge extraction. A comprehensive dictionary is built based on the keywords identified by the these techniques from the entire text of the document. The text is parsed for keywords or the number of its occurrences and the context in which the word appears in the documents. The whole document is identified by the knowledge that is represented in its contents. Based on such knowledge extracted from all the documents, the documents are clustered into meaningful groups in a catalog tree. The results of document analysis and clustering information are stored in a database.

184 citations


Book ChapterDOI
11 Oct 2001
TL;DR: The penetration of data warehouses into the management and exploitation of spatial databases is a major trend as it is for non-spatial databases.
Abstract: Recent years have witnessed major changes in the Geographic Information System (GIS) market, from technological offerings to user requests. For example, spatial databases used to be implemented in GISs or in Computer-Assisted Design (CAD) systems coupled with a Relational Data Base Management System (RDBMS). Today, spatial databases are also implemented in spatial extensions of universal servers, in spatial engine software components, in GIS web servers, in analytical packages using so-called 'data cubes' and in spatial data warehouses. Such databases are structured according to either a relational, object-oriented, multi-dimensional or hybrid paradigm. In addition, these offerings are integrated as a piece of the overall technological framework of the organization and they are implemented according to very diverse architectures responding to differing users' contexts: centralized vs distributed, thin-clients vs thick-clients, Local Area Network (LAN) vs intranets, spatial data warehouses vs legacy systems, etc. As one may say, 'Gone are the days of a spatial database implemented solely on a stand-alone GIS' (Bédard 1999). In fact, this evolution of the GIS market follows the general trends of mainstream Information Technologies (IT). Among all these possibilities, the penetration of data warehouses into the management and exploitation of spatial databases is a major trend as it is for non-spatial databases. According to Rawling and Kucera (1997), 'the term Data Warehouse has become the hottest industry buzzword of the decade just behind Internet and information highway'. More specifically, this penetration of data warehouses allows developers to build new solutions geared towards one major need which has never been solved efficiently insofar: to provide a unified view of dispersed heterogeneous databases in order to efficiently feed the decision-support tools used for strategic decision making. In fact, the data warehouse emerged as the unifying solution to a series of individual circumstances related to providing the necessary basis for global knowledge discovery. First, large organizations often have several departmental or application-oriented independent databases which may overlap in content. Usually, such systems work properly for day-today operational-level decisions. However, when one needs to obtain aggregated or summarized information integrating data from these different

Proceedings ArticleDOI
03 Jan 2001
TL;DR: The paper conceptualizes five types of knowledge maps that can be used in managing organizational knowledge and proposes a five-step procedure to implement knowledge maps in a corporate intranet.
Abstract: Establishes the conceptual and empirical basis for an innovative instrument of corporate knowledge management: the knowledge map. It begins by briefly outlining the rationale for knowledge mapping, i.e. providing a common context to access expertise and experience in large companies. It then conceptualizes five types of knowledge maps that can be used in managing organizational knowledge. They are: knowledge sources, assets, structures, applications and development maps. In order to illustrate these five types of maps, a series of examples is presented (from a multimedia agency, a consulting group, a market research firm and a medium-sized services company), and the advantages and disadvantages of the knowledge mapping technique for knowledge management are discussed. The paper concludes with a series of quality criteria for knowledge maps and proposes a five-step procedure to implement knowledge maps in a corporate intranet.

01 Jan 2001
TL;DR: This paper describes the integration of an interactive visualization user interface with a knowledge management tool called Protege, a general-purpose tool that allows domain experts to build knowledge-based systems by creating and modifying reusable ontologies and problem-solving methods.
Abstract: This paper describes the integration of an interactive visualization user interface with a knowledge management tool called Protege. Protege is a general-purpose tool that allows domain experts to build knowledge-based systems by creating and modifying reusable ontologies and problem-solving methods, and by instantiating ontologies to construct knowledge bases. The SHriMP (Simple Hierarchical Multi-Perspective) visualization technique was designed to enhance how people browse, explore and interact with complex information spaces. Although SHriMP is information independent, its primary use to date has been for visualizing and documenting software programs. The paper describes how we have applied software visualization techniques to more general knowledge domains. It is hoped that the integrated environment (called Jambalaya) will result in an easier to use and more powerful environment to support ontology evolution and knowledge acquisition. An example scenario of how Jambalaya can be applied to knowledge acquisition is provided.

Journal ArticleDOI
TL;DR: This paper deals with learning first-order logic rules from data lacking an explicit classification predicate, and describes a heuristic measure of confirmation, trading off novelty and satisfaction of the rule.
Abstract: This paper deals with learning first-order logic rules from data lacking an explicit classification predicate. Consequently, the learned rules are not restricted to predicate definitions as in supervised inductive logic programming. First-order logic offers the ability to deal with structured, multi-relational knowledge. Possible applications include first-order knowledge discovery, induction of integrity constraints in databases, multiple predicate learning, and learning mixed theories of predicate definitions and integrity constraints. One of the contributions of our work is a heuristic measure of confirmation, trading off novelty and satisfaction of the rule. The approach has been implemented in the Tertius system. The system performs an optimal best-first search, finding the k most confirmed hypotheses, and includes a non-redundant refinement operator to avoid duplicates in the search. Tertius can be adapted to many different domains by tuning its parameters, and it can deal either with individual-based representations by upgrading propositional representations to first-order, or with general logical rules. We describe a number of experiments demonstrating the feasibility and flexibility of our approach.

Journal ArticleDOI
TL;DR: The Know‐Net solution is presented, that aims to innovatively fuse the process‐centred approach with the product‐ Centred approach by developing a knowledge asset‐centric design and includes a theoretical framework, a corporate transformation and measurement method and a software tool.
Abstract: Two main approaches to knowledge management (KM) have been followed by early adopters of the principle: the process‐centred approach, that mainly treats KM as a social communication process; and the product‐centred approach, that focuses on knowledge artefacts, their creation, storage and reuse in computer‐based corporate memories. This distinction is evident not only in KM implementations in companies, but also in supporting methodologies and tools. This paper presents the Know‐Net solution that aims to innovatively fuse the process‐centred approach with the product‐centred approach by developing a knowledge asset‐centric design. The Know‐Net solution includes a theoretical framework, a corporate transformation and measurement method and a software tool.

Book ChapterDOI
Luc Dehaspe, Hannu Toironen1
05 Oct 2001
TL;DR: Algorithms for relational association rule discovery that are well-suited for exploratory data mining are presented, which offer the flexibility required to experiment with examples more complex than feature vectors and patternsMore complex than item sets.
Abstract: Within KDD, the discovery of frequent patterns has been studied in a variety of settings. In its simplest form, known from association rule mining, the task is to discover all frequent item sets, i.e., all combinations of items that are found in a sufficient number of examples. We present algorithms for relational association rule discovery that are well-suited for exploratory data mining. They offer the flexibility required to experiment with examples more complex than feature vectors and patterns more complex than item sets.

Journal ArticleDOI
Charu C. Aggarwal1, Philip S. Yu
TL;DR: The problem of online mining of association rules in a large database of sales transactions is discussed, with the use of nonredundant association rules helping significantly in the reduction of irrelevant noise in the data mining process.
Abstract: We discuss the problem of online mining of association rules in a large database of sales transactions. The online mining is performed by preprocessing the data effectively in order to make it suitable for repeated online queries. We store the preprocessed data in such a way that online processing may be done by applying a graph theoretic search algorithm whose complexity is proportional to the size of the output. The result is an online algorithm which is independent of the size of the transactional data and the size of the preprocessed data. The algorithm is almost instantaneous in the size of the output. The algorithm also supports techniques for quickly discovering association rules from large itemsets. The algorithm is capable of finding rules with specific items in the antecedent or consequent. These association rules are presented in a compact form, eliminating redundancy. The use of nonredundant association rules helps significantly in the reduction of irrelevant noise in the data mining process.

Proceedings ArticleDOI
29 Nov 2001
TL;DR: It is shown that the e-commerce domain can provide all the right ingredients for successful data mining and an integrated architecture for supporting this integration is described, which can dramatically reduce the pre-processing, cleaning, and data understanding effort in knowledge discovery projects.
Abstract: We show that the e-commerce domain can provide all the right ingredients for successful data mining. We describe an integrated architecture for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the application server layer (not the Web server) in order to support logging of data and metadata that is essential to the discovery process. We describe the data transformation bridges required from the transaction processing systems and customer event streams (e.g., clickstreams) to the data warehouse. We detail the mining workbench, which needs to provide multiple views of the data through reporting, data mining algorithms, visualization, and OLAP. We conclude with a set of challenges.

Journal Article
TL;DR: This bibliography subsumes an earlier bibliography and shows that the value of investigating temporal, spatial and spatio-temporal data has been growing in both interest and applicability.
Abstract: Data mining and knowledge discovery have become important issues for research over the past decade. This has been caused not only by the growth in the size of datasets but also in the availability of otherwise unavailable datasets over the Internet and the increased value that organisations now place on the knowledge that can be gained from data analysis. It is therefore not surprising that the increased interest in temporal and spatial data has led also to an increased interest in mining such data. This bibliography subsumes an earlier bibliography and shows that the value of investigating temporal, spatial and spatio-temporal data has been growing in both interest and applicability.

Journal ArticleDOI
TL;DR: This study validated the predictive power of data mining algorithms by comparing the performance of logistic regression and two decision tree algorithms, CHIAD (Chi-squared Automatic Interaction Detection) and C5.0 (a variant of C4.5) using the Korea Medical Insurance Corporation database.

Proceedings ArticleDOI
01 May 2001
TL;DR: The Epsilon Grid Order is proposed, a new algorithm for determining the similarity join of very large data sets, based on a particular sort order of the data points, obtained by laying an equi-distant grid with cell length ε over the data space and comparing the grid cells lexicographically.
Abstract: The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter e. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length e over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the e-kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.

Journal ArticleDOI
TL;DR: Electronic commerce is emerging as the killer domain for data—mining technology, and there is support for such a bold statement.
Abstract: Electronic commerce is emerging as the killer domain for data—mining technology. Is there support for such a bold statement? Data—mining technologies have been around for decades, without moving significantly beyond the domain of computer scientists, statisticians, and hard-core business analysts. Why are electronic commerce systems any different from other data—mining applications?

15 Jun 2001
TL;DR: The concept of a knowledge-rich context, its major types and its components, and a methodology for developing extraction tools that is based on lexical, grammatical and paralinguistic patterns are defined.
Abstract: Knowledge-rich contexts express conceptual information for a term. Terminographers need such contexts to construct definitions, and to acquire domain knowledge. This paper summarizes what we have learned about extracting knowledge-rich contexts semi-automatically. First, we define the concept of a knowledge-rich context, its major types and its components. Second, we describe a methodology for developing extraction tools that is based on lexical, grammatical and paralinguistic patterns. Third, we outline the most problematic research issues that must be addressed before semi-automatic knowledge extraction can become a fully mature field.

Journal ArticleDOI
01 Jun 2001
TL;DR: A prototype Knowledge Management System (KMS) that supports linking of artifacts to processes, flexible interaction and hypermedia services, distribution annotation and authoring as well as providing visibility to artifacts as they change over time is discussed.
Abstract: The Internet has led to the widespread trade of digital information products. These products exhibit unusual properties such as high fixed costs and near-zero marginal costs. They need to be developed on compressed time frames by spatially and temporally distributed teams, have short lifecycles, and high perishability. This paper addresses the challenges that information product development (IPD) teams face. Drawing on the knowledge intensive nature of IPD tasks, we identify potential solutions to these problems that can be provided by a knowledge management system. We discuss a prototype Knowledge Management System (KMS) that supports linking of artifacts to processes, flexible interaction and hypermedia services, distribution annotation and authoring as well as providing visibility to artifacts as they change over time. Using a case from the publishing industry, we illustrate how contextualized decision paths/traces provide a rich base of formal and informal knowledge that supports IPD teams.

Book ChapterDOI
TL;DR: Based on results about knowledge representation within the theoretical framework of Formal Concept Analysis, relatively small bases for association rules from which all rules can be deduced are presented.
Abstract: Association rules are used to investigate large databases. The analyst is usually confronted with large lists of such rules and has to find the most relevant ones for his purpose. Based on results about knowledge representation within the theoretical framework of Formal Concept Analysis, we present relatively small bases for association rules from which all rules can be deduced. We also provide algorithms for their calculation.

Journal ArticleDOI
TL;DR: The paper stresses the need for the closer integration of three largely disparate technologies: geographic visualization, knowledge discovery in databases, and geocomputation.
Abstract: This paper details the research agenda of the International Cartographic Association Commission on Visualization: Working Group on Database-Visualization Links. The paper stresses the need for the closer integration of three largely disparate technologies: geographic visualization, knowledge discovery in databases, and geocomputation. The introduction explains the meaning behind these terms, the ethos behind their practice, and their connections within the broad realm of knowledge construction activities. The state of the art is then described for different approaches to knowledge construction, concentrating where possible on visual and geographically oriented methods. From these sections, a research agenda is synthesized in the form of three sets of research questions addressing: (1) visual approaches to data mining; (2) visual support for knowledge construction and geocomputation; and (3) databases and data models that must be satisfied to make visually-led knowledge construction a reality in the geogra...

15 Jun 2001
TL;DR: Knowledge-rich contexts express conceptual information for a term as discussed by the authors, and they can be used to construct definitions and acquire domain knowledge, which can be extracted semi-automatically using lexical, grammatical and paralinguistic patterns.
Abstract: Knowledge-rich contexts express conceptual information for a term. Terminographers need such contexts to construct definitions, and to acquire domain knowledge. This paper summarizes what we have learned about extracting knowledge-rich contexts semi-automatically. First, we define the concept of a knowledge-rich context, its major types and its components. Second, we describe a methodology for developing extraction tools that is based on lexical, grammatical and paralinguistic patterns. Third, we outline the most problematic research issues that must be addressed before semi-automatic knowledge extraction can become a fully mature field.

Proceedings Article
01 Jan 2001
TL;DR: A Semantic Annotation Tool for extraction of knowledge structures from web pages through the use of simple user-defined knowledge extraction patterns and to provide support for ontology population by using the information extraction component.
Abstract: This paper describes a Semantic Annotation Tool for extraction of knowledge structures from web pages through the use of simple user-defined knowledge extraction patterns. The semantic annotation tool contains: an ontology-based mark-up component which allows the user to browse and to mark-up relevant pieces of information; a learning component (Crystal from the University of Massachusetts at Amherst) which learns rules from examples and an information extraction component which extracts the objects and relation between these objects. Our final aim is to provide support for ontology population by using the information extraction component. Our system uses as domain of study “KMi Planet”, a Webbased news server that helps to communicate relevant information between members in our institute.

Proceedings ArticleDOI
01 Dec 2001
TL;DR: The strong demands MEDSYNDIKATE poses to the availability of expressive knowledge sources are accounted for by two alternative approaches to (semi)automatic ontology engineering.
Abstract: MEDSYNDIKATE is a natural language processor for automatically acquiring knowledge from medical finding reports. The content of these documents is transferred to formal representation structures which constitute a corresponding text knowledge base. The system architecture integrates requirements from the analysis of single sentences, as well as those of referentially linked sentences forming cohesive texts. The strong demands MEDSYNDIKATE poses to the availability of expressive knowledge sources are accounted for by two alternative approaches to (semi)automatic ontology engineering. We also present data for the knowledge extraction performance of MEDSYNDIKATE for three major syntactic patterns in medical documents.

Journal Article
TL;DR: A rule discovery process that is based on rough set theory is discussed, using a slope-collapse database as an example showing how rules can be discovered from a large, real-life database.
Abstract: The knowledge discovery from real-life databases is a multi-phase process consisting of numerous steps, including attribute selection, discretization of realvalued attributes, and rule induction. In the paper, we discuss a rule discovery process that is based on rough set theory. The core of the process is a soft hybrid induction system called the Generalized Distribution Table and Rough Set System (GDT-RS) for discovering classification rules from databases with uncertain and incomplete data. The system is based on a combination of Generalization Distribution Table (GDT) and the Rough Set methodologies. In the preprocessing, two modules, i.e. Rough Sets with Heuristics (RSH) and Rough Sets with Boolean Reasoning (RSBR), are used for attribute selection and discretization of real-valued attributes, respectively. We use a slope-collapse database as an example showing how rules can be discovered from a large, real-life database.