scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2000"


Journal ArticleDOI
TL;DR: Efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the association mining task are presented and the effect of using different database layout schemes combined with the proposed decomposition and traverse techniques are presented.
Abstract: Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. We present efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented which quickly identify all the long frequent itemsets and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining improvements of more than an order of magnitude for our test databases.

1,637 citations


Book ChapterDOI
20 Aug 2000
TL;DR: In this paper, the authors introduce the concept of privacy preserving data mining, where two parties owning confidential databases wish to run a data mining algorithm on the union of their databases, without revealing any unnecessary information.
Abstract: In this paper we introduce the concept of privacy preserving data mining. In our model, two parties owning confidential databases wish to run a data mining algorithm on the union of their databases, without revealing any unnecessary information. This problem has many practical and important applications, such as in medical research with confidential patient records. Data mining algorithms are usually complex, especially as the size of the input is measured in megabytes, if not gigabytes. A generic secure multi-party computation solution, based on evaluation of a circuit computing the algorithm on the entire input, is therefore of no practical use. We focus on the problem of decision tree learning and use ID3, a popular and widely used algorithm for this problem. We present a solution that is considerably more efficient than generic solutions. It demands very few rounds of communication and reasonable bandwidth. In our solution, each party performs by itself a computation of the same order as computing the ID3 algorithm for its own database. The results are then combined using efficient cryptographic protocols, whose overhead is only logarithmic in the number of transactions in the databases. We feel that our result is a substantial contribution, demonstrating that secure multi-party computation can be made practical, even for complex problems and large inputs.

995 citations


Book ChapterDOI
02 Oct 2000
TL;DR: This work describes ProtEgE-2000 knowledge model that makes the import and export of knowledge bases from and to other knowledge-base servers easy and demonstrates that it can resolve many of the differences between the knowledge models of ProtEe-2000 and Resource Description Framework (RDF)--a system for annotating Web pages with knowledge elements--by defining a new metaclass set.
Abstract: Knowledge-based systems have become ubiquitous in recent years. Knowledge-base developers need to be able to share and reuse knowledge bases that they build. Therefore, interoperability among different knowledge-representation systems is essential. The Open Knowledge-Base Connectivity protocol (OKBC) is a common query and construction interface for frame-based systems that facilitates this interoperability. ProtEgE-2000 is an OKBC-compatible knowledge-base-editing environment developed in our laboratory. We describe ProtEgE-2000 knowledge model that makes the import and export of knowledge bases from and to other knowledge-base servers easy. We discuss how the requirements of being a usable and configurable knowledge-acquisition tool affected our decisions in the knowledge-model design. ProtEgE-2000 also has a flexible metaclass architecture which provides configurable templates for new classes in the knowledge base. The use of metaclasses makes ProtEgE-2000 easily extensible and enables its use with other knowledge models. We demonstrate that we can resolve many of the differences between the knowledge models of ProtEgE-2000 and Resource Description Framework (RDF)--a system for annotating Web pages with knowledge elements--by defining a new metaclass set. Resolving the differences between the knowledge models in declarative way enables easy adaptation of ProtEgE-2000 as an editor for other knowledge-representation systems.

754 citations


Journal Article
TL;DR: This paper introduces the concept of privacy preserving data mining, and presents a solution that is considerably more efficient than generic solutions, and demonstrates that secure multi-party computation can be made practical, even for complex problems and large inputs.
Abstract: In this paper we introduce the concept of privacy preserving data mining. In our model, two parties owning confidential databases wish to run a data mining algorithm on the union of their databases, without revealing any unnecessary information. This problem has many practical and important applications, such as in medical research with confidential patient records. Data mining algorithms are usually complex, especially as the size of the input is measured in megabytes, if not gigabytes. A generic secure multi-party computation solution, based on evaluation of a circuit computing the algorithm on the entire input, is therefore of no practical use. We focus on the problem of decision tree learning and use ID3, a popular and widely used algorithm for this problem. We present a solution that is considerably more efficient than generic solutions. It demands very few rounds of communication and reasonable bandwidth. In our solution, each party performs by itself a computation of the same order as computing the ID3 algorithm for its own database. The results are then combined using efficient cryptographic protocols, whose overhead is only logarithmic in the number of transactions in the databases. We feel that our result is a substantial contribution, demonstrating that secure multi-party computation can be made practical, even for complex problems and large inputs.

669 citations


01 Jan 2000
TL;DR: A text mining framework consisting of two components: Text refining that transforms unstructured text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form is presented.
Abstract: Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents. Regarded by many as the next wave of knowledge discovery, text mining has very high commercial values. Last count reveals that there are more than ten high-tech companies offering products for text mining. Has text mining evolved so rapidly to become a mature field? This article attempts to shed some lights to the question. We first present a text mining framework consisting of two components: Text refining that transforms unstructured text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form. We then survey the state-of-the-art text mining products/applications and align them based on the text refining and knowledge distillation functions as well as the intermediate form that they adopt. In conclusion, we highlight the upcoming challenges of text mining and the opportunities it offers.

560 citations


Journal ArticleDOI
TL;DR: The growing self-organizing map (GSOM) is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated.
Abstract: The growing self-organizing map (GSOM) algorithm is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated. The spread factor is independent of the dimensionality of the data and as such can be used as a controlling measure for generating maps with different dimensionality, which can then be compared and analyzed with better accuracy. The spread factor is also presented as a method of achieving hierarchical clustering of a data set with the GSOM. Such hierarchical clustering allows the data analyst to identify significant and interesting clusters at a higher level of the hierarchy, and continue with finer clustering of the interesting clusters only. Therefore, only a small map is created in the beginning with a low spread factor, which can be generated for even a very large data set. Further analysis is conducted on selected sections of the data and of smaller volume. Therefore, this method facilitates the analysis of even very large data sets.

529 citations


Journal ArticleDOI
TL;DR: KDD-Cup 2000, the yearly competition in data mining, is described, for the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found.
Abstract: We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found. We chronicle the data generation phase starting from the collection at the site through its conversion to a star schema in a warehouse through data cleansing, data obfuscation for privacy protection, and data aggregation. We describe the information given to the participants, including the questions, site structure, the marketing calendar, and the data schema. Finally, we discuss interesting insights, common mistakes, and lessons learned. Three winners were announced and they describe their own experiences and lessons in the pages following this paper.

303 citations


Book ChapterDOI
Bamshad Mobasher1, Honghua Dai1, Tao Luo1, Yuqing Sun1, Jiang Zhu1 
04 Sep 2000
TL;DR: This paper presents a framework for Web usage mining, distinguishing between the offine tasks of data preparation and mining, and the online process of customizing Web pages based on a user's active session, and describes effective techniques based on clustering to obtain a uniform representation for both site usage and site content profiles.
Abstract: Recent proposals have suggested Web usage mining as an enabling mechanism to overcome the problems associated with more traditional Web personalization techniques such as collaborative or content-based filtering. These problems include lack of scalability, reliance on subjective user ratings or static profiles, and the inability to capture a richer set of semantic relationships among objects (in content-based systems). Yet, usage-based personalization can be problematic when little usage data is available pertaining to some objects or when the site content changes regularly. For more effective personalization, both usage and content attributes of a site must be integrated into a Web mining framework and used by the recommendation engine in a uniform manner. In this paper we present such a framework, distinguishing between the offine tasks of data preparation and mining, and the online process of customizing Web pages based on a user's active session. We describe effective techniques based on clustering to obtain a uniform representation for both site usage and site content profiles, and we show how these profiles can be used to perform real-time personalization.

293 citations


Journal ArticleDOI
01 Feb 2000
TL;DR: WaveCluster is proposed, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements and can effectively identify arbitrarily shaped clusters at different degrees of detail.
Abstract: Many applications require the management of spatial data in a multidimensional feature space. Clustering large spatial databases is an important problem, which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the noise (outliers) and the order of input data. We propose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using the multiresolution property of wavelet transforms, we can effectively identify arbitrarily shaped clusters at different degrees of detail. We also demonstrate that WaveCluster is highly efficient in terms of time complexity. Experimental results on very large datasets are presented, which show the efficiency and effectiveness of the proposed approach compared to the other recent clustering methods.

279 citations


Book
01 Dec 2000
TL;DR: This book discusses Rough Sets and Rough Logic: A KDD Perspective from a Rough Set Perspective, which aims to provide a perspective on knowledge discovery in Information Systems from a rough set perspective.
Abstract: 1. Introduction.- Introducing the Book.- 1. A Rough Set Perspective on Knowledge Discovery in Information Systems: An Essay on the Topic of the Book.- 2. Methods and Applications: Reducts, Similarity, Mereology.- 2. Rough Set Algorithms in Classification Problem.- 3. Rough Mereology in Information Systems. A Case Study: Qualitative Spatial Reasoning.- 4. Knowledge Discovery by Application of Rough Set Models.- 5. Various Approaches to Reasoning with Frequency Based Decision Reducts: A Survey.- 3. Methods and Applications: Regular Pattern Extraction, Concurrency.- 6. Regularity Analysis and its Applications in Data Mining.- 7. Rough Set Methods for the Synthesis and Analysis of Concurrent Processes.- 4. Methods and Applications: Algebraic and Statistical Aspects, Conflicts, Incompleteness.- 8. Conflict Analysis.- 9. Logical and Algebraic Techniques for Rough Set Data Analysis.- 10. Statistical Techniques for Rough Set Data Analysis.- 11. Data Mining in Incomplete Information Systems from Rough Set Perspective.- 5. Afterword.- 12. Rough Sets and Rough Logic: A KDD Perspective.- Appendix: Selected Bibliofgraphy on Rough Sets.

272 citations


Proceedings ArticleDOI
03 Oct 2000
TL;DR: This initial study develops a method to identify and extract cause-effect information that is explicitly expressed in medical abstracts in the Medline database through a set of graphical patterns constructed.
Abstract: This paper reports the first part of a project that aims to develop a knowledge extraction and knowledge discovery system that extracts causal knowledge from textual databases. In this initial study, we develop a method to identify and extract cause-effect information that is explicitly expressed in medical abstracts in the Medline database. A set of graphical patterns were constructed that indicate the presence of a causal relation in sentences, and which part of the sentence represents the cause and which part represents the effect. The patterns are matched with the syntactic parse trees of sentences, and the parts of the parse tree that match with the slots in the patterns are extracted as the cause or the effect.

Book
01 Jan 2000
TL;DR: This book offers a new mathematical model of knowledge that is general and expressive yet more workable in practice than previous models, and presents a style of semantic argument and formal analysis that would be cumbersome or completely impractical with other approaches.
Abstract: The idea of knowledge bases lies at the heart of symbolic, or "traditional," artificial intelligence. A knowledge-based system decides how to act by running formal reasoning procedures over a body of explicitly represented knowledge -- a knowledge base. The system is not programmed for specific tasks; rather, it is told what it needs to know and expected to infer the rest. This book is about the logic of such knowledge bases. It describes in detail the relationship between symbolic representations of knowledge and abstract states of knowledge, exploring along the way the foundations of knowledge, knowledge bases, knowledge-based systems, and knowledge representation and reasoning. Assuming some familiarity with first-order predicate logic, the book offers a new mathematical model of knowledge that is general and expressive yet more workable in practice than previous models. The book presents a style of semantic argument and formal analysis that would be cumbersome or completely impractical with other approaches. It also shows how to treat a knowledge base as an abstract data type, completely specified in an abstract way by the knowledge-level operations defined over it.

Journal ArticleDOI
TL;DR: A data mining technique that integrates neural network, case-based reasoning, and rule- based reasoning is proposed; it would search the unstructured customer service records for machine fault diagnosis.

Journal ArticleDOI
TL;DR: This paper describes a particular knowledge discovery algorithm—Genetic Programming (GP), an augmented version of GP—dimensionally aware GP—which is arguably more useful in the process of scientific discovery is described in great detail and an application of dimensionallyaware GP to a problem of induction of an empirical relationship describing the additional resistance to flow induced by flexible vegetation.
Abstract: Present day instrumentation networks already provide immense quantities of data, very little of which provides any insights into the basic physical processes that are occurring in the measured medium. This is to say that the data by itself contributes little to the knowledge of such processes. Data mining and knowledge discovery aim to change this situation by providing technologies that will greatly facilitate the mining of data for knowledge. In this new setting the role of a human expert is to provide domain knowledge, interpret models suggested by the computer and devise further experiments that will provide even better data coverage. Clearly, there is an enormous amount of knowledge and understanding of physical processes that should not be just thrown away. Consequently, we strongly believe that the most appropriate way forward is to combine the best of the two approaches: theory-driven, understanding-rich with data-driven discovery process. This paper describes a particular knowledge discovery algorithm—Genetic Programming (GP). Additionally, an augmented version of GP—dimensionally aware GP—which is arguably more useful in the process of scientific discovery is described in great detail. Finally, the paper concludes with an application of dimensionally aware GP to a problem of induction of an empirical relationship describing the additional resistance to flow induced by flexible vegetation.

Journal ArticleDOI
01 Sep 2000
TL;DR: A textual data mining architecture that extends a classic paradigm for knowledge discovery in databases is introduced, and a broad view of data mining—the process of discovering patterns in large collections of data—is described in some detail.
Abstract: This paper surveys applications of data mining techniques to large text collections, and illustrates how those techniques can be used to support the management of science and technology research Specific issues that arise repeatedly in the conduct of research management are described, and a textual data mining architecture that extends a classic paradigm for knowledge discovery in databases is introduced That architecture integrates information retrieval from text collections, information extraction to obtain data from individual texts, data warehousing for the extracted data, data mining to discover useful patterns in the data, and visualization of the resulting patterns At the core of this architecture is a broad view of data mining—the process of discovering patterns in large collections of data—and that step is described in some detail The final section of the paper illustrates how these ideas can be applied in practice, drawing upon examples from the recently completed first phase of the textual data mining program at the Office of Naval Research The paper concludes by identifying some research directions that offer significant potential for improving the utility of textual data mining for research management applications

Journal ArticleDOI
TL;DR: KRAFT uses an open and flexible agent architecture in which knowledge sources, knowledge fusing entities and users are all represented by independent KRAFT agents, communicating using a messaging protocol.
Abstract: This paper describes the Knowledge Reuse And Fusion/Transformation (KRAFT) architecture which supports the fusion of knowledge from multiple, distributed, heterogeneous sources. The architecture uses constraints as a common knowledge interchange format, expressed against a common ontology. Knowledge held in local sources can be transformed into a common constraint language, and fused with knowledge from other sources. The fused knowledge is then used to solve some problem or deliver some information to a user. Problem solving in KRAFT typically exploits pre-existing constraint solvers. KRAFT uses an open and flexible agent architecture in which knowledge sources, knowledge fusing entities and users are all represented by independent KRAFT agents, communicating using a messaging protocol. Facilitator agents perform matchmaking and brokerage services between the various kinds of agent. KRAFT is being applied to an example application in the domain of network data services design.

Journal ArticleDOI
TL;DR: The paper discusses the NIST Design Repository Project, where engineers are increasingly turning to design repositories as knowledge bases to help them represent, capture, share and reuse corporate design knowledge.
Abstract: Driven by pressure to reduce product development time, industry has started looking for new ways to exploit stores of engineering artifact knowledge. Engineers are increasingly turning to design repositories as knowledge bases to help them represent, capture, share and reuse corporate design knowledge. The paper discusses the NIST Design Repository Project.

Proceedings ArticleDOI
04 Dec 2000
TL;DR: The hypothesis of this paper is that knowledge components and knowledge structures could serve as meta mental models that would enable learners to more easily acquire conceptual and causal networks and their associated processes.
Abstract: This paper describes knowledge components that are thought to be appropriate and sufficient to precisely describe certain types of cognitive subject matter content (knowledge). It also describes knowledge structures that show the relationships among these knowledge components and among other knowledge objects. It suggests that a knowledge structure is a form of schema such as those that learners use to represent knowledge in memory. A mental model is a schema plus cognitive processes for manipulating and modifying the knowledge stored in a schema. We suggested processes that enable learners to manipulate the knowledge components of conceptual network knowledge structures for purposes of classification, generalization, and concept elaboration. We further suggested processes that enable learners to manipulate the knowledge components of process knowledge structures (PEAnets) for purposes of explanation, prediction, and troubleshooting. The hypothesis of this paper is that knowledge components and knowledge structures, such as those described in this paper, could serve as meta mental models that would enable learners to more easily acquire conceptual and causal networks and their associated processes. The resulting specific mental models would facilitate their ability to solve problems of conceptualization and interpretation.

Journal ArticleDOI
TL;DR: The discovery task is affected by structural features of semistructured data in a nontrivial way and traditional data mining frameworks are inapplicable.
Abstract: Many semistructured objects are similarly, though not identically structured. We study the problem of discovering "typical" substructures of a collection of semistructured objects. The discovered structures can serve the following purposes: 1) the "table-of-contents" for gaining general information of a source, 2) a road map for browsing and querying information sources, 3) a basis for clustering documents, 4) partial schemas for providing standard database access methods, and 5) user/customer interests and browsing patterns. The discovery task is affected by structural features of semistructured data in a nontrivial way and traditional data mining frameworks are inapplicable. We define this discovery problem and propose a solution.

Journal ArticleDOI
TL;DR: The knowledge discovery and data mining field draws on findings from statistics, databases, and artificial intelligence to construct tools that let users gain insight from massive data sets by paying attention to the cognitive factors that make the resulting models coherent, credible, easy to use, and easy to communicate to others.
Abstract: The knowledge discovery and data mining (KDD) field draws on findings from statistics, databases, and artificial intelligence to construct tools that let users gain insight from massive data sets. People in business, science, medicine, academia, and government collect such data sets, and several commercial packages now offer general-purpose KDD tools. An important KDD goal is to "turn data into knowledge". For example, knowledge acquired through such methods on a medical database could be published in a medical journal. Knowledge acquired from analyzing a financial or marketing database could revise business practice and influence a management school's curriculum. In addition, some US laws require reasons for rejecting a loan application, which knowledge from the KDD could provide. Occasionally, however, you must explain the learned decision criteria to a court, as in the recent lawsuit Blue Mountain filed against Microsoft for a mail filter that classified electronic greeting cards as spam mail. We expect more from knowledge discovery tools than simply creating accurate models as in machine learning, statistics, and pattern recognition. We can fully realize the benefits of data mining by paying attention to the cognitive factors that make the resulting models coherent, credible, easy to use, and easy to communicate to others.

Journal ArticleDOI
TL;DR: Data qual- ity is a particularly troublesome issue in data mining applications, and this is examined.
Abstract: Data mining is defined as the process of seeking interesting or valuable information within large data sets. This presents novel chal- lenges and problems, distinct from those typically arising in the allied areas of statistics, machine learning, pattern recognition or database science. A distinction is drawn between the two data mining activities of model building and pattern detection. Even though statisticians are familiar with the former, the large data sets involved in data mining mean that novel problems do arise. The second of the activities, pat- tern detection, presents entirely new classes of challenges, some arising, again, as a consequence of the large sizes of the data sets. Data qual- ity is a particularly troublesome issue in data mining applications, and this is examined. The discussion is illustrated with a variety of real examples.

Proceedings ArticleDOI
01 Dec 2000
TL;DR: The relative computational simplicity of the proposed method makes it possible to process and analyze large volumes of data in a short time and significantly contributes to and enhances a user's ability to discover such embedded information.
Abstract: Research in bioinformatics in the past decade has generated a large volume of textual biological data stored in databases such as MEDLINE. It takes a copious amount of effort and time, even for expert users, to manually extract useful information embedded in such a large volume of retrieved data and automated intelligent text analysis tools are increasingly becoming essential. In this article, we present a simple analysis and knowledge discovery method that can identify related genes as well as their shared functionality (if any) based on a collection of relevant retrieved relevant MEDLINE documents. The relative computational simplicity of the proposed method makes it possible to process and analyze large volumes of data in a short time. Hence, it significantly contributes to and enhances a user's ability to discover such embedded information. Two case studies are presented that indicate the usefulness of the proposed method.

Journal ArticleDOI
TL;DR: This book discusses data Mining-Based Modeling of Human Visual Perception, and the discovery of Clinical Knowledge in Databases Extracted from Hospital Information Systems and Knowledge Discovery in Time Series.
Abstract: Medical Data Mining and Knowledge Discovery * Legal Policy and Security Issues in the Handling of Medical Data * Medical Natural Language Understanding as a Supporting Technology for Data Mining in Healthcare * Anatomic Pathology Data Mining * A Data Clustering and Visualization Methodology for Epidemiological Pathology Discoveries * Mining Structure-Function Associations in a Brain Image Database * ADRIS * Knowledge Discovery in Mortality Records: An Info-Fuzzy Approach * Consistent and Complete Data and "Expert" Mining in Medicine * A Medical Data Mining Application Based on Evolutionary Computation * Methods of Temporal Data Validation and Abstraction in High-Frequency Domains * Data Mining the Matrix Associated Regions for Gene Therapy * Discovery of Temporal Patterns in Sparse Course-of-Disease Data * Data Mining-Based Modeling of Human Visual Perception * Discovery of Clinical Knowledge in Databases Extracted from Hospital Information Systems * Knowledge Discovery in Time Series.

Journal ArticleDOI
TL;DR: CiteSeer is an automatic generator of digital libraries of scientific literature that uses sophisticated acquisition, parsing, and presentation methods to eliminate most of the manual effort of finding useful publications on the Web.
Abstract: Scientific literature on the Web makes up a massive, noisy, disorganized database. Unlike large, single-source databases such as a corporate customer database, the Web database draws from many sources, each with its own organization. Also, owing to its diversity, most records in this database are irrelevant to an individual researcher. Furthermore, the database is constantly growing in content and changing in organization. All these characteristics make the Web a difficult domain for knowledge discovery. To quickly and easily gather useful knowledge from such a database, users need the help of an information filtering system that automatically extracts only relevant records as they appear in a stream of incoming records. To this end, we have developed the CiteSeer. CiteSeer is an automatic generator of digital libraries of scientific literature. It uses sophisticated acquisition, parsing, and presentation methods to eliminate most of the manual effort of finding useful publications on the Web.

Journal ArticleDOI
TL;DR: GP has been demonstrated to be a really useful data mining tool, but future work should also include the application of the GP system proposed here to other data sets, to further validate the results reported in this article.
Abstract: Explores a promising data mining approach. Despite the small number of examples available in the authors' application domain (taking into account the large number of attributes), the results of their experiments can be considered very promising. The discovered rules had good performance concerning predictive accuracy, considering both the rule set as a whole and each individual rule. Furthermore, what is more important from a data mining viewpoint, the system discovered some comprehensible rules. It is interesting to note that the system achieved very consistent results by working from "tabula rasa," without any background knowledge, and with a small number of examples. The authors emphasize that their system is still in an experiment in the research stage of development. Therefore, the results presented here should not be used alone for real-world diagnoses without consulting a physician. Future research includes a careful selection of attributes in a preprocessing step, so as to reduce the number of attributes (and the corresponding search space) given to the GP. Attribute selection is a very active research area in data mining. Given the results obtained so far, GP has been demonstrated to be a really useful data mining tool, but future work should also include the application of the GP system proposed here to other data sets, to further validate the results reported in this article.

Journal Article
TL;DR: A new definition of quantitative association rules based on fuzzy set theory is introduced and the algorithm uses new definitions for interesting mea- sures and experimental results show the efficiency of the algorithm for large databases.
Abstract: During the last ten years, data mining, also known as knowledge discovery in databases, has established its position as a prominent and important research area Mining association rules is one of the important research problems in data mining Many algorithms have been proposed to find association rules in databases with binary attributes In this paper, we deal with the problem of mining association rules in databases containing both quantitative and categorical attributes An example of such an association might be ``10% of married people between age 50 and 70 have at least 2 cars'''' We introduce a new definition of quantitative association rules based on fuzzy set theory Using the fuzzy set concept, the discovered rules are more understandable to a human Moreover, fuzzy sets handle numerical values better than existing methods because fuzzy sets soften the effect of sharp boundaries The above example could be rephrased eg 10% of married old people have several cars In this paper we present a new algorithm for mining fuzzy quantitative association rules The algorithm uses new definitions for interesting mea- sures Experimental results show the efficiency of the algorithm for large databases

Journal ArticleDOI
TL;DR: The approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process, and statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations.
Abstract: This paper presents an approach for knowledge discovery in texts extracted from the Web. Instead of analyzing words or attribute values, the approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process. Statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations. In this way, users can pe rform discovery in a high level, since concepts describe real world events, objects, thoughts, etc. For identifying concepts in texts, a categorization algorithm is used associated to a previous classification task for concept definitions. Two experiments are presented: one for political analysis and other for competitive intelligence. At the end, the approach is discussed, examining its problems and advantages in the Web context. Keywords Knowledge discovery, data mining, information extraction, categorization, text mining. 1. INTRODUCTION The Web is a large and growing collection of texts. This amount of text is becoming a valuable resource of information and knowledge. As Garofalakis and partners comment, "

Journal ArticleDOI
TL;DR: A decade's experience with the MED is summarized to serve as a proof-of-concept that knowledge-based terminologies can support the use of coded patient data for a variety of knowledge- based activities, including the improved understanding of patient data, the access of information sources relevant to specific patient care problems, the application of expert systems directly to the care of patients, and the discovery of new medical knowledge.

Journal ArticleDOI
TL;DR: A rule induction system based on rough sets and attribute-oriented generalization is introduced and was applied to a database of congenital malformation to extract diagnostic rules and an expert system which makes a differential diagnosis on congenital disorders is developed.

Journal ArticleDOI
TL;DR: A case study involving two problems involving the understanding of customer retention patterns by classifying policy holders as likely to renew or terminate their policies is presented and a variety of techniques within the methodology of data mining are solved.
Abstract: The insurance industry is concerned with many problems of interest to the operational research community. This paper presents a case study involving two such problems and solves them using a variety of techniques within the methodology of data mining. The first of these problems is the understanding of customer retention patterns by classifying policy holders as likely to renew or terminate their policies. The second is better understanding claim patterns, and identifying types of policy holders who are more at risk. Each of these problems impacts on the decisions relating to premium pricing, which directly affects profitability. A data mining methodology is used which views the knowledge discovery process within an holistic framework utilising hypothesis testing, statistics, clustering, decision trees, and neural networks at various stages. The impacts of the case study on the insurance company are discussed.