Showing papers on "Knowledge extraction published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Mining concept-drifting data streams using ensemble classifiers

[...]

Haixun Wang¹, Wei Fan¹, Philip S. Yu¹, Jiawei Han²•Institutions (2)

IBM¹, University of Illinois at Urbana–Champaign²

24 Aug 2003

TL;DR: This paper proposes a general framework for mining concept-drifting data streams using weighted ensemble classifiers, and shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

...read moreread less

Abstract: Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Beyesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

...read moreread less

1,403 citations

Journal Article•DOI•

From a Firm-Based to a Community-Based Model of Knowledge Creation: The Case of the Linux Kernel Development

[...]

Gwendolyn K. Lee¹, Robert E. Cole²•Institutions (2)

INSEAD¹, University of California, Berkeley²

01 Nov 2003-Organization Science

TL;DR: A model of community-based, evolutionary knowledge creation is built to study how thousands of talented volunteers, dispersed across organizational and geographical boundaries, collaborate via the Internet to produce a knowledge-intensive, innovative product of high quality.

...read moreread less

Abstract: We propose a new model of knowledge creation in purposeful, loosely coordinated, distributed systems, as an alternative to a firm-based one. Specifically, using the case of the Linux kernel development project, we build a model of community-based, evolutionary knowledge creation to study how thousands of talented volunteers, dispersed across organizational and geographical boundaries, collaborate via the Internet to produce a knowledge-intensive, innovative product of high quality. By comparing and contrasting the Linux model with the traditional/commercial model of software development and firmbased knowledge creation efforts, we show how the proposed model of knowledge creation expands beyond the boundary of the firm. Our model suggests that the product development process can be effectively organized as an evolutionary process of learning driven by criticism and error correction. We conclude by offering some theoretical implications of our community-based model of knowledge creation for the literature of organizational learning, community life, and the uses of knowledge in society.

...read moreread less

711 citations

Special Issue on data mining and knowledge discovery with evolutionary algorithms

[...]

Ashish Ghosh, Alex A. Freitas

01 Dec 2003

TL;DR: This book integrates two areas of computer science, namely data mining and evolutionary algorithms, and emphasizes the importance of discovering comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.

...read moreread less

Abstract: From the Publisher: This book integrates two areas of computer science, namely data mining and evolutionary algorithms. Both these areas have become increasingly popular in the last few years, and their integration is currently an active research area. In general, data mining consists of extracting knowledge from data. In particular, in this book we emphasize the importance of discovering comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions. In contrast, most rule induction methods perform a local, greedy search in the space of candidate rules. Intuitively, the global search of evolutionary algorithms can discover interesting rules and patterns that would be missed by the greedy search performed by most rule induction methods.

...read moreread less

699 citations

Journal Article•DOI•

Automatic ontology-based knowledge extraction from Web documents

[...]

Harith Alani¹, Sanghee Kim¹, David E. Millard¹, Mark J. Weal¹, Wendy Hall¹, Paul H. Lewis¹, Nigel Shadbolt¹ - Show less +3 more•Institutions (1)

University of Southampton¹

01 Jan 2003-IEEE Intelligent Systems

TL;DR: The Artequakt project is considered, which links a knowledge extraction tool with an ontology to achieve continuous knowledge support and guide information extraction and is further enhanced using a lexicon-based term expansion mechanism that provides extended ontology terminology.

...read moreread less

Abstract: To bring the Semantic Web to life and provide advanced knowledge services, we need efficient ways to access and extract knowledge from Web documents. Although Web page annotations could facilitate such knowledge gathering, annotations are rare and will probably never be rich or detailed enough to cover all the knowledge these documents contain. Manual annotation is impractical and unscalable, and automatic annotation tools remain largely undeveloped. Specialized knowledge services therefore require tools that can search and extract specific knowledge directly from unstructured text on the Web, guided by an ontology that details what type of knowledge to harvest. An ontology uses concepts and relations to classify domain knowledge. Other researchers have used ontologies to support knowledge extraction, but few have explored their full potential in this domain. The paper considers the Artequakt project which links a knowledge extraction tool with an ontology to achieve continuous knowledge support and guide information extraction. The extraction tool searches online documents and extracts knowledge that matches the given classification structure. It provides this knowledge in a machine-readable format that will be automatically maintained in a knowledge base (KB). Knowledge extraction is further enhanced using a lexicon-based term expansion mechanism that provides extended ontology terminology.

...read moreread less

490 citations

Book Chapter•DOI•

A survey of evolutionary algorithms for data mining and knowledge discovery

[...]

Alex A. Freitas¹•Institutions (1)

Pontifícia Universidade Católica do Paraná¹

01 Jan 2003

TL;DR: This chapter discusses the use of evolutionary algorithms, particularly genetic algorithms and genetic programming, in data mining and knowledge discovery, and discusses some preprocessing and postprocessing steps of the knowledge discovery process, focusing on attribute selection and pruning of an ensemble of classifiers.

...read moreread less

Abstract: This chapter discusses the use of evolutionary algorithms, particularly genetic algorithms and genetic programming, in data mining and knowledge discovery. We focus on the data mining task of classification. In addition, we discuss some preprocessing and postprocessing steps of the knowledge discovery process, focusing on attribute selection and pruning of an ensemble of classifiers. We show how the requirements of data mining and knowledge discovery influence the design of evolutionary algorithms. In particular, we discuss how individual representation, genetic operators and fitness functions have to be adapted for extracting high-level knowledge from data.

...read moreread less

452 citations

Journal Article•DOI•

A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis

[...]

Myra Spiliopoulou, Bamshad Mobasher, Bettina Berendt, Miki Nakagawa

15 Apr 2003-Informs Journal on Computing

TL;DR: A set of performance measures that are sensitive to two types of reconstruction errors and appropriate for different applications in knowledge discovery (KDD) applications are proposed that help the analyst in the selection of the heuristic best suited for the application at hand.

...read moreread less

Abstract: Web-usage mining has become the subject of intensive research, as its potential for personalized services, adaptive Web sites and customer profiling is recognized. However, the reliability of Web-usage mining results depends heavily on the proper preparation of the input datasets. In particular, errors in the reconstruction of sessions and incomplete tracing of users' activities in a site can easily result in invalid patterns and wrong conclusions. In this study, we evaluate the performance of heuristics employed to reconstruct sessions from the server log data. Such heuristics are called to partition activities first by user and then by visit of the user in the site, where user identification mechanisms, such as cookies, may or may not be available. We propose a set of performance measures that are sensitive to two types of reconstruction errors and appropriate for different applications in knowledge discovery (KDD) applications.We have tested our framework on the Web server data of a frame-based Web site. The first experiment concerned a specific KDD application and has shown the sensitivity of the heuristics to particularities of the site's structure and traffic. The second experiment is not bound to a specific application but rather compares the performance of the heuristics for different measures and thus for different application types. Our results show that there is no single best heuristic, but our measures help the analyst in the selection of the heuristic best suited for the application at hand.

...read moreread less

358 citations

Journal Article•DOI•

Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

[...]

José Ramón Cano¹, Francisco Herrera², Manuel Lozano²•Institutions (2)

University of Huelva¹, University of Granada²

01 Dec 2003-IEEE Transactions on Evolutionary Computation

TL;DR: The results show that the evolutionary instance selection algorithms consistently outperform the nonevolutionary ones, the main advantages being: better instance reduction rates, higher classification accuracy, and models that are easier to interpret.

...read moreread less

Abstract: Evolutionary algorithms are adaptive methods based on natural evolution that may be used for search and optimization As data reduction in knowledge discovery in databases (KDDs) can be viewed as a search problem, it could be solved using evolutionary algorithms (EAs) In this paper, we have carried out an empirical study of the performance of four representative EA models in which we have taken into account two different instance selection perspectives, the prototype selection and the training set selection for data reduction in KDD This paper includes a comparison between these algorithms and other nonevolutionary instance selection algorithms The results show that the evolutionary instance selection algorithms consistently outperform the nonevolutionary ones, the main advantages being: better instance reduction rates, higher classification accuracy, and models that are easier to interpret

...read moreread less

325 citations

Proceedings Article•DOI•

Using randomized response techniques for privacy-preserving data mining

[...]

Wenliang Du¹, Zhijun Zhan¹•Institutions (1)

Syracuse University¹

24 Aug 2003

TL;DR: This paper presents a method to build decision tree classifiers from the disguised data, and shows that although the data are disguised, this method can still achieve fairly high accuracy.

...read moreread less

Abstract: Privacy is an important issue in data mining and knowledge discovery. In this paper, we propose to use the randomized response techniques to conduct the data mining computation. Specially, we present a method to build decision tree classifiers from the disguised data. We conduct experiments to compare the accuracy of our decision tree with the one built from the original undisguised data. Our results show that although the data are disguised, our method can still achieve fairly high accuracy. We also show how the parameter used in the randomized response techniques affects the accuracy of the results.

...read moreread less

319 citations

Journal Article•DOI•

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

[...]

Jean-François Boulicaut¹, Artur Bykowski¹, Christophe Rigotti¹•Institutions (1)

Institut national des sciences Appliquées de Lyon¹

01 Jan 2003-Data Mining and Knowledge Discovery

TL;DR: The experiments show that the extraction of frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent item set.

...read moreread less

Abstract: Given a large collection of transactions containing items, a basic common data mining problem is to extract the so-called frequent itemsets (i.e., sets of items appearing in at least a given number of transactions). In this paper, we propose a structure called free-sets, from which we can approximate any itemset support (i.e., the number of transactions containing the itemset) and we formalize this notion in the framework of e-adequate representations (H. Mannila and H. Toivonen, 1996. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 189–194). We show that frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent itemset. Experiments on real dense data sets show a significant reduction of the size of the output when compared with standard frequent itemset extraction. Furthermore, the experiments show that the extraction of frequent free-sets is still possible when the extraction of frequent itemsets becomes intractable, and that the supports of the frequent free-sets can be used to approximate very closely the supports of the frequent itemsets. Finally, we consider the effect of this approximation on association rules (a popular kind of patterns that can be derived from frequent itemsets) and show that the corresponding errors remain very low in practice.

...read moreread less

290 citations

Patent•

Computer program products, systems and methods for information discovery and relational analyses

[...]

Harold R. Garner¹, Jonathan D. Wren¹•Institutions (1)

University of Texas System¹

19 Sep 2003

TL;DR: In this article, a method for accessing domains of information to identify heretofore unknown relationships between disparate sources of data to seek and obtain knowledge is presented. But the method is not suitable for large-scale data sets.

...read moreread less

Abstract: The present invention is a system, method for accessing domains of information to identify heretofore unknown relationships between disparate sources of data (7) to seek and obtain knowledge (18), the invention includes a source of data with one or more domains of information, an Object-Relationship Database (53) for integrating objects from one or more domains of information and a knowledge discovery engine (54) where relationships between two or more objects are identified, retrieved, grouped, ranked, filtered and numerically evaluated.

...read moreread less

218 citations

Journal Article•DOI•

Application of data mining techniques in pharmacovigilance

[...]

Andrew M. Wilson, Lehana Thabane¹, Anne Holbrook¹•Institutions (1)

McMaster University¹

30 Sep 2003-British Journal of Clinical Pharmacology

TL;DR: The use of data mining techniques in knowledge discovery in medical databases is likely to be of increasing importance in the process of pharmacovigilance as they arelikely to be able to detect signals earlier than using current methods.

...read moreread less

Abstract: Aims To discuss the potential use of data mining and knowledge discovery in databases for detection of adverse drug events (ADE) in pharmacovigilance. Methods A literature search was conducted to identify articles, which contained details of data mining, signal generation or knowledge discovery in relation to adverse drug reactions or pharmacovigilance in medical databases. Results ADEs are common and result in significant mortality, and despite existing systems drugs have been withdrawn due to ADEs many years after licensing. Knowledge discovery in databases (KDD) is a technique which may be used to detect potential ADEs more efficiently. KDD involves the selection of data variables and databases, data preprocessing, data mining and data interpretation and utilization. Data mining encompasses a number of statistical techniques including cluster analysis, link analysis, deviation detection and disproportionality assessment which can be utilized to determine the presence of and to assess the strength of ADE signals. Currently the only data mining methods to be used in pharmacovigilance are those of disproportionality, such as the Proportional Reporting Ratio and Information Component, which have been used to analyse the UK Yellow Card Scheme spontaneous reporting database and the WHO Uppsala Monitoring Centre database. The association of pericarditis with practolol but not with other β-blockers, the association of captopril and other angiotensin-converting enzymes with cough, and the association of terfenadine with heart rate and rhythm disorders could be identified by mining the WHO database. Conclusion In view of the importance of ADEs and the development of massive data storage systems and powerful computer systems, the use of data mining techniques in knowledge discovery in medical databases is likely to be of increasing importance in the process of pharmacovigilance as they are likely to be able to detect signals earlier than using current methods.

...read moreread less

Association Rule Mining: A Survey

[...]

Qiankun Zhao, Sourav S. Bhowmick

01 Jan 2003

TL;DR: The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge, which evaluates the mining result according to users’ requirements and domain knowledge.

...read moreread less

Abstract: Data mining [Chen et al. 1996] is the process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large information repositories such as: relational database, data warehouses, XML repository, etc. Also data mining is known as one of the core processes of Knowledge Discovery in Database (KDD). Many people take data mining as a synonym for another popular term, Knowledge Discovery in Database (KDD). Alternatively other people treat Data Mining as the core process of KDD. The KDD processes are shown in Figure 1 [Han and Kamber 2000]. Usually there are three processes. One is called preprocessing, which is executed before data mining techniques are applied to the right data. The preprocessing includes data cleaning, integration, selection and transformation. The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge. After that comes another process called postprocessing, which evaluates the mining result according to users’ requirements and domain knowledge. Regarding the evaluation results, the knowledge can be presented if the result is satisfactory, otherwise we have to run some or all of those processes again until we get the satisfactory result. The actually processes work as follows. First we need to clean and integrate the databases. Since the data source may come from different databases, which may have some inconsistences and duplications, we must clean the data source by removing those noises or make some compromises. Suppose we have two different databases, different words are used to refer the same thing in their schema. When we try to integrate the two sources we can only choose one of them, if we know that they denote the same thing. And also real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. As not all the data in the database are related to our mining task, the second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose we want to find which items are often purchased together in a supermarket, while the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each items and so on, but for this specific task we only need items bought. After selection of relevant data, the database that we are going to apply our data mining techniques to will be much smaller, consequently the whole process will be

...read moreread less

Journal Article•

Towards a Core Ontology for Information Integration

[...]

Martin Doerr, Jane Hunter¹, Carl Lagoze²•Institutions (2)

J. F. Drake State Technical College¹, Cornell University²

01 Jan 2003-Journal of Digital Information

TL;DR: It is argued that a core ontology is one of the key building blocks necessary to enable the scalable assimilation of information from diverse sources and the subsequent building of a variety of services such as cross-domain searching, browsing, data mining and knowledge extraction.

...read moreread less

Abstract: In this paper, we argue that a core ontology is one of the key building blocks necessary to enable the scalable assimilation of information from diverse sources. A complete and extensible ontology that expresses the basic concepts that are common across a variety of domains and can provide the basis for specialization into domain-specific concepts and vocabularies, is essential for well-defined mappings between domain-specific knowledge representations (i.e., metadata vocabularies) and the subsequent building of a variety of services such as cross-domain searching, browsing, data mining and knowledge extraction. This paper describes the results of a series of three workshops held in 2001 and 2002 which brought together representatives from the cultural heritage and digital library communities with the goal of harmonizing their knowledge perspectives and producing a core ontology. The knowledge perspectives of these two communities were represented by the CIDOC/CRM [31], an ontology for information exchange in the cultural heritage and museum community, and the ABC ontology [33], a model for the exchange and integration of digital library information. This paper describes the mediation process between these two different knowledge biases and the results of this mediation - the harmonization of the ABC and CIDOC/CRM ontologies, which we believe may provide a useful basis for information integration in the wider scope of the involved communities.

...read moreread less

Journal Article•DOI•

KnowledgeScope: managing knowledge in context

[...]

M. Millie Kwan¹, P. Balasubramanian²•Institutions (2)

University of Hong Kong¹, Boston University²

01 Jul 2003

TL;DR: A design of a knowledge management system called KnowledgeScope is proposed that addresses problems through an integrated workflow support capability that captures and retrieves knowledge as an organizational process proceeds and a process meta-model that organizes that knowledge and context in a knowledge repository.

...read moreread less

Abstract: Knowledge repositories have been implemented in many organizations, but they often suffer from non-use. This research considers two key design factors that cause non-use: the extra burden on users to document knowledge in the repository, and the lack of a standard knowledge structure that facilitates knowledge sharing among users with different perspectives. We propose a design of a knowledge management system called KnowledgeScope that addresses these problems through (1) an integrated workflow support capability that captures and retrieves knowledge as an organizational process proceeds, i.e., within the context in which it is created and used, and (2) a process meta-model that organizes that knowledge and context in a knowledge repository. In this paper, we describe this design and report the results from implementing it in a real-life organization.

...read moreread less

Posted Content•

Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup

[...]

Alexander S. Yeh¹, Lynette Hirschman¹, Alexander A. Morgan¹•Institutions (1)

Mitre Corporation¹

20 Aug 2003-arXiv: Computation and Language

TL;DR: A Challenge Evaluation task that was created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup, where 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products.

...read moreread less

Abstract: MOTIVATION: The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. RESULTS: We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new (`blind') articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the the evaluation results and describe the techniques used by the top performing groups. CONTACT: asy@mitre.org KEYWORDS: text mining, evaluation, curation, genomics, data management

...read moreread less

Journal Article•DOI•

Building the knowledge map: an industrial case study

[...]

Su-Yeon Kim¹, Euiho Suh¹, Hyun-Seok Hwang¹•Institutions (1)

Pohang University of Science and Technology¹

01 May 2003-Journal of Knowledge Management

TL;DR: This paper proposes a practical methodology to capture and represent organizational knowledge that uses a knowledge map as a tool to represent knowledge.

...read moreread less

Abstract: Recently, research interest in knowledge management has grown rapidly. Much research on knowledge management is conducted in academic and industrial communities. Utilizing knowledge accumulated in an organization can be a strategic weapon to acquire a competitive advantage. Capturing and representing knowledge is critical in knowledge management. This paper proposes a practical methodology to capture and represent organizational knowledge. The methodology uses a knowledge map as a tool to represent knowledge. We explore several techniques of knowledge representation and suggest a roadmap with concrete procedures to build the knowledge map. A case study in a manufacturing company is provided.

...read moreread less

Journal Article•DOI•

Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup.

[...]

Alexander S. Yeh¹, Lynette Hirschman¹, Alexander A. Morgan¹•Institutions (1)

Mitre Corporation¹

03 Jul 2003-Bioinformatics

TL;DR: For example, the KDD Challenge Evaluation Task as mentioned in this paper evaluated text mining techniques for gene expression products and reported on the evaluation results and describe the techniques used by the top performing groups.

...read moreread less

Abstract: Motivation: The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. Results: We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new (‘blind’) articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the evaluation results and describe the techniques used by the top performing groups.

...read moreread less

Proceedings Article•DOI•

On the capability of an SOM based intrusion detection system

[...]

Hilmi Gunes Kayacik¹, A.N. Zincir-Heywood¹, Malcolm I. Heywood¹•Institutions (1)

Dalhousie University¹

20 Jul 2003

TL;DR: An approach to network intrusion detection is investigated, based purely on a hierarchy of Self-Organizing Feature Maps, which is capable of detection (false positive) rates of 89% and is at least as good as the alternative data-mining approaches that require all 41 features.

...read moreread less

Abstract: An approach to network intrusion detection is investigated, based purely on a hierarchy of Self-Organizing Feature Maps. Our principle interest is to establish just how far such an approach can be taken in practice. To do so, the KDD benchmark dataset from the International Knowledge Discovery and Data Mining Tools Competition is employed. This supplies a connection-based description of a factitious computer network in which each connection is described in terms of 41 features. Unlike previous approaches, only 6 of the most basic features are employed. The resulting system is capable of detection (false positive) rates of 89% (4.6%), where this is at least as good as the alternative data-mining approaches that require all 41 features.

...read moreread less

Journal Article•DOI•

Review of Information visualization in data mining and knowledge discovery by Usama Fayyad, Georges G. Grinstein, and Andreas Wierse. Morgan Kaufmann 2002

[...]

Christopher A. Badurek¹•Institutions (1)

State University of New York System¹

02 Jun 2003-Journal of the Association for Information Science and Technology

Journal Article•DOI•

Evolutionary computing for knowledge discovery in medical diagnosis

[...]

Kay Chen Tan¹, Qi Yu¹, C. M. Heng¹, Tong Heng Lee¹•Institutions (1)

National University of Singapore¹

01 Feb 2003-Artificial Intelligence in Medicine

TL;DR: Simulation results show that the evolutionary classifier produces comprehensible rules and good classification accuracy for the medical datasets and results obtained from t-tests further justify its robustness and invariance to random partition of datasets.

...read moreread less

Book Chapter•DOI•

Preprocessing and mining web log data for web personalization

[...]

Miriam Baglioni¹, U. Ferrara, Andrea Romei¹, Salvatore Ruggieri¹, Franco Turini¹ - Show less +1 more•Institutions (1)

University of Pisa¹

23 Sep 2003

TL;DR: The web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behaviour of a web site users by means of data and web mining techniques are described.

...read moreread less

Abstract: We describe the web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from the access logs of a web server by means of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare data for knowledge extraction. Then we show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site.

...read moreread less

Sequential Pattern Mining: A Survey

[...]

Qiankun Zhao, Sourav S. Bhowmick

01 Jan 2003

...read moreread less

Book•DOI•

Knowledge Discovery in Databases: PKDD 2003

[...]

Nada Lavrač, Dragan Gamberger, Ljupčo Todorovski, Hendrik Blockeel

01 Jan 2003

TL;DR: It is argued that sailing is an interesting paradigm for a class of hybrid systems that one could call Skill-based Systems.

...read moreread less

Abstract: This paper describes the Robosail project. It started in 1997 with the aim to build a self-learning auto pilot for a single handed sailing yacht. The goal was to make an adaptive system that would help a single handed sailor to go faster on average in a race. Presently, after five years of development and a number of sea trials, we have a commercial system available (www.robosail.com). It is a hybrid system using agent technology, machine learning, data mining and rule-based reasoning. Apart from describing the system we try to generalize our findings, and argue that sailing is an interesting paradigm for a class of hybrid systems that one could call Skill-based Systems.

...read moreread less

Book Chapter•DOI•

Comparative Evaluation of Approaches to Propositionalization

[...]

Mark-A. Krogel¹, Simon Rawles², Filip Železný³, Filip Železný⁴, Peter A. Flach², Nada Lavrač, Stefan Wrobel⁵ - Show less +3 more•Institutions (5)

Otto-von-Guericke University Magdeburg¹, University of Bristol², University of Wisconsin-Madison³, Czech Technical University in Prague⁴, University of Bonn⁵

29 Sep 2003

TL;DR: This paper compares up-to-date methods for propositionalization from two main groups: logic-oriented and database-oriented techniques, which can handle complex background knowledge and provide expressive first-order models especially on larger data sets.

...read moreread less

Abstract: Propositionalization has already been shown to be a promising approach for robustly and effectively handling relational data sets for knowledge discovery. In this paper, we compare up-to-date methods for propositionalization from two main groups: logic-oriented and database-oriented techniques. Experiments using several learning tasks – both ILP benchmarks and tasks from recent international data mining competitions – show that both groups have their specific advantages. While logic-oriented methods can handle complex background knowledge and provide expressive first-order models, database-oriented methods can be more efficient especially on larger data sets. Obtained accuracies vary such that a combination of the features produced by both groups seems a further valuable venture.

...read moreread less

Journal Article•DOI•

Technology and knowledge: bridging a generating gap

[...]

Israel Spiegler¹•Institutions (1)

Tel Aviv University¹

01 Jul 2003-Information & Management

TL;DR: Refuting the notion of technology as a replacement of knowledge, this paper focuses on a gap between them that needs to be bridged, and two models of knowledge are reviewed.

...read moreread less

Book Chapter•DOI•

Information-Theoretic Measures for Knowledge Discovery and Data Mining

[...]

Yiyu Yao¹•Institutions (1)

University of Regina¹

01 Jan 2003

TL;DR: A critical review and analysis of information-theoretic measures of attribute importance and attribute association, with emphasis on their interpretations and connections are presented.

...read moreread less

Abstract: A database may be considered as a statistical population, and an attribute as a statistical variable taking values from its domain. One can carry out statistical and information-theoretic analysis on a database. Based on the attribute values, a database can be partitioned into smaller populations. An attribute is deemed important if it partitions the database such that previously unknown regularities and patterns are observable. Many information-theoretic measures have been proposed and applied to quantify the importance of attributes and relationships between attributes in various fields. In the context of knowledge discovery and data mining (KDD), we present a critical review and analysis of information-theoretic measures of attribute importance and attribute association, with emphasis on their interpretations and connections.

...read moreread less

Proceedings Article•DOI•

Question answering from the web using knowledge annotation and knowledge mining techniques

[...]

Jimmy Lin¹, Boris Katz¹•Institutions (1)

Massachusetts Institute of Technology¹

03 Nov 2003

TL;DR: This work presents a strategy for answering fact-based natural language questions that is guided by a characterization of real-world user queries, implemented in a system called Aranea, that extracts answers from the Web using two different techniques: knowledge annotation and knowledge mining.

...read moreread less

Abstract: We present a strategy for answering fact-based natural language questions that is guided by a characterization of real-world user queries. Our approach, implemented in a system called Aranea, extracts answers from the Web using two different techniques: knowledge annotation and knowledge mining. Knowledge annotation is an approach to answering large classes of frequently occurring questions by utilizing semi\-structured and structured Web sources. Knowledge mining is a statistical approach that leverages massive amounts of Web data to overcome many natural language processing challenges. We have integrated these two different paradigms into a question answering system capable of providing users with concise answers that directly address their information needs.

...read moreread less

Journal Article•DOI•

Protein family classification and functional annotation

[...]

Cathy H. Wu¹, Hongzhan Huang¹, Lai-Su L. Yeh¹, Winona C. Barker¹•Institutions (1)

Georgetown University Medical Center¹

01 Feb 2003-Computational Biology and Chemistry

TL;DR: The approach to protein functional annotation with case studies and examines common identification errors is described and it is illustrated that data integration in PIR supports exploration of protein relationships and may reveal protein functional associations beyond sequence homology.

...read moreread less

Book•

Data Mining: Opportunities and Challenges

[...]

John Wang¹•Institutions (1)

Montclair State University¹

04 Feb 2003

TL;DR: A Survey of Bayesian Data Mining Control of inductive Bias in Supervised Learning Using Evolutionary Computation: A Wrapper-Based Approach Cooperative Learning and Virtual Reality-Based Visualization for Data Mining

...read moreread less

Abstract: A Survey of Bayesian Data Mining Control of inductive Bias in Supervised Learning Using Evolutionary Computation: A Wrapper-Based Approach Cooperative Learning and Virtual Reality-Based Visualization for Data Mining Feature Selection in Data Mining Parallel and Distributed Data Mining Through Parallel Skeletons and Distributed Objects Data Mining Based on Rough Sets Impact of Missing Data on Data Mining Mining Text Documents for Thematic Hierarchies Using Self-Organizing Maps The Pitfalls of Knowledge Discovery in Databases and Data Mining Maximum Performance Efficiency Approaches For Estimating Best Practice Costs Bayesian Data Mining and Knowledge Discovery Mining Free Text for Structure Query-by-Structure Approach for the Web Financial Benchmarking Using Self-Organizing Maps - Studying the International Pulp and Paper Industry Data Mining in Health Care Applications Data Mining for Human Resource Information Systems Data Mining in Information Technology and Bank Performance Social, Ethical and Legal Issues of Data Mining Data Mining in Designing an Agent-Based DSS Critical and Future Trends in DM: A Review of the Key DM Technologies and Applications

...read moreread less

A Data Mining Ontology for Grid Programming

[...]

Mario Cannataro¹, Carmela Comito²•Institutions (2)

Magna Græcia University¹, University of Calabar²

01 Jan 2003

TL;DR: An ontology for the Data Mining domain is presented that can be used to simplify the development of distributed knowledge discovery applications on the Grid, offering to a domain expert a reference model for the different kind of data mining tasks, methodologies and software available to solve a given problem, helping a user in finding the most appropriate solution.

...read moreread less

Abstract: The Grid is an integrated infrastructure for coordinated resource sharing and problem solving in distributed environments. The effective and efficient use of stored data and its transformation into information and knowledge will be a main driver in Grid evolution. The use of ontologies to describe Grid resources will simplify and structure the systematic building of Grid applications through the composition and reuse of software components and the development of knowledge-based services and tools. The paper presents an ontology for the Data Mining domain that can be used to simplify the development of distributed knowledge discovery applications on the Grid, offering to a domain expert a reference model for the different kind of data mining tasks, methodologies and software available to solve a given problem, helping a user in finding the most appropriate solution. How the DAMON ontology is used to enhance the design of distributed data mining applications on the KNOWLEDGE GRID is also shown.

...read moreread less

Collapse