scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2003"


Proceedings ArticleDOI
24 Aug 2003
TL;DR: This paper proposes a general framework for mining concept-drifting data streams using weighted ensemble classifiers, and shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.
Abstract: Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Beyesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

1,403 citations


Journal ArticleDOI
TL;DR: A model of community-based, evolutionary knowledge creation is built to study how thousands of talented volunteers, dispersed across organizational and geographical boundaries, collaborate via the Internet to produce a knowledge-intensive, innovative product of high quality.
Abstract: We propose a new model of knowledge creation in purposeful, loosely coordinated, distributed systems, as an alternative to a firm-based one. Specifically, using the case of the Linux kernel development project, we build a model of community-based, evolutionary knowledge creation to study how thousands of talented volunteers, dispersed across organizational and geographical boundaries, collaborate via the Internet to produce a knowledge-intensive, innovative product of high quality. By comparing and contrasting the Linux model with the traditional/commercial model of software development and firmbased knowledge creation efforts, we show how the proposed model of knowledge creation expands beyond the boundary of the firm. Our model suggests that the product development process can be effectively organized as an evolutionary process of learning driven by criticism and error correction. We conclude by offering some theoretical implications of our community-based model of knowledge creation for the literature of organizational learning, community life, and the uses of knowledge in society.

711 citations


01 Dec 2003
TL;DR: This book integrates two areas of computer science, namely data mining and evolutionary algorithms, and emphasizes the importance of discovering comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.
Abstract: From the Publisher: This book integrates two areas of computer science, namely data mining and evolutionary algorithms. Both these areas have become increasingly popular in the last few years, and their integration is currently an active research area. In general, data mining consists of extracting knowledge from data. In particular, in this book we emphasize the importance of discovering comprehensible, interesting knowledge, which is potentially useful for intelligent decision making.In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions. In contrast, most rule induction methods perform a local, greedy search in the space of candidate rules. Intuitively, the global search of evolutionary algorithms can discover interesting rules and patterns that would be missed by the greedy search performed by most rule induction methods.

699 citations


Journal ArticleDOI
TL;DR: The Artequakt project is considered, which links a knowledge extraction tool with an ontology to achieve continuous knowledge support and guide information extraction and is further enhanced using a lexicon-based term expansion mechanism that provides extended ontology terminology.
Abstract: To bring the Semantic Web to life and provide advanced knowledge services, we need efficient ways to access and extract knowledge from Web documents. Although Web page annotations could facilitate such knowledge gathering, annotations are rare and will probably never be rich or detailed enough to cover all the knowledge these documents contain. Manual annotation is impractical and unscalable, and automatic annotation tools remain largely undeveloped. Specialized knowledge services therefore require tools that can search and extract specific knowledge directly from unstructured text on the Web, guided by an ontology that details what type of knowledge to harvest. An ontology uses concepts and relations to classify domain knowledge. Other researchers have used ontologies to support knowledge extraction, but few have explored their full potential in this domain. The paper considers the Artequakt project which links a knowledge extraction tool with an ontology to achieve continuous knowledge support and guide information extraction. The extraction tool searches online documents and extracts knowledge that matches the given classification structure. It provides this knowledge in a machine-readable format that will be automatically maintained in a knowledge base (KB). Knowledge extraction is further enhanced using a lexicon-based term expansion mechanism that provides extended ontology terminology.

490 citations


Book ChapterDOI
01 Jan 2003
TL;DR: This chapter discusses the use of evolutionary algorithms, particularly genetic algorithms and genetic programming, in data mining and knowledge discovery, and discusses some preprocessing and postprocessing steps of the knowledge discovery process, focusing on attribute selection and pruning of an ensemble of classifiers.
Abstract: This chapter discusses the use of evolutionary algorithms, particularly genetic algorithms and genetic programming, in data mining and knowledge discovery. We focus on the data mining task of classification. In addition, we discuss some preprocessing and postprocessing steps of the knowledge discovery process, focusing on attribute selection and pruning of an ensemble of classifiers. We show how the requirements of data mining and knowledge discovery influence the design of evolutionary algorithms. In particular, we discuss how individual representation, genetic operators and fitness functions have to be adapted for extracting high-level knowledge from data.

452 citations


Journal ArticleDOI
TL;DR: A set of performance measures that are sensitive to two types of reconstruction errors and appropriate for different applications in knowledge discovery (KDD) applications are proposed that help the analyst in the selection of the heuristic best suited for the application at hand.
Abstract: Web-usage mining has become the subject of intensive research, as its potential for personalized services, adaptive Web sites and customer profiling is recognized. However, the reliability of Web-usage mining results depends heavily on the proper preparation of the input datasets. In particular, errors in the reconstruction of sessions and incomplete tracing of users' activities in a site can easily result in invalid patterns and wrong conclusions. In this study, we evaluate the performance of heuristics employed to reconstruct sessions from the server log data. Such heuristics are called to partition activities first by user and then by visit of the user in the site, where user identification mechanisms, such as cookies, may or may not be available. We propose a set of performance measures that are sensitive to two types of reconstruction errors and appropriate for different applications in knowledge discovery (KDD) applications.We have tested our framework on the Web server data of a frame-based Web site. The first experiment concerned a specific KDD application and has shown the sensitivity of the heuristics to particularities of the site's structure and traffic. The second experiment is not bound to a specific application but rather compares the performance of the heuristics for different measures and thus for different application types. Our results show that there is no single best heuristic, but our measures help the analyst in the selection of the heuristic best suited for the application at hand.

358 citations


Journal ArticleDOI
TL;DR: The results show that the evolutionary instance selection algorithms consistently outperform the nonevolutionary ones, the main advantages being: better instance reduction rates, higher classification accuracy, and models that are easier to interpret.
Abstract: Evolutionary algorithms are adaptive methods based on natural evolution that may be used for search and optimization As data reduction in knowledge discovery in databases (KDDs) can be viewed as a search problem, it could be solved using evolutionary algorithms (EAs) In this paper, we have carried out an empirical study of the performance of four representative EA models in which we have taken into account two different instance selection perspectives, the prototype selection and the training set selection for data reduction in KDD This paper includes a comparison between these algorithms and other nonevolutionary instance selection algorithms The results show that the evolutionary instance selection algorithms consistently outperform the nonevolutionary ones, the main advantages being: better instance reduction rates, higher classification accuracy, and models that are easier to interpret

325 citations


Proceedings ArticleDOI
24 Aug 2003
TL;DR: This paper presents a method to build decision tree classifiers from the disguised data, and shows that although the data are disguised, this method can still achieve fairly high accuracy.
Abstract: Privacy is an important issue in data mining and knowledge discovery. In this paper, we propose to use the randomized response techniques to conduct the data mining computation. Specially, we present a method to build decision tree classifiers from the disguised data. We conduct experiments to compare the accuracy of our decision tree with the one built from the original undisguised data. Our results show that although the data are disguised, our method can still achieve fairly high accuracy. We also show how the parameter used in the randomized response techniques affects the accuracy of the results.

319 citations


Journal ArticleDOI
TL;DR: The experiments show that the extraction of frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent item set.
Abstract: Given a large collection of transactions containing items, a basic common data mining problem is to extract the so-called frequent itemsets (i.e., sets of items appearing in at least a given number of transactions). In this paper, we propose a structure called free-sets, from which we can approximate any itemset support (i.e., the number of transactions containing the itemset) and we formalize this notion in the framework of e-adequate representations (H. Mannila and H. Toivonen, 1996. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 189–194). We show that frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent itemset. Experiments on real dense data sets show a significant reduction of the size of the output when compared with standard frequent itemset extraction. Furthermore, the experiments show that the extraction of frequent free-sets is still possible when the extraction of frequent itemsets becomes intractable, and that the supports of the frequent free-sets can be used to approximate very closely the supports of the frequent itemsets. Finally, we consider the effect of this approximation on association rules (a popular kind of patterns that can be derived from frequent itemsets) and show that the corresponding errors remain very low in practice.

290 citations


Patent
19 Sep 2003
TL;DR: In this article, a method for accessing domains of information to identify heretofore unknown relationships between disparate sources of data to seek and obtain knowledge is presented. But the method is not suitable for large-scale data sets.
Abstract: The present invention is a system, method for accessing domains of information to identify heretofore unknown relationships between disparate sources of data (7) to seek and obtain knowledge (18), the invention includes a source of data with one or more domains of information, an Object-Relationship Database (53) for integrating objects from one or more domains of information and a knowledge discovery engine (54) where relationships between two or more objects are identified, retrieved, grouped, ranked, filtered and numerically evaluated.

218 citations


Journal ArticleDOI
TL;DR: The use of data mining techniques in knowledge discovery in medical databases is likely to be of increasing importance in the process of pharmacovigilance as they arelikely to be able to detect signals earlier than using current methods.
Abstract: Aims To discuss the potential use of data mining and knowledge discovery in databases for detection of adverse drug events (ADE) in pharmacovigilance. Methods A literature search was conducted to identify articles, which contained details of data mining, signal generation or knowledge discovery in relation to adverse drug reactions or pharmacovigilance in medical databases. Results ADEs are common and result in significant mortality, and despite existing systems drugs have been withdrawn due to ADEs many years after licensing. Knowledge discovery in databases (KDD) is a technique which may be used to detect potential ADEs more efficiently. KDD involves the selection of data variables and databases, data preprocessing, data mining and data interpretation and utilization. Data mining encompasses a number of statistical techniques including cluster analysis, link analysis, deviation detection and disproportionality assessment which can be utilized to determine the presence of and to assess the strength of ADE signals. Currently the only data mining methods to be used in pharmacovigilance are those of disproportionality, such as the Proportional Reporting Ratio and Information Component, which have been used to analyse the UK Yellow Card Scheme spontaneous reporting database and the WHO Uppsala Monitoring Centre database. The association of pericarditis with practolol but not with other β-blockers, the association of captopril and other angiotensin-converting enzymes with cough, and the association of terfenadine with heart rate and rhythm disorders could be identified by mining the WHO database. Conclusion In view of the importance of ADEs and the development of massive data storage systems and powerful computer systems, the use of data mining techniques in knowledge discovery in medical databases is likely to be of increasing importance in the process of pharmacovigilance as they are likely to be able to detect signals earlier than using current methods.

01 Jan 2003
TL;DR: The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge, which evaluates the mining result according to users’ requirements and domain knowledge.
Abstract: Data mining [Chen et al. 1996] is the process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large information repositories such as: relational database, data warehouses, XML repository, etc. Also data mining is known as one of the core processes of Knowledge Discovery in Database (KDD). Many people take data mining as a synonym for another popular term, Knowledge Discovery in Database (KDD). Alternatively other people treat Data Mining as the core process of KDD. The KDD processes are shown in Figure 1 [Han and Kamber 2000]. Usually there are three processes. One is called preprocessing, which is executed before data mining techniques are applied to the right data. The preprocessing includes data cleaning, integration, selection and transformation. The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge. After that comes another process called postprocessing, which evaluates the mining result according to users’ requirements and domain knowledge. Regarding the evaluation results, the knowledge can be presented if the result is satisfactory, otherwise we have to run some or all of those processes again until we get the satisfactory result. The actually processes work as follows. First we need to clean and integrate the databases. Since the data source may come from different databases, which may have some inconsistences and duplications, we must clean the data source by removing those noises or make some compromises. Suppose we have two different databases, different words are used to refer the same thing in their schema. When we try to integrate the two sources we can only choose one of them, if we know that they denote the same thing. And also real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. As not all the data in the database are related to our mining task, the second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose we want to find which items are often purchased together in a supermarket, while the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each items and so on, but for this specific task we only need items bought. After selection of relevant data, the database that we are going to apply our data mining techniques to will be much smaller, consequently the whole process will be

Journal Article
TL;DR: It is argued that a core ontology is one of the key building blocks necessary to enable the scalable assimilation of information from diverse sources and the subsequent building of a variety of services such as cross-domain searching, browsing, data mining and knowledge extraction.
Abstract: In this paper, we argue that a core ontology is one of the key building blocks necessary to enable the scalable assimilation of information from diverse sources. A complete and extensible ontology that expresses the basic concepts that are common across a variety of domains and can provide the basis for specialization into domain-specific concepts and vocabularies, is essential for well-defined mappings between domain-specific knowledge representations (i.e., metadata vocabularies) and the subsequent building of a variety of services such as cross-domain searching, browsing, data mining and knowledge extraction. This paper describes the results of a series of three workshops held in 2001 and 2002 which brought together representatives from the cultural heritage and digital library communities with the goal of harmonizing their knowledge perspectives and producing a core ontology. The knowledge perspectives of these two communities were represented by the CIDOC/CRM [31], an ontology for information exchange in the cultural heritage and museum community, and the ABC ontology [33], a model for the exchange and integration of digital library information. This paper describes the mediation process between these two different knowledge biases and the results of this mediation - the harmonization of the ABC and CIDOC/CRM ontologies, which we believe may provide a useful basis for information integration in the wider scope of the involved communities.

Journal ArticleDOI
01 Jul 2003
TL;DR: A design of a knowledge management system called KnowledgeScope is proposed that addresses problems through an integrated workflow support capability that captures and retrieves knowledge as an organizational process proceeds and a process meta-model that organizes that knowledge and context in a knowledge repository.
Abstract: Knowledge repositories have been implemented in many organizations, but they often suffer from non-use. This research considers two key design factors that cause non-use: the extra burden on users to document knowledge in the repository, and the lack of a standard knowledge structure that facilitates knowledge sharing among users with different perspectives. We propose a design of a knowledge management system called KnowledgeScope that addresses these problems through (1) an integrated workflow support capability that captures and retrieves knowledge as an organizational process proceeds, i.e., within the context in which it is created and used, and (2) a process meta-model that organizes that knowledge and context in a knowledge repository. In this paper, we describe this design and report the results from implementing it in a real-life organization.

Posted Content
TL;DR: A Challenge Evaluation task that was created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup, where 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products.
Abstract: MOTIVATION: The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. RESULTS: We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new (`blind') articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the the evaluation results and describe the techniques used by the top performing groups. CONTACT: asy@mitre.org KEYWORDS: text mining, evaluation, curation, genomics, data management

Journal ArticleDOI
TL;DR: This paper proposes a practical methodology to capture and represent organizational knowledge that uses a knowledge map as a tool to represent knowledge.
Abstract: Recently, research interest in knowledge management has grown rapidly. Much research on knowledge management is conducted in academic and industrial communities. Utilizing knowledge accumulated in an organization can be a strategic weapon to acquire a competitive advantage. Capturing and representing knowledge is critical in knowledge management. This paper proposes a practical methodology to capture and represent organizational knowledge. The methodology uses a knowledge map as a tool to represent knowledge. We explore several techniques of knowledge representation and suggest a roadmap with concrete procedures to build the knowledge map. A case study in a manufacturing company is provided.

Journal ArticleDOI
TL;DR: For example, the KDD Challenge Evaluation Task as mentioned in this paper evaluated text mining techniques for gene expression products and reported on the evaluation results and describe the techniques used by the top performing groups.
Abstract: Motivation: The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. Results: We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new (‘blind’) articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the evaluation results and describe the techniques used by the top performing groups.

Proceedings ArticleDOI
20 Jul 2003
TL;DR: An approach to network intrusion detection is investigated, based purely on a hierarchy of Self-Organizing Feature Maps, which is capable of detection (false positive) rates of 89% and is at least as good as the alternative data-mining approaches that require all 41 features.
Abstract: An approach to network intrusion detection is investigated, based purely on a hierarchy of Self-Organizing Feature Maps. Our principle interest is to establish just how far such an approach can be taken in practice. To do so, the KDD benchmark dataset from the International Knowledge Discovery and Data Mining Tools Competition is employed. This supplies a connection-based description of a factitious computer network in which each connection is described in terms of 41 features. Unlike previous approaches, only 6 of the most basic features are employed. The resulting system is capable of detection (false positive) rates of 89% (4.6%), where this is at least as good as the alternative data-mining approaches that require all 41 features.


Journal ArticleDOI
TL;DR: Simulation results show that the evolutionary classifier produces comprehensible rules and good classification accuracy for the medical datasets and results obtained from t-tests further justify its robustness and invariance to random partition of datasets.

Book ChapterDOI
23 Sep 2003
TL;DR: The web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behaviour of a web site users by means of data and web mining techniques are described.
Abstract: We describe the web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from the access logs of a web server by means of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare data for knowledge extraction. Then we show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site.

01 Jan 2003
TL;DR: The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge, which evaluates the mining result according to users’ requirements and domain knowledge.
Abstract: Data mining [Chen et al. 1996] is the process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large information repositories such as: relational database, data warehouses, XML repository, etc. Also data mining is known as one of the core processes of Knowledge Discovery in Database (KDD). Many people take data mining as a synonym for another popular term, Knowledge Discovery in Database (KDD). Alternatively other people treat Data Mining as the core process of KDD. The KDD processes are shown in Figure 1 [Han and Kamber 2000]. Usually there are three processes. One is called preprocessing, which is executed before data mining techniques are applied to the right data. The preprocessing includes data cleaning, integration, selection and transformation. The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge. After that comes another process called postprocessing, which evaluates the mining result according to users’ requirements and domain knowledge. Regarding the evaluation results, the knowledge can be presented if the result is satisfactory, otherwise we have to run some or all of those processes again until we get the satisfactory result. The actually processes work as follows. First we need to clean and integrate the databases. Since the data source may come from different databases, which may have some inconsistences and duplications, we must clean the data source by removing those noises or make some compromises. Suppose we have two different databases, different words are used to refer the same thing in their schema. When we try to integrate the two sources we can only choose one of them, if we know that they denote the same thing. And also real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. As not all the data in the database are related to our mining task, the second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose we want to find which items are often purchased together in a supermarket, while the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each items and so on, but for this specific task we only need items bought. After selection of relevant data, the database that we are going to apply our data mining techniques to will be much smaller, consequently the whole process will be

BookDOI
01 Jan 2003
TL;DR: It is argued that sailing is an interesting paradigm for a class of hybrid systems that one could call Skill-based Systems.
Abstract: This paper describes the Robosail project. It started in 1997 with the aim to build a self-learning auto pilot for a single handed sailing yacht. The goal was to make an adaptive system that would help a single handed sailor to go faster on average in a race. Presently, after five years of development and a number of sea trials, we have a commercial system available (www.robosail.com). It is a hybrid system using agent technology, machine learning, data mining and rule-based reasoning. Apart from describing the system we try to generalize our findings, and argue that sailing is an interesting paradigm for a class of hybrid systems that one could call Skill-based Systems.

Book ChapterDOI
29 Sep 2003
TL;DR: This paper compares up-to-date methods for propositionalization from two main groups: logic-oriented and database-oriented techniques, which can handle complex background knowledge and provide expressive first-order models especially on larger data sets.
Abstract: Propositionalization has already been shown to be a promising approach for robustly and effectively handling relational data sets for knowledge discovery. In this paper, we compare up-to-date methods for propositionalization from two main groups: logic-oriented and database-oriented techniques. Experiments using several learning tasks – both ILP benchmarks and tasks from recent international data mining competitions – show that both groups have their specific advantages. While logic-oriented methods can handle complex background knowledge and provide expressive first-order models, database-oriented methods can be more efficient especially on larger data sets. Obtained accuracies vary such that a combination of the features produced by both groups seems a further valuable venture.

Journal ArticleDOI
TL;DR: Refuting the notion of technology as a replacement of knowledge, this paper focuses on a gap between them that needs to be bridged, and two models of knowledge are reviewed.

Book ChapterDOI
Yiyu Yao1
01 Jan 2003
TL;DR: A critical review and analysis of information-theoretic measures of attribute importance and attribute association, with emphasis on their interpretations and connections are presented.
Abstract: A database may be considered as a statistical population, and an attribute as a statistical variable taking values from its domain. One can carry out statistical and information-theoretic analysis on a database. Based on the attribute values, a database can be partitioned into smaller populations. An attribute is deemed important if it partitions the database such that previously unknown regularities and patterns are observable. Many information-theoretic measures have been proposed and applied to quantify the importance of attributes and relationships between attributes in various fields. In the context of knowledge discovery and data mining (KDD), we present a critical review and analysis of information-theoretic measures of attribute importance and attribute association, with emphasis on their interpretations and connections.

Proceedings ArticleDOI
03 Nov 2003
TL;DR: This work presents a strategy for answering fact-based natural language questions that is guided by a characterization of real-world user queries, implemented in a system called Aranea, that extracts answers from the Web using two different techniques: knowledge annotation and knowledge mining.
Abstract: We present a strategy for answering fact-based natural language questions that is guided by a characterization of real-world user queries. Our approach, implemented in a system called Aranea, extracts answers from the Web using two different techniques: knowledge annotation and knowledge mining. Knowledge annotation is an approach to answering large classes of frequently occurring questions by utilizing semi\-structured and structured Web sources. Knowledge mining is a statistical approach that leverages massive amounts of Web data to overcome many natural language processing challenges. We have integrated these two different paradigms into a question answering system capable of providing users with concise answers that directly address their information needs.

Journal ArticleDOI
TL;DR: The approach to protein functional annotation with case studies and examines common identification errors is described and it is illustrated that data integration in PIR supports exploration of protein relationships and may reveal protein functional associations beyond sequence homology.

Book
04 Feb 2003
TL;DR: A Survey of Bayesian Data Mining Control of inductive Bias in Supervised Learning Using Evolutionary Computation: A Wrapper-Based Approach Cooperative Learning and Virtual Reality-Based Visualization for Data Mining
Abstract: A Survey of Bayesian Data Mining Control of inductive Bias in Supervised Learning Using Evolutionary Computation: A Wrapper-Based Approach Cooperative Learning and Virtual Reality-Based Visualization for Data Mining Feature Selection in Data Mining Parallel and Distributed Data Mining Through Parallel Skeletons and Distributed Objects Data Mining Based on Rough Sets Impact of Missing Data on Data Mining Mining Text Documents for Thematic Hierarchies Using Self-Organizing Maps The Pitfalls of Knowledge Discovery in Databases and Data Mining Maximum Performance Efficiency Approaches For Estimating Best Practice Costs Bayesian Data Mining and Knowledge Discovery Mining Free Text for Structure Query-by-Structure Approach for the Web Financial Benchmarking Using Self-Organizing Maps - Studying the International Pulp and Paper Industry Data Mining in Health Care Applications Data Mining for Human Resource Information Systems Data Mining in Information Technology and Bank Performance Social, Ethical and Legal Issues of Data Mining Data Mining in Designing an Agent-Based DSS Critical and Future Trends in DM: A Review of the Key DM Technologies and Applications

01 Jan 2003
TL;DR: An ontology for the Data Mining domain is presented that can be used to simplify the development of distributed knowledge discovery applications on the Grid, offering to a domain expert a reference model for the different kind of data mining tasks, methodologies and software available to solve a given problem, helping a user in finding the most appropriate solution.
Abstract: The Grid is an integrated infrastructure for coordinated resource sharing and problem solving in distributed environments. The effective and efficient use of stored data and its transformation into information and knowledge will be a main driver in Grid evolution. The use of ontologies to describe Grid resources will simplify and structure the systematic building of Grid applications through the composition and reuse of software components and the development of knowledge-based services and tools. The paper presents an ontology for the Data Mining domain that can be used to simplify the development of distributed knowledge discovery applications on the Grid, offering to a domain expert a reference model for the different kind of data mining tasks, methodologies and software available to solve a given problem, helping a user in finding the most appropriate solution. How the DAMON ontology is used to enhance the design of distributed data mining applications on the KNOWLEDGE GRID is also shown.