scispace - formally typeset
Search or ask a question
Topic

Knowledge extraction

About: Knowledge extraction is a research topic. Over the lifetime, 20251 publications have been published within this topic receiving 413401 citations.


Papers
More filters
01 Jan 2002
TL;DR: This analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.
Abstract: Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence of missing data, many Machine Learning algorithms handle missing data in a rather naive way. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we analyse the use of the k-nearest neighbour as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.

306 citations

Posted Content
TL;DR: In this article, the problem of variable selection and feature extraction using penalized likelihood methods has been studied in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management.
Abstract: Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of high-dimensionality arise in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. We first give a comprehensive overview of statistical challenges with high dimensionality in these diverse disciplines. We then approach the problem of variable selection and feature extraction using a unified framework: penalized likelihood methods. Issues relevant to the choice of penalty functions are addressed. We demonstrate that for a host of statistical problems, as long as the dimensionality is not excessively large, we can estimate the model parameters as well as if the best model is known in advance. The persistence property in risk minimization is also addressed. The applicability of such a theory and method to diverse statistical problems is demonstrated. Other related problems with high-dimensionality are also discussed.

306 citations

Journal ArticleDOI
TL;DR: KDD-Cup 2000, the yearly competition in data mining, is described, for the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found.
Abstract: We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found. We chronicle the data generation phase starting from the collection at the site through its conversion to a star schema in a warehouse through data cleansing, data obfuscation for privacy protection, and data aggregation. We describe the information given to the participants, including the questions, site structure, the marketing calendar, and the data schema. Finally, we discuss interesting insights, common mistakes, and lessons learned. Three winners were announced and they describe their own experiences and lessons in the pages following this paper.

303 citations

Book ChapterDOI
01 Jan 2011
TL;DR: In this chapter, data mining and knowledge discovery (DMKD) is presented with basic concepts, a brief history of its evolution, mathematical foundations, and usable techniques, along with the data warehouse and the decision support system (DSS).
Abstract: In this chapter, data mining and knowledge discovery (DMKD) is presented with basic concepts, a brief history of its evolution, mathematical foundations, and usable techniques, along with the data warehouse and the decision support system (DSS). First, dataset and knowledge will be defined and elucidated as under DMKD. DMKD is a discovery process with different hierarchies, granularities, and/or scales. For a set of concepts that may be best understood if being viewed and explained from various perspectives, the chapter starts with a definition followed by a table explaining DMKD from different views (Sect. 5.1). The evolution of DMKD is then briefly tracked from the rapid advance in massive data to the birth of DMKD (Sect. 5.2). Some mathematical foundations are given in Sect. 5.3, i.e. probability theory, statistics, fuzzy set, rough set, data fields, and cloud models. Section 5.4 introduces some usable DMKD techniques. DMKD is used to discover a set of rules and exceptions with association, classification, clustering, prediction, discrimination, and exception detection. In Sects. 5.5 and 5.6, data warehouses and decision support systems are given. The first one mentioned is one of the data sources for DMKD, and DMKD is a new technique to assist the latter with a task. Finally, trends and perspectives are summarized and forecasted into two promising fields, web mining and spatial data mining (Sect. 5.7).

300 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
90% related
Support vector machine
73.6K papers, 1.7M citations
90% related
Artificial neural network
207K papers, 4.5M citations
87% related
Fuzzy logic
151.2K papers, 2.3M citations
86% related
Feature extraction
111.8K papers, 2.1M citations
86% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023120
2022285
2021506
2020660
2019740
2018683