Topic

Knowledge extraction

About: Knowledge extraction is a research topic. Over the lifetime, 20251 publications have been published within this topic receiving 413401 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Data mining and knowledge discovery in databases

[...]

Usama M. Fayyad¹, Ramasamy Uthurusamy²•Institutions (2)

California Institute of Technology¹, General Motors²

01 Nov 1996-Communications of The ACM

TL;DR: This article presents a comprehensive introduction and summary of the main basic concepts and bibliography in the area of Data Mining, nowadays and can be considered as a good starting point for newcomers in the field.

...read moreread less

Abstract: The term knowledge discovery in databases or KDD, for short, was coined in 1989 to refer to the broad process of finding knowledge in data, and to emphasize the “high-level” application of particular Data Mining (DM) methods (Fayyad, Piatetski-Shapiro, & Smyth, 1996). Fayyad considers DM as one of the phases of the KDD process. The DM phase concerns, mainly, the means by which the patterns are extracted and enumerated from data. The literature is sometimes a source of some confusion because the two terms are indistinctively used, making it difficult to determine exactly each of the concepts (Benoît, 2002). Nowadays, the two terms are, usually, indistinctly used. Efforts are being developed in order to create standards and rules in the field of DM with great relevance being given to the subject of inductive databases (De Raedt, 2003) (Imielinski & Mannila, 1996). Within the context of inductive databases a great relevance is given to the so called DM languages. This article presents a comprehensive introduction and summary of the main basic concepts and bibliography in the area of DM, nowadays. Thus, the main contribution of this article is that it can be considered as a good starting point for newcomers in the area. The remaining of this article is organized as follows. Firstly, DM and the KDD process are introduced. Following, the main DM tasks, methods/algorithms, and models/patterns are organized and succinctly explained. SEMMA and CRISP-DM are next introduced and compared with KDD. A brief explanation of standards for DM is then presented. The article concludes with possible future research directions and conclusion. BACKGROUND

...read moreread less

570 citations

Text Mining: The state of the art and the challenges

[...]

Ah-Hwee Tan, Heng Mui, Keng Terrace

01 Jan 2000

TL;DR: A text mining framework consisting of two components: Text refining that transforms unstructured text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form is presented.

...read moreread less

Abstract: Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents. Regarded by many as the next wave of knowledge discovery, text mining has very high commercial values. Last count reveals that there are more than ten high-tech companies offering products for text mining. Has text mining evolved so rapidly to become a mature field? This article attempts to shed some lights to the question. We first present a text mining framework consisting of two components: Text refining that transforms unstructured text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form. We then survey the state-of-the-art text mining products/applications and align them based on the text refining and knowledge distillation functions as well as the intermediate form that they adopt. In conclusion, we highlight the upcoming challenges of text mining and the opportunities it offers.

...read moreread less

560 citations

Proceedings Article•

The field matching problem: Algorithms and applications

[...]

Alvaro Monge¹, Charles Elkan¹•Institutions (1)

University of California, San Diego¹

02 Aug 1996

TL;DR: Three field matching algorithms are described, one of which is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences, and their performance on real-world datasets is evaluated.

...read moreread less

Abstract: To combine information from heterogeneous sources, equivalent data in the multiple sources must be identified. This task is the field matching problem. Specifically, the task is to determine whether or not two syntactic values are alternative designations of the same semantic entity. For example the addresses Dept. of Comput. Sci. and Eng., University of California, San Diego, 9500 Gilman Dr. Dept. 0114, La Jolla. CA 92093 and UCSD, Computer Science and Engineering Department, CA 92093-0114 do designate the same department. This paper describes three field matching algorithms, and evaluates their performance on real-world datasets. One proposed method is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences. Several applications of field matching in knowledge discovery are described briefly, including WEBFIND, which is a new software tool that discovers scientific papers published on the worldwide web. WEBFIND uses external information sources to guide its search for authors and papers. Like many other worldwide web tools, WEBFIND needs to solve the field matching problem in order to navigate between information sources.

...read moreread less

557 citations

Journal Article•DOI•

Theoretical Comparison between the Gini Index and Information Gain Criteria

[...]

Laura Elena Raileanu¹, Kilian Stoffel¹•Institutions (1)

University of Neuchâtel¹

01 May 2004-Annals of Mathematics and Artificial Intelligence

TL;DR: A formal methodology is introduced, which allows us to compare multiple split criteria and permits us to present fundamental insights into the decision process.

...read moreread less

Abstract: Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for a high payoff in many business and scientific applications. One of the main tasks in KDD is classification. A particular efficient method for classification is decision tree induction. The selection of the attribute used at each node of the tree to split the data (split criterion) is crucial in order to correctly classify objects. Different split criteria were proposed in the literature (Information Gain, Gini Index, etc.). It is not obvious which of them will produce the best decision tree for a given data set. A large amount of empirical tests were conducted in order to answer this question. No conclusive results were found. In this paper we introduce a formal methodology, which allows us to compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given data set. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

...read moreread less

554 citations

Proceedings Article•DOI•

[...]

Shyam Boriah¹, Varun Chandola¹, Vipin Kumar•Institutions (1)

University of Minnesota¹

01 Jan 2008

TL;DR: In this paper, the performance of a variety of similarity measures in the context of a specific data mining task is evaluated. But their relative performance has not been evaluated for all types of problems.

...read moreread less

Abstract: Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.

...read moreread less

554 citations

Collapse

Network Information

Performance

Metrics

20,644

Papers

453,302

Citations

No. of papers in the topic in previous years
Year	Papers
2023	120
2022	285
2021	506
2020	660
2019	740
2018	683

Knowledge extraction

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics