scispace - formally typeset
Search or ask a question
Topic

Knowledge extraction

About: Knowledge extraction is a research topic. Over the lifetime, 20251 publications have been published within this topic receiving 413401 citations.


Papers
More filters
01 Jan 2003
TL;DR: The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge, which evaluates the mining result according to users’ requirements and domain knowledge.
Abstract: Data mining [Chen et al. 1996] is the process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large information repositories such as: relational database, data warehouses, XML repository, etc. Also data mining is known as one of the core processes of Knowledge Discovery in Database (KDD). Many people take data mining as a synonym for another popular term, Knowledge Discovery in Database (KDD). Alternatively other people treat Data Mining as the core process of KDD. The KDD processes are shown in Figure 1 [Han and Kamber 2000]. Usually there are three processes. One is called preprocessing, which is executed before data mining techniques are applied to the right data. The preprocessing includes data cleaning, integration, selection and transformation. The main process of KDD is the data mining process, in this process different algorithms are applied to produce hidden knowledge. After that comes another process called postprocessing, which evaluates the mining result according to users’ requirements and domain knowledge. Regarding the evaluation results, the knowledge can be presented if the result is satisfactory, otherwise we have to run some or all of those processes again until we get the satisfactory result. The actually processes work as follows. First we need to clean and integrate the databases. Since the data source may come from different databases, which may have some inconsistences and duplications, we must clean the data source by removing those noises or make some compromises. Suppose we have two different databases, different words are used to refer the same thing in their schema. When we try to integrate the two sources we can only choose one of them, if we know that they denote the same thing. And also real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. As not all the data in the database are related to our mining task, the second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose we want to find which items are often purchased together in a supermarket, while the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each items and so on, but for this specific task we only need items bought. After selection of relevant data, the database that we are going to apply our data mining techniques to will be much smaller, consequently the whole process will be

150 citations

Proceedings ArticleDOI
07 Aug 2002
TL;DR: A parallel granular neural network (GNN) is developed to speed up data mining and knowledge discovery process for credit card fraud detection and gives fewer average training errors with larger amount of past training data.
Abstract: A parallel granular neural network (GNN) is developed to speed up data mining and knowledge discovery process for credit card fraud detection. The entire system is parallelized on the Silicon Graphics Origin 2000, which is a shared memory multiprocessor system consisting of 24-CPU, 4G main memory, and 200 GB hard-drive. In simulations, the parallel fuzzy neural network running on a 24-processor system is trained in parallel using training data sets, and then the trained parallel fuzzy neural network discovers fuzzy rules for future prediction. A parallel learning algorithm is implemented in C. The data are extracted into a flat file from an SQL server database containing sample Visa Card transactions and then preprocessed for applying in fraud detection. The data are classified into three categories: first for training, second for prediction, and third for fraud detection. After learning from training data, the GNN is used to predict on a second set of data and later the third set of data is applied for fraud detection. GNN gives fewer average training errors with larger amount of past training data. The higher the fraud detection error is, the greater the possibility of that transaction being actually fraudulent.

150 citations

Patent
12 May 2011
TL;DR: In this article, the authors present a knowledge representation system which includes a knowledge base in which knowledge is represented in a structured, machine-readable format that encodes meaning and techniques for extracting structured knowledge from unstructured text and for determining the reliability of such extracted knowledge are also described.
Abstract: Embodiments of the present invention relate to knowledge representation systems which include a knowledge base in which knowledge is represented in a structured, machine-readable format that encodes meaning. Techniques for extracting structured knowledge from unstructured text and for determining the reliability of such extracted knowledge are also described.

150 citations

Journal ArticleDOI
TL;DR: Data qual- ity is a particularly troublesome issue in data mining applications, and this is examined.
Abstract: Data mining is defined as the process of seeking interesting or valuable information within large data sets. This presents novel chal- lenges and problems, distinct from those typically arising in the allied areas of statistics, machine learning, pattern recognition or database science. A distinction is drawn between the two data mining activities of model building and pattern detection. Even though statisticians are familiar with the former, the large data sets involved in data mining mean that novel problems do arise. The second of the activities, pat- tern detection, presents entirely new classes of challenges, some arising, again, as a consequence of the large sizes of the data sets. Data qual- ity is a particularly troublesome issue in data mining applications, and this is examined. The discussion is illustrated with a variety of real examples.

150 citations

Proceedings ArticleDOI
28 Jun 2007
TL;DR: This paper introduces a framework consisting of a set of distance operators based on primitive as well as derived parameters of trajectories (speed and direction) to support trajectory clustering and classification mining tasks, which definitely imply a way to quantify the distance between two trajectories.
Abstract: Trajectory database (TD) management is a relatively new topic of database research, which has emerged due to the explosion of mobile devices and positioning technologies. Trajectory similarity search forms an important class of queries in TD with applications in trajectory data analysis and spatiotemporal knowledge discovery. In contrast to related works which make use of generic similarity metrics that virtually ignore the temporal dimension, in this paper we introduce a framework consisting of a set of distance operators based on primitive (space and time) as well as derived parameters of trajectories (speed and direction). The novelty of the approach is not only to provide qualitatively different means to query for similar trajectories, but also to support trajectory clustering and classification mining tasks, which definitely imply a way to quantify the distance between two trajectories. For each of the proposed distance operators we devise highly parametric algorithms, the efficiency of which is evaluated through an extensive experimental study using synthetic and real trajectory datasets.

149 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
90% related
Support vector machine
73.6K papers, 1.7M citations
90% related
Artificial neural network
207K papers, 4.5M citations
87% related
Fuzzy logic
151.2K papers, 2.3M citations
86% related
Feature extraction
111.8K papers, 2.1M citations
86% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023120
2022285
2021506
2020660
2019740
2018683