scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Data Mining & Knowledge Management Process in 2012"


Journal ArticleDOI
TL;DR: In this article, the authors explored the applications of data mining techniques which have been developed to support knowledge management process and classified the journal articles indexed in ScienceDirect Database from 2007 to 2012.
Abstract: Data mining is one of the most important steps of the knowledge discovery in databases process and is considered as significant subfield in knowledge management. Research in data mining continues growing in business and in learning organization over coming decades. This review paper explores the applications of data mining techniques which have been developed to support knowledge management process. The journal articles indexed in ScienceDirect Database from 2007 to 2012 are analyzed and classified. The discussion on the findings is divided into 4 topics: (i) knowledge resource; (ii) knowledge types and/or knowledge datasets; (iii) data mining tasks; and (iv) data mining techniques and applications used in knowledge management. The article first briefly describes the definition of data mining and data mining functionality. Then the knowledge management rationale and major knowledge management tools integrated in knowledge management cycle are described. Finally, the applications of data mining techniques in the process of knowledge management are summarized and discussed.

101 citations


Journal ArticleDOI
TL;DR: The paper essentially gives an overview of the current research that will provide a background for the students and research scholars about the topic and highlight its potentials in aspect of decision making.
Abstract: While the areas of applications in data mining are growing substantially, it has become extremely necessary for incremental learning methods to move a step ahead. The tremendous growth of unlabeled data has made incremental learning take up a big leap. Starting from BI applications to image classifications, from analysis to predictions, every domain needs to learn and update. Incremental learning allows to explore new areas at the same time performs knowledge amassing. In this paper we discuss the areas and methods of incremental learning currently taking place and highlight its potentials in aspect of decision making. The paper essentially gives an overview of the current research that will provide a background for the students and research scholars about the topic.

58 citations


Journal ArticleDOI
TL;DR: This paper investigates the use of various data mining techniques for knowledge discovery in insurance business by introducing different exhibits for discovering knowledge in the form of association rules, clustering, classification and correlation suitable for data characteristics.
Abstract: Knowledge discovery in financial organization have been built and operated mainly to support decision making using knowledge as strategic factor. In this paper, we investigate the use of various data mining techniques for knowledge discovery in insurance business. Existing software are inefficient in showing such data characteristics. We introduce different exhibits for discovering knowledge in the form of association rules, clustering, classification and correlation suitable for data characteristics. Proposed data mining techniques, the decision- maker can define the expansion of insurance activities to empower the different forces in existing life insurance sector.

34 citations



Journal ArticleDOI
TL;DR: This research paper identifies how data mining techniques can be applicable in the field of digital forensics that will enable forensic investigator to reach the first step in effective prosecution, namely charge-sheeting of digital crime cases.
Abstract: Data mining is part of the interdisciplinary field of knowledge discovery in databases. Research on data mining began in the 1980s and grew rapidly in the 1990s.Specific techniques that have been developed within disciplines such as artificial intelligence, machine learning and pattern recognition have been successfully employed in data mining. Data mining has been successfully introduced in many different fields. An important application area for data mining techniques is the World Wide Web Recently, data mining techniques have also being applied to the field of criminal forensics nothing but Digital forensics. Examples include detecting deceptive criminal identities, identifying groups of criminals who are engaging in various illegal activities and many more. Data mining techniques typically aim to produce insight from large volumes of data. Digital forensics is a sophisticated and cutting edge area of breakthrough research. Canvass of digital forensic investigation and application is growing at a rapid rate with mammoth digitization of an information economy. Law enforcement and military organizations have heavy reliance on digital forensic today. As information age is revolutionizing at a speed inconceivable and information being stored in digital form, the need for accurate intellectual interception, timely retrieval, and nearly zero fault processing of digital data is crux of the issue. This research paper will focus on role of data mining techniques for digital forensics. It also identifies how Data mining techniques can be applicable in the field of digital forensics that will enable forensic investigator to reach the first step in effective prosecution, namely charge-sheeting of digital crime cases.

20 citations



Journal ArticleDOI
TL;DR: A new content extraction method is proposed, which can discover web page content according to the number of punctuations and the ratio of non-hyperlink character number to character number that hyperlinks contain, and can eliminate noise and extract main content blocks from web page effectively.
Abstract: Extracting main content from web page is the preprocessing of web information system. The content extraction approach based on wrapper is limited to one specific information source, and greatly depends on web page structure. It is seldom employed in practice. A new content extraction method is thus proposed in this paper, which can discover web page content according to the number of punctuations and the ratio of non-hyperlink character number to character number that hyperlinks contain. It can eliminate noise and extract main content blocks from web page effectively. Experimental results show that this approach is accurate and suitable for most web sites.

15 citations


Journal ArticleDOI
TL;DR: In this article, the authors compared the performance of three decision trees based algorithms, one artificial neural network, one statistical, one support vector machines with and without adaboost and one clustering algorithm on four datasets from different domains.
Abstract: In today’s business scenario, we percept major changes in how managers use computerized support in making decisions. As more number of decision-makers use computerized support in decision making, decision support systems (DSS) is developing from its starting as a personal support tool and is becoming the common resource in an organization. DSS serve the management, operations, and planning levels of an organization and help to make decisions, which may be rapidly changing and not easily specified in advance. Data mining has a vital role to extract important information to help in decision making of a decision support system. It has been the active field of research in the last two-three decades. Integration of data mining and decision support systems (DSS) can lead to the improved performance and can enable the tackling of new types of problems. Artificial Intelligence methods are improving the quality of decision support, and have become embedded in many applications ranges from ant locking automobile brakes to these days interactive search engines. It provides various machine learning techniques to support data mining. The classification is one of the main and valuable tasks of data mining. Several types of classification algorithms have been suggested, tested and compared to determine the future trends based on unseen data. There has been no single algorithm found to be superior over all others for all data sets. Various issues such as predictive accuracy, training time to build the model, robustness and scalability must be considered and can have tradeoffs, further complex the quest for an overall superior method. The objective of this paper is to compare various classification algorithms that have been frequently used in data mining for decision support systems. Three decision trees based algorithms, one artificial neural network, one statistical, one support vector machines with and without adaboost and one clustering algorithm are tested and compared on four datasets from different domains in terms of predictive accuracy, error rate, classification index, comprehensibility and training time. Experimental results demonstrate that Genetic Algorithm (GA) and support vector machines based algorithms are better in terms of predictive accuracy. Former shows highest comprehensibility but is slower than later. From the decision tree based algorithms, QUEST produces trees with lesser breadth and depth showing more comprehensibility. This research work shows that GA based algorithm is more powerful algorithm and shall be the first choice of organizations for their decision support systems. SVM without adaboost shall be the first choice in context of speed and predictive accuracy. Adaboost improves the accuracy of SVM but on the cost of large training time.

14 citations


Journal ArticleDOI
TL;DR: A technical review on various fuzzy similarity-based models is given in this article, where a tour of different methodologies is provided which is based upon fuzzy similarity related concerns, and the technical comparisons among each model's parameters are shown in the form of a 3-D chart.
Abstract: In this new and current era of technology, advancements and techniques, efficient and effective text document classification is becoming a challenging and highly required area to capably categorize text documents into mutually exclusive categories. Fuzzy similarity provides a way to find the similarity of features among various documents. In this paper, a technical review on various fuzzy similarity based models is given. These models are discussed and compared to frame out their use and necessity. A tour of different methodologies is provided which is based upon fuzzy similarity related concerns. It shows that how text and web documents are categorized efficiently into different categories. Various experimental results of these models are also discussed. The technical comparisons among each model's parameters are shown in the form of a 3-D chart. Such study and technical review provide a strong base of research work done on fuzzy similarity based text document categorization.

10 citations


Journal ArticleDOI
TL;DR: A new method of classifying text documents that preserves the sequence of term occurrence in a document with the help of a novel datastructure called ‘Status Matrix’ and proposes to index the terms in Btree, an efficient index scheme.
Abstract: In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets.

9 citations


Journal ArticleDOI
TL;DR: In this article, a multi-objective genetic algorithm based approach for data quality on categorical attributes is proposed and the results show that their approach is outperformed by the objecti vees like accuracy, completeness, comprehensibility and interestingness.
Abstract: Data quality on categorical attribute is a difficult problem that has not receivedas much attention as numerical counterpart. Our basic idea is to employ association rule for the purpose of data quality measurement. Strong rule generation is an important area of data mining.Association rule mining problems can be considered as a multiobjective problem rather than as a single objective one.The main area of concentration was the rules generated by association rule mining using gene tic algorithm. The advantage of using geneticalgorithm is to discover high level predictionrules is that they perform a global search and cope better with attribute interaction than the greedy rule induction algorithm often used in data mining. Genetic algorithm based approach utilizes the linkage betweenassociation rule and feature selection. In this paper, we put forward a Multi objective genetic algorithm approach for data quality on categorical attributes. The result shows that our approach is outperformed by the objecti ves like accuracy, completeness, comprehensibilityand interestingness.

Journal ArticleDOI
TL;DR: A new multi label text classification model for assigning more relevant set of categories to every input text document is proposed and the use of Semi Supervised Learning in MLTC greatly improves the decision making capability of classifier.
Abstract: Automatic text categorization (ATC) is a prominent research area within Information retrieval. Through this paper a classification model for ATC in multi-label domain is discussed. We are proposing a new multi label text classification model for assigning more relevant set of categories to every input text document. Our model is greatly influenced by graph based framework and Semi supervised learning. We demonstrate the effectiveness of our model using Enron , Slashdot , Bibtex and RCV1 datasets. Our experimental results indicate that the use of Semi Supervised Learning in MLTC greatly improves the decision making capability of classifier.

Journal ArticleDOI
TL;DR: The objective of this paper focuses on the formulation of association rules using which decisions can be made for future Endeavour using Apriori Algorithm which is one of the classical algorithms for deriving association rules.
Abstract: Data mining involves the use of advanced data analysis tools to find out new, suitable patterns and project the relationship among the patterns which were not known prior. In data mining, association rule learning is a trendy and familiar method for ascertaining new relations between variables in large databases. One of the emerging research areas under Data mining is Social Networks. The objective of this paper focuses on the formulation of association rules using which decisions can be made for future Endeavour. This research applies Apriori Algorithm which is one of the classical algorithms for deriving association rules. The Algorithm is applied to Face book 100 university dataset which has originated from Adam D’Angelo of Face book. It contains self-defined characteristics of a person including variables like residence, year, and major, second major, gender, school. This paper to begin with the research uses only ten Universities and highlights the formation of association rules between the attributes or variables and explores the association rule between a course and gender, and discovers the influence of gender in studying a course. This paper attempts to cover the main algorithms used for clustering, with a brief and simple description of each.The previous research with this dataset has applied only regression models and this is the first time to apply association rules.

Journal ArticleDOI
TL;DR: The proposed method extends the MST based clustering algorithm with VAT procedure, called as VAT-Based-MST-Clustering, to find prior tendency in MST Based Clustering by Visual Access Tendency (VAT) and to find clustering results in a single step instead of several trails.
Abstract: Clustering has been widely used in data analysis. Dissimilarity assesses the distance between objects and this is important in Minimum Spanning Tree (MST) based clustering. An inconsistent edge is identified and removed without knowledge of prior tendency in MST based clustering, which explore the results of clusters in the form of sub-trees. Clustering validity is to be checked at every iterated MST clusters by Dunn’s Index. Higher Dunn’s Index imposes the exact clustering. The existing system takes more run time when there are several iterations where as the proposed system takes single step with very less run time. Key contribution of the paper is to find prior tendency in MST Based Clustering by Visual Access Tendency (VAT) and to find clustering results in a single step instead of several trails. The proposed method extends the MST based clustering algorithm with VAT procedure, called as VAT-Based-MST-Clustering. Results are tested on synthetic data sets, and real data sets to conclude the clustering results are improved by proposed method with respect to the runtime.

Journal ArticleDOI
TL;DR: A technique for web people search is developed which clusters the web pages based on semantic information and maps them using ontology based decision tree making the user to access the information in more easy way.
Abstract: Nowadays, searching for people on web is the most common activity done by most of the users. When we give a query for person search, it returns a set of web pages related to distinct person of given name. For such type of search the job of finding the web page of interest is left on the user. In this paper, we develop a technique for web people search which clusters the web pages based on semantic information and maps them using ontology based decision tree making the user to access the information in more easy way. This technique uses the concept of ontology thus reducing the number of inconsistencies. The result proves that ontology based decision tree and clustering helps in increasing the efficiency of the overall search.


Journal ArticleDOI
TL;DR: It was ascertained that the computational times for the FPSO method outperforms the FCM and PSO method and had higher solution quality in terms of the objective function value (OFV).
Abstract: This paper presents an efficient hybrid method, namely fuzzy particle swarm optimization (FPSO) and fuzzy c-means (FCM) algorithms, to solve the fuzzy clustering problem, especially for large sizes. When the problem becomes large, the FCM algorithm may result in uneven distribution of data, making it difficult to find an optimal solution in reasonable amount of time. The PSO algorithm does find a good or nearoptimal solution in reasonable time, but its performance was improved by seeding the initial swarm with the result of the c-means algorithm. The fuzzy c-means, PSO and FPSO algorithms are compared using the performance factors of object function value (OFV) and CPU execution time. It was ascertained that the computational times for the FPSO method outperforms the FCM and PSO method and had higher solution quality in terms of the objective function value (OFV).

Journal ArticleDOI
TL;DR: The proposed system will support top level management to make a good decision in any time under any uncertain environment.
Abstract: Decision Support System (DSS) is equivalent synonym as management information systems (MIS). Most of imported data are being used in solutions like data mining (DM). Decision supporting systems include also decisions made upon individual data from external sources, management feeling, and various other data sources not included in business intelligence. Successfully supporting managerial decision-making is critically dependent upon the availability of integrated, high quality information organized and presented in a timely and easily understood manner. Data mining have emerged to meet this need. They serve as an integrated repository for internal and external data-intelligence critical to understanding and evaluating the business within its environmental context. With the addition of models, analytic tools, and user interfaces, they have the potential to provide actionable information that supports effective problem and opportunity identification, critical decision-making, and strategy formulation, implementation, and evaluation. The proposed system will support top level management to make a good decision in any time under any uncertain environment.

Journal ArticleDOI
TL;DR: The proposed algorithm RAQ-FIG (Resource Adaptive Quality Assuring Frequent Item Generation) accounts for the computational resources like memory available and dynamically adapts the rate of processing based on the available memory.
Abstract: The increasing importance of data stream arising in a wide range of advanced applications has led to the extensive study of mining frequent patterns. Mining data streams poses many new challenges amongst which are the one-scan nature, the unbounded memory requirement and the high arrival rate of data streams.Further the usage of memory resources should be taken care of regardless of the amount of data generated in the stream. In this work we extend the ideas of existing proposals to ensure efficient resource utilization and quality control. The proposed algorithm RAQ-FIG (Resource Adaptive Quality Assuring Frequent Item Generation) accounts for the computational resources like memory available and dynamically adapts the rate of processing based on the available memory. It will compute the recent approximate frequent itemsets by using a single pass algorithm. The empirical results demonstrate the efficacy of the proposed approach for finding recent frequent itemsets from a data stream.


Journal ArticleDOI
TL;DR: A cost model for the recommendation of candidate webviews is presented to reduce the complexity and the execution cost of the online selection of materialized webviews in data–intensive websites (DIWS).
Abstract: In this paper we present a cost model for the recommendation of candidate webviews. Our idea is to intervene at regular period of time in order to filter the candidate webviews which will be used by an algorithm for the selection of materialized webviews in data–intensive websites (DIWS). The aim is to reduce the complexity and the execution cost of the online selection of materialized webviews. A webview is a static instance of a dynamic web page. The materialization of webviews consists of storing the results of some requests on the server in order to avoid repetitive data generation from the sources. Our experiment results show that our solution is very efficient to specify the more profitable webviews and to improve the query response time.

Journal ArticleDOI
TL;DR: In this paper, an interval-based temporal pattern is defined as a pattern that occurs across a time-interval, then disappears for some time, again recurs across another timeinterval and so on and so forth.
Abstract: We present a novel technique to identify calendar-based (annual, monthly and daily) periodicities of an interval-based temporal pattern. An interval-based temporal pattern is a pattern that occurs across a time-interval, then disappears for some time, again recurs across another time-interval and so on and so forth. Given the sequence of time-intervals in which an interval-based temporal pattern has occurred, we propose a method for identifying the extent to which the pattern is periodic with respect to a calendar cycle. In comparison to previous work, our method is asymptotically faster. We also show an interesting relationship between periodicities across different levels of any hierarchical timestamp (year/month/day, hour/minute/second etc.).

Journal ArticleDOI
TL;DR: A multi-chunk ensemble of classifiers is used to classify evolving data streams and improve the prediction accuracy over single classifiers, and the used method shows better efficiency.
Abstract: Data mining is a user-centric process that is used to extract useful patterns from large volumes of data. With the growth of the Internet, a data stream is today a key area of advanced analysis and data mining. Handling data streams is a difficult task due to the variations in the data and the frequent occurrences of concept drifts. No single classifier can be relied upon to correctly classify data stream data since they are developed through a specific learning approach. Hence we use a multi-chunk ensemble of classifiers to classify evolving data streams and improve the prediction accuracy over single classifiers. We evaluate our ensemble on synthetic as well as real time data, compute the precision and represent it graphically using both majority voting as well as new proposed weighted averaging and compare its performance against individual classifiers. Current techniques include a single chunk approach, where the entire set of data is considered as a whole, and the used method shows better efficiency.

Journal ArticleDOI
TL;DR: This paper introduces pruning strategies, by which linear time algorithm for detecting clique is obtained, which has wide application in bioinformatics and graph mining.
Abstract: In graph mining determining clique is np complete problem. This paper introduces pruning strategies, by which linear time algorithm for detecting clique is obtained. Clique determination is widely applicable in social network analysis. In social network analysis cliques signifies that each person in the network knows every other person in the group. Here pruning is done using edge connectivity and degree constraints. Initially the graph (g) is checked for a bridge, if it is detected, then graph (g) is disconnected. Then minimum and maximum degree criteria are used to determine a clique. The algorithm also has wide application in bioinformatics.


Journal ArticleDOI
TL;DR: This research proposes a new rough decision model that allows making decisions based on modularity mechanism and provides a flexible and a quick way for extracting decision rules of large size information tables using rough decision models.
Abstract: Decision models which adopt rough set theory have been used effectively in many real world applications. However, rough decision models suffer the high computational complexity when dealing with datasets of huge size. In this research we propose a new rough decision model that allows making decisions based on modularity mechanism. According to the proposed approach, large-size datasets can be divided into arbitrary moderate-size datasets, then a group of rough decision models can be built as separate decision modules. The overall model decision is computed as the consensus decision of all decision modules through some aggregation technique. This approach provides a flexible and a quick way for extracting decision rules of large size information tables using rough decision models.

Journal ArticleDOI
TL;DR: A Cultural Algorithm Toolkit for Classification Rule Mining (CAT-CRM) is proposed which allows the user to control three different set of parameters and hence can be used for experimenting with an evolutionary system, a rule mining system or an agent based social system.
Abstract: Cultural algorithms (CA) are inspired from the cultural evolutionary process in nature and use social intelligence to solve problems. Cultural algorithms are composed of a belief space which uses different knowledge sources, a population space and a protocol that enables exchange of knowledge between these sources. Knowledge created in the population space is accepted into the belief space while this collective knowledge from these sources is combined to influence the decisions of the individual agents in solving problems. Classification rules comes under descriptive knowledge discovery in data mining and are the most sought out by users since they represent highly comprehensible form of knowledge. The rules have certain properties which make them useful forms of actionable knowledge to users. The rules are evaluated using these properties represented as objective and subjective measures. Objective measures are problem oriented while subjective measures are more user oriented. Evolutionary systems allow the user to incorporate different rule metrics into the solution of a multi objective rule mining problem. However the algorithms found in the literature allow only certain attributes of the system to be controlled by the user. Research gap exists in providing a complete user controlled system to experiment with evolutionary multi objective classification rule mining. In the current study a Cultural Algorithm Toolkit for Classification Rule Mining (CAT-CRM) is proposed which allows the user to control three different set of parameters. CAT-CRM allows the user to control the evolutionary parameters, the rule parameters as well as agent parameters and hence can be used for experimenting with an evolutionary system, a rule mining system or an agent based social system. Results of experiments conducted to observe the effect of different crossover rates and mutation rates on classification accuracy on a bench mark data set is reported.


Journal ArticleDOI
TL;DR: An index based algorithm for ensuring the consistency of read-write mobile clients is proposed that can receive consistent and current data and it allows the mobile clients to update broadcasted data.
Abstract: Most of emerging mobile data base applications emphasis on data dissemination. Data dissemination is the delivery of data from a server to large number of mobile clients. While the data on server being broadcasted to mobile clients , update transactions from external sources may be executed concurrently. If the execution of update and broadcast transactions were overlapped, then mobile clients may observe inconsistent data. This paper proposes an index based algorithm for ensuring the consistency of read-write mobile clients. Under this algorithm mobile clients can receive consistent and current data and it allows the mobile clients to update broadcasted data.