scispace - formally typeset
Search or ask a question

Showing papers in "Data mining and knowledge engineering in 2010"


Journal Article
TL;DR: Data mining is the search for new, valuable, and nontrivial information in large volumes of data, a cooperative effort of humans and computers that is possible to put data-mining activities into one of two categories: Predictive data mining, which produces the model of the system described by the given data set, or Descriptive data mining which produces new, nontrivials information based on the available data set.
Abstract: Understand the need for analyses of large, complex, information-rich data sets. Identify the goals and primary tasks of the data-mining process. Describe the roots of data-mining technology. Recognize the iterative character of a data-mining process and specify its basic steps. Explain the influence of data quality on a data-mining process. Establish the relation between data warehousing and data mining. Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers. In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data-mining activities into one of two categories: Predictive data mining, which produces the model of the system described by the given data set, or Descriptive data mining, which produces new, nontrivial information based on the available data set.

4,646 citations


Journal Article
TL;DR: A model to evaluate collaborative inference based on the query sequences of collaborators and their task-sensitive collaboration levels and a technique to prevent multiple collaborative users from deriving sensitive information via inference is developed.
Abstract: Malicious users can exploit the correlation among data to infer sensitive information from a series of seemingly innocuous data accesses. Thus, we develop an inference violation detection system to protect sensitive data content. Based on data dependency, database schema and semantic knowledge. We constructed a semantic inference model (SIM) that represents the possible inference channels from any attribute to the pre-assigned sensitive attributes. The SIM is then instantiated to a semantic inference graph (SIG) for query-time inference violation detection. For a single user case, when a user poses a query, the detection system will examine his/her past query log and calculate the probability of inferring sensitive information. The query request will be denied if the inference probability exceeds the pre specified threshold. For multi-user cases, the users may share their query answers to increase the inference probability. Therefore, we develop a model to evaluate collaborative inference based on the query sequences of collaborators and their task-sensitive collaboration levels. Experimental studies reveal that information authoritativeness, communication fidelity and honesty in collaboration are three key factors that affect the level of achievable collaboration. An example is given to illustrate the use of the proposed technique to prevent multiple collaborative users from deriving sensitive information via inference.

15 citations


Journal Article
TL;DR: The proposed Fuzzy HARM algorithm for association rule mining in web recommendation system results in better quality and performance and makes use of apriori algorithm to generate association rules.
Abstract: Web-based product and recommendation systems have become ever popular on-line business practice with increasing emphasis on modeling customer needs and providing them with targeted or personalized service solutions in real-time interaction. Recommender systems is a specific type of information filtering system technique that attempts to recommend information items like images, web pages, etc that are likely to be of interest to the user. Normally, a recommender system compares a user profile to some reference characteristics, and seeks to predict the 'rating' and retrieve the query elements. This system can be classified into two groups: one is Content-based recommendation and another is collaborative recommendation system. Content based recommendation tries to recommend web sites similar to those web sites the user has liked, whereas collaborative recommendation tries to find some users who share similar tastes with the given user and recommends web sites they like to that user. Based on web usage data in adoptive association rule based web mining the association rules were applied to personalization. The technique makes use of apriori algorithm to generate association rules. Even this method has some disadvantages. An effective Fuzzy Association Rule Mining Algorithm (FARM) is proposed by the author to overcome those disadvantages. This proposed Fuzzy HARM algorithm for association rule mining in web recommendation system results in better quality and performance.

9 citations


Journal Article
TL;DR: This paper presents a survey on various clustering algorithms that are proposed earlier in literature, and provides an insight into the advantages and limitations of some of those earlier proposed clustering techniques.
Abstract: Clustering is a technique adapted in many real world applications. Generally clustering can be thought of as partitioning the data into group or subsets, which contain analogous objects. A lot of clustering techniques like K-Means algorithm, Fuzzy C-Means algorithm (FCM), spectral clustering algorithm and so on has been proposed earlier in literature. Recently, clustering algorithms are extensively used for mixed data types to evaluate the performance of the clustering techniques. This paper presents a survey on various clustering algorithms that are proposed earlier in literature. Moreover it provides an insight into the advantages and limitations of some of those earlier proposed clustering techniques. The comparison of various clustering techniques is provided in this paper. The future enhancement section of this paper provides a general idea for improving the existing clustering algorithms to achieve better clustering accuracy.

6 citations


Journal Article
TL;DR: This paper concentrates on the Intrusion Detection and authentication with data fusion in MANET and Simulation results are presented to show the effectiveness of the proposed scheme.
Abstract: In Mobile Ad Hoc Network (MANET), Multimodal Biometric technology plays a vital role in giving security between user-to-device authentications. This paper concentrates on the Intrusion Detection and authentication with data fusion in MANET. To overcome the fault in unimodal biometric systems, Multimodal biometrics is set out to work with Intrusion Detection Systems. Each and every device has dimensions and estimation limitations, many devices to be selected and with the help of Dempster-Shafter theory for data fusion observation precision gets increased. Based on the security posture, system concludes which biosensor (IDS) to select and whether user authentication (or IDS input) is essential. By every authentication device and Intrusion Detection System (IDS), the decisions are made in a fully distributed manner. Simulation results are presented to show the effectiveness of the proposed scheme. Index Terms—Authentication, biometrics, intrusion detection, mobile ad hoc networks (MANETs), security.

6 citations


Journal Article
TL;DR: A general survey of various clustering algorithms is presented and the efficiency of Self-Organized Map (SOM) algorithm in enhancing the mixed data clustering is described.
Abstract: Clustering is a widely used technique to find interesting patterns dwelling in the dataset that remain unknown. In general, clustering is a method of dividing the data into groups of similar objects. One of significant research areas in data mining is to develop methods to modernize knowledge by using the existing knowledge, since it can generally augment mining efficiency,especially for very bulky database. Data mining uncovers hidden,previously unknown, and potentially useful information from large amounts of data. This paper presents a general survey of various clustering algorithms. In addition, the paper also describes the efficiency of Self-Organized Map (SOM) algorithm in enhancing the mixed data clustering.

6 citations


Journal Article
TL;DR: The mining of single dimensional association rule and non-repetitive predicate multi-dimensional association rule are combined over the transactions of multidimensional transaction database and each predicate should be partitioned at the fuzzy set level.
Abstract: Mining association rules in transactional or relational databases is an important task in data mining. Fuzzy predicates have been incorporated into association rule mining to extend types of data relationships that can be represented, for interpretation of rules in linguistic terms and to avoid fix boundaries in partitioning data attributes. In this paper, the mining of single dimensional association rule and non-repetitive predicate multi-dimensional association rule are combined over the transactions of multidimensional transaction database. The algorithm mines conditional hybrid dimension association rules which satisfy the definite condition on the basis of multi-dimensional transaction database. In this algorithm each predicate should be partitioned at the fuzzy set level, the support count of itemsets is calculated by performing fuzzy AND operation on items that constitute the itemsets. Apriori property is used in algorithm to prune the item sets. The implementation of algorithm is illustrated with the help of a simple example.

3 citations


Journal Article
TL;DR: Recognition of hand printing, handwriting and printed text are the main focusing research area because still no 100% recognition is possible even though the available scanned image is accurate.
Abstract: Translating scanned images to machine readable text focuses paperless environment which leads to the concept of optical character recognition. It increases the demand in many emerging applications like postal system, banks , institutions, word processing, library system etc where all the processing are automated. It is one of the field of research in artificial intelligence which is branch of computer science and aims at to create intelligence in machines. Recognition of hand printing, handwriting and printed text are the main focusing research area because still no 100% recognition is possible even though the available scanned image is accurate. Text can be in different scripts, numerals, images. In this paper introduction to tamil script, previous studies about tamil script recognition and methods for recognition of tamil characters and issues related to recognitionare discussed.

3 citations


Journal Article
TL;DR: The paper describes about the general working behavior, the methodologies followed on these approaches and the parameters which affects the performance of classical fuzzy clustering algorithms.
Abstract: Fuzzy clustering algorithms are helpful when there exists a dataset with sub groupings of points having indistinct boundaries and overlap between the clusters. This paper gives an overview of different classical fuzzy clustering algorithm. The fuzzy clustering algorithms can be categorized as classical fuzzy clustering and shape based clustering. The paper describes about the general working behavior, the methodologies followed on these approaches and the parameters which affects the performance of classical fuzzy clustering algorithms.

3 citations


Journal Article
TL;DR: The design objective is modeling, simulation and optimal tuning of Thyristor Controlled Series Compensator (TCSC) controller for improvement of stability of Single Machine Infinite Bus (SMIB) power system.
Abstract: The design objective is modeling, simulation and optimal tuning of Thyristor Controlled Series Compensator (TCSC) controller for improvement of stability of Single Machine Infinite Bus (SMIB) power system. The design of TCSC controller requires optimization of multiple performance measures that are competing with each other. The design factor is to improve the power system stability with minimum control effort. Lead- lag Control structure for TCSC is proposed and a comprehensive assessment of the effects of tuned TCSC controller parameters have been tuned using global optimization for ISE performance index of the power system. The simulation results show the proposed controller is effective in damping low frequency oscillation. For various system parameters the stability analysis of the system is carried out. Root locus analysis is carried out to asset the stability of SMIB system with the proposed TCSC controller. The system transfer function is found and the stability of the system with controller and without controller is found and stability is analyzed.

2 citations


Journal Article
TL;DR: An algorithm is presented that is capable of segmenting and classifying an audio stream into speech male, speech female, music, noise and silence, and best suited features for classification of different audio classes are suggested.
Abstract: Audio classification has found widespread use in many emerging applications. It involves extraction of vital temporal, spectral and statistical features, and using these in creating an efficient classifier. Most of the audio classification work has been done on binary class classification. In our work we suggest best suited features for classification of different audio classes. Here, we present an algorithm for audio classification that is capable of segmenting and classifying an audio stream into speech male, speech female, music, noise and silence. The speech clips are further segment into voiced and unvoiced frames. A number of timbre features have been discussed, which distinguish the different audio formats. For pre classification, Probability Density Function (PDF), which is a threshold-based method, is performed over each audio clip. For further classification, K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) Classifiers are proposed. Experiments have been performed to determine the best features of each binary class. Utilization of these features in multiclass classification yielded accuracy 96.34% in audio discrimination.

Journal Article
TL;DR: In this paper, a modified K-NN (MKNN) classifier is proposed, which calculates group prototypes from several patterns belonging to the same class and uses these prototypes for the recognition of patterns.
Abstract: This paper describes, proposed modified K-NN (MKNN) classifier, which calculates group prototypes from several patterns belonging to the same class and uses these prototypes for the recognition of patterns. Number of prototypes created by MKNN classifier is dependant on the distance factor d. More prototypes are created for smaller value of d and vice versa. We have compared performance of original KNN and MKNN using a fault diagnosis databases. From the experimentation, one can conclude that performance of MKNN is better than original KNN, in terms of percentage recognition rate and recall time per pattern, classification and classification time. MKNN, thus has increased the scope of original KNN for its application to large data sets, which was not possible previously.

Journal Article
TL;DR: The intent is to generate social network using various network tools and analyze the relationship between participants using block modeling techniques and the data, generated in binary form, has been analyzed and visualized with varying cluster sizes.
Abstract: With the advent of the Internet, social networks have grown enormously and Social Network Analysis (SNA) has come up as an important field for research. Social networks are represented as graphs where each node called an actor or vertex in a graph a is derived using various social network modeling techniques. One of the well known technique called ‗block modeling‘ groups vertices into clusters and determine the relations between these clusters using matrices as computational tools. It is grounded on different structural concepts like equivalence and positions which are related to the theoretical concepts of social role and role sets. In this paper, the intent is to generate social network using various network tools and analyze the relationship between participants using block modeling techniques. The data, generated in binary form, has been analyzed and visualized with varying cluster sizes.

Journal Article
TL;DR: The overall result of report and image retrieval module and text-assisted image feature extraction module is satisfactory to radiologists and the overall precision and re- call for medical finding extraction are 95.5% and 87.9% respectively.
Abstract: Medical text mining has gained increasing interest in recent years. Radiology reports contain rich information de- scribing radiologist's observations on the patient's medical conditions in the associated medical images. However, as most reports are in free text format, the valuable information contained in those reports cannot be easily accessed and used, unless proper text mining has been applied. In this paper, we propose a text mining system to extract and use the information in radiology reports. The system consists of three main modules: a medical finding extractor, a report and image retriever, and a text-assisted image feature extractor. In evaluation, the overall precision and re- call for medical finding extraction are 95.5% and 87.9% respectively, and for all modifiers of the medical findings 88.2% and 82.8% respectively. The overall result of report and image retrieval module and text-assisted image feature extraction module is satisfactory to radiologists.

Journal Article
TL;DR: This survey paper aims at giving an overview to some of the previous researches done in this topic, evaluating the current status of the field, and envisioning possible future trends in this area.
Abstract: Determining the association rules is a core topic of data mining. This survey paper aims at giving an overview to some of the previous researches done in this topic, evaluating the current status of the field, and envisioning possible future trends in this area. The theories behind association rules are presented at the beginning. Comparison of different algorithms is provided as part of the evaluation.

Journal Article
TL;DR: This paper describes how ontology (in the traditional metaphysical sense) can contribute to delivering a more efficient and effective process of matching by providing a framework for the analysis, and so the basis for a methodology.
Abstract: Interoperability and integration of data sources are becoming ever more challenging issues with the increase in both the amount of data and the number of data producers. Interoperability not only has to resolve the differences in data structures, it also has to deal with semantic heterogeneity. Taking semantically heterogeneous databases as the prototypical situation, this paper describes how ontology (in the traditional metaphysical sense) can contribute to delivering a more efficient and effective process of matching by providing a framework for the analysis, and so the basis for a methodology. It delivers not only a better process for matching, but the process also gives a better result.

Journal Article
TL;DR: The Intelligent Knowledge Based Heterogeneous Database using OGSA-DAI Architecture (IKBHDOA) provides solution to the problem of writing query and knowing technical details of Database and has intelligence to retrieve the information from Different Sets of Database based on user’s inputs.
Abstract: In this paper we present a framework to manage the distributed and heterogeneous databases in grid environment Using Open Grid Services Architecture – Data Access and Integration (ODSA-DAI). Even though there is a lot of improvement in database technology, connecting heterogeneous databases within a single application challenging task. Maintaining the information for future purpose is very important in database technology. Whenever the information is needed, then it refers the database, process query and finally produces the result. Database maintains the billion of information. User maintains their information in different database. So whenever they need, they collect it from different database. User cannot easily collect their information from different database without having database knowledge. The current database interfaces are just collecting the information from many databases. The Intelligent Knowledge Based Heterogeneous Database using OGSA-DAI Architecture (IKBHDOA) provides solution to the problem of writing query and knowing technical details of Database. It has intelligence to retrieve the information from Different Sets of Database based on user’s inputs.

Journal Article
TL;DR: SFLA aims to set a generic paradism of the efficient mining that acquire the data set of proteins for these food items and promotes predictions of protein functions with gene ontology for research work.
Abstract: In this work we present a novel approach that uses interspecies sequences homology to connect the networks of multi species and possible more species and possible more species together with gene ontology dependencies in order to improve protein classification for research work. Proteins are involved in many for all biological process such energy metabolism, signal transduction and translation initiation. Even though for a large portion of proteins and their biological function are unknown or incomplete, therefore constructing efficient and reliable models for predicting the protein function has to be used in research work. Our method readily extends to multi species food and produce the improvements similar to them multi species. In the presence of multi interacting networks are using data mining for integration of a data from various sources and contributing increased accuracy of the function prediction of the multiple species for research work. We further enhance our model to account for the gene ontology dependencies by linking multiple related ontology categories such as, we have selected the food items from various countries such as from America the famous food items of yoghurt and Australia food items of oats and Indian food items of soya bean. The data sets are highly desirable for this use from various countries using logical networks from center for bioinformatics research institute (Chennai) and stored in the mining. SFLA aims to set a generic paradism of the efficient mining that acquire the data set of proteins for these food items and promotes predictions of protein functions with gene ontology for research work.

Journal Article
TL;DR: This paper proposes a hash based method for multilevel association rule mining, which extracting knowledge implicit in transactions database with different support at each level with a top-down progressively deepening approach to derive large itemsets.
Abstract: Data mining is having a vital role in many of the applications like market-basket analysis, in biotechnology field etc. In data mining, frequent itemsets plays an important role which is used to identify the correlations among the fields of database. The problem of developing models and algorithms for multilevel association mining pose for new challenges for mathematics and computer science. In most of the studies, multilevel rules will be mined through repeated mining from databases or mining the rules at each individually levels, it affects the efficiency, integrality and accuracy. This paper proposes a hash based method for multilevel association rule mining, which extracting knowledge implicit in transactions database with different support at each level. The proposed algorithm adopts a top-down progressively deepening approach to derive large itemsets. This approach incorporates boundaries instead of sharp boundary intervals. An example is also given to demonstrate that the proposed mining algorithm can derive the multiple-level association rules under different supports in a simple and effective manner.

Journal Article
TL;DR: A simple yet efficient way for functionally characterizing a novel enzyme by the application of support vector machines and henceforth resolving any over fitting of data that may be present in the instance sets.
Abstract: As the proteinic enzyme sequences are entering the databases at a prodigious rate, the functional annotation of these sequences has become a major challenge in the field of Bioinformatics The dispersion in the data makes this task even tougher The authors illustrate in this paper a simple yet efficient way for functionally characterizing a novel enzyme by the application of support vector machines The best accuracy gained by this method on generalization test is 9155% with Mathew's Correlation Coefficient (MCC) of 063 The method was further validated by three different types of testing The resulting accuracy for the LOO estimate was found to be 9105% with MCC of 062 henceforth resolving any over fitting of data that may be present in the instance sets

Journal Article
TL;DR: In this paper, a redundancy check is performed on the original dataset and the resultant is to be preserved, this resultant dataset is then checked for conflicting data and if they will be corrected and updated to the original data.
Abstract: In real world datasets, lots of redundant and conflicting data exists. The performance of a classification algorithm in data mining is greatly affected by noisy information (i.e. redundant and conflicting data). These parameters not only increase the cost of mining process, but also degrade the detection performance of the classifiers. They have to be removed to increase the efficiency and accuracy of the classifiers. This process is called as the tuning of the dataset. The redundancy check will be performed on the original dataset and the resultant is to be preserved. This resultant dataset is to be then checked for conflicting data and if they will be corrected and updated to the original dataset. This updated dataset is to be then classified using a variety of classifiers like Multilayer perceptron, SVM, Decision stump, Kstar, LWL, Rep tree, Decision table, ID3, J48 and Naive Bayes. The performance of the updated datasets on these classifiers is to be found. The results will show a significant improvement in the classification accuracy when redundancy and conflicts are to be removed. The conflicts after correction ate be updated to the original dataset, and when the performance of the classifier is to be evaluated, great improvement is to be witnessed.