Journal•ISSN: 2005-4270
International journal of database theory and application
NADIA
About: International journal of database theory and application is an academic journal. The journal publishes majorly in the area(s): Cluster analysis & Cloud computing. It has an ISSN identifier of 2005-4270. Over the lifetime, 568 publications have been published receiving 2530 citations.
Topics: Cluster analysis, Cloud computing, Support vector machine, Big data, Canopy clustering algorithm
Papers
More filters
••
TL;DR: This paper first categorize the documents using KNN based machine learning approach and then return the most relevant documents to solve the text categorization problem.
Abstract: Text Categorization (TC), also known as Text Classification, is the task of automatically classifying a set of text documents into different categories from a predefined set. If a document belongs to exactly one of the categories, it is a single-label classification task; otherwise, it is a multi-label classification task. TC uses several tools from Information Retrieval (IR) and Machine Learning (ML) and has received much attention in the last years from both researchers in the academia and industry developers. In this paper, we first categorize the documents using KNN based machine learning approach and then return the most relevant documents.
197 citations
••
TL;DR: There is a strong relationship between learner’s behaviors and their academic achievement, and the proposed model based on data mining techniques with new data attributes/features, which are called student's behavioral features proves the reliability of this proposed model.
Abstract: Educational data mining has received considerable attention in the last few years. Many data mining techniques are proposed to extract the hidden knowledge from educational data. The extracted knowledge helps the institutions to improve their teaching methods and learning process. All these improvements lead to enhance the performance of the students and the overall educational outputs. In this paper, we propose a new student’s performance prediction model based on data mining techniques with new data attributes/features, which are called student’s behavioral features. These type of features are related to the learner’s interactivity with the e-learning management system. The performance of student’s predictive model is evaluated by set of classifiers, namely; Artificial Neural Network, Naive Bayesian and Decision tree. In addition, we applied ensemble methods to improve the performance of these classifiers. We used Bagging, Boosting and Random Forest (RF), which are the common ensemble methods used in the literature. The obtained results reveal that there is a strong relationship between learner’s behaviors and their academic achievement. The accuracy of the proposed model using behavioral features achieved up to 22.1% improvement comparing to the results when removing such features and it achieved up to 25.8% accuracy improvement using ensemble methods. By testing the model using newcomer students, the achieved accuracy is more than 80%. This result proves the reliability of the proposed model.
195 citations
••
TL;DR: This work proposes to implement a typical decision tree algorithm, C4.5, using MapReduce programming model, and transforms the traditional algorithm into a series of Map and Reduce procedures, showing both time efficiency and scalability.
Abstract: Recent years have witness the development of cloud computing and the big data era, which brings up challenges to traditional decision tree algorithms. First, as the size of dataset becomes extremely big, the process of building a decision tree can be quite time consuming. Second, because the data cannot fit in memory any more, some computation must be moved to the external storage and therefore increases the I/O cost. To this end, we propose to implement a typical decision tree algorithm, C4.5, using MapReduce programming model. Specifically, we transform the traditional algorithm into a series of Map and Reduce procedures. Besides, we design some data structures to minimize the communication cost. We also conduct extensive experiments on a massive dataset. The results indicate that our algorithm exhibits both time efficiency and scalability.
145 citations
••
TL;DR: The main aim of this paper is to extrapolate the various areas of SVM with a basis of understanding the technique and a comprehensive survey, while offering researchers a modernized picture of the depth and breadth in both the theory and applications.
Abstract: During the last two decades, a substantial amount of research efforts has been intended for support vector machine at the application of various data mining tasks. Data Mining is a pioneering and attractive research area due to its huge application areas and task primitives. Support Vector Machine (SVM) is playing a decisive role as it provides techniques those are especially well suited to obtain results in an efficient way and with a good level of quality. In this paper, we survey the role of SVM in various data mining tasks like classification, clustering, prediction, forecasting and others applications. In broader point of view, we have reviewed the number of research publications that have been contributed in various internationally reputed journals for the data mining applications and also suggested a possible no. of issues of SVM. The main aim of this paper is to extrapolate the various areas of SVM with a basis of understanding the technique and a comprehensive survey, while offering researchers a modernized picture of the depth and breadth in both the theory and applications.
107 citations
••
TL;DR: An analysis of 10% of KDD cup’99 training dataset based on intrusion detection establishes a relationship between the attack types and the protocol used by the hackers, using clustered data.
Abstract: The KDD Cup 99 dataset has been the point of attraction for many researchers in the field of intrusion detection from the last decade. Many researchers have contributed their efforts to analyze the dataset by different techniques. Analysis can be used in any type of industry that produces and consumes data, of course that includes security. This paper is an analysis of 10% of KDD cup’99 training dataset based on intrusion detection. We have focused on establishing a relationship between the attack types and the protocol used by the hackers, using clustered data. Analysis of data is performed using k-means clustering; we have used the Oracle 10g data miner as a tool for the analysis of dataset and build 1000 clusters to segment the 494,020 records. The investigation revealed many interesting results about the protocols and attack types preferred by the hackers for intruding the networks. Keyword: KDD 99 dataset, clustering, k-means, intrusion detection
93 citations