scispace - formally typeset
Search or ask a question

Showing papers in "International journal of database theory and application in 2014"


Journal ArticleDOI
TL;DR: This paper first categorize the documents using KNN based machine learning approach and then return the most relevant documents to solve the text categorization problem.
Abstract: Text Categorization (TC), also known as Text Classification, is the task of automatically classifying a set of text documents into different categories from a predefined set. If a document belongs to exactly one of the categories, it is a single-label classification task; otherwise, it is a multi-label classification task. TC uses several tools from Information Retrieval (IR) and Machine Learning (ML) and has received much attention in the last years from both researchers in the academia and industry developers. In this paper, we first categorize the documents using KNN based machine learning approach and then return the most relevant documents.

197 citations


Journal ArticleDOI
TL;DR: This work proposes to implement a typical decision tree algorithm, C4.5, using MapReduce programming model, and transforms the traditional algorithm into a series of Map and Reduce procedures, showing both time efficiency and scalability.
Abstract: Recent years have witness the development of cloud computing and the big data era, which brings up challenges to traditional decision tree algorithms. First, as the size of dataset becomes extremely big, the process of building a decision tree can be quite time consuming. Second, because the data cannot fit in memory any more, some computation must be moved to the external storage and therefore increases the I/O cost. To this end, we propose to implement a typical decision tree algorithm, C4.5, using MapReduce programming model. Specifically, we transform the traditional algorithm into a series of Map and Reduce procedures. Besides, we design some data structures to minimize the communication cost. We also conduct extensive experiments on a massive dataset. The results indicate that our algorithm exhibits both time efficiency and scalability.

145 citations


Journal ArticleDOI
TL;DR: A hybrid method of C5.0 and SVM is developed to improve the accuracy of the intrusion detection system when compared to using individual SVM and individual S VM and the performance of the proposed method with DARPA dataset is investigated and evaluated.
Abstract: Nowadays, much attention has been paid to intrusion detection system (IDS) which is closely linked to the safe use of network services. Several machine-learning paradigms including neural networks, linear genetic programming (LGP), support vector machines (SVM), Bayesian networks, multivariate adaptive regression splines (MARS) fuzzy inference systems (FISs), etc. have been investigated for the design of IDS. In this paper, we develop a hybrid method of C5.0 and SVM and investigate and evaluate the performance of our proposed method with DARPA dataset. The motivation for using the hybrid approach is to improve the accuracy of the intrusion detection system when compared to using individual SVM and individual SVM.

37 citations


Journal ArticleDOI
TL;DR: Detailed description of data cleaning, imbalanced data handling and dimensionality reduction pre-processing techniques are depicted in this paper, which highlights the research opportunities and challenges of Knowledge Discovery process.
Abstract: Knowledge Discovery in Databases (KDD) covers various processes of exploring useful information from voluminous data. These data may contain several inconsistencies, missing records or irrelevant features, which make the knowledge extraction, a difficult process. So, it is essential to apply pre-processing techniques to these data in order to enhance its quality. Detailed description of data cleaning, imbalanced data handling and dimensionality reduction pre-processing techniques are depicted in this paper. Another important aspect of Knowledge Discovery is to filter, integrate, visualize and evaluate the extracted knowledge. In this paper, several visualization techniques such as scatter plots, parallel co-ordinates and pixel oriented technique are explained. The paper also includes detail descriptions of three visualization tools which are DBMiner, Spotfire and WinViz along with their comparative evaluation on the basis of certain criteria. It also highlights the research opportunities and challenges of Knowledge Discovery process.

28 citations


Journal ArticleDOI
TL;DR: In this paper, the authors discuss about NewSQL data management system; and compares with NoSQL and with traditional database system, they also provide the list of popular NoSQL as well as NewSQL databases in separate categorized tables.
Abstract: One of the key advances in resolving the “big-data” problem has been the emergence of an alternative database technology. Today, classic RDBMS are complemented by a rich set of alternative Data Management Systems (DMS) specially designed to handle the volume, variety, velocity and variability of Big Data collections; these DMS include NoSQL, NewSQL and Search-based systems. NewSQL is a class of modern relational database management systems (RDBMS) that provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system. This paper discusses about NewSQL data management system; and compares with NoSQL and with traditional database system. This paper covers architecture, characteristics, classification of NewSQL databases for online transaction processing (OLTP) for Big data management. It also provides the list of popular NoSQL as well as NewSQL databases in separate categorized tables. This paper compares SQL based RDBMS, NoSQL and NewSQL databases with set of metrics; as well as, addressed some research issues of NoSQL and NewSQL.

25 citations


Journal ArticleDOI
TL;DR: The results of experiment show that the proposed WLSTSVM technique performed well for imbalanced dataset and its accuracy is better as compared to other existing methods.
Abstract: This research work proposes a Weighted Least Square Twin Support Vector Machine (WLSTSVM) for imbalanced dataset. Real world data are imbalanced in nature due to which most of the classification techniques do not work well. In Imbalanced data, there is a huge difference between the numbers of data samples of classes. One class data samples are larger as compared to other class data samples. This paper discusses the traditional methods of handling imbalanced data and proposes an improvement over Least Square Twin Support Vector Machine. This research work has performed experiment on five benchmark UCI datasets using 10-fold cross validation method. The results of experiment show that the proposed technique performed well for imbalanced dataset and its accuracy is better as compared to other existing methods. This research work presents the formulation of proposed approach for both linear and non-linear data samples.

22 citations


Journal ArticleDOI
TL;DR: The architecture of this SAMOA (Scalable Advanced Massive Online Analysis) framework and its directory structure is discussed and a practical experience of configuring and deployment of the tool for handling massive online analysis on Big Data is expressed.
Abstract: Data analytics and machine learning has always been of great importance in almost every field especially in business decision making and strategy building, in healthcare domain, in text mining and pattern identification on the web, in meteorological department, etc. The daily exponential growth of data today has shifted the normal data analytics to new paradigm of Big Data Analytics and Big Data Machine Learning. We need tools to perform online data analysis on streaming data for achieving faster learning and faster response in data analytics as well as maintaining scalability in terms of huge volume of data. SAMOA (Scalable Advanced Massive Online Analysis) is a recent framework in this reference. This paper discusses the architecture of this SAMOA framework and its directory structure. Also it expresses a practical experience of configuring and deployment of the tool for handling massive online analysis on Big Data.

17 citations


Journal ArticleDOI
TL;DR: This paper firstly explores the widely used distance metrics (such as Euclidean) in TC problems, and it is found that these metrics may not be appropriate for highly skewed dataset like text categorization.
Abstract: Text classification (TC) is a classic research topic in computer applications. In this paper, we firstly explore the widely used distance metrics (such as Euclidean) in TC problems, and we find that these metrics may not be appropriate for highly skewed dataset like text categorization. Therefore, a novel method of learning evidence from multiple distance metric is proposed. Based on DS theory, the evidences learnt from these distance metric are combined for improving the effectiveness of kNN based text classifier. Because the computed neighbors for the given query pattern may be from heterogeneous neighborhood sources and usually have different influence on predicting the class label. The ensemble of distance metric is tested on three standard benchmark data sets. Finally, we demonstrate the robustness of the proposed approach by a series of experiments.

13 citations


Journal ArticleDOI
TL;DR: This paper focuses on the comparative study (1997 – 2012) of different sentiment classification techniques performed on different data set domain such as web discourse, reviews and news articles etc.
Abstract: The growth of social website and electronic media contributes vast amount of user generated content such as customer reviews, comments and opinions. Sentiment Analysis term is referred to the extraction of others (speaker or writer) opinion in given source material (text) by using NLP, Linguistic Computation and Text mining. Sentiment classification of product and service reviews and comments has emerged as the most useful application in the area of sentiment analysis. This paper focuses on the comparative study (1997 – 2012) of different sentiment classification techniques performed on different data set domain such as web discourse, reviews and news articles etc. The most popular approaches are Bag of words and feature extraction used by researchers to deal with sentiment analysis of opinion related to movies, electronics, cars, music etc. The sentiment analysis is used by manufacturers, politicians, news groups, and some organization to know the opinions of customer, people, and social website users.

13 citations


Journal ArticleDOI
TL;DR: A trust based method to detect failures in a fast way and then recover from failures is designed, and a checkpoint based algorithm is applied to perform data recovery.
Abstract: Failures happen for large scale distributed systems such as Hadoop clusters. Native Hadoop provides basic support for failure tolerance. For example, data blocks are replicated over several HDFS nodes, and Map or Reduce tasks would be re-executed if they fail. However, simply re-processing the whole task decreases the efficiency of job execution, especially when the task is almost done. To this end, we propose a fault tolerance mechanism to detect and then recover from failures. Specifically, instead of simply using a timeout configuration, we design a trust based method to detect failures in a fast way. Then, a checkpoint based algorithm is applied to perform data recovery. Our experiments shows that our method exhibits good performance and is proved to be efficient.

10 citations


Journal ArticleDOI
TL;DR: A study on the use of encryption, signature and signature followed by encryption security standards in a hierarchical web service and the outcomes of statistical analysis based on the observed performance metrics are presented.
Abstract: Security implementation in hierarchical SOAP based web services is essential from the perspectives of efficiency and reliability of the service. We present here a study on the use of encryption, signature and signature followed by encryption security standards in a hierarchical web service. The web service with the security policies is evaluated through the development of the application, testing of the service and evaluation of the performance metrics. The performance latencies are observed and compared to study the response of hierarchical web service against each security policies. Here we present in details the architecture of the service, software aspects, its testing procedure and the performance aspects of implemented security policy along with the outcomes of statistical analysis based on the observed performance metrics.

Journal ArticleDOI
TL;DR: A new ensemble framework with partial labeled instances for learning from the textual stream with an adaptive selection method and empirical evaluation of textual streams reveals that this approach outperforms state-of-the-art stream classification algorithms.
Abstract: Increasing access to large-scale, high-dimensional and non-stationary streams in many real applications has made it necessary to design new dynamic classification algorithms. Most existing approaches for the textual stream classification are able to train the model relying on labeled data. However, only a limited number of instances can be labeled in a real streaming environment since large-scale data appear at a high speed. Therefore, it is useful to make unlabeled instances available for training and updating the ensemble models. In this paper, we present a new ensemble framework with partial labeled instances for learning from the textual stream. A new semi-supervised cluster-based classifier is proposed as the subclassifier in our approach. In order to integrate these sub-classifiers, we propose an adaptive selection method. Empirical evaluation of textual streams reveals that our approach outperforms state-of-the-art stream classification algorithms.

Journal ArticleDOI
TL;DR: The results based on the 250,000 Microblog items published by Sina show that the proposed parallel schema for Web services and the Benefit Ratio of Composite Services (BROCS) could effectively improve the efficiency of the composite Web service.
Abstract: The accessing, transferring and processing of data need to occur parallel in order to tackle the problem brought on by the increasing volume of data on the Internet. In this paper, we put forward the parallel schema for data-intensive Web services and the Benefit Ratio of Composite Services (BROCS) to balance the throughput and cost. Furthermore, we present the method to determine the degree of parallelism (DOP) based on the BROCS model to optimize the quality of the composite services. The experiment demonstrates how DOP affects the benefit ratio of the composite service. Meanwhile, the results based on the 250,000 Microblog items published by Sina show that our proposed parallel schema for Web services could effectively improve the efficiency of the composite Web service.

Journal ArticleDOI
TL;DR: In this article, the authors have developed a SOAP based research web service, suitable for online medical services to study the performance and to evaluate the technique used for developing the service, they call the service as MedWS (prototype research medical web service).
Abstract: Performance testing of hierarchical web service communications is essential from the perspective of users as well as developers, since it directly reflects the behavior of the service. As such we have developed a SOAP based research web service, suitable for online medical services to study the performance and to evaluate the technique used for developing the service. We call the service as MedWS (prototype research medical web service). Load and stress testing have been carried out on MedWS using Mercury Load Runner to study the performance, stability, scalability and efficiency of the service. The performance depends on metrics such as hits/sec, response time, throughput, errors/s and transaction summary. These metrics are tested with different stress levels. The statistical analysis on the recorded data has been carried out to study the stability and quality of the application. The present study reveals that the SOAP based web service which has been developed with Java programming language is stable, scalable and effective. We present here the architecture, testing procedure, results of performance testing as well as the results of statistical analysis on recorded data of MedWS.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors introduced data mining technology into the association analysis of China's electricity consumption growth, select many socioeconomic indicators since 2000, constitute the relevant factors database, complement of a few missing data, and dig out a number of indicators closely related to the electricity consumption with cluster analysis, and the data of distortion indicators is corrected.
Abstract: The relationship between med-long term load forecasting and socio-economic indicators is very difficult to describe with an accurate mathematical model. The paper introduce data mining technology into the association analysis of China's electricity consumption growth, select many socio-economic indicators since 2000, constitute the relevant factors database, complement of a few missing data, and dig out a number of indicators closely related to the electricity consumption with cluster analysis, and the data of distortion indicators is corrected, thus, build a more scientific load forecasting model. Validate and test the correlation of electricity consumption and selected indicators by dynamic neural network time sequence tool. The results show that the prediction model has good convergence, and the effect is satisfactory.

Journal ArticleDOI
TL;DR: This paper proposes a steel semantic model (named STSM) based on ontology and logic rules for the representation of the steel knowledge which covers the basic knowledge in steel domain and describes the content and organization of STSM.
Abstract: There are rich data resources in materials science, but these data resources are heterogeneous in level of system, structure, syntax, and semantics. Therefore, a domain ontology is necessary and helpful for the integration of these heterogeneous data resources, and it is also one of the main tasks of materials informatics. In this paper, we propose a steel semantic model (named STSM) based on ontology and logic rules for the representation of the steel knowledge. STSM is developed with the consideration of the features of materials data and the developed process is presented. Then, we describe the content and organization of STSM which covers the basic knowledge in steel domain. Further, domain axioms and logic rules are designed to enhance the reasoning ability of STSM. STSM is built and tested in protege, and an experimental prototype based on Jena is also developed to demonstrate the effectiveness of STSM.

Journal ArticleDOI
TL;DR: Experimental results show: the DBSCAN of MapReduce running on the cloud computing platform Hadoop has good speedup and scalability.
Abstract: For the lack of "density-based spatial clustering with noise" (DBSCAN) algorithm in dealing with large data sets, MapReduce programming model is proposed to achieve the clustering of DBSCAN. Map functions to complete the data analysis, and get clustering rules in different data objects; Then Reduce functions merge these clustering rules to get a final result. Experimental results show: the DBSCAN of MapReduce running on the cloud computing platform Hadoop has good speedup and scalability.

Journal ArticleDOI
TL;DR: The tool automatic identification and database management system based on RFID is built, and the system proved to be with high reliability and efficiency.
Abstract: RFID (Radio Frequency Identification) is a non-contact automatic identification technology, without human intervention. It is convenient to apply and it can identify fast moving objects. Around the tool and the tool information, the problem of low operational efficiency and low degree of tool management information are still unsolved. In this paper, RFID technology is applied in the tool automatic identification and management system. The tool automatic identification and database management system based on RFID is built. The application model, tool information database design and Upper Computer management software of the RFID in tool identification and management are analyzed. The overall architecture and the software function module of the system are constructed. The database management system is developed by the LabVIEW software. The interface design between LabVIEW and Access database is completed by using ActiveX, as a result, the information of Access database can be written-in and read-out by the management system. Meanwhile, the communication protocol between the Upper Computer and Lower Computer is programed that based on VISA. The RS232 is applied to make the communication between the Upper Computer and Lower Computer come true. Finally, take the SECO's Solid Carbide Ball-end Milling Cutter as an example to have an experiment. Through the experiment, the system proved to be with high reliability and efficiency.

Journal ArticleDOI
TL;DR: The main focus of this work is to present a performance analysis of various techniques available for document clustering, including a new method called Hierarchical Agglomerative Clustering (HAC) which manages clusters as tree like structure that make possible for browsing.
Abstract: Text clustering is the method of combining text or documents which are similar and dissimilar to one another. In several text tasks, this text mining is used such as extraction of information and concept/entity, summarization of documents, modeling of relation with entity, categorization/classification and clustering. This text mining categorizes only digital documents or text and it is a method of data mining. It is the method of combining text document into category and applied in various applications such as retrieval of information, web or corporate information systems. Clustering is also called unsupervised learning because like other document classification, no labeled documents are providing in clustering; hence, clustering is also known as unsupervised learning. A new method called Hierarchical Agglomerative Clustering (HAC) which manages clusters as tree like structure that make possible for browsing. In this HAC method, the nodes in the tree can be viewed as parent-child relationship i.e. topic-subtopic relationship in a hierarchy. HAC method starts with each example in its own cluster and iteratively combines them to form larger and larger clusters. The main focus of this work is to present a performance analysis of various techniques available for document clustering.

Journal ArticleDOI
J. H. Ge, H. Gao, Y. P. Wang, P. Q. Fu, C. T. Zhang 
TL;DR: An optimization method of parallel dynamic chain real-time available resources was put forward, which processed global dynamic feedback tracing for every module and optimized available resources primarily on this basis, graded the primary available resources and gave a real- time candidate resource set.
Abstract: Aimed at the real-time request of dynamic scheduling to product resource, an optimization method of parallel dynamic chain real-time available resources was put forward. Proceed from real-time tracing of resource information influencing scheduling task’s dynamic property, established a resource information’s real-time tracing back and optimization model, which used module and parallel process mechanism to different kinds of real-time traced back resource information. The mechanism processed global dynamic feedback tracing for every module and optimized available resources primarily, on this basis, graded the primary available resources and gave a real-time candidate resource set. Through one example of one gear production scheduling, the method’s validity was tested.

Journal ArticleDOI
TL;DR: By leveraging the inherent data structure of Hadoop HDFS, a query process framework is designed which construct a two-level index which first locates the target nodes for desired data, and then search within each node for further combination.
Abstract: With the development of cloud computing and big data, the massive volume of dataset proposes a big challenge for cloud data management systems. Unlike traditional database management method, cloud data queries are typically parallel and distributed. Intuitively, the query processing framework should embrace these characteristics. In this paper, by leveraging the inherent data structure of Hadoop HDFS, we design a query process framework. Specifically, facilitated by the key-value structure of HDFS, we construct a two-level index which first locates the target nodes for desired data, and then search within each node for further combination. As for joint operation, our query process engine optimizes by judging if the join key is equal to index key. If not, a MapReduce based join algorithm is then called. In this way, our method can reduce the cost of query processing. Besides, we conduct experiments for empirical evaluation.

Journal ArticleDOI
TL;DR: The challenges confront by mobile data mining, visualization challenges, and mobile device limitations are identified and the possible ways to merge both of these technologies, with the available resources are specified.
Abstract: The rapidly improving technologies like data mining and mobile technology need careful investigation in order to emerge these technologies. In this paper we identified the challenges confront by mobile data mining, visualization challenges, and mobile device limitations. The paper introduced a comprehensive framework that explores the idea, how we can incorporate both of these promising technologies regarding the challenges facing mobile data mining and all possible application scenarios. The paper further specifies the possible ways to merge both of these technologies, with the available resources.

Journal ArticleDOI
TL;DR: This study studied and discussed some significant enhancement of DBSCAN algorithm to tackle with problems of worst time complexity and varied densities, and analysed all the enhancements to computational time and output to the original DBS CAN.
Abstract: Clustering is the most used technique in data mining. Clustering maximize the intra-cluster similarity and minimize the inter clusters similarity. DBSCAN is the basic density based clustering algorithm. Cluster is defined as regions of high density are separated from regions that are less dense. DBSCAN algorithm can discover clusters of arbitrary shapes and size in large spatial databases. Beside its popularity, DBSCAN has drawbacks that its worst time complexity reaches to O (n2). Similarly, it cannot deal with varied densities. It is hard to know the initial value of input parameters. In this study, we have studied and discussed some significant enhancement of DBSCAN algorithm to tackle with these problems. We analysed all the enhancements to computational time and output to the original DBSCAN. Majority of variations adopted hybrid techniques and use partitioning to overcome the limitations of DBSCAN algorithm. Some of which performs better and some have their own usefulness and characteristics.

Journal ArticleDOI
TL;DR: A MapReduce algorithm on high gene feature for parallel and distributed selection and classification, aiming to save time resources to make a higher accuracy in training process on large scale gene datasets is proposed.
Abstract: With the large-scale application of high dimensional gene expression data which exists lots of redundant information, it may waste a lot of time in feature selection and classification. By analyzing the process of MapReduce computing paradigms on cloud platform, it is found that the feature selection which through parallel and distributed computing in MapReduce combined with extreme learning machine is appropriate for constructing a recognition method. This paper proposed a MapReduce algorithm on high gene feature for parallel and distributed selection and classification, aiming to save time resources to make a higher accuracy in training process on large scale gene datasets. Simulation experiments on gene datasets show that the running time on cloud platform is greatly shortened by the time promising the high classification accuracy.

Journal ArticleDOI
TL;DR: In this paper, a new approach for steganography in various financial statements is proposed, where additional zeroes can be added before a number and also after the fractional part of a number without changing the value of the number.

Journal ArticleDOI
TL;DR: It is observed that automobile insurer organizations are being tasked with a new challenge characterized by increased competition, increased requirements of automobile insurance quality and an increasing emphasis on time-to-market.
Abstract: Over the last decade, it has been observed that automobile insurer organizations are being tasked with a new challenge characterized by increased competition, increased requirements of automobile insurance quality and an increasing emphasis on time-to-market. Furthermore, knowledge regarding what customers think, what they want, and how to serve them is quite useful for insurance organization wishing to generate suitable strategies in competitive markets.


Journal ArticleDOI
TL;DR: This paper proposes a method to enforce the access control rules by encrypting the original table and using the keys to distinguish various rights, and implements validation-only rights for ODBs and oblige them to fulfill any update validation without knowing the original data.
Abstract: With the growing popularity of outsourced databases (ODBs), access control for multiple users with different privileges in outsourced environments is required in more and more applications. Under the assumption that ODBs may be interested in the original data value, or delay the update operations when end users cannot verify the results, this paper attempts to enhance ODBs with finegrained access control for multiple users with less impact on their other functionalities. Our work can be divided into two parts. In the first part, we propose a method to enforce the access control rules by encrypting the original table and using the keys to distinguish various rights. In addition to read/non-read rights, read/update rights can be distinguished in our encrypted table. We also implement validation-only rights for ODBs and oblige them to fulfill any update validation without knowing the original data. In the second part, we study the query evaluation over the encrypted table. Two kinds of B+ tree indexes on each column are designed, which can accelerate the selection in ODBs.

Journal ArticleDOI
TL;DR: To improve the accuracy of memory based recommendation while keeping the low time cost, an expected item bias (EIA) based similarity computation and a hybrid approach (HA) integrating the global rating information and local rating information are proposed.
Abstract: To improve the accuracy of memory based recommendation while keeping the low time cost, an expected item bias (EIA) based similarity computation is proposed. And a hybrid approach (HA) integrating the global rating information and local rating information is also proposed. The features of two classical datasets MovieLens and Netflix for recommendation system benchmarking are anglicized. The experiments on MovieLens and Netflix show that both EIA and HA could improve the performance alone. A combinational use of them will lead even better results on the two benchmark datasets.

Journal ArticleDOI
TL;DR: A new string dictionary index SB-trie is proposed, which is not only succinct on space, but also has good locality of reference, making it I/O efficient in external memory environment and consumes less space and has greater searching performance in disk environment.
Abstract: With the coming of the big data age, more and more string dictionaries need to be processed. The existing string dictionary indexes are either too space-consuming, or lack of locality of reference, making them inapplicable in the external memory environment. Targeted with these problems, first we design a new succinct representation of Patricia trie using LOUDS encoding. Then applying it to external memory indexing problem, we propose a new string dictionary index SB-trie, which is not only succinct on space, but also has good locality of reference, making it I/O efficient in external memory environment. Experiments show that SB-trie consumes less space and has greater searching performance in disk environment.