scispace - formally typeset
Search or ask a question
Book ChapterDOI

Hybrid Paradigm to Establish Accurate Results for Null Value Problem Using Rough—Neural Network Model

01 Jan 2017-pp 349-362
TL;DR: This paper presents a hybrid approach for solving null value problems using the concepts of rough set theory and neural network and produces result with better efficiency as observed by the values of accuracy, completeness, and coverage.
Abstract: The systems in which missing values (NULL) occur are called incomplete information systems and computations on these may lead to biased conclusions. The structured difference of the datasets and importance of attributes compels us to depend on uncertainty-based approaches for finding the null values. This paper presents a hybrid approach for solving null value problems using the concepts of rough set theory and neural network. In this, complete tuple set is used for training the NN. The incomplete tuples are then tested using the model. Level of dependency is used to judge the importance of association rules [11]. Testing the dataset after reducing unwanted attributes, yields a reduced error percentage. The system produces result with better efficiency as observed by the values of accuracy, completeness, and coverage. Thus, the proposed algorithm can be suitably modified for different scenarios using the algorithm step-by-step to solve the null value problem.
References
More filters
Journal ArticleDOI
TL;DR: In this comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected and the out-of-bag imputation error estimates of missForest prove to be adequate in all settings.
Abstract: Motivation Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. Availability The package missForest is freely available from http://stat.ethz.ch/CRAN/. Contact stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch

2,928 citations

01 Jan 2010
TL;DR: Gain ratio and Correlation based feature selection have been used to illustrate the significance of feature subset selection for classifying Pima Indian diabetic database (PIDD) and results show that the feature subsets selected by CFS filter resulted in marginal improvement for both back propagation neural network and Radial basis function network classification accuracy.
Abstract: Feature subset selection is of great importance in the field of data mining. The high dimension data makes testing and training of general classification methods difficult. In the present paper two filters approaches namely Gain ratio and Correlation based feature selection have been used to illustrate the significance of feature subset selection for classifying Pima Indian diabetic database (PIDD). The C4.5 tree uses gain ratio to determine the splits and to select the most important features. Genetic algorithm is used as search method with Correlation based feature selection as subset evaluating mechanism. The feature subset obtained is then tested using two supervised classification method namely, Back propagation neural network and Radial basis function network. Experimental results show that the feature subsets selected by CFS filter resulted in marginal improvement for both back propagation neural network and Radial basis function network classification accuracy when compared to feature subset selected by information gain filter.

361 citations

Journal ArticleDOI
TL;DR: A number of methods for estimating the standard error of predicted values from a multilayer perceptron are discussed, including the delta method based on the Hessian, bootstrap estimators, and the sandwich estimator.
Abstract: We discuss a number of methods for estimating the standard error of predicted values from a multilayer perceptron. These methods include the delta method based on the Hessian, bootstrap estimators, and the “sandwich” estimator. The methods are described and compared in a number of examples. We find that the bootstrap methods perform best, partly because they capture variability due to the choice of starting weights.

287 citations

Book ChapterDOI
01 Jul 2003
TL;DR: This paper proposes a new rough sets model and redefine the core attributes and reducts based on relational algebra to take advantages of the very efficient set-oriented database operations.
Abstract: Rough sets theory was proposed by Pawlak in the early 1980s and has been applied successfully in a lot of domains. One of the major limitations of the traditional rough sets model in the real applications is the inefficiency in the computation of core and reduct, because all the intensive computational operations are performed in flat files. In order to improve the efficiency of computing core attributes and reducts, many novel approaches have been developed, some of which attempt to integrate database technologies. In this paper, we propose a new rough sets model and redefine the core attributes and reducts based on relational algebra to take advantages of the very efficient set-oriented database operations. With this new model and our new definitions, we present two new algorithms to calculate core attributes and reducts for feature selections. Since relational algebra operations have been efficiently implemented in most widely-used database systems, the algorithms presented in this paper can be extensively applied to these database systems and adapted to a wide range of real-life applications with very large data sets. Compared with the traditional rough set models, our model is very efficient and scalable.

136 citations

Journal ArticleDOI
TL;DR: Two ways of dealing with incomplete data — network reduction using multiple neural network classifiers, and value substitution using estimated values from predictor networks — are investigated and found that the network reduction method was superior.
Abstract: Backpropagation neural networks have been applied to prediction and classification problems in many real world situations. However, a drawback of this type of neural network is that it requires a full set of input data, and real world data is seldom complete. We have investigated two ways of dealing with incomplete data — network reduction using multiple neural network classifiers, and value substitution using estimated values from predictor networks — and compared their performance with an induction method. On a thyroid disease database collected in a clinical situation, we found that the network reduction method was superior. We conclude that network reduction can be a useful method for dealing with missing values in diagnostic systems based on backpropagation neural networks.

108 citations