Showing papers on "Cross-validation published in 2016"

PDF

Open Access

Journal Article•DOI•

Recursive partitioning for heterogeneous causal effects

[...]

Susan Athey¹, Guido W. Imbens¹•Institutions (1)

05 Jul 2016-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This paper provides a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects, and proposes an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation.

...read moreread less

Abstract: In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.

...read moreread less

913 citations

Proceedings Article•DOI•

Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification

[...]

Sanjay Yadav¹, Sanyam Shukla¹•Institutions (1)

Maulana Azad National Institute of Technology¹

01 Feb 2016

TL;DR: Results show that till a certain threshold, k-fold cross-validation with varying value of k with respect to number of instances can indeed be used over hold-out validation for quality classification.

...read moreread less

Abstract: While training a model with data from a dataset, we have to think of an ideal way to do so. The training should be done in such a way that while the model has enough instances to train on, they should not over-fit the model and at the same time, it must be considered that if there are not enough instances to train on, the model would not be trained properly and would give poor results when used for testing. Accuracy is important when it comes to classification and one must always strive to achieve the highest accuracy, provided there is not trade off with inexcusable time. While working on small datasets, the ideal choices are k-fold cross-validation with large value of k (but smaller than number of instances) or leave-one-out cross-validation whereas while working on colossal datasets, the first thought is to use holdout validation, in general. This article studies the differences between the two validation schemes, analyzes the possibility of using k-fold cross-validation over hold-out validation even on large datasets. Experimentation was performed on four large datasets and results show that till a certain threshold, k-fold cross-validation with varying value of k with respect to number of instances can indeed be used over hold-out validation for quality classification.

...read moreread less

401 citations

Journal Article•DOI•

Facial Emotion Recognition Based on Biorthogonal Wavelet Entropy, Fuzzy Support Vector Machine, and Stratified Cross Validation

[...]

Yudong Zhang¹, Zhang-Jing Yang¹, Huimin Lu², Xingxing Zhou³, Preetha Phillips⁴, Qing-Ming Liu³, Shuihua Wang³ - Show less +3 more•Institutions (4)

Nanjing Audit University¹, Kyushu Institute of Technology², Nanjing Normal University³, Shepherd University⁴

22 Nov 2016-IEEE Access

TL;DR: A new emotion recognition system based on facial expression images that is superior to three state-of-the-art methods is proposed and achieved an overall accuracy of 96.77±0.10%.

...read moreread less

Abstract: Emotion recognition represents the position and motion of facial muscles. It contributes significantly in many fields. Current approaches have not obtained good results. This paper aimed to propose a new emotion recognition system based on facial expression images. We enrolled 20 subjects and let each subject pose seven different emotions: happy, sadness, surprise, anger, disgust, fear, and neutral. Afterward, we employed biorthogonal wavelet entropy to extract multiscale features, and used fuzzy multiclass support vector machine to be the classifier. The stratified cross validation was employed as a strict validation model. The statistical analysis showed our method achieved an overall accuracy of 96.77±0.10%. Besides, our method is superior to three state-of-the-art methods. In all, this proposed method is efficient.

...read moreread less

242 citations

Journal Article•DOI•

Mapping soil organic carbon content by geographically weighted regression: A case study in the Heihe River Basin, China

[...]

Xiaodong Song¹, Dick J. Brus², Dick J. Brus¹, Feng Liu¹, De-Cheng Li¹, Yu Guo Zhao¹, Jin Ling Yang¹, Gan-Lin Zhang¹ - Show less +4 more•Institutions (2)

Chinese Academy of Sciences¹, Wageningen University and Research Centre²

01 Jan 2016-Geoderma

TL;DR: In this article, a case study on soil organic carbon mapping across a 50,810 km 2 area in northwestern China was conducted, where the authors compared the quality of the maps obtained by GWR and GWRR on the one side and multiple linear regression (MLR) on the other side, and concluded that fitting regression coefficients locally as in GWR only paid when no spatial random effect was included in the model.

...read moreread less

134 citations

An intercomparison of a large ensemble of statistical downscaling methods for Europe: Overall results from the VALUE perfect predictor cross-validation experiment

[...]

01 Apr 2016

133 citations

Journal Article•DOI•

Application of Machine Learning Techniques to High-Dimensional Clinical Data to Forecast Postoperative Complications

[...]

Paul Thottakkara¹, Tezcan Ozrazgat-Baslanti¹, Bradley B. Hupf¹, Parisa Rashidi¹, Panos M. Pardalos¹, Petar Momčilović¹, Azra Bihorac¹ - Show less +3 more•Institutions (1)

University of Florida¹

27 May 2016-PLOS ONE

TL;DR: Generalized additive models and support vector machines had good performance as risk prediction model for postoperative sepsis and AKI, while feature extraction using principal component analysis improved performance of all models.

...read moreread less

Abstract: Objective To compare performance of risk prediction models for forecasting postoperative sepsis and acute kidney injury. Design Retrospective single center cohort study of adult surgical patients admitted between 2000 and 2010. Patients 50,318 adult patients undergoing major surgery. Measurements We evaluated the performance of logistic regression, generalized additive models, naive Bayes and support vector machines for forecasting postoperative sepsis and acute kidney injury. We assessed the impact of feature reduction techniques on predictive performance. Model performance was determined using the area under the receiver operating characteristic curve, accuracy, and positive predicted value. The results were reported based on a 70/30 cross validation procedure where the data were randomly split into 70% used for training the model and the 30% for validation. Main results The areas under the receiver operating characteristic curve for different models ranged between 0.797 and 0.858 for acute kidney injury and between 0.757 and 0.909 for severe sepsis. Logistic regression, generalized additive model, and support vector machines had better performance compared to Naive Bayes model. Generalized additive models additionally accounted for non-linearity of continuous clinical variables as depicted in their risk patterns plots. Reducing the input feature space with LASSO had minimal effect on prediction performance, while feature extraction using principal component analysis improved performance of the models. Conclusions Generalized additive models and support vector machines had good performance as risk prediction model for postoperative sepsis and AKI. Feature extraction using principal component analysis improved the predictive performance of all models.

...read moreread less

125 citations

Journal Article•DOI•

Satellite-derived high resolution PM2.5 concentrations in Yangtze River Delta Region of China using improved linear mixed effects model

[...]

Zongwei Ma¹, Yang Liu², Qiuyue Zhao¹, Miaomiao Liu¹, Yuanchun Zhou¹, Jun Bi¹ - Show less +2 more•Institutions (2)

Nanjing University¹, Emory University²

01 May 2016-Atmospheric Environment

TL;DR: Li et al. as discussed by the authors proposed a nested linear mixed effects (LME) model including nested month-, week, and day-specific random effects of PM2.5-AOD relationships.

...read moreread less

124 citations

Journal Article•DOI•

Estimation of monthly evaporative loss using relevance vector machine, extreme learning machine and multivariate adaptive regression spline models

[...]

Ravinesh C. Deo¹, Pijush Samui², Dookie Kim³•Institutions (3)

University of Southern Queensland¹, VIT University², Kunsan National University³

01 Aug 2016-Stochastic Environmental Research and Risk Assessment

TL;DR: It is advocated that the RVM model can be employed as a promising machine learning tool for the prediction of evaporative loss.

...read moreread less

Abstract: The forecasting of evaporative loss (E) is vital for water resource management and understanding of hydrological process for farming practices, ecosystem management and hydrologic engineering. This study has developed three machine learning algorithms, namely the relevance vector machine (RVM), extreme learning machine (ELM) and multivariate adaptive regression spline (MARS) for the prediction of E using five predictor variables, incident solar radiation (S), maximum temperature (T max), minimum temperature (T min), atmospheric vapor pressure (VP) and precipitation (P). The RVM model is based on the Bayesian formulation of a linear model with appropriate prior that results in sparse representations. The ELM model is computationally efficient algorithm based on Single Layer Feedforward Neural Network with hidden neurons that randomly choose input weights and the MARS model is built on flexible regression algorithm that generally divides solution space into intervals of predictor variables and fits splines (basis functions) to each interval. By utilizing random sampling process, the predictor data were partitioned into the training phase (70 % of data) and testing phase (remainder 30 %). The equations for the prediction of monthly E were formulated. The RVM model was devised using the radial basis function, while the ELM model comprised of 5 inputs and 10 hidden neurons and used the radial basis activation function, and the MARS model utilized 15 basis functions. The decomposition of variance among the predictor dataset of the MARS model yielded the largest magnitude of the Generalized Cross Validation statistic (≈0.03) when the T max was used as an input, followed by the relatively lower value (≈0.028, 0.019) for inputs defined by the S and VP. This confirmed that the prediction of E utilized the largest contributions of the predictive features from the T max, verified emphatically by sensitivity analysis test. The model performance statistics yielded correlation coefficients of 0.979 (RVM), 0.977 (ELM) and 0.974 (MARS), Root-Mean-Square-Errors of 9.306, 9.714 and 10.457 and Mean-Absolute-Error of 0.034, 0.035 and 0.038. Despite the small differences in the overall prediction skill, the RVM model appeared to be more accurate in prediction of E. It is therefore advocated that the RVM model can be employed as a promising machine learning tool for the prediction of evaporative loss.

...read moreread less

121 citations

Journal Article•DOI•

Improving the Accuracy of Daily PM2.5 Distributions Derived from the Fusion of Ground-Level Measurements with Aerosol Optical Depth Observations, a Case Study in North China.

[...]

Baolei Lv¹, Yongtao Hu², Howard H. Chang³, Armistead G. Russell², Yuqi Bai¹ - Show less +1 more•Institutions (3)

Tsinghua University¹, Georgia Institute of Technology², Emory University³

13 Apr 2016-Environmental Science & Technology

TL;DR: It is indicated that model prediction performance should be assessed by accounting for monitor clustering due to the potential misinterpretation of model accuracy in spatial prediction when validation monitors are randomly selected.

...read moreread less

Abstract: The accuracy in estimated fine particulate matter concentrations (PM2.5), obtained by fusing of station-based measurements and satellite-based aerosol optical depth (AOD), is often reduced without accounting for the spatial and temporal variations in PM2.5 and missing AOD observations. In this study, a city-specific linear regression model was first developed to fill in missing AOD data. A novel interpolation-based variable, PM2.5 spatial interpolator (PMSI2.5), was also introduced to account for the spatial dependence in PM2.5 across grid cells. A Bayesian hierarchical model was then developed to estimate spatiotemporal relationships between AOD and PM2.5. These methods were evaluated through a city-specific 10-fold cross-validation procedure in a case study in North China in 2014. The cross validation R(2) was 0.61 when PMSI2.5 was included and 0.48 when PMSI2.5 was excluded. The gap-filled AOD values also effectively improved predicted PM2.5 concentrations with an R(2) = 0.78. Daily ground-level PM2.5 concentration fields at a 12 km resolution were predicted with complete spatial and temporal coverage. This study also indicates that model prediction performance should be assessed by accounting for monitor clustering due to the potential misinterpretation of model accuracy in spatial prediction when validation monitors are randomly selected.

...read moreread less

120 citations

Journal Article•DOI•

Multi-scale deep networks and regression forests for direct bi-ventricular volume estimation

[...]

Xiantong Zhen¹, Zhijie Wang², Ali Islam, Mousumi Bhaduri, Ian Chan, Shuo Li¹, Shuo Li² - Show less +3 more•Institutions (2)

University of Western Ontario¹, GE Healthcare²

01 May 2016-Medical Image Analysis

TL;DR: A general, fully learning-based framework for direct bi-ventricular volume estimation, which removes user inputs and unreliable assumptions, and largely outperforms existing direct methods on a larger dataset of 100 subjects including both healthy and diseased cases with twice the number of subjects used in previous methods.

...read moreread less

115 citations

Journal Article•DOI•

Displacement prediction of landslide based on generalized regression neural networks with K-fold cross-validation

[...]

Ping Jiang, Jiejie Chen

19 Jul 2016-Neurocomputing

TL;DR: A generalized regression neural networks with K-fold cross-validation method for predicting the displacement of landslide using Pearson cross-correlation coefficients (PCC) and mutual information (MI) is proposed.

...read moreread less

Journal Article•DOI•

Using Genetic Distance to Infer the Accuracy of Genomic Prediction.

[...]

Marco Scutari¹, Ian Mackay², David J. Balding³•Institutions (3)

University of Oxford¹, National Institute of Agricultural Botany², University of Melbourne³

02 Sep 2016-PLOS Genetics

TL;DR: It is found that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations, and this relationship is illustrated using simulations and a collection of data sets from mice, wheat and human genetics.

...read moreread less

Abstract: The prediction of phenotypic traits using high-density genomic data has many applications such as the selection of plants and animals of commercial interest; and it is expected to play an increasing role in medical diagnostics. Statistical models used for this task are usually tested using cross-validation, which implicitly assumes that new individuals (whose phenotypes we would like to predict) originate from the same population the genomic prediction model is trained on. In this paper we propose an approach based on clustering and resampling to investigate the effect of increasing genetic distance between training and target populations when predicting quantitative traits. This is important for plant and animal genetics, where genomic selection programs rely on the precision of predictions in future rounds of breeding. Therefore, estimating how quickly predictive accuracy decays is important in deciding which training population to use and how often the model has to be recalibrated. We find that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations. We illustrate this relationship using simulations and a collection of data sets from mice, wheat and human genetics.

...read moreread less

Journal Article•DOI•

Enhancing Long-Term Streamflow Forecasting and Predicting using Periodicity Data Component: Application of Artificial Intelligence

[...]

Zaher Mundher Yaseen¹, Ozgur Kisi², Vahdettin Demir²•Institutions (2)

National University of Malaysia¹, Canik Başarı University²

06 Jul 2016-Water Resources Management

TL;DR: In this paper, three heuristic regression techniques, least square support vector regression (LSSVR), multivariate adaptive regression splines (MARS) and M5 Model Tree (M5-Tree), are investigated for forecasting and predicting of monthly streamflows.

...read moreread less

Abstract: Streamflow forecasting and predicting are significant concern for several applications of water resources and management including flood management, determination of river water potentials, environmental flow analysis, and agriculture and hydro-power generation. Forecasting and predicting of monthly streamflows are investigated by using three heuristic regression techniques, least square support vector regression (LSSVR), multivariate adaptive regression splines (MARS) and M5 Model Tree (M5-Tree). Data from four different stations, Besiri and Malabadi located in Turkey, Hit and Baghdad located in Iraq, are used in the analysis. Cross validation method is employed in the applications. In the first stage of the study, the heuristic regression models are compared with each other and multiple linear regression (MLR) in forecasting one month ahead streamflow of each station, individually. In the second stage, the models are evaluated and compared in predicting streamflow of one station using data of nearby station. The research investigated also the influence of the periodicity component (month number of the year) as an external sub-set in modeling long-term streamflow. In both stages, the comparison results indicate that the LSSVR model generally performs superior to the MARS, M5-Tree and MLR models. In addition, it is seen that adding periodicity as input to the models significantly increase their accuracy in forecasting and predicting monthly streamflows in both stages of the study.

...read moreread less

Journal Article•DOI•

A machine learning approach to geochemical mapping

[...]

Charlie Kirkwood¹, Mark Cave¹, David Beamish¹, Stephen Grebby¹, António Ferreira¹ - Show less +1 more•Institutions (1)

British Geological Survey¹

01 Aug 2016-Journal of Geochemical Exploration

TL;DR: In this article, quantile regression forests (an elaboration of random forests) are used to investigate the potential of high resolution auxiliary information alone to support the generation of accurate and interpretable geochemical maps.

...read moreread less

Proceedings Article•DOI•

Rice crop yield prediction using artificial neural networks

[...]

Niketa Gandhi¹, Owaiz Petkar¹, Leisa Armstrong²•Institutions (2)

University of Mumbai¹, Edith Cowan University²

15 Jul 2016

TL;DR: In this article, a multilayer perceptron neural network was used to predict rice production yield and investigate the factors affecting the rice crop yield for various districts of Maharashtra state in India.

...read moreread less

Abstract: Rice crop production contributes to the food security of India, more than 40% to overall crop production. Its production is reliant on favorable climatic conditions. Variability from season to season is detrimental to the farmer's income and livelihoods. Improving the ability of farmers to predict crop productivity in under different climatic scenarios, can assist farmers and other stakeholders in making important decisions in terms of agronomy and crop choice. This study aimed to use neural networks to predict rice production yield and investigate the factors affecting the rice crop yield for various districts of Maharashtra state in India. Data were sourced from publicly available Indian Government's records for 27 districts of Maharashtra state, India. The parameters considered for the present study were precipitation, minimum temperature, average temperature, maximum temperature and reference crop evapotranspiration, area, production and yield for the Kharif season (June to November) for the years 1998 to 2002. The dataset was processed using WEKA tool. A Multilayer Perceptron Neural Network was developed. Cross validation method was used to validate the data. The results showed the accuracy of 97.5% with a sensitivity of 96.3 and specificity of 98.1. Further, mean absolute error, root mean squared error, relative absolute error and root relative squared error were calculated for the present study. The study dataset was also executed using Knowledge Flow of the WEKA tool. The performance of the classifier is visually summarized using ROC curve.

...read moreread less

Journal Article•DOI•

Different medical data mining approaches based prediction of ischemic stroke

[...]

Ahmet Arslan¹, Cemil Colak¹, Mehmet Ediz Sarihan¹•Institutions (1)

İnönü University¹

01 Jul 2016-Computer Methods and Programs in Biomedicine

TL;DR: SVM and SGB models explained in the current study could yield remarkable predictive performance in the classification of ischemic stroke.

...read moreread less

Journal Article•DOI•

Estimation Stability with Cross Validation (ESCV)

[...]

Chinghway Lim, Bin Yu

10 May 2016-Journal of Computational and Graphical Statistics

TL;DR: In this article, a model-free criterion ESCV based on a new estimation stability (ES) metric and cross-validation is proposed to find a smaller and locally ES-optimal model smaller than the CV choice.

...read moreread less

Abstract: Cross-validation (CV) is often used to select the regularization parameter in high-dimensional problems. However, when applied to the sparse modeling method Lasso, CV leads to models that are unstable in high-dimensions, and consequently not suited for reliable interpretation. In this article, we propose a model-free criterion ESCV based on a new estimation stability (ES) metric and CV. Our proposed ESCV finds a smaller and locally ES-optimal model smaller than the CV choice so that it fits the data and also enjoys estimation stability property. We demonstrate that ESCV is an effective alternative to CV at a similar easily parallelizable computational cost. In particular, we compare the two approaches with respect to several performance measures when applied to the Lasso on both simulated and real datasets. For dependent predictors common in practice, our main finding is that ESCV cuts down false positive rates often by a large margin, while sacrificing little of true positive rates. ESCV usually outperfo...

...read moreread less

Journal Article•DOI•

Model averaging based on leave-subject-out cross-validation

[...]

Yan Gao¹, Yan Gao², Xinyu Zhang¹, Xinyu Zhang³, Shouyang Wang¹, Guohua Zou⁴, Guohua Zou¹ - Show less +3 more•Institutions (4)

Chinese Academy of Sciences¹, Minzu University of China², Capital University of Economics and Business³, Capital Normal University⁴

01 May 2016-Journal of Econometrics

TL;DR: In this paper, a frequentist model averaging method based on the leave-subject-out cross-validation was developed for averaging longitudinal data models and time series models which can have heteroscedastic errors.

...read moreread less

Journal Article•DOI•

One day ahead wind speed forecasting: A resampling-based approach

[...]

Weigang Zhao¹, Yi-Ming Wei¹, Zhongyue Su²•Institutions (2)

Beijing Institute of Technology¹, Lanzhou University²

15 Sep 2016-Applied Energy

TL;DR: In this paper, a nonlinear autoregressive (exogenous) model for one-day-ahead mean hourly wind speed forecasting, where general regression neural network is employed to model nonlinearities of the system, is presented.

...read moreread less

Journal Article•DOI•

The “double cross-validation” software tool for MLR QSAR model development

[...]

Kunal Roy¹, Pravin Ambure¹•Institutions (1)

Jadavpur University¹

15 Dec 2016-Chemometrics and Intelligent Laboratory Systems

TL;DR: An open access “Double Cross-Validation (DCV)” software tool which can be used to perform multiple linear regression (MLR) model development by employing the double cross-validation technique and two variable selection methods are incorporated in this tool.

...read moreread less

Journal Article•DOI•

Assessing Predictive Properties of Genome-Wide Selection in Soybeans

[...]

Alencar Xavier¹, William M. Muir¹, Katy M. Rainey¹•Institutions (1)

Purdue University¹

01 Aug 2016-G3: Genes, Genomes, Genetics

TL;DR: This study finds that breeding programs seeking efficient genomic selection in soybeans would best allocate resources by investing in a representative training set, and the most robust prediction model was the combination of reproducing kernel Hilbert space regression and BayesB.

...read moreread less

Abstract: Many economically important traits in plant breeding have low heritability or are difficult to measure. For these traits, genomic selection has attractive features and may boost genetic gains. Our goal was to evaluate alternative scenarios to implement genomic selection for yield components in soybean (Glycine max L. merr). We used a nested association panel with cross validation to evaluate the impacts of training population size, genotyping density, and prediction model on the accuracy of genomic prediction. Our results indicate that training population size was the factor most relevant to improvement in genome-wide prediction, with greatest improvement observed in training sets up to 2000 individuals. We discuss assumptions that influence the choice of the prediction model. Although alternative models had minor impacts on prediction accuracy, the most robust prediction model was the combination of reproducing kernel Hilbert space regression and BayesB. Higher genotyping density marginally improved accuracy. Our study finds that breeding programs seeking efficient genomic selection in soybeans would best allocate resources by investing in a representative training set.

...read moreread less

Journal Article•DOI•

Predicting Nitrate Concentration and Its Spatial Distribution in Groundwater Resources Using Support Vector Machines (SVMs) Model

[...]

Raheleh Arabgol¹, Majid Sartaj¹, Keyvan Asghari²•Institutions (2)

University of Ottawa¹, Isfahan University of Technology²

01 Jan 2016-Environmental Modeling & Assessment

TL;DR: In this article, a support vector machine (SVM) model was developed to predict nitrate concentration in groundwater of Arak plain, Iran and the associated parameters for the optimum SVM model were obtained using a combination of 4-fold cross-validation and grid search technique.

...read moreread less

Abstract: In this paper, a support vector machine (SVM) model was developed to predict nitrate concentration in groundwater of Arak plain, Iran. The model provided a tool for prediction of nitrate concentration using a set of easily measurable groundwater quality variables including water temperature, electrical conductivity, groundwater depth, total dissolved solids, dissolved oxygen, pH, land use, and season of the year as input variables. The data set comprised of 160 water samples representing 40 different wells monitored for 1 year. The associated parameters for the optimum SVM model were obtained using a combination of 4-fold cross-validation and grid search technique. The optimum model was used to predict nitrate concentration in Arak plain aquifer. The SVM model predicted nitrate concentration in training and test stage data sets with reasonably high correlation (0.92 and 0.87, respectively) with the measured values and low root mean squared errors of 0.086 and 0.111, respectively. Finally, the map of nitrate concentration in groundwater was prepared for all four seasons using the trained SVM model and a geographic information system (GIS) interpolation scheme and compared with the results with a physics-based (flow and contaminant) model. Overall, the results showed that SVM model could be used as a fast, reliable, and cost-effective method for assessment and predicting groundwater quality.

...read moreread less

Journal Article•DOI•

Improving Accuracy of Intrusion Detection Model Using PCA and optimized SVM

[...]

Sumaiya Thaseen Ikram¹, Aswani Kumar Cherukuri¹•Institutions (1)

VIT University¹

30 Jun 2016

TL;DR: A hybrid intrusiondetection model by integrating the principal component analysis (PCA) and support vector machine (SVM) and automatic parameter selection technique is proposed, which performs better with higher accuracy, faster convergence speed and better generalization.

...read moreread less

Abstract: Intrusion detection is very essential for providing security to different network domains and is mostly used for locating and tracing the intruders. There are many problems with traditional intrusion detection models (IDS) such as low detection capability against unknown network attack, high false alarm rate and insufficient analysis capability. Hence the major scope of the research in this domain is to develop an intrusion detection model with improved accuracy and reduced training time. This paper proposes a hybrid intrusiondetection model by integrating the principal component analysis (PCA) and support vector machine (SVM). The novelty of the paper is the optimization of kernel parameters of the SVM classifier using automatic parameter selection technique. This technique optimizes the punishment factor ( C ) and kernel parameter gamma ( γ ), thereby improving the accuracy of the classifier and reducing the training and testing time. The experimental results obtained on the NSL KDD and gurekddcup dataset show that the proposed technique performs better with higher accuracy, faster convergence speed and better generalization. Minimum resources are consumed as the classifier input requires reduced feature set for optimum classification. A comparative analysis of hybrid models with the proposed model is also performed. ACM CCS (2012) Classification : Security and privacy → Intrusion/anomaly detection and malware mitigation → Intrusion detection systems *To cite this article: S. T. Ikram and A. K. Cherukuri, "Improving Accuracy of Intrusion Detection Model Using PCA and optimized SVM", CIT. Journal of Computing and Information Technology , vol. 24, no. 2, pp. 133–148, 2016.

...read moreread less

Journal Article•DOI•

Statistical Inference for Data Adaptive Target Parameters.

[...]

Alan Hubbard¹, Sara Kherad-Pajouh¹, Mark J. van der Laan¹•Institutions (1)

University of California, Berkeley¹

01 May 2016-The International Journal of Biostatistics

TL;DR: The theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods.

...read moreread less

Abstract: Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.

...read moreread less

Proceedings Article•DOI•

Drug side effect prediction through linear neighborhoods and multiple data source integration

[...]

Wen Zhang¹, Yanlin Chen¹, Shikui Tu², Feng Liu¹, Qianlong Qu¹ - Show less +1 more•Institutions (2)

Wuhan University¹, University of Massachusetts Medical School²

01 Dec 2016

TL;DR: This paper proposes the linear neighborhood similarity method (LNSM), which utilizes single-source data for the side effect prediction, and extends LNSM to deal with multi- source data, and proposes two data integration methods which can effectively integrate multi-sourceData integration methods, which outperform other state-of-the-art side effect Prediction methods in the cross validation and independent test.

...read moreread less

Abstract: predicting drug side effects is a critical task in the drug discovery, which attracts great attentions in both academy and industry. Although lots of machine learning methods have been proposed, great challenges arise with boom of precision medicine. On one hand, many methods are based on the assumption that similar drugs may share same side effects, but measuring the drug-drug similarity appropriately is challenging. One the other hand, multi-source data provide diverse information for the analysis of side effects, and should be integrated for the high-accuracy prediction. In this paper, we tackle the side effect prediction problem through linear neighborhoods and multi-source data integration. In the feature space, linear neighborhoods are constructed to extract the drug-drug similarity, namely “linear neighborhood similarity”. By transferring the similarity into the side effect space, known side effect information is propagated through the similarity-based graph. Thus, we propose the linear neighborhood similarity method (LNSM), which utilizes single-source data for the side effect prediction. Further, we extend LNSM to deal with multi-source data, and propose two data integration methods: similarity matrix integration method (LNSM-SMI) and cost minimization integration method (LNSM-CMI), which integrate drug substructure data, drug target data, drug transporter data, drug enzyme data, drug pathway data and drug indication data to improve the prediction accuracy. The proposed methods are evaluated on the benchmark datasets. The linear neighborhood similarity method (LNSM) can produce satisfying results on the single-source data. Data integration methods (LNSM-SMI and LNSM-CMI) can effectively integrate multi-source data, and outperform other state-of-the-art side effect prediction methods in the cross validation and independent test. The proposed methods are promising for the drug side effect prediction.

...read moreread less

Journal Article•DOI•

Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation

[...]

Simone Wahl, Anne-Laure Boulesteix¹, Astrid Zierer, Barbara Thorand, Mark A. van de Wiel², Mark A. van de Wiel³ - Show less +2 more•Institutions (3)

Ludwig Maximilian University of Munich¹, VU University Amsterdam², VU University Medical Center³

26 Oct 2016-BMC Medical Research Methodology

TL;DR: Val-MI represents a valid strategy to obtain estimates of predictive performance measures in prognostic models developed on incomplete data, and bootstrap 0.632+ estimate representing a reliable method to correct for optimism.

...read moreread less

Abstract: Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained. When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.

...read moreread less

Journal Article•DOI•

Cross-validation aggregation for combining autoregressive neural network forecasts

[...]

Devon K. Barrow¹, Sven F. Crone²•Institutions (2)

Coventry University¹, Lancaster University²

01 Oct 2016-International Journal of Forecasting

TL;DR: K-fold and Monte Carlo cross-validation and aggregation and aggregation (crogging) for combining neural network autoregressive forecasts demonstrate significant improvements in forecasting accuracy especially for short time series and long forecast horizons.

...read moreread less

Journal Article•DOI•

A novel behavioral model of the pasture-based dairy cow from GPS data using data mining and machine learning techniques

[...]

Manod Williams¹, N. Mac Parthaláin¹, Paul Brewer¹, Wiliam P. James¹, Michael T. Rose¹ - Show less +1 more•Institutions (1)

Aberystwyth University¹

01 Mar 2016-Journal of Dairy Science

TL;DR: It is found that final model selection depended upon level of performance and model complexity, and the classifier learner deemed most suitable for this particular problem was JRip, a rule-based learner.

...read moreread less

Journal Article•DOI•

A proposed methodology for deriving tsunami fragility functions for buildings using optimum intensity measures

[...]

Joshua Macabuag¹, Tiziana Rossetto¹, Ioanna Ioannou¹, Anawat Suppasri², Daisuke Sugawara, Bruno Adriano², Fumihiko Imamura², Ian Eames¹, Shunichi Koshimura² - Show less +5 more•Institutions (2)

University College London¹, Tohoku University²

11 Aug 2016-Natural Hazards

TL;DR: In this article, the authors presented a rigorous methodology using advanced statistical methods for the selection of the optimal tsunami intensity measure (TIM) for fragility function derivation for any given dataset, using a unique, detailed, disaggregated damage dataset from the 2011 Great East Japan earthquake and tsunami (total 67,125 buildings), identifying the optimum TIM for describing observed damage for the case study locations.

...read moreread less

Abstract: Tsunami fragility curves are statistical models which form a key component of tsunami risk models, as they provide a probabilistic link between a tsunami intensity measure (TIM) and building damage. Existing studies apply different TIMs (e.g. depth, velocity, force etc.) with conflicting recommendations of which to use. This paper presents a rigorous methodology using advanced statistical methods for the selection of the optimal TIM for fragility function derivation for any given dataset. This methodology is demonstrated using a unique, detailed, disaggregated damage dataset from the 2011 Great East Japan earthquake and tsunami (total 67,125 buildings), identifying the optimum TIM for describing observed damage for the case study locations. This paper first presents the proposed methodology, which is broken into three steps: (1) exploratory analysis, (2) statistical model selection and trend analysis and (3) comparison and selection of TIMs. The case study dataset is then presented, and the methodology is then applied to this dataset. In Step 1, exploratory analysis on the case study dataset suggests that fragility curves should be constructed for the sub-categories of engineered (RC and steel) and non-engineered (wood and masonry) construction materials. It is shown that the exclusion of buildings of unknown construction material (common practice in existing studies) may introduce bias in the results; hence, these buildings are estimated as engineered or non-engineered through use of multiple imputation (MI) techniques. In Step 2, a sensitivity analysis of several statistical methods for fragility curve derivation is conducted in order to select multiple statistical models with which to conduct further exploratory analysis and the TIM comparison (to draw conclusions which are non-model-specific). Methods of data aggregation and ordinary least squares parameter estimation (both used in existing studies) are rejected as they are quantitatively shown to reduce fragility curve accuracy and increase uncertainty. Partially ordered probit models and generalised additive models (GAMs) are selected for the TIM comparison of Step 3. In Step 3, fragility curves are then constructed for a number of TIMs, obtained from numerical simulation of the tsunami inundation of the 2011 GEJE. These fragility curves are compared using K-fold cross-validation (KFCV), and it is found that for the case study dataset a force-based measure that considers different flow regimes (indicated by Froude number) proves the most efficient TIM. It is recommended that the methodology proposed in this paper be applied for defining future fragility functions based on optimum TIMs. With the introduction of several concepts novel to the field of fragility assessment (MI, GAMs, KFCV for model optimisation and comparison), this study has significant implications for the future generation of empirical and analytical fragility functions.

...read moreread less

Journal Article•DOI•

Evaluating the sources of water to wells

[...]

Michael N. Fienen¹, Bernard T. Nolan¹, Daniel T. Feinstein²•Institutions (2)

United States Geological Survey¹, University of Wisconsin–Milwaukee²

01 Mar 2016-Environmental Modelling and Software

TL;DR: A groundwater flow model was emulated using a Bayesian Network, an Artificial neural network, and a Gradient Boosted Regression Tree to emulate the process model with a statistical "metamodel" and the results have application for managing allocation of groundwater.

...read moreread less

Abstract: For decision support, the insights and predictive power of numerical process models can be hampered by insufficient expertise and computational resources required to evaluate system response to new stresses. An alternative is to emulate the process model with a statistical "metamodel." Built on a dataset of collocated numerical model input and output, a groundwater flow model was emulated using a Bayesian Network, an Artificial neural network, and a Gradient Boosted Regression Tree. The response of interest was surface water depletion expressed as the source of water-to-wells. The results have application for managing allocation of groundwater. Each technique was tuned using cross validation and further evaluated using a held-out dataset. A numerical MODFLOW-USG model of the Lake Michigan Basin, USA, was used for the evaluation. The performance and interpretability of each technique was compared pointing to advantages of each technique. The metamodel can extend to unmodeled areas. Display Omitted Metamodeling can be used for decision support emulating groundwater models.Artificial neural networks, gradient boosting, and Bayesian networks each have advantages.Spatial relations among wells and streams are key drivers for source of water to groundwater wells.

...read moreread less

Collapse