scispace - formally typeset
Search or ask a question

Showing papers on "Cross-validation published in 2016"


Journal ArticleDOI
TL;DR: This paper provides a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects, and proposes an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation.
Abstract: In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.

913 citations


Proceedings ArticleDOI
01 Feb 2016
TL;DR: Results show that till a certain threshold, k-fold cross-validation with varying value of k with respect to number of instances can indeed be used over hold-out validation for quality classification.
Abstract: While training a model with data from a dataset, we have to think of an ideal way to do so. The training should be done in such a way that while the model has enough instances to train on, they should not over-fit the model and at the same time, it must be considered that if there are not enough instances to train on, the model would not be trained properly and would give poor results when used for testing. Accuracy is important when it comes to classification and one must always strive to achieve the highest accuracy, provided there is not trade off with inexcusable time. While working on small datasets, the ideal choices are k-fold cross-validation with large value of k (but smaller than number of instances) or leave-one-out cross-validation whereas while working on colossal datasets, the first thought is to use holdout validation, in general. This article studies the differences between the two validation schemes, analyzes the possibility of using k-fold cross-validation over hold-out validation even on large datasets. Experimentation was performed on four large datasets and results show that till a certain threshold, k-fold cross-validation with varying value of k with respect to number of instances can indeed be used over hold-out validation for quality classification.

401 citations


Journal ArticleDOI
TL;DR: A new emotion recognition system based on facial expression images that is superior to three state-of-the-art methods is proposed and achieved an overall accuracy of 96.77±0.10%.
Abstract: Emotion recognition represents the position and motion of facial muscles. It contributes significantly in many fields. Current approaches have not obtained good results. This paper aimed to propose a new emotion recognition system based on facial expression images. We enrolled 20 subjects and let each subject pose seven different emotions: happy, sadness, surprise, anger, disgust, fear, and neutral. Afterward, we employed biorthogonal wavelet entropy to extract multiscale features, and used fuzzy multiclass support vector machine to be the classifier. The stratified cross validation was employed as a strict validation model. The statistical analysis showed our method achieved an overall accuracy of 96.77±0.10%. Besides, our method is superior to three state-of-the-art methods. In all, this proposed method is efficient.

242 citations


Journal ArticleDOI
01 Jan 2016-Geoderma
TL;DR: In this article, a case study on soil organic carbon mapping across a 50,810 km 2 area in northwestern China was conducted, where the authors compared the quality of the maps obtained by GWR and GWRR on the one side and multiple linear regression (MLR) on the other side, and concluded that fitting regression coefficients locally as in GWR only paid when no spatial random effect was included in the model.

134 citations



Journal ArticleDOI
27 May 2016-PLOS ONE
TL;DR: Generalized additive models and support vector machines had good performance as risk prediction model for postoperative sepsis and AKI, while feature extraction using principal component analysis improved performance of all models.
Abstract: Objective To compare performance of risk prediction models for forecasting postoperative sepsis and acute kidney injury. Design Retrospective single center cohort study of adult surgical patients admitted between 2000 and 2010. Patients 50,318 adult patients undergoing major surgery. Measurements We evaluated the performance of logistic regression, generalized additive models, naive Bayes and support vector machines for forecasting postoperative sepsis and acute kidney injury. We assessed the impact of feature reduction techniques on predictive performance. Model performance was determined using the area under the receiver operating characteristic curve, accuracy, and positive predicted value. The results were reported based on a 70/30 cross validation procedure where the data were randomly split into 70% used for training the model and the 30% for validation. Main results The areas under the receiver operating characteristic curve for different models ranged between 0.797 and 0.858 for acute kidney injury and between 0.757 and 0.909 for severe sepsis. Logistic regression, generalized additive model, and support vector machines had better performance compared to Naive Bayes model. Generalized additive models additionally accounted for non-linearity of continuous clinical variables as depicted in their risk patterns plots. Reducing the input feature space with LASSO had minimal effect on prediction performance, while feature extraction using principal component analysis improved performance of the models. Conclusions Generalized additive models and support vector machines had good performance as risk prediction model for postoperative sepsis and AKI. Feature extraction using principal component analysis improved the predictive performance of all models.

125 citations


Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a nested linear mixed effects (LME) model including nested month-, week, and day-specific random effects of PM2.5-AOD relationships.

124 citations


Journal ArticleDOI
TL;DR: It is advocated that the RVM model can be employed as a promising machine learning tool for the prediction of evaporative loss.
Abstract: The forecasting of evaporative loss (E) is vital for water resource management and understanding of hydrological process for farming practices, ecosystem management and hydrologic engineering. This study has developed three machine learning algorithms, namely the relevance vector machine (RVM), extreme learning machine (ELM) and multivariate adaptive regression spline (MARS) for the prediction of E using five predictor variables, incident solar radiation (S), maximum temperature (T max), minimum temperature (T min), atmospheric vapor pressure (VP) and precipitation (P). The RVM model is based on the Bayesian formulation of a linear model with appropriate prior that results in sparse representations. The ELM model is computationally efficient algorithm based on Single Layer Feedforward Neural Network with hidden neurons that randomly choose input weights and the MARS model is built on flexible regression algorithm that generally divides solution space into intervals of predictor variables and fits splines (basis functions) to each interval. By utilizing random sampling process, the predictor data were partitioned into the training phase (70 % of data) and testing phase (remainder 30 %). The equations for the prediction of monthly E were formulated. The RVM model was devised using the radial basis function, while the ELM model comprised of 5 inputs and 10 hidden neurons and used the radial basis activation function, and the MARS model utilized 15 basis functions. The decomposition of variance among the predictor dataset of the MARS model yielded the largest magnitude of the Generalized Cross Validation statistic (≈0.03) when the T max was used as an input, followed by the relatively lower value (≈0.028, 0.019) for inputs defined by the S and VP. This confirmed that the prediction of E utilized the largest contributions of the predictive features from the T max, verified emphatically by sensitivity analysis test. The model performance statistics yielded correlation coefficients of 0.979 (RVM), 0.977 (ELM) and 0.974 (MARS), Root-Mean-Square-Errors of 9.306, 9.714 and 10.457 and Mean-Absolute-Error of 0.034, 0.035 and 0.038. Despite the small differences in the overall prediction skill, the RVM model appeared to be more accurate in prediction of E. It is therefore advocated that the RVM model can be employed as a promising machine learning tool for the prediction of evaporative loss.

121 citations


Journal ArticleDOI
TL;DR: It is indicated that model prediction performance should be assessed by accounting for monitor clustering due to the potential misinterpretation of model accuracy in spatial prediction when validation monitors are randomly selected.
Abstract: The accuracy in estimated fine particulate matter concentrations (PM2.5), obtained by fusing of station-based measurements and satellite-based aerosol optical depth (AOD), is often reduced without accounting for the spatial and temporal variations in PM2.5 and missing AOD observations. In this study, a city-specific linear regression model was first developed to fill in missing AOD data. A novel interpolation-based variable, PM2.5 spatial interpolator (PMSI2.5), was also introduced to account for the spatial dependence in PM2.5 across grid cells. A Bayesian hierarchical model was then developed to estimate spatiotemporal relationships between AOD and PM2.5. These methods were evaluated through a city-specific 10-fold cross-validation procedure in a case study in North China in 2014. The cross validation R(2) was 0.61 when PMSI2.5 was included and 0.48 when PMSI2.5 was excluded. The gap-filled AOD values also effectively improved predicted PM2.5 concentrations with an R(2) = 0.78. Daily ground-level PM2.5 concentration fields at a 12 km resolution were predicted with complete spatial and temporal coverage. This study also indicates that model prediction performance should be assessed by accounting for monitor clustering due to the potential misinterpretation of model accuracy in spatial prediction when validation monitors are randomly selected.

120 citations


Journal ArticleDOI
TL;DR: A general, fully learning-based framework for direct bi-ventricular volume estimation, which removes user inputs and unreliable assumptions, and largely outperforms existing direct methods on a larger dataset of 100 subjects including both healthy and diseased cases with twice the number of subjects used in previous methods.

115 citations


Journal ArticleDOI
TL;DR: A generalized regression neural networks with K-fold cross-validation method for predicting the displacement of landslide using Pearson cross-correlation coefficients (PCC) and mutual information (MI) is proposed.

Journal ArticleDOI
TL;DR: It is found that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations, and this relationship is illustrated using simulations and a collection of data sets from mice, wheat and human genetics.
Abstract: The prediction of phenotypic traits using high-density genomic data has many applications such as the selection of plants and animals of commercial interest; and it is expected to play an increasing role in medical diagnostics. Statistical models used for this task are usually tested using cross-validation, which implicitly assumes that new individuals (whose phenotypes we would like to predict) originate from the same population the genomic prediction model is trained on. In this paper we propose an approach based on clustering and resampling to investigate the effect of increasing genetic distance between training and target populations when predicting quantitative traits. This is important for plant and animal genetics, where genomic selection programs rely on the precision of predictions in future rounds of breeding. Therefore, estimating how quickly predictive accuracy decays is important in deciding which training population to use and how often the model has to be recalibrated. We find that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations. We illustrate this relationship using simulations and a collection of data sets from mice, wheat and human genetics.

Journal ArticleDOI
TL;DR: In this paper, three heuristic regression techniques, least square support vector regression (LSSVR), multivariate adaptive regression splines (MARS) and M5 Model Tree (M5-Tree), are investigated for forecasting and predicting of monthly streamflows.
Abstract: Streamflow forecasting and predicting are significant concern for several applications of water resources and management including flood management, determination of river water potentials, environmental flow analysis, and agriculture and hydro-power generation. Forecasting and predicting of monthly streamflows are investigated by using three heuristic regression techniques, least square support vector regression (LSSVR), multivariate adaptive regression splines (MARS) and M5 Model Tree (M5-Tree). Data from four different stations, Besiri and Malabadi located in Turkey, Hit and Baghdad located in Iraq, are used in the analysis. Cross validation method is employed in the applications. In the first stage of the study, the heuristic regression models are compared with each other and multiple linear regression (MLR) in forecasting one month ahead streamflow of each station, individually. In the second stage, the models are evaluated and compared in predicting streamflow of one station using data of nearby station. The research investigated also the influence of the periodicity component (month number of the year) as an external sub-set in modeling long-term streamflow. In both stages, the comparison results indicate that the LSSVR model generally performs superior to the MARS, M5-Tree and MLR models. In addition, it is seen that adding periodicity as input to the models significantly increase their accuracy in forecasting and predicting monthly streamflows in both stages of the study.

Journal ArticleDOI
TL;DR: In this article, quantile regression forests (an elaboration of random forests) are used to investigate the potential of high resolution auxiliary information alone to support the generation of accurate and interpretable geochemical maps.

Proceedings ArticleDOI
15 Jul 2016
TL;DR: In this article, a multilayer perceptron neural network was used to predict rice production yield and investigate the factors affecting the rice crop yield for various districts of Maharashtra state in India.
Abstract: Rice crop production contributes to the food security of India, more than 40% to overall crop production. Its production is reliant on favorable climatic conditions. Variability from season to season is detrimental to the farmer's income and livelihoods. Improving the ability of farmers to predict crop productivity in under different climatic scenarios, can assist farmers and other stakeholders in making important decisions in terms of agronomy and crop choice. This study aimed to use neural networks to predict rice production yield and investigate the factors affecting the rice crop yield for various districts of Maharashtra state in India. Data were sourced from publicly available Indian Government's records for 27 districts of Maharashtra state, India. The parameters considered for the present study were precipitation, minimum temperature, average temperature, maximum temperature and reference crop evapotranspiration, area, production and yield for the Kharif season (June to November) for the years 1998 to 2002. The dataset was processed using WEKA tool. A Multilayer Perceptron Neural Network was developed. Cross validation method was used to validate the data. The results showed the accuracy of 97.5% with a sensitivity of 96.3 and specificity of 98.1. Further, mean absolute error, root mean squared error, relative absolute error and root relative squared error were calculated for the present study. The study dataset was also executed using Knowledge Flow of the WEKA tool. The performance of the classifier is visually summarized using ROC curve.

Journal ArticleDOI
TL;DR: SVM and SGB models explained in the current study could yield remarkable predictive performance in the classification of ischemic stroke.

Journal ArticleDOI
TL;DR: In this article, a model-free criterion ESCV based on a new estimation stability (ES) metric and cross-validation is proposed to find a smaller and locally ES-optimal model smaller than the CV choice.
Abstract: Cross-validation (CV) is often used to select the regularization parameter in high-dimensional problems. However, when applied to the sparse modeling method Lasso, CV leads to models that are unstable in high-dimensions, and consequently not suited for reliable interpretation. In this article, we propose a model-free criterion ESCV based on a new estimation stability (ES) metric and CV. Our proposed ESCV finds a smaller and locally ES-optimal model smaller than the CV choice so that it fits the data and also enjoys estimation stability property. We demonstrate that ESCV is an effective alternative to CV at a similar easily parallelizable computational cost. In particular, we compare the two approaches with respect to several performance measures when applied to the Lasso on both simulated and real datasets. For dependent predictors common in practice, our main finding is that ESCV cuts down false positive rates often by a large margin, while sacrificing little of true positive rates. ESCV usually outperfo...

Journal ArticleDOI
TL;DR: In this paper, a frequentist model averaging method based on the leave-subject-out cross-validation was developed for averaging longitudinal data models and time series models which can have heteroscedastic errors.

Journal ArticleDOI
TL;DR: In this paper, a nonlinear autoregressive (exogenous) model for one-day-ahead mean hourly wind speed forecasting, where general regression neural network is employed to model nonlinearities of the system, is presented.

Journal ArticleDOI
TL;DR: An open access “Double Cross-Validation (DCV)” software tool which can be used to perform multiple linear regression (MLR) model development by employing the double cross-validation technique and two variable selection methods are incorporated in this tool.

Journal ArticleDOI
TL;DR: This study finds that breeding programs seeking efficient genomic selection in soybeans would best allocate resources by investing in a representative training set, and the most robust prediction model was the combination of reproducing kernel Hilbert space regression and BayesB.
Abstract: Many economically important traits in plant breeding have low heritability or are difficult to measure. For these traits, genomic selection has attractive features and may boost genetic gains. Our goal was to evaluate alternative scenarios to implement genomic selection for yield components in soybean (Glycine max L. merr). We used a nested association panel with cross validation to evaluate the impacts of training population size, genotyping density, and prediction model on the accuracy of genomic prediction. Our results indicate that training population size was the factor most relevant to improvement in genome-wide prediction, with greatest improvement observed in training sets up to 2000 individuals. We discuss assumptions that influence the choice of the prediction model. Although alternative models had minor impacts on prediction accuracy, the most robust prediction model was the combination of reproducing kernel Hilbert space regression and BayesB. Higher genotyping density marginally improved accuracy. Our study finds that breeding programs seeking efficient genomic selection in soybeans would best allocate resources by investing in a representative training set.

Journal ArticleDOI
TL;DR: In this article, a support vector machine (SVM) model was developed to predict nitrate concentration in groundwater of Arak plain, Iran and the associated parameters for the optimum SVM model were obtained using a combination of 4-fold cross-validation and grid search technique.
Abstract: In this paper, a support vector machine (SVM) model was developed to predict nitrate concentration in groundwater of Arak plain, Iran. The model provided a tool for prediction of nitrate concentration using a set of easily measurable groundwater quality variables including water temperature, electrical conductivity, groundwater depth, total dissolved solids, dissolved oxygen, pH, land use, and season of the year as input variables. The data set comprised of 160 water samples representing 40 different wells monitored for 1 year. The associated parameters for the optimum SVM model were obtained using a combination of 4-fold cross-validation and grid search technique. The optimum model was used to predict nitrate concentration in Arak plain aquifer. The SVM model predicted nitrate concentration in training and test stage data sets with reasonably high correlation (0.92 and 0.87, respectively) with the measured values and low root mean squared errors of 0.086 and 0.111, respectively. Finally, the map of nitrate concentration in groundwater was prepared for all four seasons using the trained SVM model and a geographic information system (GIS) interpolation scheme and compared with the results with a physics-based (flow and contaminant) model. Overall, the results showed that SVM model could be used as a fast, reliable, and cost-effective method for assessment and predicting groundwater quality.

Journal ArticleDOI
30 Jun 2016
TL;DR: A hybrid intrusiondetection model by integrating the principal component analysis (PCA) and support vector machine (SVM) and automatic parameter selection technique is proposed, which performs better with higher accuracy, faster convergence speed and better generalization.
Abstract: Intrusion detection is very essential for providing security to different network domains and is mostly used for locating and tracing the intruders. There are many problems with traditional intrusion detection models (IDS) such as low detection capability against unknown network attack, high false alarm rate and insufficient analysis capability. Hence the major scope of the research in this domain is to develop an intrusion detection model with improved accuracy and reduced training time. This paper proposes a hybrid intrusiondetection model by integrating the principal component analysis (PCA) and support vector machine (SVM). The novelty of the paper is the optimization of kernel parameters of the SVM classifier using automatic parameter selection technique. This technique optimizes the punishment factor ( C ) and kernel parameter gamma ( γ ), thereby improving the accuracy of the classifier and reducing the training and testing time. The experimental results obtained on the NSL KDD and gurekddcup dataset show that the proposed technique performs better with higher accuracy, faster convergence speed and better generalization. Minimum resources are consumed as the classifier input requires reduced feature set for optimum classification. A comparative analysis of hybrid models with the proposed model is also performed. ACM CCS (2012) Classification : Security and privacy → Intrusion/anomaly detection and malware mitigation → Intrusion detection systems *To cite this article: S. T. Ikram and A. K. Cherukuri, "Improving Accuracy of Intrusion Detection Model Using PCA and optimized SVM", CIT. Journal of Computing and Information Technology , vol. 24, no. 2, pp. 133–148, 2016.

Journal ArticleDOI
TL;DR: The theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods.
Abstract: Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper proposes the linear neighborhood similarity method (LNSM), which utilizes single-source data for the side effect prediction, and extends LNSM to deal with multi- source data, and proposes two data integration methods which can effectively integrate multi-sourceData integration methods, which outperform other state-of-the-art side effect Prediction methods in the cross validation and independent test.
Abstract: predicting drug side effects is a critical task in the drug discovery, which attracts great attentions in both academy and industry. Although lots of machine learning methods have been proposed, great challenges arise with boom of precision medicine. On one hand, many methods are based on the assumption that similar drugs may share same side effects, but measuring the drug-drug similarity appropriately is challenging. One the other hand, multi-source data provide diverse information for the analysis of side effects, and should be integrated for the high-accuracy prediction. In this paper, we tackle the side effect prediction problem through linear neighborhoods and multi-source data integration. In the feature space, linear neighborhoods are constructed to extract the drug-drug similarity, namely “linear neighborhood similarity”. By transferring the similarity into the side effect space, known side effect information is propagated through the similarity-based graph. Thus, we propose the linear neighborhood similarity method (LNSM), which utilizes single-source data for the side effect prediction. Further, we extend LNSM to deal with multi-source data, and propose two data integration methods: similarity matrix integration method (LNSM-SMI) and cost minimization integration method (LNSM-CMI), which integrate drug substructure data, drug target data, drug transporter data, drug enzyme data, drug pathway data and drug indication data to improve the prediction accuracy. The proposed methods are evaluated on the benchmark datasets. The linear neighborhood similarity method (LNSM) can produce satisfying results on the single-source data. Data integration methods (LNSM-SMI and LNSM-CMI) can effectively integrate multi-source data, and outperform other state-of-the-art side effect prediction methods in the cross validation and independent test. The proposed methods are promising for the drug side effect prediction.

Journal ArticleDOI
TL;DR: Val-MI represents a valid strategy to obtain estimates of predictive performance measures in prognostic models developed on incomplete data, and bootstrap 0.632+ estimate representing a reliable method to correct for optimism.
Abstract: Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained. When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.

Journal ArticleDOI
TL;DR: K-fold and Monte Carlo cross-validation and aggregation and aggregation (crogging) for combining neural network autoregressive forecasts demonstrate significant improvements in forecasting accuracy especially for short time series and long forecast horizons.

Journal ArticleDOI
TL;DR: It is found that final model selection depended upon level of performance and model complexity, and the classifier learner deemed most suitable for this particular problem was JRip, a rule-based learner.

Journal ArticleDOI
TL;DR: In this article, the authors presented a rigorous methodology using advanced statistical methods for the selection of the optimal tsunami intensity measure (TIM) for fragility function derivation for any given dataset, using a unique, detailed, disaggregated damage dataset from the 2011 Great East Japan earthquake and tsunami (total 67,125 buildings), identifying the optimum TIM for describing observed damage for the case study locations.
Abstract: Tsunami fragility curves are statistical models which form a key component of tsunami risk models, as they provide a probabilistic link between a tsunami intensity measure (TIM) and building damage. Existing studies apply different TIMs (e.g. depth, velocity, force etc.) with conflicting recommendations of which to use. This paper presents a rigorous methodology using advanced statistical methods for the selection of the optimal TIM for fragility function derivation for any given dataset. This methodology is demonstrated using a unique, detailed, disaggregated damage dataset from the 2011 Great East Japan earthquake and tsunami (total 67,125 buildings), identifying the optimum TIM for describing observed damage for the case study locations. This paper first presents the proposed methodology, which is broken into three steps: (1) exploratory analysis, (2) statistical model selection and trend analysis and (3) comparison and selection of TIMs. The case study dataset is then presented, and the methodology is then applied to this dataset. In Step 1, exploratory analysis on the case study dataset suggests that fragility curves should be constructed for the sub-categories of engineered (RC and steel) and non-engineered (wood and masonry) construction materials. It is shown that the exclusion of buildings of unknown construction material (common practice in existing studies) may introduce bias in the results; hence, these buildings are estimated as engineered or non-engineered through use of multiple imputation (MI) techniques. In Step 2, a sensitivity analysis of several statistical methods for fragility curve derivation is conducted in order to select multiple statistical models with which to conduct further exploratory analysis and the TIM comparison (to draw conclusions which are non-model-specific). Methods of data aggregation and ordinary least squares parameter estimation (both used in existing studies) are rejected as they are quantitatively shown to reduce fragility curve accuracy and increase uncertainty. Partially ordered probit models and generalised additive models (GAMs) are selected for the TIM comparison of Step 3. In Step 3, fragility curves are then constructed for a number of TIMs, obtained from numerical simulation of the tsunami inundation of the 2011 GEJE. These fragility curves are compared using K-fold cross-validation (KFCV), and it is found that for the case study dataset a force-based measure that considers different flow regimes (indicated by Froude number) proves the most efficient TIM. It is recommended that the methodology proposed in this paper be applied for defining future fragility functions based on optimum TIMs. With the introduction of several concepts novel to the field of fragility assessment (MI, GAMs, KFCV for model optimisation and comparison), this study has significant implications for the future generation of empirical and analytical fragility functions.

Journal ArticleDOI
TL;DR: A groundwater flow model was emulated using a Bayesian Network, an Artificial neural network, and a Gradient Boosted Regression Tree to emulate the process model with a statistical "metamodel" and the results have application for managing allocation of groundwater.
Abstract: For decision support, the insights and predictive power of numerical process models can be hampered by insufficient expertise and computational resources required to evaluate system response to new stresses. An alternative is to emulate the process model with a statistical "metamodel." Built on a dataset of collocated numerical model input and output, a groundwater flow model was emulated using a Bayesian Network, an Artificial neural network, and a Gradient Boosted Regression Tree. The response of interest was surface water depletion expressed as the source of water-to-wells. The results have application for managing allocation of groundwater. Each technique was tuned using cross validation and further evaluated using a held-out dataset. A numerical MODFLOW-USG model of the Lake Michigan Basin, USA, was used for the evaluation. The performance and interpretability of each technique was compared pointing to advantages of each technique. The metamodel can extend to unmodeled areas. Display Omitted Metamodeling can be used for decision support emulating groundwater models.Artificial neural networks, gradient boosting, and Bayesian networks each have advantages.Spatial relations among wells and streams are key drivers for source of water to groundwater wells.