scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 2022"


Journal ArticleDOI
01 Jan 2022-Energy
TL;DR: An improved electricity price forecasting model is developed that offers the advantages of adaptive data preprocessing, advanced optimization method and kernel-based model, and optimal model selection strategy, and a newly proposed optimal models selection strategy is applied to determine the developed model that provides the most desirable forecasting result.

58 citations


Journal ArticleDOI
01 Jan 2022-Energy
TL;DR: Wang et al. as discussed by the authors developed an improved electricity price forecasting model that offers the advantages of adaptive data preprocessing, advanced optimization method, kernel-based model, and optimal model selection strategy.

45 citations


Journal ArticleDOI
TL;DR: In this article , a profile-likelihood approach is used to explore the relationship between parameter identifiability and model misspecification, finding that the logistic growth model does not suffer identifier issues for the type of data we consider whereas the Gompertz and Richards' models encounter practical non-identifier issues.

22 citations


DOI
07 Mar 2022
TL;DR: The copula approach as mentioned in this paper allows marginal models to be constructed for each variable, which can be used to develop flexible multivariate distribution classes with the availability of massive multivariate data.
Abstract: With the availability of massive multivariate data comes a need to develop flexible multivariate distribution classes. The copula approach allows marginal models to be constructed for each variable...

20 citations


Journal ArticleDOI
TL;DR: Vine copula models as discussed by the authors allow marginal models to be constructed for each variable separately and join with a dependence structure characterized by a copula, which gives rise to highly flexible models that still allow for computationally tractable estimation and model selection procedures.
Abstract: With the availability of massive multivariate data comes a need to develop flexible multivariate distribution classes. The copula approach allows marginal models to be constructed for each variable separately and joined with a dependence structure characterized by a copula. The class of multivariate copulas was limited for a long time to elliptical (including the Gaussian and t-copula) and Archimedean families (such as Clayton and Gumbel copulas). Both classes are rather restrictive with regard to symmetry and tail dependence properties. The class of vine copulas overcomes these limitations by building a multivariate model using only bivariate building blocks. This gives rise to highly flexible models that still allow for computationally tractable estimation and model selection procedures. These features made vine copula models quite popular among applied researchers in numerous areas of science. This article reviews the basic ideas underlying these models, presents estimation and model selection approaches, and discusses current developments and future directions.

19 citations


Journal ArticleDOI
01 Feb 2022-Entropy
TL;DR: The basic ideas underlying the MDL criterion are reviewed and the role of MDL in the selection of the best principal components in the well known PCA is investigated.
Abstract: The minimun description length (MDL) is a powerful criterion for model selection that is gaining increasing interest from both theorists and practicioners. It allows for automatic selection of the best model for representing data without having a priori information about them. It simply uses both data and model complexity, selecting the model that provides the least coding length among a predefined set of models. In this paper, we briefly review the basic ideas underlying the MDL criterion and its applications in different fields, with particular reference to the dimension reduction problem. As an example, the role of MDL in the selection of the best principal components in the well known PCA is investigated.

19 citations


Journal ArticleDOI
TL;DR: In this paper , three models with a cross-validation approach do the required task of feature selection based on the use of statistical and correlation matrices for multivariate analysis and extensive comparison findings are produced utilizing retrieval, F1 score, and precision measurements.
Abstract: Diagnosing the cardiovascular disease is one of the biggest medical difficulties in recent years. Coronary cardiovascular (CHD) is a kind of heart and blood vascular disease. Predicting this sort of cardiac illness leads to more precise decisions for cardiac disorders. Implementing Grid Search Optimization (GSO) machine training models is therefore a useful way to forecast the sickness as soon as possible. The state-of-the-art work is the tuning of the hyperparameter together with the selection of the feature by utilizing the model search to minimize the false-negative rate. Three models with a cross-validation approach do the required task. Feature Selection based on the use of statistical and correlation matrices for multivariate analysis. For Random Search and Grid Search models, extensive comparison findings are produced utilizing retrieval, F1 score, and precision measurements. The models are evaluated using the metrics and kappa statistics that illustrate the three models’ comparability. The study effort focuses on optimizing function selection, tweaking hyperparameters to improve model accuracy and the prediction of heart disease by examining Framingham datasets using random forestry classification. Tuning the hyperparameter in the model of grid search thus decreases the erroneous rate achieves global optimization.

18 citations


Journal ArticleDOI
27 May 2022-Entropy
TL;DR: In this article , the impact of the choice of the item response theory (IRT) model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated.
Abstract: In educational large-scale assessment studies such as PISA, item response theory (IRT) models are used to summarize students’ performance on cognitive test items across countries. In this article, the impact of the choice of the IRT model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated. Eleven different IRT models are compared using information criteria. Moreover, model uncertainty is quantified by estimating model error, which can be compared with the sampling error associated with the sampling of students. The PISA 2009 dataset for the cognitive domains mathematics, reading, and science is used as an example of the choice of the IRT model. It turned out that the three-parameter logistic IRT model with residual heterogeneity and a three-parameter IRT model with a quadratic effect of the ability θ provided the best model fit. Furthermore, model uncertainty was relatively small compared to sampling error regarding country means in most cases but was substantial for country standard deviations and percentiles. Consequently, it can be argued that model error should be included in the statistical inference of educational large-scale assessment studies.

14 citations


Journal ArticleDOI
01 Mar 2022
TL;DR: In this article , the authors discuss the use of machine learning models, techniques and practices for choice modeling, and discuss their potential to improve choice modeling practices. But, despite the potential benefits of using the advances of ML to improve the choice modelling practices, the choice modeling field has been hesitant to embrace machine learning.
Abstract: Since its inception, the choice modelling field has been dominated by theory-driven modelling approaches. Machine learning offers an alternative data-driven approach for modelling choice behaviour and is increasingly drawing interest in our field. Cross-pollination of machine learning models, techniques and practices could help overcome problems and limitations encountered in the current theory-driven modelling paradigm, such as subjective labour-intensive search processes for model selection, and the inability to work with text and image data. However, despite the potential benefits of using the advances of machine learning to improve choice modelling practices, the choice modelling field has been hesitant to embrace machine learning. This discussion paper aims to consolidate knowledge on the use of machine learning models, techniques and practices for choice modelling, and discuss their potential. Thereby, we hope not only to make the case that further integration of machine learning in choice modelling is beneficial, but also to further facilitate it. To this end, we clarify the similarities and differences between the two modelling paradigms; we review the use of machine learning for choice modelling; and we explore areas of opportunities for embracing machine learning models and techniques to improve our practices. To conclude this discussion paper, we put forward a set of research questions which must be addressed to better understand if and how machine learning can benefit choice modelling.

14 citations


Journal ArticleDOI
TL;DR: Good agreement is observed between model-predicted and experimentally identified NNMs, which verifies the effectiveness of the proposed approach for nonlinear model updating and model class selection.

14 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a multi-space collaboration (MSC) framework for optimal model selection, which adopts space separation strategy to do the model selection on the sub-space, which increases the probability of selecting the optimal model; a subspace elimination strategy is introduced, and the subspace with low development potential is gradually eliminated as iteration progresses.

Journal ArticleDOI
TL;DR: In this paper , a Bayesian model updating and model class selection approach based on nonlinear normal modes (NNMs) is proposed to estimate the joint posterior distribution of unknown model parameters.

Journal ArticleDOI
TL;DR: In this article , the authors used empirical mode decomposition (EMD) to extract and visualize the irreversible component of deformation monitoring sequence for concrete dams, and proposed a new model selection criterion, namely over-fitting coefficient, for evaluating and selecting the optimum model among candidate monitoring models.

Journal ArticleDOI
TL;DR: It is concluded that the use of deep learning could not be the most suitable approach to develop models to identify AMPs, mainly because shallow models achieve comparable-to-superior performances and are simpler (Ockham's razor principle).
Abstract: In the last few decades, antimicrobial peptides (AMPs) have been explored as an alternative to classical antibiotics, which in turn motivated the development of machine learning models to predict antimicrobial activities in peptides. The first generation of these predictors was filled with what is now known as shallow learning-based models. These models require the computation and selection of molecular descriptors to characterize each peptide sequence and train the models. The second generation, known as deep learning-based models, which no longer requires the explicit computation and selection of those descriptors, started to be used in the prediction task of AMPs just four years ago. The superior performance claimed by deep models regarding shallow models has created a prevalent inertia to using deep learning to identify AMPs. However, methodological flaws and/or modeling biases in the building of deep models do not support such superiority. Here, we analyze the main pitfalls that led to establish biased conclusions on the leading performance of deep models. Also, we analyze whether deep models truly contribute to achieve better predictions than shallow models by performing fair studies on different state-of-the-art benchmarking datasets. The experiments reveal that deep models do not outperform shallow models in the classification of AMPs, and that both types of models codify similar chemical information since their predictions are highly similar. Thus, according to the currently available datasets, we conclude that the use of deep learning could not be the most suitable approach to develop models to identify AMPs, mainly because shallow models achieve comparable-to-superior performances and are simpler (Ockham's razor principle). Even so, we suggest the use of deep learning only when its capabilities lead to obtaining significantly better performance gains worth the additional computational cost.

Journal ArticleDOI
TL;DR: In this article, a scaled principal component analysis (sPCA) method was introduced to forecast oil volatility, and compared with two commonly used dimensionality reduction methods: PCA and partial least squares (PLS).

Journal ArticleDOI
TL;DR: This work proposes a sparse robust statistical regression framework that considers compositional and non-compositional measurements as predictors and identifies outliers in continuous response variables, and shows the ability of the approach to jointly select a sparse set of predictive microbial features and identify outlier in the response.

Journal ArticleDOI
TL;DR: In this paper , a new form of the Bayesian Information Criterion (BIC) was proposed for order selection in linear regression models where the parameter vector dimension is small compared to the sample size, which eliminates the scaling problem and at the same time is consistent for both large sample sizes and high-SNR scenarios.

Journal ArticleDOI
TL;DR: This work proposes a subdata selection method based on leverage scores which enables us to conduct the selection task on a small subdata set and improves the probability of selecting the best model but also enhances the estimation efficiency.

Journal ArticleDOI
TL;DR: In this article , the authors analyzed the main pitfalls that led to biased conclusions on the leading performance of deep learning models and analyzed whether deep models truly contribute to achieve better predictions than shallow models by performing fair studies on different state-of-the-art benchmarking datasets.
Abstract: In the last few decades, antimicrobial peptides (AMPs) have been explored as an alternative to classical antibiotics, which in turn motivated the development of machine learning models to predict antimicrobial activities in peptides. The first generation of these predictors was filled with what is now known as shallow learning-based models. These models require the computation and selection of molecular descriptors to characterize each peptide sequence and train the models. The second generation, known as deep learning-based models, which no longer requires the explicit computation and selection of those descriptors, started to be used in the prediction task of AMPs just four years ago. The superior performance claimed by deep models regarding shallow models has created a prevalent inertia to using deep learning to identify AMPs. However, methodological flaws and/or modeling biases in the building of deep models do not support such superiority. Here, we analyze the main pitfalls that led to establish biased conclusions on the leading performance of deep models. Also, we analyze whether deep models truly contribute to achieve better predictions than shallow models by performing fair studies on different state-of-the-art benchmarking datasets. The experiments reveal that deep models do not outperform shallow models in the classification of AMPs, and that both types of models codify similar chemical information since their predictions are highly similar. Thus, according to the currently available datasets, we conclude that the use of deep learning could not be the most suitable approach to develop models to identify AMPs, mainly because shallow models achieve comparable-to-superior performances and are simpler (Ockham's razor principle). Even so, we suggest the use of deep learning only when its capabilities lead to obtaining significantly better performance gains worth the additional computational cost.

Journal ArticleDOI
TL;DR: In this article , a mean-squared error-based criterion is proposed to address ill-conditioning problem in MBDoE, which is shown to be well-suited for illconditioned cases.

Posted ContentDOI
19 Feb 2022-bioRxiv
TL;DR: The results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization but only under the authors' new prior settings, and concludes that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well behaved prior distributions are applied.
Abstract: Model selection aims to choose the most adequate model for the statistical analysis at hand. The model must be complex enough to capture the complexity of the data but should be simple enough to not overfit. In phylogenetics, the most common model selection scenario concerns selecting an appropriate substitution and partition model for sequence evolution to infer a phylogenetic tree. Here we explored the impact of substitution model over-parameterization in a Bayesian statistical framework. We performed simulations under the simplest substitution model, the Jukes-Cantor model, and compare posterior estimates of phylogenetic tree topologies and tree length under the true model to the most complex model, the GTR+Γ+I substitution model, including over-splitting the data into additional subsets (i.e., applying partitioned models). We explored four choices of prior distributions: the default substitution model priors of MrBayes, BEAST2 and RevBayes and a newly devised prior choice (Tame). Our results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization but only under our new prior settings. All three default priors introduced biases for the estimated tree length. We conclude that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well behaved prior distributions are applied.


Journal ArticleDOI
TL;DR: In this article , a review of post-model-selection inference in linear regression models is presented, while also incorporating perspectives from high-dimensional inference in these models, and theoretical insights explaining the phenomena observed in the example.
Abstract: The research on statistical inference after data-driven model selection can be traced as far back as Koopmans (1949). The intensive research on modern model selection methods for high-dimensional data over the past three decades revived the interest in statistical inference after model selection. In recent years, there has been a surge of articles on statistical inference after model selection and now a rather vast literature exists on this topic. Our manuscript aims at presenting a holistic review of post-model-selection inference in linear regression models, while also incorporating perspectives from high-dimensional inference in these models. We first give a simulated example motivating the necessity for valid statistical inference after model selection. We then provide theoretical insights explaining the phenomena observed in the example. This is done through a literature survey on the post-selection sampling distribution of regression parameter estimators and properties of coverage probabilities of näıve confidence intervals. Categorized according to two types of estimation targets, namely the populationand projection-based regression coefficients, we present a review of recent uncertainty assessment methods. We also discuss possible pros and cons for the confidence intervals constructed by different methods. MSC2020 subject classifications: Primary 62F25; secondary 62J07.

Journal ArticleDOI
TL;DR: In this article , the authors used the Akaike information criterion and the Bayesian information criterion (BIC) for selection of the best variables to train the model, and the best model was selected on the basis of the smallest AIC and BIC values.
Abstract: Data available in software engineering for many applications contains variability and it is not possible to say which variable helps in the process of the prediction. Most of the work present in software defect prediction is focused on the selection of best prediction techniques. For this purpose, deep learning and ensemble models have shown promising results. In contrast, there are very few researches that deals with cleaning the training data and selection of best parameter values from the data. Sometimes data available for training the models have high variability and this variability may cause a decrease in model accuracy. To deal with this problem we used the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) for selection of the best variables to train the model. A simple ANN model with one input, one output and two hidden layers was used for the training instead of a very deep and complex model. AIC and BIC values are calculated and combination for minimum AIC and BIC values to be selected for the best model. At first, variables were narrowed down to a smaller number using correlation values. Then subsets for all the possible variable combinations were formed. In the end, an artificial neural network (ANN) model was trained for each subset and the best model was selected on the basis of the smallest AIC and BIC value. It was found that combination of only two variables’ ns and entropy are best for software defect prediction as it gives minimum AIC and BIC values. While, nm and npt is the worst combination and gives maximum AIC and BIC values.

Journal ArticleDOI
TL;DR: Simulation studies and applications to small and large educational data sets with up to 2,485 parameters suggest that the Bayesian approach is more robust against model misspecification due to omitted covariates than l 1 -penalized nodewise logistic regressions.

Journal ArticleDOI
TL;DR: It is concluded that the proposed method can be used to extend ML‐based covariate selection, and holds potential as a complete full model alternative to classical covariate analyses.
Abstract: In population pharmacokinetic (PK) models, interindividual variability is explained by implementation of covariates in the model. The widely used forward stepwise selection method is sensitive to bias, which may lead to an incorrect inclusion of covariates. Alternatives, such as the full fixed effects model, reduce this bias but are dependent on the chosen implementation of each covariate. As the correct functional forms are unknown, this may still lead to an inaccurate selection of covariates. Machine learning (ML) techniques can potentially be used to learn the optimal functional forms for implementing covariates directly from data. A recent study suggested that using ML resulted in an improved selection of influential covariates. However, how do we select the appropriate functional form for including these covariates? In this work, we use SHapley Additive exPlanations (SHAP) to infer the relationship between covariates and PK parameters from ML models. As a case‐study, we use data from 119 patients with hemophilia A receiving clotting factor VIII concentrate peri‐operatively. We fit both a random forest and a XGBoost model to predict empirical Bayes estimated clearance and central volume from a base nonlinear mixed effects model. Next, we show that SHAP reveals covariate relationships which match previous findings. In addition, we can reveal subtle effects arising from combinations of covariates difficult to obtain using other methods of covariate analysis. We conclude that the proposed method can be used to extend ML‐based covariate selection, and holds potential as a complete full model alternative to classical covariate analyses.

Journal ArticleDOI
07 Feb 2022-Energies
TL;DR: An NILM framework based on low frequency power data using a convex hull data selection approach and hybrid deep learning architecture is proposed, which compares favorably with the performance of existing approaches.
Abstract: The availability of smart meters and IoT technology has opened new opportunities, ranging from monitoring electrical energy to extracting various types of information related to household occupancy, and with the frequency of usage of different appliances. Non-intrusive load monitoring (NILM) allows users to disaggregate the usage of each device in the house using the total aggregated power signals collected from a smart meter that is typically installed in the household. It enables the monitoring of domestic appliance use without the need to install individual sensors for each device, thus minimizing electrical system complexities and associated costs. This paper proposes an NILM framework based on low frequency power data using a convex hull data selection approach and hybrid deep learning architecture. It employs a sliding window of aggregated active and reactive powers sampled at 1 Hz. A randomized approximation convex hull data selection approach performs the selection of the most informative vertices of the real convex hull. The hybrid deep learning architecture is composed of two models: a classification model based on a convolutional neural network trained with a regression model based on a bidirectional long-term memory neural network. The results obtained on the test dataset demonstrate the effectiveness of the proposed approach, achieving F1 values ranging from 0.95 to 0.99 for the four devices considered and estimation accuracy values between 0.88 and 0.98. These results compare favorably with the performance of existing approaches.


Journal ArticleDOI
TL;DR: In this article , an offline data-driven evolutionary optimization framework based on model selection (MS-DDEO) is proposed, where a model pool is constructed by four radial basis function models with different smoothness degrees for model selection.
Abstract: In data-driven evolutionary optimization, since different models are suitable for different types of problems, an appropriate surrogate model to approximate the real objective function is of great significance, especially in offline optimization. In this paper, an offline data-driven evolutionary optimization framework based on model selection (MS-DDEO) is proposed. A model pool is constructed by four radial basis function models with different smoothness degrees for model selection. Meanwhile, two model selection criteria are designed for offline optimization. Among them, Model Error Criterion uses some ranking-top data as test set to test the ability to predict optimum. Distance Deviation Criterion estimate reliability by distances between predicted solution and some ranking-top data. Combining the two criteria, we select the most suitable surrogate model for offline optimization. Experiments show that this method can effectively select suitable models for most test problems. Results on the benchmark problems and airfoil design example show that the proposed algorithm is able to handle offline problems with better optimization performance and less computational cost than other state-of-the-art offline data-driven optimization algorithms.

Journal ArticleDOI
TL;DR: In this paper , the authors show that the model-learning problem displays a transition from a low-noise phase in which the true model can be learned, to a phase where the observation noise is too high for the real model to be learned by any method.
Abstract: Given a finite and noisy dataset generated with a closed-form mathematical model, when is it possible to learn the true generating model from the data alone? This is the question we investigate here. We show that this model-learning problem displays a transition from a low-noise phase in which the true model can be learned, to a phase in which the observation noise is too high for the true model to be learned by any method. Both in the low-noise phase and in the high-noise phase, probabilistic model selection leads to optimal generalization to unseen data. This is in contrast to standard machine learning approaches, including artificial neural networks, which in this particular problem are limited, in the low-noise phase, by their ability to interpolate. In the transition region between the learnable and unlearnable phases, generalization is hard for all approaches including probabilistic model selection.