scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 2006"


Journal ArticleDOI
TL;DR: In this paper, instead of selecting factors by stepwise backward elimination, the authors focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection.
Abstract: Summary. We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor analysis-of-variance problem as the most important and well-known example. Instead of selecting factors by stepwise backward elimination, we focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection. The lasso, the LARS algorithm and the non-negative garrotte are recently proposed regression methods that can be used to select individual variables. We study and propose efficient algorithms for the extensions of these methods for factor selection and show that these extensions give superior performance to the traditional stepwise backward elimination method in factor selection problems. We study the similarities and the differences between these methods. Simulations and real examples are used to illustrate the methods.

7,400 citations


Journal Article
TL;DR: It is proved that a single condition, which is called the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large.
Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are "irrepresentable" (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result.

2,803 citations


Book
22 Nov 2006
TL;DR: The Implied Marginal Variance-Covariance Matrix for the Final Model Diagnostics for theFinal Model Software Notes and Recommendations Other Analytic Approaches Recommendations.
Abstract: INTRODUCTION What Are Linear Mixed Models (LMMs)? A Brief History of Linear Mixed Models LINEAR MIXED MODELS: AN OVERVIEW Introduction Specification of LMMs The Marginal Linear Model Estimation in LMMs Computational Issues Tools for Model Selection Model-Building Strategies Checking Model Assumptions (Diagnostics) Other Aspects of LMMs Power Analysis for Linear Mixed Models Chapter Summary TWO-LEVEL MODELS FOR CLUSTERED DATA: THE RAT PUP EXAMPLE Introduction The Rat Pup Study Overview of the Rat Pup Data Analysis Analysis Steps in the Software Procedures Results of Hypothesis Tests Comparing Results across the Software Procedures Interpreting Parameter Estimates in the Final Model Estimating the Intraclass Correlation Coefficients (ICCs) Calculating Predicted Values Diagnostics for the Final Model Software Notes and Recommendations THREE-LEVEL MODELS FOR CLUSTERED DATA THE CLASSROOM EXAMPLE Introduction The Classroom Study Overview of the Classroom Data Analysis Analysis Steps in the Software Procedures Results of Hypothesis Tests Comparing Results across the Software Procedures Interpreting Parameter Estimates in the Final Model Estimating the Intraclass Correlation Coefficients (ICCs) Calculating Predicted Values Diagnostics for the Final Model Software Notes Recommendations MODELS FOR REPEATED-MEASURES DATA: THE RAT BRAIN EXAMPLE Introduction The Rat Brain Study Overview of the Rat Brain Data Analysis Analysis Steps in the Software Procedures Results of Hypothesis Tests Comparing Results across the Software Procedures Interpreting Parameter Estimates in the Final Model The Implied Marginal Variance-Covariance Matrix for the Final Model Diagnostics for the Final Model Software Notes Other Analytic Approaches Recommendations RANDOM COEFFICIENT MODELS FOR LONGITUDINAL DATA: THE AUTISM EXAMPLE Introduction The Autism Study Overview of the Autism Data Analysis Analysis Steps in the Software Procedures Results of Hypothesis Tests Comparing Results across the Software Procedures Interpreting Parameter Estimates in the Final Model Calculating Predicted Values Diagnostics for the Final Model Software Note: Computational Problems with the D Matrix An Alternative Approach: Fitting the Marginal Model with an Unstructured Covariance Matrix MODELS FOR CLUSTERED LONGITUDINAL DATA: THE DENTAL VENEER EXAMPLE Introduction The Dental Veneer Study Overview of the Dental Veneer Data Analysis Analysis Steps in the Software Procedures Results of Hypothesis Tests Comparing Results across the Software Procedures Interpreting Parameter Estimates in the Final Model The Implied Marginal Variance-Covariance Matrix for the Final Model Diagnostics for the Final Model Software Notes and Recommendations Other Analytic Approaches MODELS FOR DATA WITH CROSSED RANDOM FACTORS: THE SAT SCORE EXAMPLE Introduction The SAT Score Study Overview of the SAT Score Data Analysis Analysis Steps in the Software Procedures Results of Hypothesis Tests Comparing Results across the Software Procedures Interpreting Parameter Estimates in the Final Model The Implied Marginal Variance-Covariance Matrix for the Final Model Recommended Diagnostics for the Final Model Software Notes and Additional Recommendations APPENDIX A: STATISTICAL SOFTWARE RESOURCES APPENDIX B: CALCULATION OF THE MARGINAL VARIANCE-COVARIANCE MATRIX APPENDIX C: ACRONYMS/ABBREVIATIONS BIBLIOGRAPHY INDEX

1,680 citations


Journal ArticleDOI
TL;DR: It is shown that stepwise regression allows models containing significant predictors to be obtained from each year's data, and that the significance of the selected models vary substantially between years and suggest patterns that are at odds with those determined by analysing the full, 4-year data set.
Abstract: 1. The biases and shortcomings of stepwise multiple regression are well established within the statistical literature. However, an examination of papers published in 2004 by three leading ecological and behavioural journals suggested that the use of this technique remains widespread: of 65 papers in which a multiple regression approach was used, 57% of studies used a stepwise procedure. 2. The principal drawbacks of stepwise multiple regression include bias in parameter estimation, inconsistencies among model selection algorithms, an inherent (but often overlooked) problem of multiple hypothesis testing, and an inappropriate focus or reliance on a single best model. We discuss each of these issues with examples. 3. We use a worked example of data on yellowhammer distribution collected over 4 years to highlight the pitfalls of stepwise regression. We show that stepwise regression allows models containing significant predictors to be obtained from each year's data. In spite of the significance of the selected models, they vary substantially between years and suggest patterns that are at odds with those determined by analysing the full, 4-year data set. 4. An information theoretic (IT) analysis of the yellowhammer data set illustrates why the varying outcomes of stepwise analyses arise. In particular, the IT approach identifies large numbers of competing models that could describe the data equally well, showing that no one model should be relied upon for inference.

1,358 citations


Journal ArticleDOI
TL;DR: It is demonstrated that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins.
Abstract: In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner. We start by highlighting the potential dangers of arbitrarily choosing protein models by demonstrating an empirical example where a single alignment can produce two topologically different and strongly supported phylogenies using two different arbitrarily-chosen amino acid substitution models. We demonstrate that in simple simulations, statistical methods of model selection are indeed robust and likely to be useful for protein model selection. We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets. Perhaps most interestingly, we demonstrate that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins. This demonstrates that choosing protein models based on their source or method of construction may not be appropriate.

1,067 citations


Journal ArticleDOI
TL;DR: KaKs_Calculator implements a set of candidate models in a maximum likelihood framework and adopts the Akaike information criterion to measure fitness between models and data, aiming to include as many features as needed for accurately capturing evolutionary information in protein-coding sequences.

901 citations


Journal ArticleDOI
TL;DR: A detailed analysis reveals that the COSSO does model selection by applying a novel soft thresholding type operation to the function components, which leads naturally to an iterative algorithm.
Abstract: We propose a new method for model selection and model fitting in multivariate nonparametric regression models, in the framework of smoothing spline ANOVA. The "COSSO" is a method of regularization with the penalty functional being the sum of component norms, instead of the squared norm employed in the traditional smoothing spline method. The COSSO provides a unified framework for several recent proposals for model selection in linear models and smoothing spline ANOVA models. Theoretical properties, such as the existence and the rate of convergence of the COSSO estimator, are studied. In the special case of a tensor product design with periodic functions, a detailed analysis reveals that the COSSO does model selection by applying a novel soft thresholding type operation to the function components. We give an equivalent formulation of the COSSO estimator which leads naturally to an iterative algorithm. We compare the COSSO with MARS, a popular method that builds functional ANOVA models, in simulations and real examples. The COSSO method can be extended to classification problems and we compare its performance with those of a number of machine learning algorithms on real datasets. The COSSO gives very competitive performance in these studies.

588 citations


Journal ArticleDOI
TL;DR: This work considers the problem of variable or feature selection for model-based clustering and proposes a greedy search algorithm for finding a local optimum in model space, which consistently yielded more accurate estimates of the number of groups and lower classification error rates.
Abstract: We consider the problem of variable or feature selection for model-based clustering. The problem of comparing two nested subsets of variables is recast as a model comparison problem and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples and found that removing irrelevant variables often improved performance. Compared with methods based on all of the variables, our variable selection method consistently yielded more accurate estimates of the number of groups and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.

558 citations


Book
01 Jan 2006
TL;DR: This book studies and applies modern flexible regression models for survival data with a special focus on extensions of the Cox model and alternative models with the specific aim of describing time-varying effects of explanatory variables.
Abstract: In survival analysis there has long been a need for models that goes beyond the Cox model as the proportional hazards assumption often fails in practice. This book studies and applies modern flexible regression models for survival data with a special focus on extensions of the Cox model and alternative models with the specific aim of describing time-varying effects of explanatory variables. One model that receives special attention is Aalen’s additive hazards model that is particularly well suited for dealing with time-varying effects. The book covers the use of residuals and resampling techniques to assess the fit of the models and also points out how the suggested models can be utilised for clustered survival data. The authors demonstrate the practically important aspect of how to do hypothesis testing of time-varying effects making backwards model selection strategies possible for the flexible models considered. The use of the suggested models and methods is illustrated on real data examples. The methods are available in the R-package timereg developed by the authors, which is applied throughout the book with worked examples for the data sets. This gives the reader a unique chance of obtaining hands-on experience. This book is well suited for statistical consultants as well as for those who would like to see more about the theoretical justification of the suggested procedures. It can be used as a textbook for a graduate/master course in survival analysis, and students will appreciate the exercises included after each chapter. The applied side of the book with many worked examples accompanied with R-code shows in detail how one can analyse real data and at the same time gives a deeper understanding of the underlying theory.

527 citations


Book ChapterDOI
01 Jan 2006
TL;DR: An alternative selection method, based on the technique of Random Forests, is proposed in the context of classification, with an application to a real dataset.
Abstract: One of the main topic in the development of predictive models is the identification of variables which are predictors of a given outcome. Automated model selection methods, such as backward or forward stepwise regression, are classical solutions to this problem, but are generally based on strong assumptions about the functional form of the model or the distribution of residuals. In this pa-per an alternative selection method, based on the technique of Random Forests, is proposed in the context of classification, with an application to a real dataset.

440 citations


Journal ArticleDOI
TL;DR: In this article, a new class of semiparametric copula-based multivariate dynamic (SCOMDY) models are introduced, which specify the conditional mean and the conditional variance of a multivariate time series parametrically, but specify the multivariate distribution of the standardized innovation semi-parametrically as a parametric copulum evaluated at nonparametric marginal distributions.

Journal ArticleDOI
TL;DR: This paper proposes a classification system based on a genetic optimization framework formulated in such a way as to detect the best discriminative features without requiring the a priori setting of their number by the user and to estimate the best SVM parameters in a completely automatic way.
Abstract: Recent remote sensing literature has shown that support vector machine (SVM) methods generally outperform traditional statistical and neural methods in classification problems involving hyperspectral images. However, there are still open issues that, if suitably addressed, could allow further improvement of their performances in terms of classification accuracy. Two especially critical issues are: 1) the determination of the most appropriate feature subspace where to carry out the classification task and 2) model selection. In this paper, these two issues are addressed through a classification system that optimizes the SVM classifier accuracy for this kind of imagery. This system is based on a genetic optimization framework formulated in such a way as to detect the best discriminative features without requiring the a priori setting of their number by the user and to estimate the best SVM parameters (i.e., regularization and kernel parameters) in a completely automatic way. For these purposes, it exploits fitness criteria intrinsically related to the generalization capabilities of SVM classifiers. In particular, two criteria are explored, namely: 1) the simple support vector count and 2) the radius margin bound. The effectiveness of the proposed classification system in general and of these two criteria in particular is assessed both by simulated and real experiments. In addition, a comparison with classification approaches based on three different feature selection methods is reported, i.e., the steepest ascent (SA) algorithm and two other methods explicitly developed for SVM classifiers, namely: 1) the recursive feature elimination technique and 2) the radius margin bound minimization method

Journal ArticleDOI
TL;DR: A hybrid model for supporting the vendor selection process in new task situations by the combined use of the multi-criteria decision-making approach and a proposed five-step hybrid process that incorporates the technique of an analytic network process (ANP).

Journal ArticleDOI
TL;DR: The use of the Akaike information criterion (AIC) in model selection and inference, as well as the interpretation of results analysed in this framework with two real herpetological data sets are illustrated.
Abstract: In ecology, researchers frequently use observational studies to explain a given pattern, such as the number of individuals in a habitat patch, with a large number of explanatory (i.e., independent) variables. To elucidate such relationships, ecologists have long relied on hypothesis testing to include or exclude variables in regression models, although the conclusions often depend on the approach used (e.g., forward, backward, stepwise selection). Though better tools have surfaced in the mid 1970's, they are still underutilized in certain fields, particularly in herpetology. This is the case of the Akaike information criterion (AIC) which is remarkably superior in model selection (i.e., variable selection) than hypothesis- based approaches. It is simple to compute and easy to understand, but more importantly, for a given data set, it provides a measure of the strength of evidence for each model that represents a plausible biological hypothesis relative to the entire set of models considered. Using this approach, one can then compute a weighted average of the estimate and standard error for any given variable of interest across all the models considered. This procedure, termed model-averaging or multimodel inference, yields precise and robust estimates. In this paper, I illustrate the use of the AIC in model selection and inference, as well as the interpretation of results analysed in this framework with two real herpetological data sets. The AIC and measures derived from it is should be routinely adopted by herpetologists.

Journal ArticleDOI
01 Oct 2006-Ecology
TL;DR: The usefulness of the weighted BIC (Bayesian information criterion) is suggested as a computationally simple alternative to AIC, based on explicit selection of prior model probabilities rather than acceptance of default priors associated with AIC.
Abstract: Statistical thinking in wildlife biology and ecology has been profoundly influenced by the introduction of AIC (Akaike's information criterion) as a tool for model selection and as a basis for model averaging. In this paper, we advocate the Bayesian paradigm as a broader framework for multimodel inference, one in which model averaging and model selection are naturally linked, and in which the performance of AIC-based tools is naturally evaluated. Prior model weights implicitly associated with the use of AIC are seen to highly favor complex models: in some cases, all but the most highly parameterized models in the model set are virtually ignored a priori. We suggest the usefulness of the weighted BIC (Bayesian information criterion) as a computationally simple alternative to AIC, based on explicit selection of prior model probabilities rather than acceptance of default priors associated with AIC. We note, however, that both procedures are only approximate to the use of exact Bayes factors. We discuss and illustrate technical difficulties associated with Bayes factors, and suggest approaches to avoiding these difficulties in the context of model selection for a logistic regression. Our example highlights the predisposition of AIC weighting to favor complex models and suggests a need for caution in using the BIC for computing approximate posterior model weights.

Journal ArticleDOI
TL;DR: It is shown how one can estimate the shape of a thermal performance curve using information theory, which ranks plausible models by their Akaike information criterion (AIC), which is a measure of a model's ability to describe the data discounted by the model's complexity.

Journal ArticleDOI
TL;DR: Analyses of datasets simulated under a range of rate-variable diversification scenarios indicate that the birth-death likelihood method has much greater power to detect variation in diversification rates when extinction is present, and appears to be the only approach available that can distinguish between a temporal increase in diversisation rates and a rate-constant model with nonzero extinction.
Abstract: Maximum likelihood is a potentially powerful approach for investigating the tempo of diversification using molecular phylogenetic data. Likelihood methods distinguish between rate-constant and rate-variable models of diversification by fitting birth-death models to phylogenetic data. Because model selection in this context is a test of the null hypothesis that diversification rates have been constant over time, strategies for selecting best-fit models must minimize Type I error rates while retaining power to detect rate variation when it is present. Here I examine model selection, parameter estimation, and power to reject the null hypothesis using likelihood models based on the birth-death process. The Akaike information criterion (AIC) has often been used to select among diversification models; however, I find that selecting models based on the lowest AIC score leads to a dramatic inflation of the Type I error rate. When appropriately corrected to reduce Type I error rates, the birth-death likelihood approach performs as well or better than the widely used gamma statistic, at least when diversification rates have shifted abruptly over time. Analyses of datasets simulated under a range of rate-variable diversification scenarios indicate that the birth-death likelihood method has much greater power to detect variation in diversification rates when extinction is present. Furthermore, this method appears to be the only approach available that can distinguish between a temporal increase in diversification rates and a rate-constant model with nonzero extinction. I illustrate use of the method by analyzing a published phylogeny for Australian agamid lizards.

BookDOI
01 Dec 2006
TL;DR: In this paper, a comprehensive overview of statistical challenges with high dimensionality in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management, is presented.
Abstract: Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of high-dimensionality arise in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. We first give a comprehensive overviewof statistical challenges with high dimensionality in these diverse disciplines. We then approach the problem of variable selection and feature extraction using a unified framework: penalized likelihood methods. Issues relevant to the choice of penalty functions are addressed. We demonstrate that for a host of statistical problems, as long as the dimensionality is not excessively large, we can estimate the model parameters as well as if the best model is known in advance. The persistence property in risk minimization is also addressed. The applicability of such a theory and method to diverse statistical problems is demonstrated. Other related problems with high-dimensionality are also discussed.

Journal ArticleDOI
TL;DR: In this article, a new evidence algorithm known as nested sampling is proposed, which combines accuracy, generality of application, and computational feasibility, and applies it to some cosmological data sets and models.
Abstract: The abundance of cosmological data becoming available means that a wider range of cosmological models are testable than ever before. However, an important distinction must be made between parameter fitting and model selection. While parameter fitting simply determines how well a model fits the data, model selection statistics, such as the Bayesian evidence, are now necessary to choose between these different models, and in particular to assess the need for new parameters. We implement a new evidence algorithm known as nested sampling, which combines accuracy, generality of application, and computational feasibility, and we apply it to some cosmological data sets and models. We find that a five-parameter model with a Harrison-Zel'dovich initial spectrum is currently preferred.

Journal ArticleDOI
David Posada1
TL;DR: The ModelTest server is a web-based application for the selection of models of nucleotide substitution using the program ModelTest, which takes as input a text file with likelihood scores for the set of candidate models.
Abstract: ModelTest server is a web-based application for the selection of models of nucleotide substitution using the program ModelTest. The server takes as input a text file with likelihood scores for the set of candidate models. Models can be selected with hierarchical likelihood ratio tests, or with the Akaike or Bayesian information criteria. The output includes several statistics for the assessment of model selection uncertainty, for model averaging or to estimate the relative importance of model parameters. The server can be accessed at http://darwin.uvigo.es/software/ modeltest_server.html.

Journal ArticleDOI
TL;DR: Alternatives to hypothesis testing are reviewed including techniques for parameter estimation and model selection using likelihood and Bayesian techniques, which hold promise for new insight in ecology by encouraging thoughtful model building as part of inquiry.
Abstract: Statistical methods emphasizing formal hypothesis testing have dominated the analyses used by ecologists to gain insight from data Here, we review alternatives to hypothesis testing including techniques for parameter estimation and model selection using likelihood and Bayesian techniques These methods emphasize evaluation of weight of evidence for multiple hypotheses, multimodel inference, and use of prior information in analysis We provide a tutorial for maximum likelihood estimation of model parameters and model selection using information theoretics, including a brief treatment of procedures for model comparison, model averaging, and use of data from multiple sources We discuss the advantages of likelihood estimation, Bayesian analysis, and meta-analysis as ways to accumulate understanding across multiple studies These statistical methods hold promise for new insight in ecology by encouraging thoughtful model building as part of inquiry, providing a unified framework for the empirical analysis of theoretical models, and by facilitating the formal accumulation of evidence bearing on fundamental questions

Journal ArticleDOI
TL;DR: In this paper, the authors apply information theory in the area of modelling fish growth and to show how model selection uncertainty may be taken into account when estimating growth parameters, and apply a multi-model inference approach for making robust parameter estimations and for dealing with uncertainty in model selection.

Journal ArticleDOI
TL;DR: A Dynamically Multi-Linked Hidden Markov Model (DML-HMM) is developed based on the discovery of salient dynamic interlinks among multiple temporal processes corresponding to multiple event classes resulting in its topology being intrinsically determined by the underlying causality and temporal order among events.
Abstract: In this work, we present a unified bottom-up and top-down automatic model selection based approach for modelling complex activities of multiple objects in cluttered scenes. An activity of multiple objects is represented based on discrete scene events and their behaviours are modelled by reasoning about the temporal and causal correlations among different events. This is significantly different from the majority of the existing techniques that are centred on object tracking followed by trajectory matching. In our approach, object-independent events are detected and classified by unsupervised clustering using Expectation-Maximisation (EM) and classified using automatic model selection based on Schwarz's Bayesian Information Criterion (BIC). Dynamic Probabilistic Networks (DPNs) are formulated for modelling the temporal and causal correlations among discrete events for robust and holistic scene-level behaviour interpretation. In particular, we developed a Dynamically Multi-Linked Hidden Markov Model (DML-HMM) based on the discovery of salient dynamic interlinks among multiple temporal processes corresponding to multiple event classes. A DML-HMM is built using BIC based factorisation resulting in its topology being intrinsically determined by the underlying causality and temporal order among events. Extensive experiments are conducted on modelling activities captured in different indoor and outdoor scenes. Our experimental results demonstrate that the performance of a DML-HMM on modelling group activities in a noisy and cluttered scene is superior compared to those of other comparable dynamic probabilistic networks including a Multi-Observation Hidden Markov Model (MOHMM), a Parallel Hidden Markov Model (PaHMM) and a Coupled Hidden Markov Model (CHMM).

Journal ArticleDOI
TL;DR: A simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior, and the improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step.
Abstract: Motivation: Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification. Results: The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense. Availability: A MATLAB implementation of the sparse logistic regression algorithm with Bayesian regularization (BLogReg) is available from http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/ Contact: [email protected]

Proceedings ArticleDOI
30 Oct 2006
TL;DR: A simple and efficient approach to model selection for weighted least-squares support vector machines, and compares a variety of model selection criteria based on leave-one-out cross-validation.
Abstract: While the model parameters of many kernel learning methods are given by the solution of a convex optimisation problem, the selection of good values for the kernel and regularisation parameters, i.e. model selection, is much less straight-forward. This paper describes a simple and efficient approach to model selection for weighted least-squares support vector machines, and compares a variety of model selection criteria based on leave-one-out cross-validation. An external cross-validation procedure is used for performance estimation, with model selection performed independently in each fold to avoid selection bias. The best entry based on these methods was ranked in joint first place in the WCCI-2006 performance prediction challenge, demonstrating the effectiveness of this approach.

Journal ArticleDOI
TL;DR: In this paper, a selection of eight high performance clear sky solar irradiance models is evaluated against a set of 16 independent data banks covering 20 years/stations, altitudes from sea level to 1600 m and a large range of different climates.

Journal ArticleDOI
TL;DR: A general framework that can account for several priors in a common inverse solution is described and how model selection, using the log-evidence, can be used to select the best combination of priors is shown.

Journal ArticleDOI
TL;DR: The results indicate that the information criterion approach appears to provide the best basis for an automated approach to method selection, provided that it is based on Akaike's information criterion.

01 Jan 2006
TL;DR: The choice of search strategy and the actual simulation performance of the algorithm are discussed and the quick modeller is outlined, distinguishing between the costs of search and costs of inference.
Abstract: Model selection is ubiquitous as we simply do not know the underlying data generating process. However, substantial criticisms are targeted at many model selection procedures. All of these criticisms can be refuted by an automatic model selection algorithms called PcGets. These algorithm is an Ox Package (see Doornik, 2001) implementing automatic general-to-specific model selection for linear regression model selections using a multi-path exploration approach: see Hendry and Krolzig (2001). This article sketches the algorithm, distinguishing between the costs of search and costs of inference. The choice of search strategy and the actual simulation performance of the algorithm are discussed and we outline the quick modeller and directions for future research.

Journal ArticleDOI
TL;DR: This paper offers a heuristic derivation of the AIC in this context and provides simulation results that show that using AIC for a geostatistical model is superior to the often-used traditional approach of ignoring spatial correlation in the selection of explanatory variables.
Abstract: We consider the problem of model selection for geospatial data. Spatial correlation is often ignored in the selection of explanatory variables, and this can influence model selection results. For example, the importance of particular explanatory variables may not be apparent when spatial correlation is ignored. To address this problem, we consider the Akaike Information Criterion (AIC) as applied to a geostatistical model. We offer a heuristic derivation of the AIC in this context and provide simulation results that show that using AIC for a geostatistical model is superior to the often-used traditional approach of ignoring spatial correlation in the selection of explanatory variables. These ideas are further demonstrated via a model for lizard abundance. We also apply the principle of minimum description length (MDL) to variable selection for the geostatistical model. The effect of sampling design on the selection of explanatory covariates is also explored. R software to implement the geostatistical model selection methods described in this paper is available in the Supplement.