scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 2007"


BookDOI
04 Oct 2007
TL;DR: This reference work and graduate level textbook considers a wide range of models and methods for analyzing and forecasting multiple time series, which include vector autoregressive, cointegrated, vector Autoregressive moving average, multivariate ARCH and periodic processes as well as dynamic simultaneous equations and state space models.
Abstract: This reference work and graduate level textbook considers a wide range of models and methods for analyzing and forecasting multiple time series. The models covered include vector autoregressive, cointegrated, vector autoregressive moving average, multivariate ARCH and periodic processes as well as dynamic simultaneous equations and state space models. Least squares, maximum likelihood, and Bayesian methods are considered for estimating these models. Different procedures for model selection and model specification are treated and a wide range of tests and criteria for model checking are introduced. Causality analysis, impulse response analysis and innovation accounting are presented as tools for structural analysis. The book is accessible to graduate students in business and economics. In addition, multiple time series courses in other fields such as statistics and engineering may be based on it. Applied researchers involved in analyzing multiple time series may benefit from the book as it provides the background and tools for their tasks. It bridges the gap to the difficult technical literature on the topic.

5,244 citations


Journal ArticleDOI
TL;DR: The implementation of the penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model is nontrivial, but it is shown that the computation can be done effectively by taking advantage of the efficient maxdet algorithm developed in convex optimization.
Abstract: SUMMARY We propose penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model. The methods lead to a sparse and shrinkage estimator of the concentration matrix that is positive definite, and thus conduct model selection and estimation simultaneously. The implementation of the methods is nontrivial because of the positive definite constraint on the concentration matrix, but we show that the computation can be done effectively by taking advantage of the efficient maxdet algorithm developed in convex optimization. We propose a BIC-type criterion for the selection of the tuning parameter in the penalized likelihood methods. The connection between our methods and existing methods is illustrated. Simulations and real examples demonstrate the competitive performance of the new methods.

1,824 citations


Book
Peter Grünwald1
23 Mar 2007
TL;DR: The minimum description length (MDL) principle as mentioned in this paper is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning, which is particularly well suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.
Abstract: The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.

1,270 citations


Posted Content
01 Jan 2007
TL;DR: In this article, the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse is formulated as a maximum likelihood problem with an added l 1-norm penalty term.
Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added l1-norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive l1-norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan [2006]), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data.

1,172 citations


Journal ArticleDOI
TL;DR: It is shown how the ReML objective function can be adjusted to provide an approximation to the log-evidence for a particular model, which means ReML can be used for model selection, specifically to select or compare models with different covariance components.

843 citations


Journal ArticleDOI
01 Sep 2007-Energy
TL;DR: This study presents three modeling techniques for the prediction of electricity energy consumption: decision tree and neural networks are considered, and model selection is based on the square root of average squared error.

800 citations


Proceedings Article
03 Dec 2007
TL;DR: This paper proposes a direct importance estimation method that does not involve density estimation and is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized.
Abstract: A situation where training and test samples follow different input distributions is called covariate shift. Under covariate shift, standard learning methods such as maximum likelihood estimation are no longer consistent—weighted variants according to the ratio of test and training input densities are consistent. Therefore, accurately estimating the density ratio, called the importance, is one of the key issues in covariate shift adaptation. A naive approach to this task is to first estimate training and test input densities separately and then estimate the importance by taking the ratio of the estimated densities. However, this naive approach tends to perform poorly since density estimation is a hard task particularly in high dimensional cases. In this paper, we propose a direct importance estimation method that does not involve density estimation. Our method is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized. Simulations illustrate the usefulness of our approach.

785 citations


Book
12 Sep 2007
TL;DR: A generalized information criterion (GIC) and a bootstrap information criterion are presented, which provide unified tools for modeling and model evaluation for a diverse range of models, including various types of nonlinear models and model estimation procedures such as robust estimation, the maximum penalized likelihood method and a Bayesian approach.
Abstract: The Akaike information criterion (AIC) derived as an estimator of the Kullback-Leibler information discrepancy provides a useful tool for evaluating statistical models, and numerous successful applications of the AIC have been reported in various fields of natural sciences, social sciences and engineering. One of the main objectives of this book is to provide comprehensive explanations of the concepts and derivations of the AIC and related criteria, including Schwarzs Bayesian information criterion (BIC), together with a wide range of practical examples of model selection and evaluation criteria. A secondary objective is to provide a theoretical basis for the analysis and extension of information criteria via a statistical functional approach. A generalized information criterion (GIC) and a bootstrap information criterion are presented, which provide unified tools for modeling and model evaluation for a diverse range of models, including various types of nonlinear models and model estimation procedures such as robust estimation, the maximum penalized likelihood method and a Bayesian approach.

750 citations


Book
16 Apr 2007
TL;DR: In this article, the authors present an overview of the literature on missing data in the field of missing data analysis and propose an approach to estimate the likelihood of a missing data set.
Abstract: Preface. Acknowledgements. I Preliminaries. 1 Introduction. 1.1 From Imbalance to the Field of Missing Data Research. 1.2 Incomplete Data in Clinical Studies. 1.3 MAR, MNAR, and Sensitivity Analysis. 1.4 Outline of the Book. 2 Key Examples. 2.1 Introduction. 2.2 The Vorozole Study. 2.3 The Orthodontic Growth Data. 2.4 Mastitis in Dairy Cattle. 2.5 The Depression Trials. 2.6 The Fluvoxamine Trial. 2.7 The Toenail Data. 2.8 Age-Related Macular Degeneration Trial. 2.9 The Analgesic Trial. 2.10 The Slovenian Public Opinion Survey. 3 Terminology and Framework. 3.1 Modelling Incompleteness. 3.2 Terminology. 3.3 Missing Data Frameworks. 3.4 Missing Data Mechanisms. 3.5 Ignorability. 3.6 Pattern-Mixture Models. II Classical Techniques and the Need for Modelling. 4 A Perspective on Simple Methods. 4.1 Introduction. 4.2 Simple Methods. 4.3 Problems with Complete Case Analysis and Last Observation Carried Forward. 4.4 Using the Available Cases: a Frequentist versus a Likelihood Perspective. 4.5 Intention to Treat. 4.6 Concluding Remarks. 5 Analysis of the Orthodontic Growth Data. 5.1 Introduction and Models. 5.2 The Original, Complete Data. 5.3 Direct Likelihood. 5.4 Comparison of Analyses. 5.5 Example SAS Code for Multivariate Linear Models. 5.6 Comparative Power under Different Covariance Structures. 5.7 Concluding Remarks. 6 Analysis of the Depression Trials. 6.1 View 1: Longitudinal Analysis. 6.2 Views 2a and 2b and All versus Two Treatment Arms. III Missing at Random and Ignorability. 7 The Direct Likelihood Method. 7.1 Introduction. 7.2 Ignorable Analyses in Practice. 7.3 The Linear Mixed Model. 7.4 Analysis of the Toenail Data. 7.5 The Generalized Linear Mixed Model. 7.6 The Depression Trials. 7.7 The Analgesic Trial. 8 The Expectation-Maximization Algorithm. 8.1 Introduction. 8.2 The Algorithm. 8.3 Missing Information. 8.4 Rate of Convergence. 8.5 EM Acceleration. 8.6 Calculation of Precision Estimates. 8.7 A Simple Illustration. 8.8 Concluding Remarks. 9 Multiple Imputation. 9.1 Introduction. 9.2 The Basic Procedure. 9.3 Theoretical Justification. 9.4 Inference under Multiple Imputation. 9.5 Efficiency. 9.6 Making Proper Imputations. 9.7 Some Roles for Multiple Imputation. 9.8 Concluding Remarks. 10 Weighted Estimating Equations. 10.1 Introduction. 10.2 Inverse Probability Weighting. 10.3 Generalized Estimating Equations for Marginal Models. 10.4 Weighted Generalized Estimating Equations. 10.5 The Depression Trials. 10.6 The Analgesic Trial. 10.7 Double Robustness. 10.8 Concluding Remarks. 11 Combining GEE and MI. 11.1 Introduction. 11.2 Data Generation and Fitting. 11.3 MI-GEE and MI-Transition. 11.4 An Asymptotic Simulation Study. 11.5 Concluding Remarks. 12 Likelihood-Based Frequentist Inference. 12.1 Introduction. 12.2 Information and Sampling Distributions. 12.3 Bivariate Normal Data. 12.4 Bivariate Binary Data. 12.5 Implications for Standard Software. 12.6 Analysis of the Fluvoxamine Trial. 12.7 The Muscatine Coronary Risk Factor Study. 12.8 The Crepeau Data. 12.9 Concluding Remarks. 13 Analysis of the Age-Related Macular Degeneration Trial. 13.1 Introduction. 13.2 Direct Likelihood Analysis of the Continuous Outcome. 13.3 Weighted Generalized Estimating Equations. 13.4 Direct Likelihood Analysis of the Binary Outcome. 13.5 Multiple Imputation. 13.6 Concluding Remarks. 14 Incomplete Data and SAS. 14.1 Introduction. 14.2 Complete Case Analysis. 14.3 Last Observation Carried Forward. 14.4 Direct Likelihood. 14.5 Weighted Estimating Equations. 14.6 Multiple Imputation. IV Missing Not at Random. 15 Selection Models. 15.1 Introduction. 15.2 The Diggle-Kenward Model for Continuous Outcomes. 15.3 Illustration and SAS Implementation. 15.4 An MNAR Dale Model. 15.5 A Model for Non-monotone Missingness. 15.6 Concluding Remarks. 16 Pattern-Mixture Models. 16.1 Introduction. 16.2 A Simple Gaussian Illustration. 16.3 A Paradox. 16.4 Strategies to Fit Pattern-Mixture Models. 16.5 Applying Identifying Restrictions. 16.6 Pattern-Mixture Analysis of the Vorozole Study. 16.7 A Clinical Trial in Alzheimer's Disease. 16.8 Analysis of the Fluvoxamine Trial. 16.9 Concluding Remarks. 17 Shared-Parameter Models. 18 Protective Estimation. 18.1 Introduction. 18.2 Brown's Protective Estimator for Gaussian Data. 18.3 A Protective Estimator for Categorical Data. 18.4 A Protective Estimator for Gaussian Data. 18.5 Concluding Remarks. V Sensitivity Analysis. 19 MNAR, MAR, and the Nature of Sensitivity. 19.1 Introduction. 19.2 Every MNAR Model Has an MAR Bodyguard. 19.3 The General Case of Incomplete Contingency Tables. 19.4 The Slovenian Public Opinion Survey. 19.5 Implications for Formal and Informal Model Selection. 19.6 Behaviour of the Likelihood Ratio Test for MAR versus MNAR. 19.7 Concluding Remarks. 20 Sensitivity Happens. 20.1 Introduction. 20.2 A Range of MNAR Models. 20.3 Identifiability Problems. 20.4 Analysis of the Fluvoxamine Trial. 20.5 Concluding Remarks. 21 Regions of Ignorance and Uncertainty. 21.1 Introduction. 21.2 Prevalence of HIV in Kenya. 21.3 Uncertainty and Sensitivity. 21.4 Models for Monotone Patterns. 21.5 Models for Non-monotone Patterns. 21.6 Formalizing Ignorance and Uncertainty. 21.7 Analysis of the Fluvoxamine Trial. 21.8 Artificial Examples. 21.9 The Slovenian Public Opinion Survey. 21.10 Some Theoretical Considerations. 21.11 Concluding Remarks. 22 Local and Global Influence Methods. 22.1 Introduction. 22.2 Gaussian Outcomes. 22.3 Mastitis in Dairy Cattle. 22.4 Alternative Local Influence Approaches. 22.5 The Milk Protein Content Trial. 22.6 Analysis of the Depression Trials. 22.7 A Local Influence Approach for Ordinal Data with Dropout. 22.8 Analysis of the Fluvoxamine Data. 22.9 A Local Influence Approach for Incomplete Binary Data. 22.10 Analysis of the Fluvoxamine Data. 22.11 Concluding Remarks. 23 The Nature of Local Influence. 23.1 Introduction. 23.2 The Rats Data. 23.3 Analysis and Sensitivity Analysis of the Rats Data. 23.4 Local Influence Methods and Their Behaviour. 23.5 Concluding Remarks. 24 A Latent-Class Mixture Model for Incomplete Longitudinal Gaussian Data. 24.1 Introduction. 24.2 Latent-Class Mixture Models. 24.3 The Likelihood Function and Estimation. 24.4 Classification. 24.5 Simulation Study. 24.6 Analysis of the Depression Trials. 24.7 Concluding Remarks. VI Case Studies. 25 The Age-Related Macular Degeneration Trial. 25.1 Selection Models and Local Influence. 25.2 Local Influence Analysis. 25.3 Pattern-Mixture Models. 25.4 Concluding Remarks. 26 The Vorozole Study. 26.1 Introduction. 26.2 Exploring the Vorozole Data. 26.3 A Selection Model for the Vorozole Study. 26.4 A Pattern-Mixture Model for the Vorozole Study. 26.5 Concluding Remarks. References. Index.

750 citations


Journal ArticleDOI
TL;DR: The Deviance Information Criterion as mentioned in this paper combines ideas from both heritages; it is readily computed from Monte Carlo posterior samples and, unlike the AIC and BIC, allows for parameter degeneracy.
Abstract: Model selection is the problem of distinguishing competing models, perhaps featuring different numbers of parameters. The statistics literature contains two distinct sets of tools, those based on information theory such as the Akaike Information Criterion (AIC), and those on Bayesian inference such as the Bayesian evidence and Bayesian Information Criterion (BIC). The Deviance Information Criterion combines ideas from both heritages; it is readily computed from Monte Carlo posterior samples and, unlike the AIC and BIC, allows for parameter degeneracy. I describe the properties of the information criteria, and as an example compute them from Wilkinson Microwave Anisotropy Probe 3-yr data for several cosmological models. I find that at present the information theory and Bayesian approaches give significantly different conclusions from that data.

725 citations


Journal ArticleDOI
TL;DR: In this article, the performance of three different simultaneous autoregressive (SAR) model types (spatial error = SAR err, lagged = SAR lag and mixed = SAR mix ) and common ordinary least squares (OLS) regression when accounting for spatial autocorrelation in species distribution data using four artificial data sets with known (but different) spatial auto-correlation structures.
Abstract: Aim Spatial autocorrelation is a frequent phenomenon in ecological data and can affect estimates of model coefficients and inference from statistical models. Here, we test the performance of three different simultaneous autoregressive (SAR) model types (spatial error = SAR err , lagged = SAR lag and mixed = SAR mix ) and common ordinary least squares (OLS) regression when accounting for spatial autocorrelation in species distribution data using four artificial data sets with known (but different) spatial autocorrelation structures. Methods We evaluate the performance of SAR models by examining spatial patterns in model residuals (with correlograms and residual maps), by comparing model parameter estimates with true values, and by assessing their type I error control with calibration curves. We calculate a total of 3240 SAR models and illustrate how the best models [in terms of minimum residual spatial autocorrelation (minRSA), maximum model fit ( R 2 ), or Akaike information criterion (AIC)] can be identified using model selection procedures. Results Our study shows that the performance of SAR models depends on model specification (i.e. model type, neighbourhood distance, coding styles of spatial weights matrices) and on the kind of spatial autocorrelation present. SAR model parameter estimates might not be more precise than those from OLS regressions in all cases. SAR err models were the most reliable SAR models and performed well in all cases (independent of the kind of spatial autocorrelation induced and whether models were selected by minRSA, R 2 or AIC), whereas OLS, SAR lag and SAR mix models showed weak type I error control and/or unpredictable biases in parameter estimates. Main conclusions SAR err models are recommended for use when dealing with spatially autocorrelated species distribution data. SAR lag and SAR mix might not always give better estimates of model coefficients than OLS, and can thus generate bias. Other spatial modelling techniques should be assessed comprehensively to test their predictive performance and accuracy for biogeographical and macroecological research.

Journal ArticleDOI
TL;DR: It is shown that the new Mallows model average (MMA) estimator is asymptotically optimal in the sense of achieving the lowest possible squared error in a class of discrete model average estimators.
Abstract: This paper considers the problem of selection of weights for averaging across least squares estimates obtained from a set of models. Existing model average methods are based on exponential Akaike information criterion (AIC) and Bayesian information criterion (BIC) weights. In distinction, this paper proposes selecting the weights by minimizing a Mallows criterion, the latter an estimate of the average squared error from the model average fit. We show that our new Mallows model average (MMA) estimator is asymptotically optimal in the sense of achieving the lowest possible squared error in a class of discrete model average estimators. In a simulation experiment we show that the MMA estimator compares favorably with those based on AIC and BIC weights. The proof of the main result is an application of the work of Li (1987).

Journal ArticleDOI
TL;DR: In this paper, the adaptive Lasso estimator is proposed for Cox's proportional hazards model, which is based on a penalized log partial likelihood with the adaptively weighted L 1 penalty on regression coefficients.
Abstract: SUMMARY We investigate the variable selection problem for Cox's proportional hazards model, and propose a unified model selection and estimation procedure with desired theoretical properties and computational convenience. The new method is based on a penalized log partial likelihood with the adaptively weighted L1 penalty on regression coefficients, providing what we call the adaptive Lasso estimator. The method incorporates different penalties for different coefficients: unimportant variables receive larger penalties than important ones, so that important variables tend to be retained in the selection process, whereas unimportant variables are more likely to be dropped. Theoretical properties, such as consistency and rate of convergence of the estimator, are studied. We also show that, with proper choice of regularization parameters, the proposed estimator has the oracle properties. The convex optimization nature of the method leads to an efficient algorithm. Both simulated and real examples show that the method performs competitively.

Journal ArticleDOI
TL;DR: The general simulation approach presented in this paper for identifying the most parsimonious model, as defined by information theory, should help to improve the understanding of the reliability of model selection when using AIC, and help the development of better selection rules.
Abstract: Summary 1. The ability to identify key ecological processes is important when solving applied problems. Increasingly, ecologists are adopting Akaike's information criterion (AIC) as a metric to help them assess and select among multiple process-based ecological models. Surprisingly, however, it is still unclear how best to incorporate AIC into the selection process in order to address the trade-off between maximizing the probability of retaining the most parsimonious model while minimizing the number of models retained. 2. Ecological count data are often observed to be overdispersed with respect to best-fitting models. Overdispersion is problematic when performing an AIC analysis, as it can result in selection of overly complex models which can lead to poor ecological inference. This paper describes and illustrates two approaches that deal effectively with overdispersion. The first approach involves modelling the causes of overdispersion implicitly using compound probability distributions. The second approach ignores the causes of overdispersion and uses quasi-AIC (QAIC) as a metric for model parsimony. 3. Simulations and a novel method that identifies the most parsimonious model are used to demonstrate the utility of the two overdispersion approaches within the context of two ecological examples. The first example addresses binomial data obtained from a study of fish survival (as related to habitat structure) and the second example addresses Poisson data obtained from a study of flower visitation by nectarivores. 4. Applying either overdispersion approach reduces the chance of selecting overly complex models, and both approaches result in very similar ecological inference. In addition, inference can be made more reliable by incorporating model nesting into the selection process (i.e. identifying which models are special cases of others), as it reduces the number of models selected without significantly reducing the probability of retaining the most parsimonious models. 5. Synthesis and applications. When data are overdispersed, inference can be improved by either modelling the causes of overdispersion or applying QAIC as a metric for model parsimony. Inference can also be improved by adopting a model filtering procedure based on how models are nested. The general simulation approach presented in this paper for identifying the most parsimonious model, as defined by information theory, should help to improve our understanding of the reliability of model selection when using AIC, and help the development of better selection rules.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of estimating, inference, and computation with multiple structural changes that occur at unknown dates in a system of equations, where changes can occur in the regression coefficients and/or the covariance matrix of the errors.
Abstract: This paper considers issues related to estimation, inference, and computation with multiple structural changes that occur at unknown dates in a system of equations. Changes can occur in the regression coefficients and/or the covariance matrix of the errors. We also allow arbitrary restrictions on these parameters, which permits the analysis of partial structural change models, common breaks that occur in all equations, breaks that occur in a subset of equations, and so forth. The method of estimation is quasi-maximum likelihood based on Normal errors. The limiting distributions are obtained under more general assumptions than previous studies. For testing, we propose likelihood ratio type statistics to test the null hypothesis of no structural change and to select the number of changes. Structural change tests with restrictions on the parameters can be constructed to achieve higher power when prior information is present. For computation, an algorithm for an efficient procedure is proposed to construct the estimates and test statistics. We also introduce a novel locally ordered breaks model, which allows the breaks in different equations to be related yet not occurring at the same dates.

Book ChapterDOI
01 Jan 2007
TL;DR: In this article, the authors propose a statistical method for investigating network structure together with relevant actor attributes as joint dependent variables in a longitudinal framework, assuming that data have been collected according to a panel design.
Abstract: This chapter proposes a statistical method for investigating network structure together with relevant actor attributes as joint dependent variables in a longitudinal framework, assuming that data have been collected according to a panel design. It discusses the specification of the stochastic model for dynamics of networks and behavior and then proceeds to parameter estimation and model selection. The chapter analyses network-behavioral coevolution and parameter estimation. It addresses Goodness-of-fit issues and model selection. The chapter presents a statistical model for the simultaneous, mutually dependent, dynamics of a relation on a given set of social actors, and the behavior of these actors as represented by one or more ordinal categorical variables. The process of network-behavioral coevolution is regarded as an emergent group-level result of the network actors’ individual decisions. Whereas the rate functions model the timing of the different actors’ different types of decisions, the objective functions model which changes are made.

Journal ArticleDOI
TL;DR: A precise analysis of what kind of penalties should be used in order to perform model selection via the minimization of a penalized least-squares type criterion within some general Gaussian framework including the classical ones is mainly devoted.
Abstract: This paper is mainly devoted to a precise analysis of what kind of penalties should be used in order to perform model selection via the minimiza- tion of a penalized least-squares type criterion within some general Gaussian framework including the classical ones. As compared to our previous paper on this topic (Birge and Massart in J. Eur. Math. Soc. 3, 203-268 (2001)), more elaborate forms of the penalties are given which are shown to be, in some sense, optimal. We indeed provide more precise upper bounds for the risk of the penalized estimators and lower bounds for the penalty terms, showing that the use of smaller penalties may lead to disastrous results. These lower bounds may also be used to design a practical strategy that allows to estimate the penalty from the data when the amount of noise is unknown. We provide an illustra- tion of the method for the problem of estimating a piecewise constant signal in Gaussian noise when neither the number, nor the location of the change points are known.

Journal ArticleDOI
TL;DR: The modified BIC is derived by asymptotic approximation of the Bayes factor for the model of Brownian motion with changing drift and performs well compared to existing methods in accurately choosing the number of regions of changed copy number.
Abstract: In the analysis of data generated by change-point processes, one critical challenge is to determine the number of change-points. The classic Bayes information criterion (BIC) statistic does not work well here because of irregularities in the likelihood function. By asymptotic approximation of the Bayes factor, we derive a modified BIC for the model of Brownian motion with changing drift. The modified BIC is similar to the classic BIC in the sense that the first term consists of the log likelihood, but it differs in the terms that penalize for model dimension. As an example of application, this new statistic is used to analyze array-based comparative genomic hybridization (array-CGH) data. Array-CGH measures the number of chromosome copies at each genome location of a cell sample, and is useful for finding the regions of genome deletion and amplification in tumor cells. The modified BIC performs well compared to existing methods in accurately choosing the number of regions of changed copy number. Unlike existing methods, it does not rely on tuning parameters or intensive computing. Thus it is impartial and easier to understand and to use.

Journal ArticleDOI
TL;DR: The generalized estimating equation (GEE) as mentioned in this paper is an extension of the generalized linear model (GLM) method to correlated data such that valid standard errors of the parameter estimates can be drawn.
Abstract: The generalized estimating equation (GEE) approach is a widely used statistical method in the analysis of longitudinal data in clinical and epidemiolog- ical studies. It is an extension of the generalized linear model (GLM) method to correlated data such that valid standard errors of the parameter estimates can be drawn. Unlike the GLM method, which is based on the maximum likelihood the- ory for independent observations, the GEE method is based on the quasilikelihood theory and no assumption is made about the distribution of response observations. Therefore, Akaike's information criterion, a widely used method for model selec- tion in GLM, is not applicable toGEE directly. However, Pan (Biometrics 2001; 57: 120{125) proposed a model-selection method for GEE and termed it quasilike- lihood under the independence model criterion. This criterion can also be used to select the best-working correlation structure. From Pan's methods, I developed a general Stata program, qic, that accommodates all the distribution and link functions and correlation structures available in Stata version 9. In this paper, I introduce this program and demonstrate how to use it to select the best working correlation structure and the best subset of covariates through two examples in longitudinal studies.

Journal ArticleDOI
TL;DR: In this paper, the authors consider three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation for the regression setting and evaluate the rates of convergence of the excess risks of the estimators obtained by these procedures.
Abstract: This paper studies statistical aggregation procedures in the regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the optimal single estimator from the list; that of (C) is to select the optimal convex combination of the given estimators; and that of (L) is to select the optimal linear combination of the given estimators. We are interested in evaluating the rates of convergence of the excess risks of the estimators obtained by these procedures. Our approach is motivated by recently published minimax results [Nemirovski, A. (2000). Topics in non-parametric statistics. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin; Tsybakov, A. B. (2003). Optimal rates of aggregation. Learning Theory and Kernel Machines. Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg]. There exist competing aggregation procedures achieving optimal convergence rates for each of the (MS), (C) and (L) cases separately. Since these procedures are not directly comparable with each other, we suggest an alternative solution. We prove that all three optimal rates, as well as those for the newly introduced (S) aggregation (subset selection), are nearly achieved via a single “universal” aggregation procedure. The procedure consists of mixing the initial estimators with weights obtained by penalized least squares. Two different penalties are considered: one of them is of the BIC type, the second one is a data-dependent $\ell_1$-type penalty.

Journal ArticleDOI
TL;DR: In this paper, the authors present inference procedures for evaluating binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values.
Abstract: Suppose that we are interested in establishing simple but reliable rules for predicting future t-year survivors through censored regression models. In this article we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models, we derive consistent estimators for the above measures through substitution and cross-validation estimation procedures. Furthermore, we provide large-sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All of the proposals are illustrated with real examples, and their finite-sample properties are evaluated through a simulation study.

Journal ArticleDOI
TL;DR: In this article, the Savage-Dickey density ratio (SDR) is used to determine the Bayes factor of two nested models and hence perform model selection, based on which a non-scale invariant spectral index of perturbations is favored for any sensible choice of prior.
Abstract: Bayesian model selection is a tool to decide whether the introduction of a new parameter is warranted by data. I argue that the usual sampling statistic significance tests for a null hypothesis can be misleading, since they do not take into account the information gained through the data, when updating the prior distribution to the posterior. On the contrary, Bayesian model selection offers a quantitative implementation of Occam’s razor. I introduce the Savage–Dickey density ratio, a computationally quick method to determine the Bayes factor of two nested models and hence perform model selection. As an illustration, I consider three key parameters for our understanding of the cosmological concordance model. By using WMAP 3–year data complemented by other cosmological measurements, I show that a non–scale invariant spectral index of perturbations is favoured for any sensible choice of prior. It is also found that a flat Universe is favoured with odds of 29 : 1 over non–flat models, and that there is strong evidence against a CDM isocurvature component to the initial conditions which is totally (anti)correlated with the adiabatic mode (odds of about 2000 : 1), but that this is strongly dependent on the prior adopted. These results are contrasted with the analysis of WMAP 1–year data, which were not informative enough to allow a conclusion as to the status of the spectral index. In a companion paper, a new technique to forecast the Bayes factor of a future observation is presented.

Journal Article
TL;DR: A penalized likelihood approach with an L1 penalty function is proposed, automatically realizing variable selection via thresholding and delivering a sparse solution in model-based clustering analysis with a common diagonal covariance matrix.
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of model-based clustering analysis with a common diagonal covariance matrix, which is especially suitable for "high dimension, low sample size" settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method.

Proceedings ArticleDOI
12 Aug 2007
TL;DR: This paper examines a host of related algorithms that, loosely speaking, fall under the category of graphical Granger methods, and characterize their relative performance from multiple viewpoints, and shows that the Lasso algorithm exhibits consistent gain over the canonical pairwise graphical Granger method.
Abstract: The need for mining causality, beyond mere statistical correlations, for real world problems has been recognized widely. Many of these applications naturally involve temporal data, which raises the challenge of how best to leverage the temporal information for causal modeling. Recently graphical modeling with the concept of "Granger causality", based on the intuition that a cause helps predict its effects in the future, has gained attention in many domains involving time series data analysis. With the surge of interest in model selection methodologies for regression, such as the Lasso, as practical alternatives to solving structural learning of graphical models, the question arises whether and how to combine these two notions into a practically viable approach for temporal causal modeling. In this paper, we examine a host of related algorithms that, loosely speaking, fall under the category of graphical Granger methods, and characterize their relative performance from multiple viewpoints. Our experiments show, for instance, that the Lasso algorithm exhibits consistent gain over the canonical pairwise graphical Granger method. We also characterize conditions under which these variants of graphical Granger methods perform well in comparison to other benchmark methods. Finally, we apply these methods to a real world data set involving key performance indicators of corporations, and present some concrete results.

Journal ArticleDOI
TL;DR: A full-QTL model is presented with which to explore the genetic architecture of complex trait in multiple environments, which includes the effects of multiple QTLs, epistasis, QTL- by-environment interactions and epistasis-by- Environment interactions.
Abstract: Summary: Understanding how interactions among set of genes affect diverse phenotypes is having a greater impact on biomedical research, agriculture and evolutionary biology. Mapping and characterizing the isolated effects of single quantitative trait locus (QTL) is a first step, but we also need to assemble networks of QTLs and define non-additive interactions (epistasis) together with a host of potential environmental modulators. In this article, we present a full-QTL model with which to explore the genetic architecture of complex trait in multiple environments. Our model includes the effects of multiple QTLs, epistasis, QTL-by-environment interactions and epistasis-by-environment interactions. A new mapping strategy, including marker interval selection, detection of marker interval interactions and genome scans, is used to evaluate putative locations of multiple QTLs and their interactions. All the mapping procedures are performed in the framework of mixed linear model that are flexible to model environmental factors regardless of fix or random effects being assumed. An F-statistic based on Henderson method III is used for hypothesis tests. This method is less computationally greedy than corresponding likelihood ratio test. In each of the mapping procedures, permutation testing is exploited to control for genome-wide false positive rate, and model selection is used to reduce ghost peaks in F-statistic profile. Parameters of the full-QTL model are estimated using a Bayesian method via Gibbs sampling. Monte Carlo simulations help define the reliability and efficiency of the method. Two real-world phenotypes (BXD mouse olfactory bulb weight data and rice yield data) are used as exemplars to demonstrate our methods. Availability: A software package is freely available at http://ibi.zju.edu.cn/software/qtlnetwork Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this paper, the authors consider three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation for the regression setting and evaluate the rates of convergence of the excess risks of the estimators obtained by these procedures.
Abstract: This paper studies statistical aggregation procedures in the regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the optimal single estimator from the list; that of (C) is to select the optimal convex combination of the given estimators; and that of (L) is to select the optimal linear combination of the given estimators. We are interested in evaluating the rates of convergence of the excess risks of the estimators obtained by these procedures. Our approach is motivated by recently published minimax results [Nemirovski, A. (2000). Topics in non-parametric statistics. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85--277. Springer, Berlin; Tsybakov, A. B. (2003). Optimal rates of aggregation. Learning Theory and Kernel Machines. Lecture Notes in Artificial Intelligence 2777 303--313. Springer, Heidelberg]. There exist competing aggregation procedures achieving optimal convergence rates for each of the (MS), (C) and (L) cases separately. Since these procedures are not directly comparable with each other, we suggest an alternative solution. We prove that all three optimal rates, as well as those for the newly introduced (S) aggregation (subset selection), are nearly achieved via a single ``universal'' aggregation procedure. The procedure consists of mixing the initial estimators with weights obtained by penalized least squares. Two different penalties are considered: one of them is of the BIC type, the second one is a data-dependent $\ell_1$-type penalty.

Journal ArticleDOI
TL;DR: Here, computer software named ‘kakusan’ is developed that enables us to solve the problems of model selection for multiple loci at the same time in model-based phylogenetic analysis of multigene sequences.
Abstract: The application of different substitution models to each gene (a.k.a. mixed model) should be considered in model-based phylogenetic analysis of multigene sequences. However, a single molecular evolution model is still usually applied. There are no computer programs able to conduct model selection for multiple loci at the same time, though several recently developed types of software for phylogenetic inference can handle mixed model. Here, I have developed computer software named ‘kakusan’ that enables us to solve the above problems. Major running steps are briefly described, and an analysis of results with kakusan is compared to that obtained with other program.

Journal ArticleDOI
TL;DR: In this article, a simple distribution-free test for non-nested model selection is proposed, which is shown to be asymptotically more efficient than the well-known Vuong test when the distribution of individual log-likelihood ratios is highly peaked.
Abstract: This paper considers a simple distribution-free test for nonnested model selection. The new test is shown to be asymptotically more efficient than the well-known Vuong test when the distribution of individual log-likelihood ratios is highly peaked. Monte Carlo results demonstrate that for many applied research situations, this distribution is indeed highly peaked. The simulation further demonstrates that the proposed test has greater power than the Vuong test under these conditions. The substantive application addresses the effect of domestic political institutions on foreign policy decision making. Do domestic institutions have effects because they hold political leaders accountable, or do they simply promote political norms that shape elite bargaining behavior? The results indicate that the latter model has greater explanatory power.

Journal ArticleDOI
TL;DR: This paper investigates the novel use of Bayesian regularisation at the second level of inference, adding a regularisation term to the model selection criterion corresponding to a prior over the hyper-parameter values, where the additional regularisation parameters are integrated out analytically.
Abstract: While the model parameters of a kernel machine are typically given by the solution of a convex optimisation problem, with a single global optimum, the selection of good values for the regularisation and kernel parameters is much less straightforward. Fortunately the leave-one-out cross-validation procedure can be performed or a least approximated very efficiently in closed form for a wide variety of kernel learning methods, providing a convenient means for model selection. Leave-one-out cross-validation based estimates of performance, however, generally exhibit a relatively high variance and are therefore prone to over-fitting. In this paper, we investigate the novel use of Bayesian regularisation at the second level of inference, adding a regularisation term to the model selection criterion corresponding to a prior over the hyper-parameter values, where the additional regularisation parameters are integrated out analytically. Results obtained on a suite of thirteen real-world and synthetic benchmark data sets clearly demonstrate the benefit of this approach.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of estimating the unconditional distribution of a post-model-selection estimator and show that no estimator for this distribution can be uniformly consistent (not even locally).
Abstract: We consider the problem of estimating the unconditional distribution of a post-model-selection estimator. The notion of a post-model-selection estimator here refers to the combined procedure resulting from first selecting a model (e.g., by a model selection criterion like AIC or by a hypothesis testing procedure) and then estimating the parameters in the selected model (e.g., by least-squares or maximum likelihood), all based on the same data set. We show that it is impossible to estimate the unconditional distribution with reasonable accuracy even asymptotically. In particular, we show that no estimator for this distribution can be uniformly consistent (not even locally). This follows as a corollary to (local) minimax lower bounds on the performance of estimators for the distribution; performance is here measured by the probability that the estimation error exceeds a given threshold. These lower bounds are shown to approach 1/2 or even 1 in large samples, depending on the situation considered. Similar impossibility results are also obtained for the distribution of linear functions (e.g., predictors) of the post-model-selection estimator.