scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 2016"


Journal ArticleDOI
TL;DR: This updated version of mclust adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
Abstract: Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.

1,841 citations


Journal ArticleDOI
TL;DR: This paper provides a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects, and proposes an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation.
Abstract: In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.

913 citations


Journal ArticleDOI
TL;DR: In this article, a general framework for smoothing parameter estimation for models with regular likelihoods constructed in terms of unknown smooth functions of covariates is discussed, where the smoothing parameters controlling the extent of penalization are estimated by Laplace approximate marginal likelihood.
Abstract: This article discusses a general framework for smoothing parameter estimation for models with regular likelihoods constructed in terms of unknown smooth functions of covariates. Gaussian random effects and parametric terms may also be present. By construction the method is numerically stable and convergent, and enables smoothing parameter uncertainty to be quantified. The latter enables us to fix a well known problem with AIC for such models, thereby improving the range of model selection tools available. The smooth functions are represented by reduced rank spline like smoothers, with associated quadratic penalties measuring function smoothness. Model estimation is by penalized likelihood maximization, where the smoothing parameters controlling the extent of penalization are estimated by Laplace approximate marginal likelihood. The methods cover, for example, generalized additive models for nonexponential family responses (e.g., beta, ordered categorical, scaled t distribution, negative binomial a...

782 citations


Journal ArticleDOI
TL;DR: A general approach to valid inference after model selection by the lasso is developed to form valid confidence intervals for the selected coefficients and test whether all relevant variables have been included in the model.
Abstract: We develop a general approach to valid inference after model selection. At the core of our framework is a result that characterizes the distribution of a post-selection estimator conditioned on the selection event. We specialize the approach to model selection by the lasso to form valid confidence intervals for the selected coefficients and test whether all relevant variables have been included in the model.

616 citations


01 Jan 2016

566 citations


Journal ArticleDOI
TL;DR: With four parameters I can fit an elephant and with five I can make him wiggle his trunk, so that with five parameters he can make his trunk wiggle.
Abstract: With four parameters I can fit an elephant and with five I can make him wiggle his trunk.—John von Neumann

377 citations


Journal ArticleDOI
TL;DR: The ctmm package for the R statistical computing environment implements all of the CTSPs currently in use in the ecological literature and couples them with powerful statistical methods for autocorrelated data adapted from geostatistics and signal processing, including variograms, periodograms and non‐Markovian maximum likelihood estimation.
Abstract: Summary Movement ecology has developed rapidly over the past decade, driven by advances in tracking technology that have largely removed data limitations. Development of rigorous analytical tools has lagged behind empirical progress, and as a result, relocation data sets have been underutilized. Discrete-time correlated random walk models (CRW) have long served as the foundation for analyzing relocation data. Unfortunately, CRWs confound the sampling and movement processes. CRW parameter estimates thus depend sensitively on the sampling schedule, which makes it difficult to draw sampling-independent inferences about the underlying movement process. Furthermore, CRWs cannot accommodate the multiscale autocorrelations that typify modern, finely sampled relocation data sets. Recent developments in modelling movement as a continuous-time stochastic process (CTSP) solve these problems, but the mathematical difficulty of using CTSPs has limited their adoption in ecology. To remove this roadblock, we introduce the ctmm package for the R statistical computing environment. ctmm implements all of the CTSPs currently in use in the ecological literature and couples them with powerful statistical methods for autocorrelated data adapted from geostatistics and signal processing, including variograms, periodograms and non-Markovian maximum likelihood estimation. ctmm is built around a standard workflow that begins with visual diagnostics, proceeds to candidate model identification, and then to maximum likelihood fitting and AIC-based model selection. Once an accurate CTSP for the data has been fitted and selected, analyses that require such a model, such as quantifying home range areas via autocorrelated kernel density estimation or estimating occurrence distributions via time-series Kriging, can then be performed. We use a case study with African buffalo to demonstrate the capabilities of ctmm and highlight the steps of a typical CTSP movement analysis workflow.

320 citations


Journal ArticleDOI
01 Mar 2016-Geoderma
TL;DR: This study provides a comprehensive comparison of machine-learning techniques for classification purposes in soil science and may assist in model selection for digital soil mapping and geomorphic modeling studies in the future.

314 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose an alternative data-driven method to infer networked nonlinear dynamical systems by using sparsity-promoting optimization to select a subset of nonlinear interactions representing dynamics on a network.
Abstract: Inferring the structure and dynamics of network models is critical to understanding the functionality and control of complex systems, such as metabolic and regulatory biological networks. The increasing quality and quantity of experimental data enable statistical approaches based on information theory for model selection and goodness-of-fit metrics. We propose an alternative data-driven method to infer networked nonlinear dynamical systems by using sparsity-promoting optimization to select a subset of nonlinear interactions representing dynamics on a network. In contrast to standard model selection methods-based upon information content for a finite number of heuristic models (order 10 or less), our model selection procedure discovers a parsimonious model from a combinatorially large set of models, without an exhaustive search. Our particular innovation is appropriate for many biological networks, where the governing dynamical systems have rational function nonlinearities with cross terms, thus requiring an implicit formulation and the equations to be identified in the null-space of a library of mixed nonlinearities, including the state and derivative terms. This method, implicit-SINDy, succeeds in inferring three canonical biological models: 1) Michaelis-Menten enzyme kinetics; 2) the regulatory network for competence in bacteria; and 3) the metabolic network for yeast glycolysis.

312 citations


Journal ArticleDOI
TL;DR: This work proposes a novel approach based on a machine learning tool named random forests (RF) to conduct selection among the highly complex models covered by ABC algorithms, modifying the way Bayesian model selection is both understood and operated.
Abstract: Approximate Bayesian computation (ABC) methods provide an elaborate approach to Bayesian inference on complex models, including model choice. Both theoretical arguments and simulation experiments indicate, however, that model posterior probabilities may be poorly evaluated by standard ABC techniques.Results: We propose a novel approach based on a machine learning tool named random forests (RF) to conduct selection among the highly complex models covered by ABC algorithms. We thus modify the way Bayesian model selection is both understood and operated, in that we rephrase the inferential goal as a classification problem, first predicting the model that best fits the data with RF and postponing the approximation of the posterior probability of the selected model for a second stage also relying on RF. Compared with earlier implementations of ABC model choice, the ABC RF approach offers several potential improvements: (i) it often has a larger discriminative power among the competing models, (ii) it is more robust against the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a gain in computation efficiency of at least 50) and (iv) it includes an approximation of the posterior probability of the selected model. The call to RF will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of this novel methodology by analyzing controlled experiments as well as genuine population genetics datasets.Availability and implementation: The proposed methodology is implemented in the R package abcrf available on the CRAN.

283 citations


Posted Content
TL;DR: This paper briefly discusses model selection, estimation and inference of homogeneous panel VAR models in a generalized method of moments (GMM) framework, and presents a set of Stata programs to conveniently execute them.
Abstract: Panel vector autoregression (VAR) models have been increasingly used in applied research. While programs specifically designed to estimate time-series VAR models are often included as standard features in most statistical packages, panel VAR model estimation and inference are often implemented with general-use routines that require some programming dexterity. In this paper, we briefly discuss model selection, estimation and inference of homogeneous panel VAR models in a generalized method of moments (GMM) framework, and present a set of Stata programs to conveniently execute them. We illustrate the pvar package of programs by using standard Stata datasets.

ReportDOI
TL;DR: In this article, a general construction of locally robust/orthogonal moment functions for GMM, where moment conditions have zero derivative with respect to first steps, is given and debiased machine learning estimators of functionals of high dimensional conditional quantiles and of dynamic discrete choice parameters with high dimensional state variables.
Abstract: Many economic and causal parameters depend on nonparametric or high dimensional first steps. We give a general construction of locally robust/orthogonal moment functions for GMM, where moment conditions have zero derivative with respect to first steps. We show that orthogonal moment functions can be constructed by adding to identifying moments the nonparametric influence function for the effect of the first step on identifying moments. Orthogonal moments reduce model selection and regularization bias, as is very important in many applications, especially for machine learning first steps. We give debiased machine learning estimators of functionals of high dimensional conditional quantiles and of dynamic discrete choice parameters with high dimensional state variables. We show that adding to identifying moments the nonparametric influence function provides a general construction of orthogonal moments, including regularity conditions, and show that the nonparametric influence function is robust to additional unknown functions on which it depends. We give a general approach to estimating the unknown functions in the nonparametric influence function and use it to automatically debias estimators of functionals of high dimensional conditional location learners. We give a variety of new doubly robust moment equations and characterize double robustness. We give general and simple regularity conditions and apply these for asymptotic inference on functionals of high dimensional regression quantiles and dynamic discrete choice parameters with high dimensional state variables.

Journal ArticleDOI
TL;DR: A generative model for robust tensor factorization in the presence of both missing data and outliers that can discover the groundtruth of CP rank and automatically adapt the sparsity inducing priors to various types of outliers is proposed.
Abstract: We propose a generative model for robust tensor factorization in the presence of both missing data and outliers. The objective is to explicitly infer the underlying low-CANDECOMP/PARAFAC (CP)-rank tensor capturing the global information and a sparse tensor capturing the local information (also considered as outliers), thus providing the robust predictive distribution over missing entries. The low-CP-rank tensor is modeled by multilinear interactions between multiple latent factors on which the column sparsity is enforced by a hierarchical prior, while the sparse tensor is modeled by a hierarchical view of Student- $t$ distribution that associates an individual hyperparameter with each element independently. For model learning, we develop an efficient variational inference under a fully Bayesian treatment, which can effectively prevent the overfitting problem and scales linearly with data size. In contrast to existing related works, our method can perform model selection automatically and implicitly without the need of tuning parameters. More specifically, it can discover the groundtruth of CP rank and automatically adapt the sparsity inducing priors to various types of outliers. In addition, the tradeoff between the low-rank approximation and the sparse representation can be optimized in the sense of maximum model evidence. The extensive experiments and comparisons with many state-of-the-art algorithms on both synthetic and real-world data sets demonstrate the superiorities of our method from several perspectives.

Journal ArticleDOI
TL;DR: It is found that the relative predictive performance of model selection by different information criteria is heavily dependent on the degree of unobserved heterogeneity between data sets, and that the choice of information criterion should ideally be based upon hypothesized properties of the population of data sets from which a given data set could have arisen.
Abstract: Summary Model selection is difficult. Even in the apparently straightforward case of choosing between standard linear regression models, there does not yet appear to be consensus in the statistical ecology literature as to the right approach. We review recent works on model selection in ecology and subsequently focus on one aspect in particular: the use of the Akaike Information Criterion (AIC) or its small-sample equivalent, AICC. We create a novel framework for simulation studies and use this to study model selection from simulated data sets with a range of properties, which differ in terms of degree of unobserved heterogeneity. We use the results of the simulation study to suggest an approach for model selection based on ideas from information criteria but requiring simulation. We find that the relative predictive performance of model selection by different information criteria is heavily dependent on the degree of unobserved heterogeneity between data sets. When heterogeneity is small, AIC or AICC are likely to perform well, but if heterogeneity is large, the Bayesian Information Criterion (BIC) will often perform better, due to the stronger penalty afforded. Our conclusion is that the choice of information criterion (or more broadly, the strength of likelihood penalty) should ideally be based upon hypothesized (or estimated from previous data) properties of the population of data sets from which a given data set could have arisen. Relying on a single form of information criterion is unlikely to be universally successful.

Journal ArticleDOI
TL;DR: A technical review of 24 large-scale hydrological models is presented to provide guidance for model selection and assesses suitability for continental setup, but setup for smaller area possible.
Abstract: Uncertainty in operational hydrological forecast systems forced with numerical weather predictions is often assessed by quantifying the uncertainty from the inputs only. However, part of the uncertainty in modelled discharge stems from the hydrological model. A multi-model system can account for some of this uncertainty, but there exists a plethora of hydrological models and it is not trivial to select those that fit specific needs and collectively capture a representative spread of model uncertainty. This paper provides a technical review of 24 large-scale models to provide guidance for model selection. Suitability for the European Flood Awareness System (EFAS), as example of an operational continental flood forecasting system, is discussed based on process descriptions, flexibility in resolution, input data requirements, availability of code and more. The model choice is in the end subjective, but this review intends to objectively assist in selecting the most appropriate model for the intended purpose. We present a technical review of 24 large-scale hydrological models.We assess suitability for continental setup, but setup for smaller area possible.The best model choice is often subjective, but criteria tables aid comparisons.

Journal ArticleDOI
TL;DR: The design of a class of machine-learning models, namely neural networks, for the load forecasts of medium-voltage/low-voltages substations are described and the results show that the neural network-based models outperform the time series models.
Abstract: Accurate forecasts of electrical substations are mandatory for the efficiency of the Advanced Distribution Automation functions in distribution systems. The paper describes the design of a class of machine-learning models, namely neural networks, for the load forecasts of medium-voltage/low-voltage substations. We focus on the methodology of neural network model design in order to obtain a model that has the best achievable predictive ability given the available data. Variable selection and model selection are applied to electrical load forecasts to ensure an optimal generalization capacity of the neural network model. Real measurements collected in French distribution systems are used to validate our study. The results show that the neural network-based models outperform the time series models and that the design methodology guarantees the best generalization ability of the neural network model for the load forecasting purpose based on the same data.

Journal ArticleDOI
TL;DR: In this paper, the authors consider a multiple-hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block of hypotheses, and propose two new testing procedures and prove that they control the false discovery rate in the ordered testing setting.
Abstract: Summary We consider a multiple-hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stopping point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stopping point or a model is equivalent to rejecting all hypotheses up to that point and none thereafter. We propose two new testing procedures and prove that they control the false discovery rate in the ordered testing setting. We also show how the methods can be applied to model selection by using recent results on p-values in sequential model selection settings.

Journal ArticleDOI
TL;DR: This work proposes a new class of partially functional linear models to characterize the regression between a scalar response and covariates of both functional and scalar types, and establishes the consistency and oracle properties of the proposed method under mild conditions.
Abstract: SUMMARY In modern experiments, functional and nonfunctional data are often encountered simultaneously when observations are sampled from random processes and high-dimensional scalar covariates. It is difficult to apply existing methods for model selection and estimation. We propose a new class of partially functional linear models to characterize the regression between a scalar response and covariates of both functional and scalar types. The new approach provides a unified and flexible framework that simultaneously takes into account multiple functional and ultrahigh-dimensional scalar predictors, enables us to identify important features, and offers improved interpretability of the estimators. The underlying processes of the functional predictors are considered to be infinite-dimensional, and one of our contributions is to characterize the effects of regularization on the resulting estimators. We establish the consistency and oracle properties of the proposed method under mild conditions, demonstrate its performance with simulation studies, and illustrate its application using air pollution data.

Journal ArticleDOI
01 Nov 2016-Genetics
TL;DR: It is argued that natural populations may experience the amount of recent positive selection required to skew inferences, and results suggest that demographic studies conducted in many species to date may have exaggerated the extent and frequency of population size changes.
Abstract: The availability of large-scale population genomic sequence data has resulted in an explosion in efforts to infer the demographic histories of natural populations across a broad range of organisms. As demographic events alter coalescent genealogies, they leave detectable signatures in patterns of genetic variation within and between populations. Accordingly, a variety of approaches have been designed to leverage population genetic data to uncover the footprints of demographic change in the genome. The vast majority of these methods make the simplifying assumption that the measures of genetic variation used as their input are unaffected by natural selection. However, natural selection can dramatically skew patterns of variation not only at selected sites, but at linked, neutral loci as well. Here we assess the impact of recent positive selection on demographic inference by characterizing the performance of three popular methods through extensive simulation of data sets with varying numbers of linked selective sweeps. In particular, we examined three different demographic models relevant to a number of species, finding that positive selection can bias parameter estimates of each of these models—often severely. We find that selection can lead to incorrect inferences of population size changes when none have occurred. Moreover, we show that linked selection can lead to incorrect demographic model selection, when multiple demographic scenarios are compared. We argue that natural populations may experience the amount of recent positive selection required to skew inferences. These results suggest that demographic studies conducted in many species to date may have exaggerated the extent and frequency of population size changes.

Journal ArticleDOI
TL;DR: The adaLASSO consistently chooses the relevant variables as the number of observations increases (model selection consistency) and has the oracle property, even when the errors are non-Gaussian and conditionally heteroskedastic.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed new methods for estimating and constructing confidence regions for a regression parameter of primary interest, a parameter in front of the regressor of interest, such as the treatment variable or a policy variable.
Abstract: This article considers generalized linear models in the presence of many controls. We lay out a general methodology to estimate an effect of interest based on the construction of an instrument that immunizes against model selection mistakes and apply it to the case of logistic binary choice model. More specifically we propose new methods for estimating and constructing confidence regions for a regression parameter of primary interest α0, a parameter in front of the regressor of interest, such as the treatment variable or a policy variable. These methods allow to estimate α0 at the root-n rate when the total number p of other regressors, called controls, potentially exceeds the sample size n using sparsity assumptions. The sparsity assumption means that there is a subset of s < n controls, which suffices to accurately approximate the nuisance part of the regression function. Importantly, the estimators and these resulting confidence regions are valid uniformly over s-sparse models satisfying s2log 2p = o(n...

Journal ArticleDOI
TL;DR: It is demonstrated that statistical theory can be applied to adjust composite likelihoods and perform robust computationally efficient statistical inference in two demographic inference tools: ∂a∂i and TRACTS.
Abstract: Many population genetics tools employ composite likelihoods, because fully modeling genomic linkage is challenging. But traditional approaches to estimating parameter uncertainties and performing model selection require full likelihoods, so these tools have relied on computationally expensive maximum-likelihood estimation (MLE) on bootstrapped data. Here, we demonstrate that statistical theory can be applied to adjust composite likelihoods and perform robust computationally efficient statistical inference in two demographic inference tools: ∂a∂i and TRACTS. On both simulated and real data, the adjustments perform comparably to MLE bootstrapping while using orders of magnitude less computational time.

Journal ArticleDOI
TL;DR: It is argued this ability to support more simple models allows for more nuanced theoretical conclusions than provided by traditional ANOVA F-tests and how ANOVA models may be reparameterized to better address substantive questions in data analysis.
Abstract: Analysis of variance (ANOVA), the workhorse analysis of experimental designs, consists of F-tests of main effects and interactions. Yet, testing, including traditional ANOVA, has been recently critiqued on a number of theoretical and practical grounds. In light of these critiques, model comparison and model selection serve as an attractive alternative. Model comparison differs from testing in that one can support a null or nested model vis-a-vis a more general alternative by penalizing more flexible models. We argue this ability to support more simple models allows for more nuanced theoretical conclusions than provided by traditional ANOVA F-tests. We provide a model comparison strategy and show how ANOVA models may be reparameterized to better address substantive questions in data analysis.

Journal ArticleDOI
TL;DR: A " working" distribution is introduced on the space of genealogies, which enables estimating marginal likelihoods while accommodating phylogenetic uncertainty, and two different "working" distributions are proposed that help GSS to outperform PS and SS in terms of accuracy when comparing demographic and evolutionary models applied to synthetic data and real-world examples.
Abstract: Marginal likelihood estimates to compare models using Bayes factors frequently accompany Bayesian phylogenetic inference. Approaches to estimate marginal likelihoods have garnered increased attention over the past decade. In particular, the introduction of path sampling (PS) and stepping-stone sampling (SS) into Bayesian phylogenetics has tremendously improved the accuracy of model selection. These sampling techniques are now used to evaluate complex evolutionary and population genetic models on empirical data sets, but considerable computational demands hamper their widespread adoption. Further, when very diffuse, but proper priors are specified for model parameters, numerical issues complicate the exploration of the priors, a necessary step in marginal likelihood estimation using PS or SS. To avoid such instabilities, generalized SS (GSS) has recently been proposed, introducing the concept of "working distributions" to facilitate--or shorten--the integration process that underlies marginal likelihood estimation. However, the need to fix the tree topology currently limits GSS in a coalescent-based framework. Here, we extend GSS by relaxing the fixed underlying tree topology assumption. To this purpose, we introduce a "working" distribution on the space of genealogies, which enables estimating marginal likelihoods while accommodating phylogenetic uncertainty. We propose two different "working" distributions that help GSS to outperform PS and SS in terms of accuracy when comparing demographic and evolutionary models applied to synthetic data and real-world examples. Further, we show that the use of very diffuse priors can lead to a considerable overestimation in marginal likelihood when using PS and SS, while still retrieving the correct marginal likelihood using both GSS approaches. The methods used in this article are available in BEAST, a powerful user-friendly software package to perform Bayesian evolutionary analyses.

Journal ArticleDOI
TL;DR: This work presents a method for model selection that enables the user to shrink the ensemble to a few representative members, conserving the model spread and accounting for model similarity, and finds that the two most dominant patterns of climate change relate to temperature and humidity patterns.
Abstract: In climate change impact research it is crucial to carefully select the meteorological input for impact models. We present a method for model selection that enables the user to shrink the ensemble to a few representative members, conserving the model spread and accounting for model similarity. This is done in three steps: First, using principal component analysis for a multitude of meteorological parameters, to find common patterns of climate change within the multi-model ensemble. Second, detecting model similarities with regard to these multivariate patterns using cluster analysis. And third, sampling models from each cluster, to generate a subset of representative simulations. We present an application based on the ENSEMBLES regional multi-model ensemble with the aim to provide input for a variety of climate impact studies. We find that the two most dominant patterns of climate change relate to temperature and humidity patterns. The ensemble can be reduced from 25 to 5 simulations while still maintaining its essential characteristics. Having such a representative subset of simulations reduces computational costs for climate impact modeling and enhances the quality of the ensemble at the same time, as it prevents double-counting of dependent simulations that would lead to biased statistics.

Posted Content
TL;DR: In this article, a hierarchical modeling approach for integrating data from multiple sources is proposed allowing spatially-varying relationships between ground measurements and other factors that estimate air quality, set within a Bayesian framework, the resulting Data Integration Model for Air Quality (DIMAQ) is used to estimate exposures, together with associated measures of uncertainty, on a high resolution grid covering the entire world.
Abstract: Air pollution is a major risk factor for global health, with both ambient and household air pollution contributing substantial components of the overall global disease burden. One of the key drivers of adverse health effects is fine particulate matter ambient pollution (PM$_{2.5}$) to which an estimated 3 million deaths can be attributed annually. The primary source of information for estimating exposures has been measurements from ground monitoring networks but, although coverage is increasing, there remain regions in which monitoring is limited. Ground monitoring data therefore needs to be supplemented with information from other sources, such as satellite retrievals of aerosol optical depth and chemical transport models. A hierarchical modelling approach for integrating data from multiple sources is proposed allowing spatially-varying relationships between ground measurements and other factors that estimate air quality. Set within a Bayesian framework, the resulting Data Integration Model for Air Quality (DIMAQ) is used to estimate exposures, together with associated measures of uncertainty, on a high resolution grid covering the entire world. Bayesian analysis on this scale can be computationally challenging and here approximate Bayesian inference is performed using Integrated Nested Laplace Approximations. Model selection and assessment is performed by cross-validation with the final model offering substantial increases in predictive accuracy, particularly in regions where there is sparse ground monitoring, when compared to current approaches: root mean square error (RMSE) reduced from 17.1 to 10.7, and population weighted RMSE from 23.1 to 12.1 $\mu$gm$^{-3}$. Based on summaries of the posterior distributions for each grid cell, it is estimated that 92% of the world's population reside in areas exceeding the World Health Organization's Air Quality Guidelines.

Journal ArticleDOI
01 Jul 2016-Ecology
TL;DR: Several different contemporary Bayesian hierarchical approaches for checking and validating multi-species occupancy models are examined and applied to a freshwater aquatic study system in Colorado, USA, to better understand the diversity and distributions of plains fishes.
Abstract: While multi-species occupancy models (MSOMs) are emerging as a popular method for analyzing biodiversity data, formal checking and validation approaches for this class of models have lagged behind. Concurrent with the rise in application of MSOMs among ecologists, a quiet regime shift is occurring in Bayesian statistics where predictive model comparison approaches are experiencing a resurgence. Unlike single-species occupancy models that use integrated likelihoods, MSOMs are usually couched in a Bayesian framework and contain multiple levels. Standard model checking and selection methods are often unreliable in this setting and there is only limited guidance in the ecological literature for this class of models. We examined several different contemporary Bayesian hierarchical approaches for checking and validating MSOMs and applied these methods to a freshwater aquatic study system in Colorado, USA, to better understand the diversity and distributions of plains fishes. Our findings indicated distinct differences among model selection approaches, with cross-validation techniques performing the best in terms of prediction.

Journal ArticleDOI
TL;DR: It is proved that the least absolute shrinkage and selection operator (Lasso) recovers the lags structure of the HAR model asymptotically if it is the true model, and Monte Carlo evidence in finite samples is presented.
Abstract: Realized volatility computed from high-frequency data is an important measure for many applications in finance, and its dynamics have been widely investigated. Recent notable advances that perform well include the heterogeneous autoregressive (HAR) model which can approximate long memory, is very parsimonious, is easy to estimate, and features good out-of-sample performance. We prove that the least absolute shrinkage and selection operator (Lasso) recovers the lags structure of the HAR model asymptotically if it is the true model, and we present Monte Carlo evidence in finite samples. The HAR model's lags structure is not fully in agreement with the one found using the Lasso on real data. Moreover, we provide empirical evidence that there are two clear breaks in structure for most of the assets we consider. These results bring into question the appropriateness of the HAR model for realized volatility. Finally, in an out-of-sample analysis, we show equal performance of the HAR model and the Lasso approach.

Journal ArticleDOI
TL;DR: To address the lack of partitioned R2‐based RVI metrics in multimodel inference, 2 metrics are proposed: Iweighted, the average model probability‐weighted partitionedR2; and Ibest, the partitioned r2 derived from the best IT model, and 2 approaches to eliminate or reduce the influence of correlated variables are proposed.
Abstract: Summary The sum of Akaike weights (SW) is often used to quantify relative variable importance (RVI) within the information-theoretic (IT) multimodel inference framework. A recent study (Galipaud et al. 2014, Methods in Ecology and Evolution 5: 983) questioned the validity of the SW approach. Regrettably, this study is flawed because SW was evaluated with an inappropriate benchmark. Irrespective of this study's methodological issues, RVI metrics based on the relative contribution of explanatory variables in explaining the variance in the response variable (partitioned R2-based) are lacking in multimodel inference. We re-evaluated the validity of SW by repeating Galipaud et al.'s experiment but with an appropriate benchmark. When explanatory variables are uncorrelated, the quantity that SW estimates (i.e. the probability that a variable is included in the actual best IT model) is monotonically related to squared zero-order correlation coefficients (r2) between explanatory variables and the response variable. The degree of correspondence between SW and r2 rankings (not values) of variables in data sets with uncorrelated explanatory variables was therefore used as a benchmark to evaluate the validity of SW as a RVI metric. To address the lack of partitioned R2-based RVI metrics in multimodel inference, we proposed 2 metrics: (a) Iweighted, the average model probability-weighted partitioned R2; and (b) Ibest, the partitioned R2 derived from the best IT model. We performed Monte Carlo simulations to evaluate the utility of Iweighted and Ibest versus partitioned R2 derived from the global model (Iglobal). SW rankings matched r2 rankings of variables; therefore, SW is a valid measure of RVI. Among partitioned R2-based metrics, Iweighted and Iglobal were generally more accurate in estimating the population partitioned R2. Iweighted performed better when explanatory variables were uncorrelated, whereas Iglobal was better in smaller data sets with correlated explanatory variables. To improve the utility of Iweighted in such data sets, we proposed approaches to eliminate or reduce the influence of correlated variables. Despite recent criticisms, our results show that SW is a valid RVI metric. To quantify RVI in terms of the R2 explained by each variable, Iweighted and Iglobal are the preferred RVI metrics.

Journal ArticleDOI
TL;DR: The development and validation of a data-driven grey-box modelling toolbox for buildings is described, based on a Modelica library with thermal building and Heating, Ventilation and Air-Conditioning models and the optimization framework in JModelica.org.
Abstract: As automatic sensing and information and communication technology get cheaper, building monitoring data becomes easier to obtain. The availability of data leads to new opportunities in the context of energy efficiency in buildings. This paper describes the development and validation of a data-driven grey-box modelling toolbox for buildings. The Python toolbox is based on a Modelica library with thermal building and Heating, Ventilation and Air-Conditioning models and the optimization framework in JModelica.org. The toolchain facilitates and automates the different steps in the system identification procedure, like data handling, model selection, parameter estimation and validation. To validate the methodology, different grey-box models are identified for a single-family dwelling with detailed monitoring data from two experiments. Validated models for forecasting and control can be identified. However, in one experiment the model performance is reduced, likely due to a poor information content in the identification data set.