scispace - formally typeset
Search or ask a question

Showing papers on "Linear model published in 2014"


Journal ArticleDOI
TL;DR: This paper presents a generic framework for permutation inference for complex general linear models (glms) when the errors are exchangeable and/or have a symmetric distribution, and shows that, even in the presence of nuisance effects, these permutation inferences are powerful while providing excellent control of false positives in a wide range of common and relevant imaging research scenarios.

2,756 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the parameters of forward models are neurophysiologically interpretable in the sense that significant nonzero weights are only observed at channels the activity of which is related to the brain process under study, in contrast to the interpretation of backward model parameters.

1,105 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a method to construct confidence intervals for individual coefficients and linear combinations of several of them in a linear regression model by turning the regression data into an approximate Gaussian sequence of point estimators of individual regression coefficients.
Abstract: Summary The purpose of this paper is to propose methodologies for statistical inference of low dimensional parameters with high dimensional data. We focus on constructing confidence intervals for individual coefficients and linear combinations of several of them in a linear regression model, although our ideas are applicable in a much broader context. The theoretical results that are presented provide sufficient conditions for the asymptotic normality of the proposed estimators along with a consistent estimator for their finite dimensional covariance matrices. These sufficient conditions allow the number of variables to exceed the sample size and the presence of many small non-zero coefficients. Our methods and theory apply to interval estimation of a preconceived regression coefficient or contrast as well as simultaneous interval estimation of many regression coefficients. Moreover, the method proposed turns the regression data into an approximate Gaussian sequence of point estimators of individual regression coefficients, which can be used to select variables after proper thresholding. The simulation results that are presented demonstrate the accuracy of the coverage probability of the confidence intervals proposed as well as other desirable properties, strongly supporting the theoretical results.

892 citations


Journal ArticleDOI
TL;DR: The logistic regression procedure is explained using examples to make it as simple as possible and to avoid confounding effects by analyzing the association of all variables together.
Abstract: Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the results is highlighted and then some special issues are discussed.

633 citations


Journal ArticleDOI
TL;DR: Efficient algorithms in the genome-wide efficient mixed model association (GEMMA) software for fitting mvLMMs and computing likelihood ratio tests are presented, which offer improved computation speed, power and P-value calibration over existing methods, and can deal with more than two phenotypes.
Abstract: Multivariate linear mixed models (mvLMMs) are powerful tools for testing associations between single-nucleotide polymorphisms and multiple correlated phenotypes while controlling for population stratification in genome-wide association studies. We present efficient algorithms in the genome-wide efficient mixed model association (GEMMA) software for fitting mvLMMs and computing likelihood ratio tests. These algorithms offer improved computation speed, power and P-value calibration over existing methods, and can deal with more than two phenotypes.

622 citations


Journal ArticleDOI
TL;DR: In this paper, a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model is proposed, which can be easily adjusted for multiplicity taking dependence among tests into account.
Abstract: We propose a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model. It can be easily adjusted for multiplicity taking dependence among tests into account. For linear models, our method is essentially the same as in Zhang and Zhang [J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014) 217–242]: we analyze its asymptotic properties and establish its asymptotic optimality in terms of semiparametric efficiency. Our method naturally extends to generalized linear models with convex loss functions. We develop the corresponding theory which includes a careful analysis for Gaussian, sub-Gaussian and bounded correlated designs.

619 citations


Book
01 Jul 2014
TL;DR: In this paper, the authors compare different types of confidence intervals for predicting body fat autoregression using R models, including Bootstrap Confidence Intervals for predicting predicted body fat and confidence interval for estimating body fat.
Abstract: Introduction Before You Start Initial Data Analysis When to Use Linear Modeling History Estimation Linear Model Matrix Representation Estimating b Least Squares Estimation Examples of Calculating b Example QR Decomposition Gauss-Markov Theorem Goodness of Fit Identifiability Orthogonality Inference Hypothesis Tests to Compare Models Testing Examples Permutation Tests Sampling Confidence Intervals for b Bootstrap Confidence Intervals Prediction Confidence Intervals for Predictions Predicting Body Fat Autoregression What Can Go Wrong with Predictions? Explanation Simple Meaning Causality Designed Experiments Observational Data Matching Covariate Adjustment Qualitative Support for Causation Diagnostics Checking Error Assumptions Finding Unusual Observations Checking the Structure of the Model Discussion Problems with the Predictors Errors in the Predictors Changes of Scale Collinearity Problems with the Error Generalized Least Squares Weighted Least Squares Testing for Lack of Fit Robust Regression Transformation Transforming the Response Transforming the Predictors Broken Stick Regression Polynomials Splines Additive Models More Complex Models Model Selection Hierarchical Models Testing-Based Procedures Criterion-Based Procedures Summary Shrinkage Methods Principal Components Partial Least Squares Ridge Regression Lasso Insurance Redlining-A Complete Example Ecological Correlation Initial Data Analysis Full Model and Diagnostics Sensitivity Analysis Discussion Missing Data Types of Missing Data Deletion Single Imputation Multiple Imputation Categorical Predictors A Two-Level Factor Factors and Quantitative Predictors Interpretation with Interaction Terms Factors with More than Two Levels Alternative Codings of Qualitative Predictors One Factor Models The Model An Example Diagnostics Pairwise Comparisons False Discovery Rate Models with Several Factors Two Factors with No Replication Two Factors with Replication Two Factors with an Interaction Larger Factorial Experiments Experiments with Blocks Randomized Block Design Latin Squares Balanced Incomplete Block Design Appendix: About R Bibliography Index

591 citations


Journal ArticleDOI
TL;DR: The knockoff filter is introduced, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables, and empirical results show that the resulting method has far more power than existing selection rules when the proportion of null variables is high.
Abstract: In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR) - the expected fraction of false discoveries among all discoveries - is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap - their construction does not require any new data - and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high.

503 citations


Journal ArticleDOI
TL;DR: In this paper, the covariance test statistic is proposed to test the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path.
Abstract: In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a [Formula: see text] distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than [Formula: see text] under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the [Formula: see text] penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties-adaptivity and shrinkage-and its null distribution is tractable and asymptotically Exp(1).

425 citations


Journal ArticleDOI
TL;DR: chieving the full potential from CS projects requires meta-data describing the sampling process, reference data to allow for standardization, and insightful modeling suitable to the question of interest.

380 citations


Journal ArticleDOI
01 Mar 2014
TL;DR: This study systematically compare linear and nonlinear regression techniques for an independent, simultaneous and proportional myoelectric control of wrist movements with two DoF and showed that KRR, a nonparametric statistical learning method, outperformed the other methods.
Abstract: In recent years the number of active controllable joints in electrically powered hand-prostheses has increased significantly. However, the control strategies for these devices in current clinical use are inadequate as they require separate and sequential control of each degree-of-freedom (DoF). In this study we systematically compare linear and nonlinear regression techniques for an independent, simultaneous and proportional myoelectric control of wrist movements with two DoF. These techniques include linear regression, mixture of linear experts (ME), multilayer-perceptron, and kernel ridge regression (KRR). They are investigated offline with electro-myographic signals acquired from ten able-bodied subjects and one person with congenital upper limb deficiency. The control accuracy is reported as a function of the number of electrodes and the amount and diversity of training data providing guidance for the requirements in clinical practice. The results showed that KRR, a nonparametric statistical learning method, outperformed the other methods. However, simple transformations in the feature space could linearize the problem, so that linear models could achieve similar performance as KRR at much lower computational costs. Especially ME, a physiologically inspired extension of linear regression represents a promising candidate for the next generation of prosthetic devices.

Journal ArticleDOI
TL;DR: It is found that it is possible to greatly reduce error rates by considering the results of all three methods when identifying outlier loci, and the relative ranking between the methods is impacted by the consideration of polygenic selection.
Abstract: The recent availability of next-generation sequencing (NGS) has made possible the use of dense genetic markers to identify regions of the genome that may be under the influence of selection. Several statistical methods have been developed recently for this purpose. Here, we present the results of an individual-based simulation study investigating the power and error rate of popular or recent genome scan methods: linear regression, Bayescan, BayEnv and LFMM. Contrary to previous studies, we focus on complex, hierarchical population structure and on polygenic selection. Additionally, we use a false discovery rate (FDR)-based framework, which provides an unified testing framework across frequentist and Bayesian methods. Finally, we investigate the influence of population allele frequencies versus individual genotype data specification for LFMM and the linear regression. The relative ranking between the methods is impacted by the consideration of polygenic selection, compared to a monogenic scenario. For strongly hierarchical scenarios with confounding effects between demography and environmental variables, the power of the methods can be very low. Except for one scenario, Bayescan exhibited moderate power and error rate. BayEnv performance was good under nonhierarchical scenarios, while LFMM provided the best compromise between power and error rate across scenarios. We found that it is possible to greatly reduce error rates by considering the results of all three methods when identifying outlier loci.

Journal ArticleDOI
TL;DR: This work proposes a method, FaST-LMM-EWASher, that automatically corrects for cell-type composition without the need for explicit knowledge of it, and then validate the method by comparison with the state-of-the-art approach.
Abstract: In epigenome-wide association studies, cell-type composition often differs between cases and controls, yielding associations that simply tag cell type rather than reveal fundamental biology. Current solutions require actual or estimated cell-type composition--information not easily obtainable for many samples of interest. We propose a method, FaST-LMM-EWASher, that automatically corrects for cell-type composition without the need for explicit knowledge of it, and then validate our method by comparison with the state-of-the-art approach. Corresponding software is available from http://www.microsoft.com/science/.

Journal ArticleDOI
TL;DR: In this paper, the authors evaluate the forecasting performance of neural models relative to that of time series methods at a regional level for tourism demand at the destination level due to the constant growth of world tourism, and find that forecasting of tourist arrivals are more accurate than forecasts of overnight stays.

Journal ArticleDOI
TL;DR: In this paper, a case study of the statistical evaluation was conducted for the DSSAT Cropping System Model (CSM) using 10 experimental datasets for maize, peanut, soybean, wheat and potato from Brazil, China, Ghana, and the USA.

Journal ArticleDOI
TL;DR: A multivariate modeling (MVM) approach to analyzing neuroimaging data at the group level with the following advantages: there is no limit on the number of factors as long as sample sizes are deemed appropriate; quantitative covariates can be analyzed together with within-subject factors; and the severity of sphericity violation varies substantially across brain regions.

Journal ArticleDOI
TL;DR: The presented approach to the fitting of generalized linear mixed models includes an L1-penalty term that enforces variable selection and shrinkage simultaneously and a gradient ascent algorithm is proposed that allows to maximize the penalized log-likelihood yielding models with reduced complexity.
Abstract: Generalized linear mixed models are a widely used tool for modeling longitudinal data. However, their use is typically restricted to few covariates, because the presence of many predictors yields unstable estimates. The presented approach to the fitting of generalized linear mixed models includes an L 1-penalty term that enforces variable selection and shrinkage simultaneously. A gradient ascent algorithm is proposed that allows to maximize the penalized log-likelihood yielding models with reduced complexity. In contrast to common procedures it can be used in high-dimensional settings where a large number of potentially influential explanatory variables is available. The method is investigated in simulation studies and illustrated by use of real data sets.

Reference EntryDOI
29 Sep 2014
TL;DR: This article discusses the C4.5, CART, CRUISE, GUIDE, and QUEST methods in terms of their algorithms, features, properties, and performances.
Abstract: A classification or regression tree is a prediction model that can be represented as a decision tree. This article discusses the C4.5, CART, CRUISE, GUIDE, and QUEST methods in terms of their algorithms, features, properties, and performances. Keywords: cross-validation; discriminant; linear model; prediction accuracy; recursive partitioning; selection bias; unbiased

Journal ArticleDOI
TL;DR: In this paper, Monte Carlo methods were used to examine model convergence rates, parameter point estimates (statistical bias), parameter interval estimates (confidence interval accuracy and precision), and both Type I error control and statistical power of tests associated with the fixed effects from linear two-level models estimated with PROC MIXED.
Abstract: Whereas general sample size guidelines have been suggested when estimating multilevel models, they are only generalizable to a relatively limited number of data conditions and model structures, both of which are not very feasible for the applied researcher. In an effort to expand our understanding of two-level multilevel models under less than ideal conditions, Monte Carlo methods, through SAS/IML, were used to examine model convergence rates, parameter point estimates (statistical bias), parameter interval estimates (confidence interval accuracy and precision), and both Type I error control and statistical power of tests associated with the fixed effects from linear two-level models estimated with PROC MIXED. These outcomes were analyzed as a function of: (a) level-1 sample size, (b) level-2 sample size, (c) intercept variance, (d) slope variance, (e) collinearity, and (f) model complexity. Bias was minimal across nearly all conditions simulated. The 95% confidence interval coverage and Type I error rate tended to be slightly conservative. The degree of statistical power was related to sample sizes and level of fixed effects; higher power was observed with larger sample sizes and level-1 fixed effects. Hierarchically organized data are commonplace in educa- tional, clinical, and other settings in which research often occurs. Students are nested within classrooms or teachers, and teachers are nested within schools. Alternatively, service recipients are nested within social workers providing ser- vices, who may in turn be nested within local civil service entities. Conducting research at any of these levels while ignoring the more detailed levels (students) or contextual levels (schools) can lead to erroneous conclusions. As such, multilevel models have been developed to properly account

Journal ArticleDOI
TL;DR: This article proposes an expectation–maximization algorithm in order to estimate the regression coefficients of modal linear regression and provides asymptotic properties for the proposed estimator without the symmetric assumption of the error density.
Abstract: Author(s): Yao, Weixin; Li, Longhai | Abstract: type="main" xml:id="sjos12054-abs-0001"g ABSTRACTThe mode of a distribution provides an important summary of data and is often estimated on the basis of some non-parametric kernel density estimator. This article develops a new data analysis tool called modal linear regression in order to explore high-dimensional data. Modal linear regression models the conditional mode of a response Y given a set of predictors x as a linear function of x. Modal linear regression differs from standard linear regression in that standard linear regression models the conditional mean (as opposed to mode) of Y as a linear function of x. We propose an expectation–maximization algorithm in order to estimate the regression coefficients of modal linear regression. We also provide asymptotic properties for the proposed estimator without the symmetric assumption of the error density. Our empirical studies with simulated data and real data demonstrate that the proposed modal regression gives shorter predictive intervals than mean linear regression, median linear regression and MM-estimators.

Journal ArticleDOI
TL;DR: An empirical application investigating the determinants of cross-country savings rates finds two latent groups among 56 countries, providing empirical confirmation that higher savings rates go in hand with higher income growth.
Abstract: This paper provides a novel mechanism for identifying and estimating latent group structures in panel data using penalized techniques. We consider both linear and nonlinear models where the regression coefficients are heterogeneous across groups but homogeneous within a group and the group membership is unknown. Two approaches are considered — penalized profile likelihood (PPL) estimation for the general nonlinear models without endogenous regressors, and penalized GMM (PGMM) estimation for linear models with endogeneity. In both cases we develop a new variant of Lasso called classifier-Lasso (C-Lasso) that serves to shrink individual coefficients to the unknown group-specific coefficients. CLasso achieves simultaneous classification and consistent estimation in a single step and the classification exhibits the desirable property of uniform consistency. For PPL estimation C-Lasso also achieves the oracle property so that group-specific parameter estimators are asymptotically equivalent to infeasible estimators that use individual group identity information. For PGMM estimation the oracle property of C-Lasso is preserved in some special cases. Simulations demonstrate good finite-sample performance of the approach both in classification and estimation. Empirical applications to both linear and nonlinear models are presented.

Journal ArticleDOI
TL;DR: This article investigates how BIC can be adapted to high-dimensional linear quantile regression and shows that a modified BIC is consistent in model selection when the number of variables diverges as the sample size increases and extends the results to structured nonparametric quantile models with a diverging number of covariates.
Abstract: Bayesian information criterion (BIC) is known to identify the true model consistently as long as the predictor dimension is finite. Recently, its moderate modifications have been shown to be consistent in model selection even when the number of variables diverges. Those works have been done mostly in mean regression, but rarely in quantile regression. The best-known results about BIC for quantile regression are for linear models with a fixed number of variables. In this article, we investigate how BIC can be adapted to high-dimensional linear quantile regression and show that a modified BIC is consistent in model selection when the number of variables diverges as the sample size increases. We also discuss how it can be used for choosing the regularization parameters of penalized approaches that are designed to conduct variable selection and shrinkage estimation simultaneously. Moreover, we extend the results to structured nonparametric quantile models with a diverging number of covariates. We illustrate o...

Journal ArticleDOI
TL;DR: It is argued here that many medical applications of machine learning models in genetic disease risk prediction rely essentially on two factors: effective model regularization and rigorous model validation.
Abstract: Supervised machine learning aims at constructing a genotype–phenotype model by learning such genetic patterns from a labeled set of training examples that will also provide accurate phenotypic predictions in new cases with similar genetic background. Such predictive models are increasingly being applied to the mining of panels of genetic variants, environmental, or other nongenetic factors in the prediction of various complex traits and disease phenotypes [1]–[8]. These studies are providing increasing evidence in support of the idea that machine learning provides a complementary view into the analysis of high-dimensional genetic datasets as compared to standard statistical association testing approaches. In contrast to identifying variants explaining most of the phenotypic variation at the population level, supervised machine learning models aim to maximize the predictive (or generalization) power at the level of individuals, hence providing exciting opportunities for e.g., individualized risk prediction based on personal genetic profiles [9]–[11]. Machine learning models can also deal with genetic interactions, which are known to play an important role in the development and treatment of many complex diseases [12]–[16], but are often missed by single-locus association tests [17]. Even in the absence of significant single-loci marginal effects, multilocus panels from distinct molecular pathways may provide synergistic contribution to the prediction power, thereby revealing part of such hidden heritability component that has remained missing because of too small marginal effects to pass the stringent genome-wide significance filters [18]. Multivariate modeling approaches have already been shown to provide improved insights into genetic mechanisms and the interaction networks behind many complex traits, including atherosclerosis, coronary heart disease, and lipid levels, which would have gone undetected using the standard univariate modeling [2], [19]–[22]. However, machine learning models also come with inherent pitfalls, such as increased computational complexity and the risk for model overfitting, which must be understood in order to avoid reporting unrealistic prediction models or over-optimistic prediction results. We argue here that many medical applications of machine learning models in genetic disease risk prediction rely essentially on two factors: effective model regularization and rigorous model validation. We demonstrate the effects of these factors using representative examples from the literature as well as illustrative case examples. This review is not meant to be a comprehensive survey of all predictive modeling approaches, but we focus on regularized machine learning models, which enforces constraints on the complexity of the learned models so that they would ignore irrelevant patterns in the training examples. Simple risk allele counting or other multilocus risk models that do not incorporate any model parameters to be learned are outside the scope of this review; in fact, such simplistic models that assume independent variants may lead to suboptimal prediction performance in the presence of either direct or indirect interactions through epistasis effects or linkage disequilibrium, respectively [23], [24]. Perhaps the simplest models considered here as learning approaches are those based on weighted risk allele summaries [23], [25]. However, even with such basic risk models intended for predictive purposes, it is important to learn the model parameters (e.g., select the variants and determine their weights) based on training data only; otherwise there is a severe risk of model overfitting, i.e., models not being capable of generalizing to new samples [5]. Representative examples of how model learning and regularization approaches address the overfitting problem are briefly summarized in Box 1, while those readers interested in their implementation details are referred to the accompanying Text S1. We specifically promote here the use of such regularized machine learning models that are scalable to the entire genome-wide scale, often based on linear models, which are easy to interpret and also enable straightforward variable selection. Genome-scale approaches avoid the need of relying on two-stage approaches [26], which apply standard statistical procedures to reduce the number of variants, since such prefiltering may miss predictive interactions across loci and therefore lead to reduced predictive performance [8], [24], [25], [27], [28]. Box 1. Synthesis of Learning Models for Genetic Risk Prediction The aim of risk models is to capture in a mathematical form the patterns in the genetic and non-genetic data most important for the prediction of disease susceptibility. The first step in model building involves choosing the functional form of the model (e.g., linear or nonlinear), and then making use of a given training data to determine the adjustable parameters of the model (e.g., a subset of variants, their weights, and other model parameters). While it is often sufficient for a statistical model to enable high enough explanatory power in the discovery material, without being overly complicated, a predictive model is also required to generalize to unseen cases. One consideration in the model construction is how to encode the genotypic measurements using genotype models, such as the dominant, recessive, multiplicative, or additive model, each implying different assumptions about the genetic effects in the data [79]. Categorical variables 0, 1, and 2 are typically used for treating genetic predictor variables (e.g., minor allele dosage), while numeric values are required for continuous risk factors (e.g., blood pressure). Expected posterior probabilities of the genotypes can also be used, especially for imputed genotypes. Transforming the genotype categories into three binary features is an alternative way to deal with missing values without imputation (used in the T1D example; see Text S1 for details). Statistical or machine learning models identify statistical or predictive interactions, respectively, rather than biological interactions between or within variants [12], [80]. While nonlinear models may better capture complex genetic interactions [7], [81], linear models are easier to interpret and provide a scalable option for performing supervised selection of multilocus variant panels at the genome-wide scale [3]. In linear models, genetic interactions are modeled implicitly by selecting such variant combinations that together are predictive of the phenotype, rather than considering pairwise gene–gene relationships explicitly. Formally, trait yi to be predicted for an individual i is modeled as a linear combination of the individual's predictor variables xij: (1) Here, the weights wj are assumed constant across the n individuals, w 0 is the bias offset term and p indicates the number of predictors discovered in the training data. In its basic form, Eq. 1 can be used for modeling continuous traits y (linear regression). For case-control classification, the binary dependent variable y is often transformed using a logistic loss function, which models the probability of a case class given a genotype profile and other risk factor covariates x (logistic regression). It has been shown that the logistic regression and naive Bayes risk models are mathematically very closely related in the context of genetic risk prediction [81].

Journal ArticleDOI
09 May 2014-Test
TL;DR: In this paper, the authors survey some classical results in change point analysis and recent extensions to time series, multivariate, panel and functional data, and present real data examples which illustrate the utility of the surveyed results.
Abstract: A common goal in modeling and data mining is to determine, based on sample data, whether or not a change of some sort has occurred in a quantity of interest. The study of statistical problems of this nature is typically referred to as change point analysis. Though change point analysis originated nearly 70 years ago, it is still an active area of research and much effort has been put forth to develop new methodology and discover new applications to address modern statistical questions. In this paper we survey some classical results in change point analysis and recent extensions to time series, multivariate, panel and functional data. We also present real data examples which illustrate the utility of the surveyed results.

Journal ArticleDOI
TL;DR: A combination methodology which attempts to benefit from the strengths of both RW and ANN models, and achieves reasonably better forecasting accuracies than each of RW, FANN and EANN models in isolation for all four financial time series.
Abstract: Properly comprehending and modeling the dynamics of financial data has indispensable practical importance. The prime goal of a financial time series model is to provide reliable future forecasts which are crucial for investment planning, fiscal risk hedging, governmental policy making, etc. These time series often exhibit notoriously haphazard movements which make the task of modeling and forecasting extremely difficult. As per the research evidence, the random walk (RW) is so far the best linear model for forecasting financial data. Artificial neural network (ANN) is another promising alternative with the unique capability of nonlinear self-adaptive modeling. Numerous comparisons of the performances of RW and ANN models have also been carried out in the literature with mixed conclusions. In this paper, we propose a combination methodology which attempts to benefit from the strengths of both RW and ANN models. In our proposed approach, the linear part of a financial dataset is processed through the RW model, and the remaining nonlinear residuals are processed using an ensemble of feedforward ANN (FANN) and Elman ANN (EANN) models. The forecasting ability of the proposed scheme is examined on four real-world financial time series in terms of three popular error statistics. The obtained results clearly demonstrate that our combination method achieves reasonably better forecasting accuracies than each of RW, FANN and EANN models in isolation for all four financial time series.

Journal ArticleDOI
TL;DR: The SSN package for R provides a set of functions for importing, simulating, and modeling of stream network data, including diagnostics and prediction, and traditional models that use Euclidean distance and simple random effects are included.
Abstract: The SSN package for R provides a set of functions for modeling stream network data. The package can import geographic information systems data or simulate new data as a ‘SpatialStreamNetwork’, a new object class that builds on the spatial sp classes. Functions are provided that fit spatial linear models (SLMs) for the ‘SpatialStreamNetwork’ object. The covariance matrix of the SLMs use distance metrics and geostatistical models that are unique to stream networks; these models account for the distances and topological configuration of stream networks, including the volume and direction of flowing water. In addition, traditional models that use Euclidean distance and simple random effects are included, along with Poisson and binomial families, for a generalized linear mixed model framework. Plotting and diagnostic functions are provided. Prediction (kriging) can be performed for missing data or for a separate set of unobserved locations, or block prediction (block kriging) can be used over sets of stream segments. This article summarizes the SSN package for importing, simulating, and modeling of stream network data, including diagnostics and prediction.

Journal ArticleDOI
TL;DR: In this paper, the authors used simulated discrete lognormal data to identify factors affecting citation scores that are unrelated to scholarly quality or usefulness so that these can be taken into account.

Journal ArticleDOI
Lee H. Dicker1
TL;DR: In this article, a method-of-moments-based estimator for the residual variance, the proportion of explained variation and other related quantities, such as the l2 signal strength, is proposed.
Abstract: The residual variance and the proportion of explained variation are important quantities in many statistical models and model fitting procedures. They play an important role in regression diagnostics and model selection procedures, as well as in determining the performance limits in many problems. In this paper we propose new method-of-moments-based estimators for the residual variance, the proportion of explained variation and other related quantities, such as the l2 signal strength. The proposed estimators are consistent and asymptotically normal in high-dimensional linear models with Gaussian predictors and errors, where the number of predictors d is proportional to the number of observations n; in fact, consistency holds even in settings where d/n → ∞. Existing results on residual variance estimation in high-dimensional linear models depend on sparsity in the underlying signal. Our results require no sparsity assumptions and imply that the residual variance and the proportion of explained variation can be consistently estimated even when d>n and the underlying signal itself is nonestimable. Numerical work suggests that some of our distributional assumptions may be relaxed. A real-data analysis involving gene expression data and single nucleotide polymorphism data illustrates the performance of the proposed methods.

Posted Content
TL;DR: In this article, the authors proposed a semiparametric single index model, which is a general model where it is only assumed that each observation y i may depend on a_i only through.
Abstract: Consider measuring an n-dimensional vector x through the inner product with several measurement vectors, a_1, a_2, ..., a_m. It is common in both signal processing and statistics to assume the linear response model y_i = + e_i, where e_i is a noise term. However, in practice the precise relationship between the signal x and the observations y_i may not follow the linear model, and in some cases it may not even be known. To address this challenge, in this paper we propose a general model where it is only assumed that each observation y_i may depend on a_i only through . We do not assume that the dependence is known. This is a form of the semiparametric single index model, and it includes the linear model as well as many forms of the generalized linear model as special cases. We further assume that the signal x has some structure, and we formulate this as a general assumption that x belongs to some known (but arbitrary) feasible set K. We carefully detail the benefit of using the signal structure to improve estimation. The theory is based on the mean width of K, a geometric parameter which can be used to understand its effective dimension in estimation problems. We determine a simple, efficient two-step procedure for estimating the signal based on this model -- a linear estimation followed by metric projection onto K. We give general conditions under which the estimator is minimax optimal up to a constant. This leads to the intriguing conclusion that in the high noise regime, an unknown non-linearity in the observations does not significantly reduce one's ability to determine the signal, even when the non-linearity may be non-invertible. Our results may be specialized to understand the effect of non-linearities in compressed sensing.

Journal ArticleDOI
TL;DR: This paper presents a data-driven method for the task of fault detection in nonlinear systems using locally weighted projection regression (LWPR) as a powerful tool for modeling the nonlinear process with locally linear models.
Abstract: This paper presents a data-driven method for the task of fault detection in nonlinear systems. In the proposed approach, locally weighted projection regression (LWPR) is employed to serve as a powerful tool for modeling the nonlinear process with locally linear models. In each local model, partial least squares (PLS) regression is performed and PLS-based fault detection scheme is applied to monitor the regional model. The diagnosis for the global process is based on the normalized weighted mean of all the local models. Both conventional and quality-related statistical indicators are designed to compute the test statistics. Two nonlinear systems, a numerical one and a benchmark, are used to demonstrate the effectiveness of the proposed method.