scispace - formally typeset
Search or ask a question

Showing papers on "Nonparametric statistics published in 2019"


Journal ArticleDOI
TL;DR: A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.
Abstract: We propose generalized random forests, a method for nonparametric statistical estimation based on random forests (Breiman [Mach. Learn. 45 (2001) 5–32]) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: nonparametric quantile regression, conditional average partial effect estimation and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.

840 citations


Journal ArticleDOI
TL;DR: This study has discussed the summary measures and methods used to test the normality of the data, and each method has its own advantages and disadvantages.
Abstract: Descriptive statistics are an important part of biomedical research which is used to describe the basic features of the data in the study. They provide simple summaries about the sample and the measures. Measures of the central tendency and dispersion are used to describe the quantitative data. For the continuous data, test of the normality is an important step for deciding the measures of central tendency and statistical methods for data analysis. When our data follow normal distribution, parametric tests otherwise nonparametric methods are used to compare the groups. There are different methods used to test the normality of data, including numerical and visual methods, and each method has its own advantages and disadvantages. In the present study, we have discussed the summary measures and methods used to test the normality of the data.

643 citations


Journal ArticleDOI
TL;DR: A new nonparametric matching framework is introduced that elucidates how various unit fixed effects models implicitly compare treated and control observations to draw causal inference and enables a diverse set of identification strategies to adjust for unobservables in the absence of dynamic causal relationships between treatment and outcome variables.
Abstract: Many researchers use unit fixed effects regression models as their default methods for causal inference with longitudinal data. We show that the ability of these models to adjust for unobserved time‐invariant confounders comes at the expense of dynamic causal relationships, which are permitted under an alternative selection‐on‐observables approach. Using the nonparametric directed acyclic graph, we highlight two key causal identification assumptions of unit fixed effects models: Past treatments do not directly influence current outcome, and past outcomes do not affect current treatment. Furthermore, we introduce a new nonparametric matching framework that elucidates how various unit fixed effects models implicitly compare treated and control observations to draw causal inference. By establishing the equivalence between matching and weighted unit fixed effects estimators, this framework enables a diverse set of identification strategies to adjust for unobservables in the absence of dynamic causal relationships between treatment and outcome variables. We illustrate the proposed methodology through its application to the estimation of GATT membership effects on dyadic trade volume.

180 citations


Journal ArticleDOI
TL;DR: This work proposes a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals and finds that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampled rates due to its bias correction properties.
Abstract: Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

158 citations


Journal ArticleDOI
TL;DR: This article revisited the analysis of family-wise error rates in statistical parametric mapping-using random field theory-reported in (Eklund et al. []: arXiv 1511.01863).
Abstract: This technical report revisits the analysis of family-wise error rates in statistical parametric mapping-using random field theory-reported in (Eklund et al. []: arXiv 1511.01863). Contrary to the understandable spin that these sorts of analyses attract, a review of their results suggests that they endorse the use of parametric assumptions-and random field theory-in the analysis of functional neuroimaging data. We briefly rehearse the advantages parametric analyses offer over nonparametric alternatives and then unpack the implications of (Eklund et al. []: arXiv 1511.01863) for parametric procedures. Hum Brain Mapp, 40:2052-2054, 2019. © 2017 The Authors Human Brain Mapping Published by Wiley Periodicals, Inc.

141 citations


Proceedings Article
01 Sep 2019
TL;DR: A completely general framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data is developed that can be applied to general nonlinear models, general differentiable loss functions, and generic black-box optimization routines.
Abstract: We develop a framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data Our approach is based on a recent algebraic characterization of DAGs that led to a fully continuous program for score-based learning of DAG models parametrized by a linear structural equation model (SEM) We extend this algebraic characterization to nonparametric SEM by leveraging nonparametric sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases Unlike existing approaches that require specific modeling choices, loss functions, or algorithms, we present a completely general framework that can be applied to general nonlinear models (eg without additive noise), general differentiable loss functions, and generic black-box optimization routines The code is available at this https URL

129 citations


Journal ArticleDOI
TL;DR: Both univariate and multivariate nonparametric control charts are reviewed, unlike the past reviews, which did not include the multivariate charts, here they are reviewed.
Abstract: Control charts that are based on assumption(s) of a specific form for the underlying process distribution are referred to as parametric control charts. There are many applications where the...

106 citations


Proceedings Article
04 Dec 2019
TL;DR: KIV is proposed, a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs), and it is proved the consistency of KIV under mild assumptions, and conditions under which convergence occurs at the minimax optimal rate for unconfounded, single-stage RKHS regression.
Abstract: Instrumental variable (IV) regression is a strategy for learning causal relationships in observational data. If measurements of input X and output Y are confounded, the causal relationship can nonetheless be identified if an instrumental variable Z is available that influences X directly, but is conditionally independent of Y given X and the unmeasured confounder. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which convergence occurs at the minimax optimal rate for unconfounded, single-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm's first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric IV regression.

97 citations


Journal ArticleDOI
TL;DR: Extensive simulations from a stochastic frontier data generating process document that the simple two-stage DEA + OLS model significantly outperforms the more complex Simar–Wilson model with lower mean absolute deviation (MAD), lower medianabsolute deviation (MEAD) as well as higher coverage rates when the contextual variables significantly impact productivity.

91 citations


Journal ArticleDOI
TL;DR: This work proposes a test of independence of two multivariate random vectors, given a sample from the underlying population, based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances.
Abstract: We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodness-of-fit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriately-defined notion of an error vector. The theory is supported by numerical studies on both simulated and real data.

83 citations


Journal ArticleDOI
TL;DR: This article proposed incremental interventions that shift propensity score values rather than set treatments to fixed values, which avoid positivity assumptions entirely and allow longitudinal effects to be visualized with a single curve instead of lists of coefficients.
Abstract: Most work in causal inference considers deterministic interventions that set each unit’s treatment to some fixed value. However, under positivity violations these interventions can lead to nonidentification, inefficiency, and effects with little practical relevance. Further, corresponding effects in longitudinal studies are highly sensitive to the curse of dimensionality, resulting in widespread use of unrealistic parametric models. We propose a novel solution to these problems: incremental interventions that shift propensity score values rather than set treatments to fixed values. Incremental interventions have several crucial advantages. First, they avoid positivity assumptions entirely. Second, they require no parametric assumptions and yet still admit a simple characterization of longitudinal effects, independent of the number of timepoints. For example, they allow longitudinal effects to be visualized with a single curve instead of lists of coefficients. After characterizing incremental inter...

Journal ArticleDOI
TL;DR: This work implements both parametric PrediXcan and nonparametric Bayesian methods in a convenient software tool "TIGAR" (Transcriptome-Integrated Genetic Association Resource), which imputes transcriptomic data and performs subsequent TWASs using individual-level or summary-level GWAS data.
Abstract: The transcriptome-wide association studies (TWASs) that test for association between the study trait and the imputed gene expression levels from cis-acting expression quantitative trait loci (cis-eQTL) genotypes have successfully enhanced the discovery of genetic risk loci for complex traits. By using the gene expression imputation models fitted from reference datasets that have both genetic and transcriptomic data, TWASs facilitate gene-based tests with GWAS data while accounting for the reference transcriptomic data. The existing TWAS tools like PrediXcan and FUSION use parametric imputation models that have limitations for modeling the complex genetic architecture of transcriptomic data. Therefore, to improve on this, we employ a nonparametric Bayesian method that was originally proposed for genetic prediction of complex traits, which assumes a data-driven nonparametric prior for cis-eQTL effect sizes. The nonparametric Bayesian method is flexible and general because it includes both of the parametric imputation models used by PrediXcan and FUSION as special cases. Our simulation studies showed that the nonparametric Bayesian model improved both imputation R2 for transcriptomic data and the TWAS power over PrediXcan when ≥1% cis-SNPs co-regulate gene expression and gene expression heritability ≤0.2. In real applications, the nonparametric Bayesian method fitted transcriptomic imputation models for 57.8% more genes over PrediXcan, thus improving the power of follow-up TWASs. We implement both parametric PrediXcan and nonparametric Bayesian methods in a convenient software tool "TIGAR" (Transcriptome-Integrated Genetic Association Resource), which imputes transcriptomic data and performs subsequent TWASs using individual-level or summary-level GWAS data.

Journal ArticleDOI
TL;DR: This new software provides a user‐friendly interface to estimate stability statistics accurately for plant scientists, agronomists, and breeders who deal with large volumes of quantitative data.
Abstract: Premise of the study Access to improved crop cultivars is the foundation for successful agriculture. New cultivars must have improved yields that are determined by quantitative and qualitative traits. Genotype-by-environment interactions (GEI) occur for quantitative traits such as reproductive fitness, longevity, height, weight, yield, and disease resistance. The stability of genotypes across a range of environments can be analyzed using GEI analysis. GEI analysis includes univariate and multivariate analyses with both parametric and non-parametric models. Methods and results The program STABILITYSOFT is online software based on JavaScript and R to calculate several univariate parametric and non-parametric statistics for various crop traits. These statistics include Plaisted and Peterson's mean variance component (θ i ), Plaisted's GE variance component (θ (i) ), Wricke's ecovalence stability index (W i 2 ), regression coefficient (b i ), deviation from regression (S di 2 ), Shukla's stability variance (σ i 2 ), environmental coefficient of variance (CV i ), Nassar and Huhn's statistics (S (1) , S (2) ), Huhn's equation (S (3) and S (6) ), Thennarasu's non-parametric statistics (NP (i) ), and Kang's rank-sum. These statistics are important in the identification of stable genotypes; hence, this program can compare and select genotypes across multiple environment trials for a given data set. This program supports both the repeated data across environments and matrix data types. The accuracy of the results obtained from this software was tested on several crop plants. Conclusions This new software provides a user-friendly interface to estimate stability statistics accurately for plant scientists, agronomists, and breeders who deal with large volumes of quantitative data. This software can also show ranking patterns of genotypes and describe associations among different statistics with yield performance through a heat map plot. The software is available at https://mohsenyousefian.com/stabilitysoft/.

Journal ArticleDOI
TL;DR: In this article, the nonparametric least squares estimator (LSE) of a multivariate convex regression function was studied, given as the solution to a quadratic program with O(n 2 ) linear constraints.
Abstract: We study the nonparametric least squares estimator (LSE) of a multivariate convex regression function. The LSE, given as the solution to a quadratic program with O(n2) linear constraints (n being t...

Posted Content
TL;DR: This paper proposes a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation, and proposes (multivariate) rank versions of distance covariance and energy statistic for testing scenarios (i) and (ii) respectively.
Abstract: In this paper, we propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation. Unlike other existing proposals in the literature, these multivariate ranks share a number of useful properties with the usual one-dimensional ranks; most importantly, these ranks are distribution-free. This crucial observation allows us to design nonparametric tests that are exactly distribution-free under the null hypothesis. We demonstrate the applicability of this approach by constructing exact distribution-free tests for two classical nonparametric problems: (i) testing for mutual independence between random vectors, and (ii) testing for the equality of multivariate distributions. In particular, we propose (multivariate) rank versions of distance covariance (Szekely et al., 2007) and energy statistic (Szekely and Rizzo, 2013) for testing scenarios (i) and (ii) respectively. In both these problems, we derive the asymptotic null distribution of the proposed test statistics. We further show that our tests are consistent against all fixed alternatives. Moreover, the proposed tests are tuning-free, computationally feasible and are well-defined under minimal assumptions on the underlying distributions (e.g., they do not need any moment assumptions). We also demonstrate the efficacy of these procedures via extensive simulations. In the process of analyzing the theoretical properties of our procedures, we end up proving some new results in the theory of measure transportation and in the limit theory of permutation statistics using Stein's method for exchangeable pairs, which may be of independent interest.

Journal ArticleDOI
TL;DR: The present article has discussed the parametric and non-parametric methods, their assumptions, and how to select appropriate statistical methods for analysis and interpretation of the biomedical data.
Abstract: In biostatistics, for each of the specific situation, statistical methods are available for analysis and interpretation of the data. To select the appropriate statistical method, one need to know the assumption and conditions of the statistical methods, so that proper statistical method can be selected for data analysis. Two main statistical methods are used in data analysis: descriptive statistics, which summarizes data using indexes such as mean and median and another is inferential statistics, which draw conclusions from data using statistical tests such as student's t-test. Selection of appropriate statistical method depends on the following three things: Aim and objective of the study, Type and distribution of the data used, and Nature of the observations (paired/unpaired). All type of statistical methods that are used to compare the means are called parametric while statistical methods used to compare other than means (ex-median/mean ranks/proportions) are called nonparametric methods. In the present article, we have discussed the parametric and non-parametric methods, their assumptions, and how to select appropriate statistical methods for analysis and interpretation of the biomedical data.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the way metafrontiers and associated measures of efficiency are obtained from nonparametric estimates of underlying group-specific frontiers and show that the convexification strategy consisting in assuming a convex metaset generally leads to erroneous results.

Journal ArticleDOI
TL;DR: This paper introduces and explores predictive distribution functions that always satisfy a natural property of validity in terms of guaranteed coverage for IID observations, and applies conformal prediction to derive predictive distributions that are valid under a nonparametric assumption.
Abstract: This paper applies conformal prediction to derive predictive distributions that are valid under a nonparametric assumption. Namely, we introduce and explore predictive distribution functions that always satisfy a natural property of validity in terms of guaranteed coverage for IID observations. The focus is on a prediction algorithm that we call the Least Squares Prediction Machine (LSPM). The LSPM generalizes the classical Dempster–Hill predictive distributions to nonparametric regression problems. If the standard parametric assumptions for Least Squares linear regression hold, the LSPM is as efficient as the Dempster–Hill procedure, in a natural sense. And if those parametric assumptions fail, the LSPM is still valid, provided the observations are IID.

Journal ArticleDOI
TL;DR: In this paper, a class of partially linear functional additive models (PLFAM) is proposed to predict a scalar response by both parametric effects of a multivariate predictor and nonparametric effect of a multi-dimensional functional predictor.
Abstract: We investigate a class of partially linear functional additive models (PLFAM) that predicts a scalar response by both parametric effects of a multivariate predictor and nonparametric effects of a multivariate functional predictor. We jointly model multiple functional predictors that are cross-correlated using multivariate functional principal component analysis (mFPCA), and model the nonparametric effects of the principal component scores as additive components in the PLFAM. To address the high-dimensional nature of functional data, we let the number of mFPCA components diverge to infinity with the sample size, and adopt the component selection and smoothing operator (COSSO) penalty to select relevant components and regularize the fitting. A fundamental difference between our framework and the existing high-dimensional additive models is that the mFPCA scores are estimated with error, and the magnitude of measurement error increases with the order of mFPCA. We establish the asymptotic convergence ...

Proceedings Article
24 May 2019
TL;DR: The orthogonal random forest (ORF) as mentioned in this paper combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017).
Abstract: We propose the orthogonal random forest, an algorithm that combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017)—a flexible nonparametric method for statistical estimation of conditional moment models using random forests. We provide a consistency rate and establish asymptotic normality for our estimator. We show that under mild assumptions on the consistency rate of the nuisance estimator, we can achieve the same error rate as an oracle with a priori knowledge of these nuisance parameters. We show that when the nuisance functions have a locally sparse parametrization, then a local `1-penalized regression achieves the required rate. We apply our method to estimate heterogeneous treatment effects from observational data with discrete treatments or continuous treatments, and we show that, unlike prior work, our method provably allows to control for a high-dimensional set of variables under standard sparsity conditions. We also provide a comprehensive empirical evaluation of our algorithm on both synthetic and real data.

Journal ArticleDOI
TL;DR: In this article, a fully nonparametric kernel method is proposed to account for observed covariates in regression discontinuity designs (RDD), which may increase precision of treatment effect estimation, but it is not suitable for nonlinear models.
Abstract: This article proposes a fully nonparametric kernel method to account for observed covariates in regression discontinuity designs (RDD), which may increase precision of treatment effect estimation. ...

Journal ArticleDOI
TL;DR: In this article, a non-parametric estimation method for a large vector autoregression (VAR) with time-varying parameters is proposed, and the estimators and their asymptotic distributions are available in closed form.
Abstract: In this paper we introduce a non-parametric estimation method for a large Vector Autoregression (VAR) with time-varying parameters. The estimators and their asymptotic distributions are available in closed form. This makes the method computationally efficient and capable of handling information sets as large as those typically handled by factor models and Factor Augmented VARs (FAVAR). When applied to the problem of forecasting key macroeconomic variables, the method outperforms constant parameter benchmarks and large (parametric) Bayesian VARs with time-varying parameters. The tool can also be used for structural analysis. As an example, we study the time-varying effects of oil price innovations on sectoral U.S. industrial output. We find that durable consumer goods and durable materials (which together account for slightly more than one fifth of total industrial output) play a key role in explaining the changing interaction between unexpected oil price increases and U.S. business cycle fluctuations.


Posted Content
TL;DR: In this article, a two-step estimator for nonparametrically estimating heterogeneous average treatment effects that vary with a limited number of discrete and continuous covariates in a selection-on-observables framework is proposed.
Abstract: This paper considers the practically important case of nonparametrically estimating heterogeneous average treatment effects that vary with a limited number of discrete and continuous covariates in a selection-on-observables framework where the number of possible confounders is very large. We propose a two-step estimator for which the first step is estimated by machine learning. We show that this estimator has desirable statistical properties like consistency, asymptotic normality and rate double robustness. In particular, we derive the coupled convergence conditions between the nonparametric and the machine learning steps. We also show that estimating population average treatment effects by averaging the estimated heterogeneous effects is semi-parametrically efficient. The new estimator is an empirical example of the effects of mothers' smoking during pregnancy on the resulting birth weight.

Journal ArticleDOI
TL;DR: In this paper, a nonparametric framework that utilizes similarity information among observations was proposed to detect and estimate change-points in a sequence of multivariate or non-Euclidean observations.
Abstract: We consider the testing and estimation of change-points, locations where the distribution abruptly changes, in a sequence of multivariate or non-Euclidean observations. We study a nonparametric framework that utilizes similarity information among observations, which can be applied to various data types as long as an informative similarity measure on the sample space can be defined. The existing approach along this line has low power and/or biased estimates for change-points under some common scenarios. We address these problems by considering new tests based on similarity information. Simulation studies show that the new approaches exhibit substantial improvements in detecting and estimating change-points. In addition, under some mild conditions, the new test statistics are asymptotically distribution-free under the null hypothesis of no change. Analytic $p$-value approximations to the significance of the new test statistics for the single change-point alternative and changed interval alternative are derived, making the new approaches easy off-the-shelf tools for large datasets. The new approaches are illustrated in an analysis of New York taxi data.

Journal ArticleDOI
TL;DR: In this article, a model-free theory of general types of parametric regression for i.i.d. observations is developed, which replaces the parameters of parameterized models with statistical functionals, to be defined on large nonparametric classes of joint distributions, without assuming a correct model.
Abstract: We develop a model-free theory of general types of parametric regression for i.i.d. observations. The theory replaces the parameters of parametric models with statistical functionals, to be called “regression functionals,” defined on large nonparametric classes of joint ${x\textrm{-}y}$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary ${x\textrm{-}y}$ distributions, without assuming a linear model (see Part I). More generally, regression functionals can be defined by minimizing objective functions, solving estimating equations, or with ad hoc constructions. In this framework, it is possible to achieve the following: (1) define a notion of “well-specification” for regression functionals that replaces the notion of correct specification of models, (2) propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3) decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution and another due to the regressor distribution interacting with misspecification, both of order $N^{-1/2}$, (4) exhibit plug-in/sandwich estimators of standard error as limit cases of ${x\textrm{-}y}$ bootstrap estimators, and (5) provide theoretical heuristics to indicate that ${x\textrm{-}y}$ bootstrap standard errors may generally be preferred over sandwich estimators.

Journal ArticleDOI
TL;DR: A nonparametric model (NPMM) is proposed which exploits auxiliary word embeddings to infer the topic number and employs a “spike and slab” function to alleviate the sparsity problem of topic-word distributions in online short text analyses.

Journal ArticleDOI
TL;DR: The research addresses that point for content based image retrieval (CBIR) by fusing parametric color and shape features with nonparametric texture feature to propose a robust and effective algorithm.

Journal ArticleDOI
TL;DR: A non-parametric and a parametric approach are used to forecast hourly air temperature up to 24 h in advance, showing that SARMA has a better performance for the first 6 forecasted hours, after which the Non-Parametric Functional Data Analysis (NPFDA) model provides superior results.
Abstract: Air temperature is a significant meteorological variable that affects social activities and economic sectors. In this paper, a non-parametric and a parametric approach are used to forecast hourly air temperature up to 24 h in advance. The former is a regression model in the Functional Data Analysis framework. The nonlinear regression operator is estimated using a kernel function. The smoothing parameter is obtained by a cross-validation procedure and used for the selection of the optimal number of closest curves. The other method applied is a Seasonal Autoregressive Moving Average (SARMA) model, the order of which is determined by the Bayesian Information Criterion. The obtained forecasts are combined using weights calculated based on the forecast errors. The results show that SARMA has a better performance for the first 6 forecasted hours, after which the Non-Parametric Functional Data Analysis (NPFDA) model provides superior results. Forecast pooling improves the accuracy of the forecasts.

Journal ArticleDOI
27 Jun 2019
TL;DR: This paper develops a nonparametric method for network reconstruction from spatiotemporal data sets using multivariate Hawkes processes that yields improved network reconstruction, providing a basis for meaningful subsequent analysis of the reconstructed networks.
Abstract: There is often latent network structure in spatial and temporal data and the tools of network analysis can yield fascinating insights into such data. In this paper, we develop a nonparametric method for network reconstruction from spatiotemporal data sets using multivariate Hawkes processes. In contrast to prior work on network reconstruction with point-process models, which has often focused on exclusively temporal information, our approach uses both temporal and spatial information and does not assume a specific parametric form of network dynamics. This leads to an effective way of recovering an underlying network. We illustrate our approach using both synthetic networks and networks constructed from real-world data sets (a location-based social media network, a narrative of crime events, and violent gang crimes). Our results demonstrate that, in comparison to using only temporal data, our spatiotemporal approach yields improved network reconstruction, providing a basis for meaningful subsequent analysis --- such as community structure and motif analysis --- of the reconstructed networks.