scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the American Statistical Association in 2013"


Journal ArticleDOI
TL;DR: A new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods is proposed, which appeals to a new class of Pólya–Gamma distributions, which are constructed in detail.
Abstract: We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Polya–Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effect models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that (1) circumvent the need for analytic approximations, numerical integration, or Metropolis–Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Polya–Gamma distribution, are implemented in the R package BayesLogit. Supplementary materials for this article are available online.

805 citations


Journal ArticleDOI
TL;DR: In this article, a tensor regression model was proposed for analysis of high-throughput data due to their ultra-high dimensionality as well as complex structure, which can efficiently exploit the structure of tensor covariates.
Abstract: Classical regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients Modern applications in medical imaging generate covariates of more complex form such as multidimensional arrays (tensors) Traditional statistical and computational methods are proving insufficient for analysis of these high-throughput data due to their ultrahigh dimensionality as well as complex structure In this article, we propose a new family of tensor regression models that efficiently exploit the special structure of tensor covariates Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation and prediction A fast and highly scalable estimation algorithm is proposed for maximum likelihood estimation and its associated asymptotic properties are studied Effectiveness of the new methods is demonstrated on both synthetic and real MRI imaging data Supplementary materials for this article are available online

425 citations


Journal ArticleDOI
TL;DR: Bayesian model averaging is extended to wind speed, taking account of a skewed distribution and observations that are coarsely discretized, and this method provides calibrated and sharp probabilistic forecasts.
Abstract: The current weather forecasting paradigm is deterministic, based on numerical models. Multiple estimates of the current state of the atmosphere are used to generate an ensemble of deterministic predictions. Ensemble forecasts, while providing information on forecast uncertainty, are often uncalibrated. Bayesian model averaging (BMA) is a statistical ensemble postprocessing method that creates calibrated predictive probability density functions (PDFs). Probabilistic wind forecasting offers two challenges: a skewed distribution, and observations that are coarsely discretized. We extend BMA to wind speed, taking account of these challenges. This method provides calibrated and sharp probabilistic forecasts. Comparisons are made between several formulations.

271 citations


Journal ArticleDOI
TL;DR: A new test for testing the hypothesis H 0 is proposed and investigated to enjoy certain optimality and to be especially powerful against sparse alternatives and applications to gene selection are discussed.
Abstract: In the high-dimensional setting, this article considers three interrelated problems: (a) testing the equality of two covariance matrices and ; (b) recovering the support of ; and (c) testing the equality of and row by row. We propose a new test for testing the hypothesis H 0: and investigate its theoretical and numerical properties. The limiting null distribution of the test statistic is derived and the power of the test is studied. The test is shown to enjoy certain optimality and to be especially powerful against sparse alternatives. The simulation results show that the test significantly outperforms the existing methods both in terms of size and power. Analysis of a prostate cancer dataset is carried out to demonstrate the application of the testing procedures. When the null hypothesis of equal covariance matrices is rejected, it is often of significant interest to further investigate how they differ from each other. Motivated by applications in genomics, we also consider recovering the support of and ...

254 citations


Journal ArticleDOI
Matt Taddy1
TL;DR: This article proposed a framework of sentiment-sufficient dimension reduction for text data using multinomial inverse regression and showed that logistic regression of phrase counts onto document annotations can be used to obtain low-dimensional document representations that are rich in sentiment information.
Abstract: Text data, including speeches, stories, and other document forms, are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-sufficient dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represented as draws from a multinomial distribution, and we show that logistic regression of phrase counts onto document annotations can be used to obtain low-dimensional document representations that are rich in sentiment information. To facilitate this modeling, a novel estimation technique is developed for multinomial logistic regression with very high-dimensional response. In particular, independent Laplace priors with unknown variance are assigned to each regression coefficient, and we detail an efficient routine for ma...

189 citations


Journal ArticleDOI
TL;DR: This article considers the problem of constructing nonparametric tolerance/prediction sets by starting from the general conformal prediction approach, and uses a kernel density estimator as a measure of agreement between a sample point and the underlying distribution.
Abstract: This article introduces a new approach to prediction by bringing together two different nonparametric ideas: distribution-free inference and nonparametric smoothing. Specifically, we consider the problem of constructing nonparametric tolerance/prediction sets. We start from the general conformal prediction approach, and we use a kernel density estimator as a measure of agreement between a sample point and the underlying distribution. The resulting prediction set is shown to be closely related to plug-in density level sets with carefully chosen cutoff values. Under standard smoothness conditions, we get an asymptotic efficiency result that is near optimal for a wide range of function classes. But the coverage is guaranteed whether or not the smoothness conditions hold and regardless of the sample size. The performance of our method is investigated through simulation studies and illustrated in a real data example.

150 citations


Journal ArticleDOI
TL;DR: This article shows a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, using the data from a current study involving similar comparator treatments, and proposes the best scoring system among all competing models.
Abstract: When comparing a new treatment with a control in a randomized clinical study, the treatment effect is generally assessed by evaluating a summary measure over a specific study population The success of the trial heavily depends on the choice of such a population In this article, we show a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, using the data from a current study involving similar comparator treatments Specifically, using the existing data, we first create a parametric scoring system as a function of multiple baseline covariates to estimate subject-specific treatment differences Based on this scoring system, we specify a desired level of treatment difference and obtain a subgroup of patients, defined as those whose estimated scores exceed this threshold An empirically calibrated threshold-specific treatment difference curve across a range of score values is constructed The subpopulation of patients satisfying any

148 citations


Journal ArticleDOI
TL;DR: A novel class of Bayesian Gaussian copula factor models that decouple the latent factors from the marginal distributions is proposed and new theoretical and empirical justifications for using this likelihood in Bayesian inference are provided.
Abstract: Gaussian factor models have proven widely useful for parsimoniously characterizing dependence in multivariate data. There is rich literature on their extension to mixed categorical and continuous variables, using latent Gaussian variables or through generalized latent trait models accommodating measurements in the exponential family. However, when generalizing to non-Gaussian measured variables, the latent variables typically influence both the dependence structure and the form of the marginal distributions, complicating interpretation and introducing artifacts. To address this problem, we propose a novel class of Bayesian Gaussian copula factor models that decouple the latent factors from the marginal distributions. A semiparametric specification for the marginals based on the extended rank likelihood yields straightforward implementation and substantial computational gains. We provide new theoretical and empirical justifications for using this likelihood in Bayesian inference. We propose new default pri...

142 citations


Journal ArticleDOI
TL;DR: This article proposes a class of penalized robust regression estimators based on exponential squared loss that can achieve the highest asymptotic breakdown point of 1/2 and shows that their influence functions are bounded with respect to the outliers in either the response or the covariate domain.
Abstract: Robust variable selection procedures through penalized regression have been gaining increased attention in the literature. They can be used to perform variable selection and are expected to yield robust estimates. However, to the best of our knowledge, the robustness of those penalized regression procedures has not been well characterized. In this article, we propose a class of penalized robust regression estimators based on exponential squared loss. The motivation for this new procedure is that it enables us to characterize its robustness in a way that has not been done for the existing procedures, while its performance is near optimal and superior to some recently developed methods. Specifically, under defined regularity conditions, our estimators are -consistent and possess the oracle property. Importantly, we show that our estimators can achieve the highest asymptotic breakdown point of 1/2 and that their influence functions are bounded with respect to the outliers in either the response or the covari...

138 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a new framework for probabilistic inference based on inferential models (IMs), which not only provides data-dependent measures of uncertainty about the unknown parameter, but also does so with an automatic long-run frequency-calibration property.
Abstract: Posterior probabilistic statistical inference without priors is an important but so far elusive goal. Fisher’s fiducial inference, Dempster–Shafer theory of belief functions, and Bayesian inference with default priors are attempts to achieve this goal but, to date, none has given a completely satisfactory picture. This article presents a new framework for probabilistic inference, based on inferential models (IMs), which not only provides data-dependent probabilistic measures of uncertainty about the unknown parameter, but also does so with an automatic long-run frequency-calibration property. The key to this new approach is the identification of an unobservable auxiliary variable associated with observable data and unknown parameter, and the prediction of this auxiliary variable with a random set before conditioning on data. Here we present a three-step IM construction, and prove a frequency-calibration property of the IM’s belief function under mild conditions. A corresponding optimality theory is develo...

131 citations


Journal ArticleDOI
TL;DR: Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy.
Abstract: Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown and need to be estimated from the measurements of the dynamic system in the presence of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function ...

Journal ArticleDOI
Drew A. Linzer1
TL;DR: A dynamic Bayesian forecasting model is presented that enables early and accurate prediction of U.S. presidential election outcomes at the state level and it is demonstrated that the victory of Barack Obama was never realistically in doubt.
Abstract: I present a dynamic Bayesian forecasting model that enables early and accurate prediction of U.S. presidential election outcomes at the state level. The method systematically combines information from historical forecasting models in real time with results from the large number of state-level opinion surveys that are released publicly during the campaign. The result is a set of forecasts that are initially as good as the historical model, and then gradually increase in accuracy as Election Day nears. I employ a hierarchical specification to overcome the limitation that not every state is polled on every day, allowing the model to borrow strength both across states and, through the use of random-walk priors, across time. The model also filters away day-to-day variation in the polls due to sampling error and national campaign effects, which enables daily tracking of voter preferences toward the presidential candidates at the state and national levels. Simulation techniques are used to estimate the candidate...

Journal ArticleDOI
TL;DR: A Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data is proposed and performed well for sparse functional data.
Abstract: Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool for functional data analysis. We consider functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and propose novel information criteria to select the number of principal component in such data. We propose a Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data. For dense functional data, we also develop an Akaike information criterion based on the expected Kullback–Leibler information under a Gaussian assumption. In connecting with the time series literature, we also consider a class of information criteria proposed for factor analysis of multivariate time series and show that they are still consistent for dense functional data, if a prescribed undersmoothing scheme is undertaken in the FPCA algor...

Journal ArticleDOI
TL;DR: An L 1-penalized likelihood approach to estimate the structure of causal Gaussian networks is developed and it is established that model selection consistency for causalGaussian networks can be achieved with the adaptive lasso penalty and sufficient experimental interventions.
Abstract: Causal networks are graphically represented by directed acyclic graphs (DAGs). Learning causal networks from data is a challenging problem due to the size of the space of DAGs, the acyclicity constraint placed on the graphical structures, and the presence of equivalence classes. In this article, we develop an L 1-penalized likelihood approach to estimate the structure of causal Gaussian networks. A blockwise coordinate descent algorithm, which takes advantage of the acyclicity constraint, is proposed for seeking a local maximizer of the penalized likelihood. We establish that model selection consistency for causal Gaussian networks can be achieved with the adaptive lasso penalty and sufficient experimental interventions. Simulation and real data examples are used to demonstrate the effectiveness of our method. In particular, our method shows satisfactory performance for DAGs with 200 nodes, which have about 20,000 free parameters. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: A Bayesian model is proposed to overcome the limitations of the various data sources and produces a synthetic database with measures of uncertainty for international migration flows and other model parameters from 2002 to 2008.
Abstract: International migration data in Europe are collected by individual countries with separate collection systems and designs. As a result, reported data are inconsistent in availability, definition, and quality. In this article, we propose a Bayesian model to overcome the limitations of the various data sources. The focus is on estimating recent international migration flows among 31 countries in the European Union and European Free Trade Association from 2002 to 2008, using data collated by Eurostat. We also incorporate covariate information and information provided by experts on the effects of undercount, measurement, and accuracy of data collection systems. The methodology is integrated and produces a synthetic database with measures of uncertainty for international migration flows and other model parameters. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: This work presents a new algorithm that can be used to find optimal designs with respect to a broad class of optimality criteria, when the model parameters or functions thereof are of interest, and for both locally optimal and multistage design strategies.
Abstract: Finding optimal designs for nonlinear models is challenging in general. Although some recent results allow us to focus on a simple subclass of designs for most problems, deriving a specific optimal design still mainly depends on numerical approaches. There is need for a general and efficient algorithm that is more broadly applicable than the current state-of-the-art methods. We present a new algorithm that can be used to find optimal designs with respect to a broad class of optimality criteria, when the model parameters or functions thereof are of interest, and for both locally optimal and multistage design strategies. We prove convergence to the optimal design, and show in various examples that the new algorithm outperforms the current state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors developed new methods for analyzing randomized experiments with noncompliance and, by extension, instrumental variable settings, when the often controversial, but key, exclusion restriction assumption is violated.
Abstract: We develop new methods for analyzing randomized experiments with noncompliance and, by extension, instrumental variable settings, when the often controversial, but key, exclusion restriction assumption is violated. We show how existing large-sample bounds on intention-to-treat effects for the subpopulations of compliers, never-takers, and always-takers can be tightened by exploiting the joint distribution of the outcome of interest and a secondary outcome, for which the exclusion restriction is satisfied. The derived bounds can be used to detect violations of the exclusion restriction and the magnitude of these violations in instrumental variables settings. It is shown that the reduced width of the bounds depends on the strength of the association of the auxiliary variable with the primary outcome and the compliance status. We also show how the setup we consider offers new identifying assumptions of intention-to-treat effects. The role of the auxiliary information is shown in two examples of a real social...

Journal ArticleDOI
TL;DR: This work proposes a nonparametric finite mixture of regression models, and develops an estimation procedure by employing kernel regression, which preserves the ascent property of the EM algorithm in an asymptotic sense.
Abstract: Motivated by an analysis of U.S. house price index (HPI) data, we propose nonparametric finite mixture of regression models. We study the identifiability issue of the proposed models, and develop an estimation procedure by employing kernel regression. We further systematically study the sampling properties of the proposed estimators, and establish their asymptotic normality. A modified EM algorithm is proposed to carry out the estimation procedure. We show that our algorithm preserves the ascent property of the EM algorithm in an asymptotic sense. Monte Carlo simulations are conducted to examine the finite sample performance of the proposed estimation procedure. An empirical analysis of the U.S. HPI data is illustrated for the proposed methodology.

Journal ArticleDOI
TL;DR: In this article, a hybrid approach for the modeling and short-term forecasting of electricity loads is proposed, which combines linear regression with curve response and curve regressors, and dimension reduction based on a singular value decomposition in a Hilbert space.
Abstract: We propose a hybrid approach for the modeling and the short-term forecasting of electricity loads. Two building blocks of our approach are (1) modeling the overall trend and seasonality by fitting a generalized additive model to the weekly averages of the load and (2) modeling the dependence structure across consecutive daily loads via curve linear regression. For the latter, a new methodology is proposed for linear regression with both curve response and curve regressors. The key idea behind the proposed methodology is dimension reduction based on a singular value decomposition in a Hilbert space, which reduces the curve regression problem to several ordinary (i.e., scalar) linear regression problems. We illustrate the hybrid method using French electricity loads between 1996 and 2009, on which we also compare our method with other available models including the Electricite de France operational model. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: In this article, the estimation of the parameters of a copula via a simulated method of moments (MM) type approach is considered, where the likelihood of the copula model is not known in closed form, or when the researcher has a set of dependence measures or other functionals of copula that are of particular interest.
Abstract: This article considers the estimation of the parameters of a copula via a simulated method of moments (MM) type approach. This approach is attractive when the likelihood of the copula model is not known in closed form, or when the researcher has a set of dependence measures or other functionals of the copula that are of particular interest. The proposed approach naturally also nests MM and generalized method of moments estimators. Drawing on results for simulation-based estimation and on recent work in empirical copula process theory, we show the consistency and asymptotic normality of the proposed estimator, and obtain a simple test of overidentifying restrictions as a specification test. The results apply to both iid and time series data. We analyze the finite-sample behavior of these estimators in an extensive simulation study. We apply the model to a group of seven financial stock returns and find evidence of statistically significant tail dependence, and mild evidence that the dependence between thes...

Journal ArticleDOI
TL;DR: A spline-backfitted kernel (SBK) estimator for the component functions and the constant is proposed, which are oracally efficient under weak dependence and usable for analyzing high-dimensional time series.
Abstract: The generalized additive model (GAM) is a multivariate nonparametric regression tool for non-Gaussian responses including binary and count data. We propose a spline-backfitted kernel (SBK) estimator for the component functions and the constant, which are oracally efficient under weak dependence. The SBK technique is both computationally expedient and theoretically reliable, thus usable for analyzing high-dimensional time series. Inference can be made on component functions based on asymptotic normality. Simulation evidence strongly corroborates the asymptotic theory. The method is applied to estimate insolvent probability and to obtain higher accuracy ratio than a previous study. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: The new estimator is used to estimate the stochastic error's parameters of the sum of three first order Gauss–Markov processes by means of a sample of over 800, 000 issued from gyroscopes that compose inertial navigation systems.
Abstract: This article presents a new estimation method for the parameters of a time series model. We consider here composite Gaussian processes that are the sum of independent Gaussian processes which, in turn, explain an important aspect of the time series, as is the case in engineering and natural sciences. The proposed estimation method offers an alternative to classical estimation based on the likelihood, that is straightforward to implement and often the only feasible estimation method with complex models. The estimator furnishes results as the optimization of a criterion based on a standardized distance between the sample wavelet variances (WV) estimates and the model-based WV. Indeed, the WV provides a decomposition of the variance process through different scales, so that they contain the information about different features of the stochastic model. We derive the asymptotic properties of the proposed estimator for inference and perform a simulation study to compare our estimator to the MLE and the LSE with...

Journal ArticleDOI
TL;DR: In this paper, the authors propose to directly reduce the dimensionality to the intrinsic dimension d of the manifold, and perform the popular local linear regression (LLR) on a tangent plane estimate.
Abstract: High-dimensional data analysis has been an active area, and the main focus areas have been variable selection and dimension reduction. In practice, it occurs often that the variables are located on an unknown, lower-dimensional nonlinear manifold. Under this manifold assumption, one purpose of this article is regression and gradient estimation on the manifold, and another is developing a new tool for manifold learning. As regards the first aim, we suggest directly reducing the dimensionality to the intrinsic dimension d of the manifold, and performing the popular local linear regression (LLR) on a tangent plane estimate. An immediate consequence is a dramatic reduction in the computational time when the ambient space dimension p ≫ d. We provide rigorous theoretical justification of the convergence of the proposed regression and gradient estimators by carefully analyzing the curvature, boundary, and nonuniform sampling effects. We propose a bandwidth selector that can handle heteroscedastic errors. With re...

Journal ArticleDOI
Zhou Zhou1
TL;DR: In this article, a simple and unified bootstrap testing procedure that provides consistent testing results under general forms of smooth and abrupt changes in the temporal dynamics of the time series is proposed.
Abstract: The assumption of (weak) stationarity is crucial for the validity of most of the conventional tests of structure change in time series. Under complicated nonstationary temporal dynamics, we argue that traditional testing procedures result in mixed structural change signals of the first and second order and hence could lead to biased testing results. The article proposes a simple and unified bootstrap testing procedure that provides consistent testing results under general forms of smooth and abrupt changes in the temporal dynamics of the time series. Monte Carlo experiments are performed to compare our testing procedure with various traditional tests. Our robust bootstrap test is applied to testing changes in an environmental and a financial time series and our procedure is shown to provide more reliable results than the conventional tests.

Journal ArticleDOI
TL;DR: The procedure generates m datasets in which the matches between the two files are imputed and results can be combined using Rubin's multiple imputation rules, and can be applied in other file-linking applications.
Abstract: End-of-life medical expenses are a significant proportion of all health care expenditures. These costs were studied using costs of services from Medicare claims and cause of death (CoD) from death certificates. In the absence of a unique identifier linking the two datasets, common variables identified unique matches for only 33% of deaths. The remaining cases formed cells with multiple cases (32% in cells with an equal number of cases from each file and 35% in cells with an unequal number). We sampled from the joint posterior distribution of model parameters and the permutations that link cases from the two files within each cell. The linking models included the regression of location of death on CoD and other parameters, and the regression of cost measures with a monotone missing data pattern on CoD and other demographic characteristics. Permutations were sampled by enumerating the exact distribution for small cells and by the Metropolis algorithm for large cells. Sparse matrix data structures enabled ef...

Journal ArticleDOI
TL;DR: A local extension of depth is introduced at analyzing multimodal or nonconvexly supported distributions through data depth and has the advantages of maintaining affine-invariance and applying to all depths in a generic way.
Abstract: Aiming at analyzing multimodal or nonconvexly supported distributions through data depth, we introduce a local extension of depth. Our construction is obtained by conditioning the distribution to appropriate depth-based neighborhoods and has the advantages, among others, of maintaining affine-invariance and applying to all depths in a generic way. Most importantly, unlike their competitors, which (for extreme localization) rather measure probability mass, the resulting local depths focus on centrality and remain of a genuine depth nature at any locality level. We derive their main properties, establish consistency of their sample versions, and study their behavior under extreme localization. We present two applications of the proposed local depth (for classification and for symmetry testing), and we extend our construction to the regression depth context. Throughout, we illustrate the results on several datasets, both artificial and real, univariate and multivariate. Supplementary materials for this artic...

Journal ArticleDOI
TL;DR: This article proposes a class of regularization methods for simultaneous variable selection and estimation in the additive hazards model, by combining the nonconcave penalized likelihood approach and the pseudoscore method, and establishes the weak oracle property and oracles property under mild, interpretable conditions.
Abstract: High-dimensional sparse modeling with censored survival data is of great practical importance, as exemplified by modern applications in high-throughput genomic data analysis and credit risk analysis. In this article, we propose a class of regularization methods for simultaneous variable selection and estimation in the additive hazards model, by combining the nonconcave penalized likelihood approach and the pseudoscore method. In a high-dimensional setting where the dimensionality can grow fast, polynomially or nonpolynomially, with the sample size, we establish the weak oracle property and oracle property under mild, interpretable conditions, thus providing strong performance guarantees for the proposed methodology. Moreover, we show that the regularity conditions required by the L 1 method are substantially relaxed by a certain class of sparsity-inducing concave penalties. As a result, concave penalties such as the smoothly clipped absolute deviation, minimax concave penalty, and smooth integration of co...

Journal ArticleDOI
TL;DR: The proposed probabilistic method for linking multiple datafiles works well, opens new directions for future research, and uses a mixture model to fit matching probabilities via maximum likelihood using the Expectation–Maximization algorithm.
Abstract: We present a probabilistic method for linking multiple datafiles This task is not trivial in the absence of unique identifiers for the individuals recorded This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record systems need to be integrated for posterior analysis Our method generalizes the Fellegi–Sunter theory for linking records from two datafiles and its modern implementations The goal of multiple record linkage is to classify the record K-tuples coming from K datafiles according to the different matching patterns Our method incorporates the transitivity of agreement in the computation of the data used to model matching probabilities We use a mixture model to fit matching probabilities via maximum likelihood using the Expectation–Maximization algorithm We present a method to decide the record K-tuples membership to the subsets of matching patterns and we prove its optimality We apply our method to th

Journal ArticleDOI
TL;DR: In this paper, the authors review some of the strategies proposed in the literature, from a theoretical point of view using arguments of sampling theory and in practical terms using several examples with a known answer, showing that sampling methods with frequency-based estimators outperform searching methods with renormalized estimators.
Abstract: One important aspect of Bayesian model selection is how to deal with huge model spaces, since the exhaustive enumeration of all the models entertained is not feasible and inferences have to be based on the very small proportion of models visited. This is the case for the variable selection problem with a moderately large number of possible explanatory variables considered in this article. We review some of the strategies proposed in the literature, from a theoretical point of view using arguments of sampling theory and in practical terms using several examples with a known answer. All our results seem to indicate that sampling methods with frequency-based estimators outperform searching methods with renormalized estimators. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: Asymptotic properties of the classifiers are studied, and it is shown that, in a variety of settings, they can even produce asymptotically perfect classification.
Abstract: We consider classification of functional data when the training curves are not observed on the same interval. Different types of classifier are suggested, one of which involves a new curve extension procedure. Our approach enables us to exploit the information contained in the endpoints of these intervals by incorporating it in an explicit but flexible way. We study asymptotic properties of our classifiers, and show that, in a variety of settings, they can even produce asymptotically perfect classification. The performance of our techniques is illustrated in applications to real and simulated data. Supplementary materials for this article are available online.