scispace - formally typeset
Search or ask a question

Showing papers in "Statistics and Computing in 2003"


Journal ArticleDOI
TL;DR: Today, most major statistical programs perform, by default, unbalanced ANOVA based on Type III sums of squares (Yates's weighted squares of means), which is founded on unrealistic models—models with interactions, but without all corresponding main effects.
Abstract: Methods for analyzing unbalanced factorial designs can be traced back to Yates (1934) Today, most major statistical programs perform, by default, unbalanced ANOVA based on Type III sums of squares (Yates's weighted squares of means) As criticized by Nelder and Lane (1995), this analysis is founded on unrealistic models—models with interactions, but without all corresponding main effects The Type II analysis (Yates's method of fitting constants) is usually not preferred because of the underlying assumption of no interactions This argument is, however, also founded on unrealistic models Furthermore, by considering the power of the two methods, it is clear that Type II is preferable

397 citations


Journal ArticleDOI
TL;DR: A model based approach to the problem of limiting the disclosure of information gathered on a set of companies or individuals is considered by utilizing the information contained in the sufficient statistics obtained from fitting a model to the public data by conditioning on the survey data.
Abstract: The problem of limiting the disclosure of information gathered on a set of companies or individuals (the “respondents”) is considered, the aim being to provide useful information while preserving confidentiality of sensitive information. The paper proposes a method which explicitly preserves certain information contained in the data. The data are assumed to consist of two sets of information on each “respondent”: public data and specific survey data. It is assumed in this paper that both sets of data are liable to be released for a subset of respondents. However, the public data will be altered in some way to preserve confidentiality whereas the specific survey data is to be disclosed without alteration. The paper proposes a model based approach to this problem by utilizing the information contained in the sufficient statistics obtained from fitting a model to the public data by conditioning on the survey data. Deterministic and stochastic variants of the method are considered.

95 citations


Journal ArticleDOI
TL;DR: This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.
Abstract: The performance of Statistical Disclosure Control (SDC) methods for microdata (also called masking methods) is measured in terms of the utility and the disclosure risk associated to the protected microdata set. Empirical disclosure risk assessment based on record linkage stands out as a realistic and practical disclosure risk assessment methodology which is applicable to every conceivable masking method. The intruder is assumed to know an external data set, whose records are to be linked to those in the protected data sets the percent of correctly linked record pairs is a measure of disclosure risk. This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.

80 citations


Journal ArticleDOI
TL;DR: A new theoretical basis for perturbation methods is discussed, which shows that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of theidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements.
Abstract: In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.

66 citations


Journal ArticleDOI
TL;DR: The same estimation technique can fit models with both additive and multiplicative effects (FANOVA models) to two-way tables, thereby extending the median polish technique.
Abstract: In this paper a robust approach for fitting multiplicative models is presented. Focus is on the factor analysis model, where we will estimate factor loadings and scores by a robust alternating regression algorithm. The approach is highly robust, and also works well when there are more variables than observations. The technique yields a robust biplot, depicting the interaction structure between individuals and variables. This biplot is not predetermined by outliers, which can be retrieved from the residual plot. Also provided is an accompanying robust R2-plot to determine the appropriate number of factors. The approach is illustrated by real and artificial examples and compared with factor analysis based on robust covariance matrix estimators. The same estimation technique can fit models with both additive and multiplicative effects (FANOVA models) to two-way tables, thereby extending the median polish technique.

64 citations


Journal ArticleDOI
TL;DR: An optimal algorithm which computes all bivariate depth contours in O(n2) time and space, using topological sweep of the dual arrangement of lines is described.
Abstract: The concept of location depth was introduced as a way to extend the univariate notion of ranking to a bivariate configuration of data points. It has been used successfully for robust estimation, hypothesis testing, and graphical display. The depth contours form a collection of nested polygons, and the center of the deepest contour is called the Tukey median. The only available implemented algorithms for the depth contours and the Tukey median are slow, which limits their usefulness. In this paper we describe an optimal algorithm which computes all bivariate depth contours in O(n2) time and space, using topological sweep of the dual arrangement of lines. Once these contours are known, the location depth of any point can be computed in O(log2 n) time with no additional preprocessing or in O(log n) time after O(n2) preprocessing. We provide fast implementations of these algorithms to allow their use in everyday statistical practice.

61 citations


Journal ArticleDOI
Jerome P. Reiter1
TL;DR: The proposed synthetic diagnostics can reveal model inadequacies without substantial increase in the risk of disclosures, and can be used to develop remote server diagnostics for generalized linear models.
Abstract: To protect public-use microdata, one approach is not to allow users access to the microdata. Instead, users submit analyses to a remote computer that reports back basic output from the fitted model, such as coefficients and standard errors. To be most useful, this remote server also should provide some way for users to check the fit of their models, without disclosing actual data values. This paper discusses regression diagnostics for remote servers. The proposal is to release synthetic diagnostics—i.e. simulated values of residuals and dependent and independent variables–constructed to mimic the relationships among the real-data residuals and independent variables. Using simulations, it is shown that the proposed synthetic diagnostics can reveal model inadequacies without substantial increase in the risk of disclosures. This approach also can be used to develop remote server diagnostics for generalized linear models.

56 citations


Journal ArticleDOI
TL;DR: A generalization of the Shapiro and Botha (1991) approach that allows one to obtain flexible spatio-temporal stationary variogram models and it is shown that if the weighted least squares criterion is chosen, the fitting of such models to pilot estimations of the variogram can be easily carried out by solving a quadratic programming problem.
Abstract: In this paper we propose a generalization of the Shapiro and Botha (1991) approach that allows one to obtain flexible spatio-temporal stationary variogram models. It is shown that if the weighted least squares criterion is chosen, the fitting of such models to pilot estimations of the variogram can be easily carried out by solving a quadratic programming problem. The work also includes an application to real data and a simulation study in order to illustrate the performance of the proposed space-time dependency modeling.

47 citations


Journal ArticleDOI
TL;DR: A combination of gradient function steps and EM steps to achieve global convergence leading to the EM algorithm with gradient function update (EMGFU), which retains the number of components to be exactly k and typically converges to the global maximum.
Abstract: The paper is focussing on some recent developments in nonparametric mixture distributions. It discusses nonparametric maximum likelihood estimation of the mixing distribution and will emphasize gradient type results, especially in terms of global results and global convergence of algorithms such as vertex direction or vertex exchange method. However, the NPMLE (or the algorithms constructing it) provides also an estimate of the number of components of the mixing distribution which might be not desirable for theoretical reasons or might be not allowed from the physical interpretation of the mixture model. When the number of components is fixed in advance, the before mentioned algorithms can not be used and globally convergent algorithms do not exist up to now. Instead, the EM algorithm is often used to find maximum likelihood estimates. However, in this case multiple maxima are often occuring. An example from a meta-analyis of vitamin A and childhood mortality is used to illustrate the considerable, inferential importance of identifying the correct global likelihood. To improve the behavior of the EM algorithm we suggest a combination of gradient function steps and EM steps to achieve global convergence leading to the EM algorithm with gradient function update (EMGFU). This algorithms retains the number of components to be exactly k and typically converges to the global maximum. The behavior of the algorithm is highlighted at hand of several examples.

45 citations


Journal ArticleDOI
TL;DR: A simple rule is proposed for choosing the number of blocks with the IEM algorithm in the extreme case of one observation per block, which provides efficient updating formulas, which avoid the direct calculation of the inverses and determinants of the component-covariance matrices.
Abstract: The EM algorithm is a popular method for parameter estimation in situations where the data can be viewed as being incomplete. As each E-step visits each data point on a given iteration, the EM algorithm requires considerable computation time in its application to large data sets. Two versions, the incremental EM (IEM) algorithm and a sparse version of the EM algorithm, were proposed recently by Neal R.M. and Hinton G.E. in Jordan M.I. (Ed.), Learning in Graphical Models, Kluwer, Dordrecht, 1998, pp. 355–368 to reduce the computational cost of applying the EM algorithm. With the IEM algorithm, the available n observations are divided into B (B ≤ n) blocks and the E-step is implemented for only a block of observations at a time before the next M-step is performed. With the sparse version of the EM algorithm for the fitting of mixture models, only those posterior probabilities of component membership of the mixture that are above a specified threshold are updateds the remaining component-posterior probabilities are held fixed. In this paper, simulations are performed to assess the relative performances of the IEM algorithm with various number of blocks and the standard EM algorithm. In particular, we propose a simple rule for choosing the number of blocks with the IEM algorithm. For the IEM algorithm in the extreme case of one observation per block, we provide efficient updating formulas, which avoid the direct calculation of the inverses and determinants of the component-covariance matrices. Moreover, a sparse version of the IEM algorithm (SPIEM) is formulated by combining the sparse E-step of the EM algorithm and the partial E-step of the IEM algorithm. This SPIEM algorithm can further reduce the computation time of the IEM algorithm.

45 citations


Journal ArticleDOI
TL;DR: This paper presents solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table that include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting.
Abstract: Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.

Journal ArticleDOI
TL;DR: It is proposed that the cumulative probability be used instead of probability density when transforming non-uniform distributions for FAST to increase the accuracy of transformation by reducing errors, and makes the transformation more convenient to be used in practice.
Abstract: The Fourier amplitude sensitivity test (FAST) can be used to calculate the relative variance contribution of model input parameters to the variance of predictions made with functional models. It is widely used in the analyses of complicated process modeling systems. This study provides an improved transformation procedure of the Fourier amplitude sensitivity test (FAST) for non-uniform distributions that can be used to represent the input parameters. Here it is proposed that the cumulative probability be used instead of probability density when transforming non-uniform distributions for FAST. This improvement will increase the accuracy of transformation by reducing errors, and makes the transformation more convenient to be used in practice. In an evaluation of the procedure, the improved procedure was demonstrated to have very high accuracy in comparison to the procedure that is currently widely in use.

Journal ArticleDOI
TL;DR: The MLP model is reformulated with the original perceptron in mind so that each node in the “hidden layers” can be considered as a latent Bernoulli random variable, and the likelihood for the reformulated latent variable model is constructed by standard finite mixture ML methods using an EM algorithm.
Abstract: Multi-layer perceptrons (MLPs), a common type of artificial neural networks (ANNs), are widely used in computer science and engineering for object recognition, discrimination and classification, and have more recently found use in process monitoring and control “Training” such networks is not a straightforward optimisation problem, and we examine features of these networks which contribute to the optimisation difficulty Although the original “perceptron”, developed in the late 1950s (Rosenblatt 1958, Widrow and Hoff 1960), had a binary output from each “node”, this was not compatible with back-propagation and similar training methods for the MLP Hence the output of each node (and the final network output) was made a differentiable function of the network inputs We reformulate the MLP model with the original perceptron in mind so that each node in the “hidden layers” can be considered as a latent (that is, unobserved) Bernoulli random variable This maintains the property of binary output from the nodes, and with an imposed logistic regression of the hidden layer nodes on the inputs, the expected output of our model is identical to the MLP output with a logistic sigmoid activation function (for the case of one hidden layer) We examine the usual MLP objective function—the sum of squares—and show its multi-modal form and the corresponding optimisation difficulty We also construct the likelihood for the reformulated latent variable model and maximise it by standard finite mixture ML methods using an EM algorithm, which provides stable ML estimates from random starting positions without the need for regularisation or cross-validation Over-fitting of the number of nodes does not affect this stability This algorithm is closely related to the EM algorithm of Jordan and Jacobs (1994) for the Mixture of Experts model We conclude with some general comments on the relation between the MLP and latent variable models

Journal ArticleDOI
TL;DR: This paper considers a Gaussian model with the mean and the variance modeled flexibly as functions of the independent variables using a Bayesian approach that allows the identification of significant variables in the variance function, as well as averaging over all possible models in both the meanand the variance functions.
Abstract: The article considers a Gaussian model with the mean and the variance modeled flexibly as functions of the independent variables The estimation is carried out using a Bayesian approach that allows the identification of significant variables in the variance function, as well as averaging over all possible models in both the mean and the variance functions The computation is carried out by a simulation method that is carefully constructed to ensure that it converges quickly and produces iterates from the posterior distribution that have low correlation Real and simulated examples demonstrate that the proposed method works well The method in this paper is important because (a) it produces more realistic prediction intervals than nonparametric regression estimators that assume a constant variances (b) variable selection identifies the variables in the variance function that are importants (c) variable selection and model averaging produce more efficient prediction intervals than those obtained by regular nonparametric regression

Journal ArticleDOI
TL;DR: The characteristics, limitations and desired properties of a remote access system are discussed and the discussion by the system used at LIS/LES is illustrated.
Abstract: Statistical Agencies manage huge amounts of microdata. The main task of these agencies is to provide a variety of users with general information about for instance the population and the economy. However, in some cases users request additional, more specific information. Many agencies have therefore set up facilities that enable selected users to obtain tailor-made statistical information. A remote access system is an example of such a facility where users can submit queries for statistical information from their own computer. These queries are handled by the statistical agency and the generated, possibly confidentialised, output is returned to the user. This way the agency still keeps control over its own data while the user does not need to make frequent visits to the agency. For some years, the Luxembourg Income Study (LIS) and Luxembourg Employment Study (LES) have made use of an advanced remote access system. At Statistics Netherlands and at other statistical institutes recently the need for a similar system has been expressed. In this article, we discuss the characteristics, limitations and desired properties of a remote access system. We illustrate the discussion by the system used at LIS/LES.

Journal ArticleDOI
TL;DR: An exact simulation algorithm that produces variables from truncated Gaussian distributions on R via a perfect sampling scheme, based on stochastic ordering and slice sampling, since accept-reject algorithms like the one of Geweke and Robert are difficult to extend to higher dimensions.
Abstract: We provide an exact simulation algorithm that produces variables from truncated Gaussian distributions on (\Bbb R+)p via a perfect sampling scheme, based on stochastic ordering and slice sampling, since accept-reject algorithms like the one of Geweke (1991) and Robert (1995) are difficult to extend to higher dimensions

Journal ArticleDOI
TL;DR: Local dependence is made a more readily interpretable practical tool, by introducing dependence maps, which simplify the estimated local dependence structure between two variables by identifying regions of (significant) positive, (not significant) zero and ( significant) negative local dependence.
Abstract: There is often more structure in the way two random variables are associated than a single scalar dependence measure, such as correlation, can reflect. Local dependence functions such as that of Holland and Wang (1987) are, therefore, useful. However, it can be argued that estimated local dependence functions convey information that is too detailed to be easily interpretable. We seek to remedy this difficulty, and hence make local dependence a more readily interpretable practical tool, by introducing dependence maps. Via local permutation testing, dependence maps simplify the estimated local dependence structure between two variables by identifying regions of (significant) positive, (not significant) zero and (significant) negative local dependence. When viewed in conjunction with an estimate of the joint density, a comprehensive picture of the joint behaviour of the variables is provided. A little theory, many implementational details and several examples are given.

Journal ArticleDOI
TL;DR: This paper proposes a different protection methodology consisting of replacing some table entries by appropriate intervals containing the actual value of the unpublished cells, and calls this methodology Partial Cell Suppression, as opposed to the classical “complete” cell suppression.
Abstract: In this paper we address the problem of protecting confidentiality in statistical tables containing sensitive information that cannot be disseminated. This is an issue of primary importance in practice. Cell Suppression is a widely-used technique for avoiding disclosure of sensitive information, which consists in suppressing all sensitive table entries along with a certain number of other entries, called complementary suppressions. Determining a pattern of complementary suppressions that minimizes the overall loss of information results into a difficult (i.e., \cal{NP}-hard) optimization problem known as the Cell Suppression Problem. We propose here a different protection methodology consisting of replacing some table entries by appropriate intervals containing the actual value of the unpublished cells. We call this methodology Partial Cell Suppression, as opposed to the classical “complete” cell suppression. Partial cell suppression has the important advantage of reducing the overall information loss needed to protect the sensitive information. Also, the new method provides automatically auditing ranges for each unpublished cell, thus saving an often time-consuming task to the statistical office while increasing the information explicitly provided with the table. Moreover, we propose an efficient (i.e., polynomial-time) algorithm to find an optimal partial suppression solution. A preliminary computational comparison between partial and complete suppression methologies is reported, showing the advantages of the new approach. Finally, we address possible extensions leading to a unified complete/partial cell suppression framework.

Journal ArticleDOI
TL;DR: A new disclosure limitation procedure based on simulation is proposed to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data.
Abstract: The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functionss these are constrained to be equal to their corresponding sample averagess the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability. Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality. The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traitss moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information. Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major points related issues are selection of characteristics, goodness of fit and strength of dependence relations.

Journal ArticleDOI
TL;DR: It is shown that for equal-variance mixture models, direct computation time can be reduced to O(Dknk), where relevant continuous parameters are each divided into D regions, and direct inference is now possible on genuine data sets for small k, where the quality of approximation is determined by the level of discretisation.
Abstract: The problem of inference in Bayesian Normal mixture models is known to be difficult. In particular, direct Bayesian inference (via quadrature) suffers from a combinatorial explosion in having to consider every possible partition of n observations into k mixture components, resulting in a computation time which is O(kn). This paper explores the use of discretised parameters and shows that for equal-variance mixture models, direct computation time can be reduced to O(Dknk), where relevant continuous parameters are each divided into D regions. As a consequence, direct inference is now possible on genuine data sets for small k, where the quality of approximation is determined by the level of discretisation. For large problems, where the computational complexity is still too great in O(Dknk) time, discretisation can provide a convergence diagnostic for a Markov chain Monte Carlo analysis.

Journal ArticleDOI
TL;DR: Metropolis-Hastings algorithms for exact conditional inference, including goodness-of-fit tests, confidence intervals and residual analysis, for binomial and multinomial logistic regression models are developed.
Abstract: We develop Metropolis-Hastings algorithms for exact conditional inference, including goodness-of-fit tests, confidence intervals and residual analysis, for binomial and multinomial logistic regression models. We present examples where the exact results, obtained by enumeration, are available for comparison. We also present examples where Monte Carlo methods provide the only feasible approach for exact inference.

Journal ArticleDOI
TL;DR: This paper describes how importance sampling can be applied to estimate likelihoods for spatio-temporal stochastic models of epidemics in plant populations, where observations consist of the set of diseased individuals at two or more distinct times.
Abstract: This paper describes how importance sampling can be applied to estimate likelihoods for spatio-temporal stochastic models of epidemics in plant populations, where observations consist of the set of diseased individuals at two or more distinct times. Likelihood computation is problematic because of the inherent lack of independence of the status of individuals in the population whenever disease transmission is distance-dependent. The methods of this paper overcome this by partitioning the population into a number of sectors and then attempting to take account of this dependence within each sector, while neglecting that between-sectors. Application to both simulated and real epidemic data sets show that the techniques perform well in comparison with existing approaches. Moreover, the results confirm the validity of likelihood estimates obtained elsewhere using Markov chain Monte Carlo methods.

Journal ArticleDOI
TL;DR: An alternative based on synthetic (composite) estimation is proposed for the problem of prediction in ordinary regression and its properties are explored by simulations for the simple regression.
Abstract: The weaknesses of established model selection procedures based on hypothesis testing and similar criteria are discussed and an alternative based on synthetic (composite) estimation is proposed. It is developed for the problem of prediction in ordinary regression and its properties are explored by simulations for the simple regression. Extensions to a general setting are described and an example with multiple regression is analysed. Arguments are presented against using a selected model for any inferences.

Journal ArticleDOI
TL;DR: A single-pass, low-storage, sequential method for estimating an arbitrary quantile of an unknown distribution that performs very well when compared to existing methods for estimating the median as well as arbitrary quantiles for a wide range of densities.
Abstract: We present a single-pass, low-storage, sequential method for estimating an arbitrary quantile of an unknown distribution. The proposed method performs very well when compared to existing methods for estimating the median as well as arbitrary quantiles for a wide range of densities. In addition to explaining the method and presenting the results of the simulation study, we discuss intuition behind the method and demonstrate empirically, for certain densities, that the proposed estimator converges to the sample quantile.

Journal ArticleDOI
TL;DR: It is shown that the best linear projection that captures the structure in the data is not necessarily a (linear) principal component, and the ability of certain nonlinear projections to capture data structure is affected by the choice of constraint in the eigendecomposition of a nonlinear transform of the data.
Abstract: Principal Components Analysis (PCA) is traditionally a linear technique for projecting multidimensional data onto lower dimensional subspaces with minimal loss of variance. However, there are several applications where the data lie in a lower dimensional subspace that is not linears in these cases linear PCA is not the optimal method to recover this subspace and thus account for the largest proportion of variance in the data. Nonlinear PCA addresses the nonlinearity problem by relaxing the linear restrictions on standard PCA. We investigate both linear and nonlinear approaches to PCA both exclusively and in combination. In particular we introduce a combination of projection pursuit and nonlinear regression for nonlinear PCA. We compare the success of PCA techniques in variance recovery by applying linear, nonlinear and hybrid methods to some simulated and real data sets. We show that the best linear projection that captures the structure in the data (in the sense that the original data can be reconstructed from the projection) is not necessarily a (linear) principal component. We also show that the ability of certain nonlinear projections to capture data structure is affected by the choice of constraint in the eigendecomposition of a nonlinear transform of the data. Similar success in recovering data structure was observed for both linear and nonlinear projections.

Journal ArticleDOI
TL;DR: It is found that although the spatial method often induces higher inferential errors, it almost always provides more protection and the aggregated areas from the spatial procedure can be somewhat more spatially smooth, and hence possibly more meaningful, than those from the non-spatial approach.
Abstract: In this paper we discuss methodology for the safe release of business microdata. In particular we extend the model-based protection procedure of Franconi and Stander (2002, The Statistician 51: 1–11) by allowing the model to take account of the spatial structure underlying the geographical information in the microdata. We discuss the use of the Gibbs sampler for performing the computations required by this spatial approach. We provide an empirical comparison of these non-spatial and spatial disclosure limitation methods based on the Italian sample from the Community Innovation Survey. We quantify the level of protection achieved for the released microdata and the error induced when various inferences are performed. We find that although the spatial method often induces higher inferential errors, it almost always provides more protection. Moreover the aggregated areas from the spatial procedure can be somewhat more spatially smooth, and hence possibly more meaningful, than those from the non-spatial approach. We discuss possible applications of these model-based protection procedures to more spatially extensive data sets.

Journal ArticleDOI
TL;DR: The one-step-late EM algorithm is adapted to PXEM to establish a fast closed form algorithm that improves on the one- Step-Late EM algorithm by insuring monotone convergence, and is used to fit a probit regression model and a variety of dynamic linear models.
Abstract: The EM algorithm is a popular method for computing maximum likelihood estimates or posterior modes in models that can be formulated in terms of missing data or latent structure. Although easy implementation and stable convergence help to explain the popularity of the algorithm, its convergence is sometimes notoriously slow. In recent years, however, various adaptations have significantly improved the speed of EM while maintaining its stability and simplicity. One especially successful method for maximum likelihood is known as the parameter expanded EM or PXEM algorithm. Unfortunately, PXEM does not generally have a closed form M-step when computing posterior modes, even when the corresponding EM algorithm is in closed form. In this paper we confront this problem by adapting the one-step-late EM algorithm to PXEM to establish a fast closed form algorithm that improves on the one-step-late EM algorithm by insuring monotone convergence. We use this algorithm to fit a probit regression model and a variety of dynamic linear models, showing computational savings of as much as 99.9%, with the biggest savings occurring when the EM algorithm is the slowest to converge.

Journal ArticleDOI
TL;DR: It is illustrated that by choosing latent variables appropriately, certain monotonicity properties hold which facilitate the use of a perfect simulation algorithm.
Abstract: The Reed-Frost epidemic model is a simple stochastic process with parameter q that describes the spread of an infectious disease among a closed population. Given data on the final outcome of an epidemic, it is possible to perform Bayesian inference for q using a simple Gibbs sampler algorithm. In this paper it is illustrated that by choosing latent variables appropriately, certain monotonicity properties hold which facilitate the use of a perfect simulation algorithm. The methods are applied to real data.

Journal ArticleDOI
TL;DR: For smaller samples, this work proposes to use the current posterior as the next prior distribution to make the posterior simulations closer to the maximum likelihood estimate (MLE) and hence improve the likelihood approximation.
Abstract: For models with random effects or missing data, the likelihood function is sometimes intractable analytically but amenable to Monte Carlo approximation. To get a good approximation, the parameter value that drives the simulations should be sufficiently close to the maximum likelihood estimate (MLE) which unfortunately is unknown. Introducing a working prior distribution, we express the likelihood function as a posterior expectation and approximate it using posterior simulations. If the sample size is large, the sample information is likely to outweigh the prior specification and the posterior simulations will be concentrated around the MLE automatically, leading to good approximation of the likelihood near the MLE. For smaller samples, we propose to use the current posterior as the next prior distribution to make the posterior simulations closer to the MLE and hence improve the likelihood approximation. By using the technique of data duplication, we can simulate from the sharpened posterior distribution without actually updating the prior distribution. The suggested method works well in several test cases. A more complex example involving censored spatial data is also discussed.

Journal ArticleDOI
TL;DR: The present work attempts to classify both problems and algorithmic tools in an effort to prescribe suitable techniques in a variety of situations to minimize functions of several parameters where the function need not be computed precisely.
Abstract: This paper presents an investigation of a method for minimizing functions of several parameters where the function need not be computed precisely. Motivated by problems requiring the optimization of negative log-likelihoods, we also want to estimate the (inverse) Hessian at the point of minimum. The imprecision of the function values impedes the application of conventional optimization methods, and the goal of Hessian estimation adds a lot to the difficulty of developing an algorithm. The present class of methods is based on statistical approximation of the functional surface by a quadratic model, so is similar in motivation to many conventional techniques. The present work attempts to classify both problems and algorithmic tools in an effort to prescribe suitable techniques in a variety of situations. The codes are available from the authors' web site http://macnash.admin.uottawa.ca/~rsmin/.