scispace - formally typeset
Search or ask a question
Author

Silvia Polettini

Bio: Silvia Polettini is an academic researcher from Sapienza University of Rome. The author has contributed to research in topics: Small area estimation & Parametric statistics. The author has an hindex of 5, co-authored 27 publications receiving 128 citations. Previous affiliations of Silvia Polettini include University of Naples Federico II & National Institute of Statistics.

Papers
More filters
Book ChapterDOI
09 Jun 2004
TL;DR: The paper reviews the main aspects of the individual risk methodology and defines measures of risk for protection of files of independent records as well as hierarchical files.
Abstract: Individual risk estimation was one of the issues that the European Union project CASC targeted. On this subject ISTAT has built on previous work by Benedetti and Franconi (1998) to improve individual risk measures. These permit the identification of unsafe records to be protected by disclosure limitation techniques. The software μ-Argus contains now a routine, that has been implemented by CBS Netherlands in cooperation with ISTAT, for computing the Benedetti-Franconi or individual risk of disclosure. The paper reviews the main aspects of the individual risk methodology. Such an approach defines measures of risk for protection of files of independent records as well as hierarchical files. The theory and some practical issues such as threshold setting are illustrated for both cases.

34 citations

Journal ArticleDOI
TL;DR: A new disclosure limitation procedure based on simulation is proposed to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data.
Abstract: The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functionss these are constrained to be equal to their corresponding sample averagess the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability. Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality. The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traitss moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information. Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major points related issues are selection of characteristics, goodness of fit and strength of dependence relations.

24 citations

Book ChapterDOI
09 Jun 2004
TL;DR: This paper builds on previous work by [BF98] to define a Bayesian hierarchical model for risk estimation, following a superpopulation approach similar to [BKP90] and [Rin03], and applies it to an artificial sample of the Italian 1991 Census data.
Abstract: When microdata files for research are released, it is possible that external users may attempt to breach confidentiality. For this reason most National Statistical Institutes apply some form of disclosure risk assessment and data protection. Risk assessment first requires a measure of disclosure risk to be defined. In this paper we build on previous work by [BF98] to define a Bayesian hierarchical model for risk estimation. We follow a superpopulation approach similar to [BKP90] and [Rin03]. For each combination of values of the key variables we derive the posterior distribution of the population frequency given the observed sample frequency. Knowledge of this posterior distribution enables us to obtain suitable summaries that can be used to estimate the risk of disclosure. One such summary is the mean of the reciprocal of the population frequency or Benedetti-Franconi risk, but we also investigate others such as the mode. We apply our approach to an artificial sample of the Italian 1991 Census data, drawn by means of a widely used sampling scheme. We report on results of this application and document the computational difficulties that we encountered. The risk estimates that we obtain are sensible, but suggest possible improvements and modifications to our methodology. We discuss these together with potential alternative strategies.

14 citations

Journal ArticleDOI
TL;DR: In this paper, the Dirichlet process random effects are used to reduce the number of fixed effects required to achieve reliable risk estimates, and the results show that their mixed models with main effects only produce roughly equivalent estimates compared to the all two-way interactions models, and are effective in defusing potential shortcomings of traditional loglinear models.
Abstract: Statistical agencies and other institutions collect data under the promise to protect the confidentiality of respondents. When releasing microdata samples, the risk that records can be identified must be assessed. To this aim, a widely adopted approach is to isolate categorical variables key to the identification and analyze multi-way contingency tables of such variables. Common disclosure risk measures focus on sample unique cells in these tables and adopt parametric log-linear models as the standard statistical tools for the problem. Such models often have to deal with large and extremely sparse tables that pose a number of challenges to risk estimation. This paper proposes to overcome these problems by studying nonparametric alternatives based on Dirichlet process random effects. The main finding is that the inclusion of such random effects allows us to reduce considerably the number of fixed effects required to achieve reliable risk estimates. This is studied on applications to real data, suggesting, in particular, that our mixed models with main effects only produce roughly equivalent estimates compared to the all two-way interactions models, and are effective in defusing potential shortcomings of traditional log-linear models. This paper adopts a fully Bayesian approach that accounts for all sources of uncertainty, including that about the population frequencies, and supplies unconditional (posterior) variances and credible intervals.

13 citations

Book ChapterDOI
TL;DR: It is argued that any microdata protection strategy is based on a formal reference model, and a regression based imputation procedure for business microdata to the Italian sample from the Community Innovation Survey is discussed.
Abstract: We argue that any microdata protection strategy is based on a formal reference model. The extent of model specification yields "parametric", "semiparametric", or "nonparametric" strategies. Following this classification, a parametric probability model, such as a normal regression model, or a multivariate distribution for simulation can be specified. Matrix masking (Cox [2]), covering local suppression, coarsening, microaggregation (Domingo-Ferrer [8]), noise injection, perturbation (e.g. Kim [15]; Fuller [12]), provides examples of the second and third class of models. Finally, a nonparametric approach, e.g. use of bootstrap procedures for generating synthetic microdata (e.g. Dandekar et. al. [4]) can be adopted.In this paper we discuss the application of a regression based imputation procedure for business microdata to the Italian sample from the Community Innovation Survey. A set of regressions (Franconi and Stander [11]) is used for generating flexible perturbation, for the protection varies according to identifiability of the enterprise; a spatial aggregation strategy is also proposed, based on principal components analysis. The inferential usefulness of the released data and the protection achieved by the strategy are evaluated.

10 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: It is concluded that multiple Imputation for Nonresponse in Surveys should be considered as a legitimate method for answering the question of why people do not respond to survey questions.
Abstract: 25. Multiple Imputation for Nonresponse in Surveys. By D. B. Rubin. ISBN 0 471 08705 X. Wiley, Chichester, 1987. 258 pp. £30.25.

3,216 citations

Journal ArticleDOI
TL;DR: Chapman and Miller as mentioned in this paper, Subset Selection in Regression (Monographs on Statistics and Applied Probability, no. 40, 1990) and Section 5.8.
Abstract: 8. Subset Selection in Regression (Monographs on Statistics and Applied Probability, no. 40). By A. J. Miller. ISBN 0 412 35380 6. Chapman and Hall, London, 1990. 240 pp. £25.00.

1,154 citations

Journal Article
TL;DR: The methodology proposed automatically adapts to the local structure when simulating paths across this manifold, providing highly efficient convergence and exploration of the target density, and substantial improvements in the time‐normalized effective sample size are reported when compared with alternative sampling approaches.
Abstract: The paper proposes Metropolis adjusted Langevin and Hamiltonian Monte Carlo sampling methods defined on the Riemann manifold to resolve the shortcomings of existing Monte Carlo algorithms when sampling from target densities that may be high dimensional and exhibit strong correlations. The methods provide fully automated adaptation mechanisms that circumvent the costly pilot runs that are required to tune proposal densities for Metropolis-Hastings or indeed Hamiltonian Monte Carlo and Metropolis adjusted Langevin algorithms. This allows for highly efficient sampling even in very high dimensions where different scalings may be required for the transient and stationary phases of the Markov chain. The methodology proposed exploits the Riemann geometry of the parameter space of statistical models and thus automatically adapts to the local structure when simulating paths across this manifold, providing highly efficient convergence and exploration of the target density. The performance of these Riemann manifold Monte Carlo methods is rigorously assessed by performing inference on logistic regression models, log-Gaussian Cox point processes, stochastic volatility models and Bayesian estimation of dynamic systems described by non-linear differential equations. Substantial improvements in the time-normalized effective sample size are reported when compared with alternative sampling approaches. MATLAB code that is available from http://www.ucl.ac.uk/statistics/research/rmhmc allows replication of all the results reported.

1,031 citations

Journal ArticleDOI
TL;DR: It is found that a hypothesis testing approach provided the best control over re-identification risk and reduces the extent of information loss compared to baseline k-anonymity.

254 citations

Journal ArticleDOI
Jerome P. Reiter1
TL;DR: Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands and to illustrate the specification of synthetic data imputation models.
Abstract: Summary. The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands, to assess the degree of protection of confidentiality that is afforded by fully synthetic data and to illustrate the specification of synthetic data imputation models. Benefits and limitations of releasing fully synthetic data sets are discussed.

230 citations