scispace - formally typeset
Search or ask a question

Showing papers in "Transactions on Data Privacy in 2008"


Journal Article
TL;DR: The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets and performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step.
Abstract: The demand for high quality microdata for analytical purposes has grown rapidly among researchers and the public over the last few years. In order to respect existing laws on data privacy and to be able to provide microdata to researchers and the public, statistical institutes, agencies and other institutions may provide masked data. Using our flexible software tools with which one can apply protection methods in an exploratory manner, it is possible to generate high quality confidential (micro-)data. In this paper we present highly flexible and easy to use software for the generation of anonymized microdata and give insights into the implementation and the design of the R-Package sdcMicro. R is a highly extendable system for statistical computing and graphics, distributed over the net. sdcMicro contains almost all popular methods for the anonymization of both categorical and continuous variables. Furthermore, several new methods have been implemented. The package can also be used for the comparison of methods and for measuring the information loss and disclosure risk of the masked data.

118 citations


Journal Article
TL;DR: The advantages and disadvantages of two approaches that pro-vide disclosure control by generating synthetic datasets are discussed: the first, proposed by Rubin [1], generates fully synthetic datasets while the second suggested by Little [2] imputes values only for selected variables that bear a high risk of disclosure.
Abstract: For datasets considered for public release, statistical agencies have to face the dilemma of guaranteeing the confidentiality of survey respondents on the one hand and offering sufficiently detailed data for scientific use on the other hand. For that reason a variety of methods that address this problem can be found in the literature. In this paper we discuss the advantages and disadvantages of two approaches that pro-vide disclosure control by generating synthetic datasets: The first, proposed by Rubin [1], generates fully synthetic datasets while the second suggested by Little [2] imputes values only for selected variables that bear a high risk of disclosure. Changing only some variables in general will lead to higher analytical validity. However, the disclosure risk will also increase for partially synthetic data, since true values remain in the data-sets. Thus, agencies willing to release synthetic datasets will have to decide, which of the two methods balances best the trade-off between data utility and disclosure risk for their data. We offer some guidelines to help making this decision. To our knowledge, the two approaches never haven been empirically compared in the literature so far. We apply the two methods to a set of variables from the 1997 wave of the German IAB Establishment Panel and evaluate their quality by comparing results from the original data with results we achieve for the same analyses run on the datasets after the imputation procedures. The results are as expected: In both cases the analytical validity of the synthetic data is high with partially synthetic datasets outperforming fully synthetic datasets in terms of data utility. But this advantage comes at the price of a higher disclosure risk for the partially synthetic data.

77 citations


Journal ArticleDOI
TL;DR: Two new privacy protection models are proposed called (p, α)-sensitive k-anonymity and (p+, α)- sensitive k-Anonymity, respectively, which allow us to release a lot more information without compromising privacy.
Abstract: Publishing data for analysis from a micro data table containing sensitive attributes, while maintaining individual privacy, is a problem of increasing significance today. The k-anonymity model was proposed for privacy preserving data publication. While focusing on identity disclosure, k-anonymity model fails to protect attribute disclosure to some extent. Many efforts are made to enhance the k-anonymity model recently. In this paper, we propose two new privacy protection models called (p, α)-sensitive k-anonymity and (p+, α)-sensitive k-anonymity, respectively. Different from previous the p-sensitive k-anonymity model, these new introduced models allow us to release a lot more information without compromising privacy. Moreover, we prove that the (p, α)-sensitive and (p+, α)-sensitive k-anonymity problems are NP-hard. We also include testing and heuristic generating algorithms to generate desired micro data table. Experimental results show that our introduced model could significantly reduce the privacy breach.

72 citations


Journal Article
TL;DR: This paper outlines an approach that draws on both advances in the social science and the computer science literatures to develop new access modalities that not only provide access but preserve data and create scientific communities.
Abstract: The vast amount of data now collected on human beings and organizations as a result of cyberinfrastructure advances has created similarly vast opportunities for social scientists to study and understand human behavior. It has also made traditional ways of protecting social science data obsolete. The challenge to social scientists is to exploit advances in cyberinfrastructure to develop new access modalities that not only provide access but preserve data and create scientific communities. This paper outlines an approach that draws on both advances in the social science and the computer science literatures.

51 citations


Journal Article
TL;DR: The development of methods for microaggregation are studied, that is, methods that ensure the protected data to satisfy a set of given constraints to avoid the introduction of inconsistencies.
Abstract: Privacy preserving data mining and statistical disclosure control have introduced several methods for data perturbation that can be used for ensuring the privacy of data respondents. Such methods, as rank swapping and microaggregation, perturbate the data introducing some kind of noise. Nevertheless, it is usual that data are edited with care after collection to remove inconsistencies, and such perturbation might cause the introduction of new inconsistencies to them. In this paper we study the development of methods for microaggregation that avoid the introduction of such inconsistencies. That is, methods that ensure the protected data to satisfy a set of given constraints.

34 citations


Journal Article
TL;DR: This study provides a new methodology for generating non-synthetic perturbed data that maintains the mean vector and covariance matrix of the masked data to be exactly the same as the original data while offering a selectable degree of similarity between original and per-turbed data.
Abstract: The mean vector and covariance matrix are sufficient statistics when the underlying distribution is multivariate normal. Many type of statistical analyses used in practice rely on the assumption of multivariate normality (Gaussian model). For these analyses, maintaining the mean vector and covari-ance matrix of the masked data to be the same as that of the original data implies that if the masked data is analyzed using these techniques, the results of such analysis will be the same as that using the original data. For numerical confidential data, a recently proposed perturbation method makes it possi-ble to maintain the mean vector and covariance matrix of the masked data to be exactly the same as the original data. However, as it is currently proposed, the perturbed values from this method are consid-ered synthetic because they are generated without considering the values of the confidential variables (and are based only on the non-confidential variables). Some researchers argue that synthetic data re-sults in information loss. In this study, we provide a new methodology for generating non-synthetic perturbed data that maintains the mean vector and covariance matrix of the masked data to be exactly the same as the original data while offering a selectable degree of similarity between original and per-turbed data.

27 citations


Journal ArticleDOI
TL;DR: The demand for high quality microdata for analytical purposes has grown rapidly among researchers and the public over the last few years and in order to respect existing laws on data privacy and to be...
Abstract: The demand for high quality microdata for analytical purposes has grown rapidly among researchers and the public over the last few years. In order to respect existing laws on data privacy and to be...

19 citations


Journal Article
TL;DR: A new method for assessing disclosure risk for tables of counts; the subtraction-attribution probability (SAP) method, which can be applied to exact or perturbed individual tables and sets of tables.
Abstract: The paper describes a new method for assessing disclosure risk for tables of counts; the subtraction-attribution probability (SAP) method. The SAP score is the probability of an intruder recovering a.risky. subpopulation table given a quantity of information about the individuals in a population table. The method can be applied to exact or perturbed individual tables and sets of tables. The method can also be used to compare the risk impact of different disclosure control regimes.

16 citations


Journal Article
TL;DR: An image quality metric Czenakowski Measure, that is substantially sensitive to LSB embedding is utilized to derive the effective image adaptive threshold and is capable of detecting stego images with an embedding of even 10% payload while the earlier methods can achieve the same detection rate only with 20% payload.
Abstract: We present a novel technique for effective steganalysis of high-color-depth digital images that have been subjected to embedding by LSB steganographic algorithms. The detection theory is based on the idea that under repeated embedding, the disruption of the signal characteristics is the highest for the first embedding and decreases subsequently. That is the marginal distortions due to repeated embeddings decrease monotonically. This decreasing distortion property exploited with Close Color Pair signature is used to construct the classifier that can distinguish between stego and cover images. For evaluation, a database composed of 1200 plain and stego images (at 10% and 20% payload and each one artificially adulterated with 20% additional data) was established. Based on this database, extensive experiments were conducted to prove the feasibility of our proposed system. Our main results are (i) a 90%+ positive-detection rate; (ii) Close Color Pair ratio is not modified significantly when additional bit streams are embedded into a test image that is already tampered with a message.; (iii) an image quality metric Czenakowski Measure, that is substantially sensitive to LSB embedding is utilized to derive the effective image adaptive threshold; (iv) capable of detecting stego images with an embedding of even 10% payload while the earlier methods can achieve the same detection rate only with 20% payload.

8 citations


Journal Article
TL;DR: This work addresses the problem of counting the number of points in the integer grid in which two digitized straight lines overlap each other in the particular case when the crossing point of the non-digitized version of the lines has integer coordinates and the slopes belong to the set of points.
Abstract: We consider the unbounded integer grid and the digitized version of the straight line y=α x+β, with α,β ∈ ℜ being the set of points (i,[α i+β]), i∈ Ζ, where [·] is the integer rounding operator ([x]-0.5≤ x < [x]+0.5). We address the problem of counting the number of points in the integer grid in which two digitized straight lines overlap each other in the particular case when the crossing point of the non-digitized version of the lines has integer coordinates and the slopes belong to the set {a/b:a∈ {-(N-1),..,(N-1)}, b∈ {1,..,(N-1)}}, that is, all the possible slopes of the segments between two different points in the N × N grid. Applications of this problem are explained, with a special focus on a shared steganographic system with error correction.

3 citations