scispace - formally typeset
Open AccessJournal ArticleDOI

A framework for efficient data anonymization under privacy and accuracy constraints

Reads0
Chats0
TLDR
This article focuses on one-dimensional (i.e., single-attribute) quasi-identifiers, and study the properties of optimal solutions under the k-anonymity and l-diversity models for the privacy-constrained and the accuracy- Constrained anonymization problems.
Abstract
Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy-preserving paradigms of k-anonymity and l-diversity. k-anonymity protects against the identification of an individual's record. l-diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) l-diversification is solved by techniques developed for the simpler k-anonymization problem, causing unnecessary information loss. (ii) The anonymization process is inefficient in terms of computational and I/O cost. (iii) Previous research focused exclusively on the privacy-constrained problem and ignored the equally important accuracy-constrained (or dual) anonymization problem.In this article, we propose a framework for efficient anonymization of microdata that addresses these deficiencies. First, we focus on one-dimensional (i.e., single-attribute) quasi-identifiers, and study the properties of optimal solutions under the k-anonymity and l-diversity models for the privacy-constrained (i.e., direct) and the accuracy-constrained (i.e., dual) anonymization problems. Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems in linear time. Finally, we generalize our solutions to multidimensional quasi-identifiers using space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly outperform the existing approaches in terms of execution time and information loss.

read more

Content maybe subject to copyright    Report

9
A Framework for Efficient Data
Anonymization under Privacy and
Accuracy Constraints
GABRIEL GHINITA, PANAGIOTIS KARRAS, and PANOS KALNIS
National University of Singapore
and
NIKOS MAMOULIS
University of Hong Kong
Recent research studied the problem of publishing microdata without revealing sensitive infor-
mation, leading to the privacy-preserving paradigms of k-anonymity and -diversity. k-anonymity
protects against the identification of an individual’s record. -diversity, in addition, safeguards
against the association of an individual with specific sensitive information. However, existing ap-
proaches suffer from at least one of the following drawbacks: (i) -diversification is solved by tech-
niques developed for the simpler k-anonymization problem, causing unnecessary information loss.
(ii) The anonymization process is inefficient in terms of computational and I/O cost. (iii) Previous
research focused exclusively on the privacy-constrained problem and ignored the equally important
accuracy-constrained (or dual) anonymization problem.
In this article, we propose a framework for efficient anonymization of microdata that addresses
these deficiencies. First, we focus on one-dimensional (i.e., single-attribute) quasi-identifiers, and
study the properties of optimal solutions under the k-anonymity and -diversity models for the
privacy-constrained (i.e., direct) and the accuracy-constrained (i.e., dual) anonymization problems.
Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems
in linear time. Finally, we generalize our solutions to multidimensional quasi-identifiers using
space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly
outperform the existing approaches in terms of execution time and information loss.
Categories and Subject Descriptors: H.2.0 [Database Management]: General—Security, integrity,
and protection
General Terms: Design, Experimentation, Security
Additional Key Words and Phrases: Privacy, anonymity
This work was partially supported by grant HKU 715108E from Hong Kong RGC.
Authors’ addresses: G. Ghinita, P. Karras, P. Kalnis, National University of Singapore, Com-
puting 1, Computing Drive, Singapore 117417; email: {ghinitag,karras,kalnis}@comp.nus.edu.sg;
N. Mamoulis, University of Hong Kong, Pokfulam Road, Hong Kong; email: nikos@cs.hku.hk.
Permission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
C
2009 ACM 0362-5915/2009/06-ART9 $10.00
DOI 10.1145/1538909.1538911 http://doi.acm.org/10.1145/1538909.1538911
ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

9:2
G. Ghinita et al.
ACM Reference Format:
Ghinita, G., Karras, P., Kalnis, P., and Mamoulis, N. 2009. A framework for efficient data anonymiza-
tion under privacy and accuracy constraints. ACM Trans. Database Syst., 34, 2, Article 9 (June
2009), 47 pages, DOI = 10.1145/1538909.1538911 http://doi.acm.org/10.1145/1538909.1538911
1. INTRODUCTION
Organizations, such as hospitals, need to release microdata (e.g., medical
records) for research and other public benefit purposes. However, sensitive
personal information (e.g., medical condition of a specific person) may be re-
vealed in this process. Conventionally, identifying attributes such as name or
social security number are not disclosed, in order to protect privacy. Still, re-
cent research [Froomkin 2000; Sweeney 2002] has demonstrated that this is
not sufficient, due to the existence of quasi-identifiers in the released micro-
data. Quasi-identifiers are sets of attributes (e.g., ZIP, Gender, DateOfBirth)
which can be joined with information obtained from diverse sources (e.g.,
public voting registration data) in order to reveal the identity of individual
records.
To address this threat, Samarati [2001] and Sweeney [2002] proposed the
k-anonymity model: For every record in a released table there should be at
least k 1 other records identical to it along a set of quasi-identifying at-
tributes. Records with identical quasi-identifier values constitute an equiva-
lence class. k-anonymity is commonly achieved either by generalization (e.g.,
show only the area code instead of the exact phone number) or suppression
(i.e., hide some values of the quasi-identifier), both of which inevitably lead to
information loss. Still, the data should remain as accurate as possible in order
to be useful in practice. Hence a trade-off between privacy and information loss
emerges.
Recently, the concept of -diversity [Machanavajjhala et al. 2006] was in-
troduced to address the limitations of k-anonymity. The latter may disclose
sensitive information when there are many identical Sensitive Attribute (SA)
values within an equivalence class
1
(e.g., all persons suffer from the same dis-
ease). -diversity prevents uniformity and background knowledge attacks by
ensuring that at least SA values are well represented in each equivalence
class (e.g., the probability to associate a tuple with an SA value is bounded
by 1/ [Xiao and Tao 2006a]). Machanavajjhala et al. [2006] suggest that any
k-anonymization algorithm can be adapted to achieve -diversity. However, the
following example demonstrates that such an approach may yield excessive in-
formation loss.
Consider the privacy-constrained anonymization problem for the microdata
in Figure 1(a), where the combination of Age, Weight is the quasi-identifier and
Disease is the sensitive attribute. Let the required privacy constraint, within
the k-anonymity model, be k = 4. The current state-of-the-art k-anonymization
1
k-anonymity remains a useful concept, suitable for cases where the sensitive attribute is implicit
or omitted (e.g., a database containing information about convicted persons, regardless of specific
crimes).
ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

A Framework for Efficient Data Anonymization
9:3
Fig. 1. k-anonymization example (k = 4).
algorithm (i.e., Mondrian [LeFevre et al. 2006a]) sorts the data points along
each dimension (i.e., Age and Weight), and partitions across the dimension with
the widest normalized range of values. In our example, the normalized ranges
for both dimensions are the same. Mondrian selects the first one (i.e., Age) and
splits it into segments 3555 and 6070 (see Figure 1(b)). Further partitioning
is not possible because any split would result in groups with less than 4 records.
We propose a different approach. First, we map the multidimensional quasi-
identifier to a 1D value. In this example we use an 8 × 8 Hilbert space filling
curve (see Section 6 for details); other mappings are also possible. The resulting
sorted 1D values are shown in Figure 1(a) (column 1D). Next, we partition the
1D space. We prove that the optimal 1D partitions are nonoverlapping and
contain between k and 2k 1 records. We obtain 3 groups which correspond
to 1D ranges [22..31], [33..42], and [55..63]. The resulting 2D partitions are
enclosed by three rectangles in Figure 1(b). In this example, our method causes
less information loss because the extents of the obtained groups are smaller
than in the case of Mondrian. For instance, consider the query “Find how many
persons are in the age segment 3545 and weight interval 5060”: The correct
answer is 3. Assuming that records are uniformly distributed within each group,
our method returns the answer 4 × 9/12 = 3 (there are 4 records in Group
1
,
9 data space cells that match the query, and a total of 12 cells in Group
1
). On
the other hand, the answer obtained with Mondrian is 6 × 9/40 = 1.35 (from
the group situated to the left of the dotted line). Clearly, our k-anonymization
algorithm is more accurate.
The advantages of our approach are even more prominent with the
-diversification problem. This problem is more difficult because, in order to
cover a variety of SA values, the optimal 1D partitioning may have to include
overlapping ranges. For example, if =3, group 2 in Figure 2(a) contains tuples
{30, 35, 56}, whereas the third group contains tuples {33, 40, 42}. Nevertheless,
we prove that there exist optimal partitionings consisting of only consecutive
ranges with respect to each individual value of the sensitive attribute. Based on
this property, we develop a heuristic which essentially groups together records
ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

9:4
G. Ghinita et al.
Fig. 2. -diversification example ( = 3).
that are close to each other in the 1D space, but have different sensitive attribute
values. The four resulting groups
2
are shown in Figure 2(b). From the result we
can infer, for instance, that no person younger than 55 suffers from Alzheimer’s.
On the other hand, if we use Mondrian, we cannot partition the space at all be-
cause any possible disjoint partitioning would violate the -diversity property.
For example, if the Age axis was split into segments 35 55 and 60 70 (i.e.,
as in the k-anonymity case), then gastritis would appear in the left-side par-
tition with probability 3/6, which is larger than the allowed 1/ = 1/3. Since
Mondrian includes all tuples in the same partition, young or old persons are as-
cribed the same probability to suffer from Alzheimer’s. Obviously the resulting
information loss is unacceptable.
The previous example demonstrates that existing techniques for the privacy-
constrained k-anonymization problem, such as Mondrian, are not appropriate
for the -diversification problem. In Section 2 we also explain that Anatomy
[Xiao and Tao 2006a], which is an -diversity-specific method, exhibits high
information loss, despite relaxing the privacy requirements (i.e., it publishes
the exact quasi-identifier). Moreover, while our techniques resemble clustering,
our experiments show that existing clustering-based anonymization techniques
(e.g., Xu et al. [2006]) are worse in terms of information loss and considerably
slower.
So far, research efforts focused on the privacy-constrained anonymization
problem, which minimizes information loss for a given value of k or ; we call
this the direct anonymization problem. However, the resulting information loss
may be high, rendering the published data useless for specific applications.
In practice, the data recipient may require certain bounds on the amount of
information loss. For instance, it is well known that the occurrence of certain
diseases is highly correlated to age (e.g., Alzheimer’s can only occur in elderly
patients). To ensure that anonymized hospital records make practical sense,
2
Note that although groups may overlap in their quasi-identifier extents, each record belongs to
exactly one group.
ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

A Framework for Efficient Data Anonymization
9:5
Fig. 3. Iterative privacy-constrained solution for the accuracy-constrained (i.e., dual) problem.
a medical researcher may require that no anonymized group should span a
range on attribute Age larger than 10 years. Motivated by such scenarios, we
introduce the accuracy-constrained or dual anonymization problem. Let E be
the maximum acceptable amount of information loss (the metric is formally
defined in Section 2). The accuracy-constrained anonymization problem finds
the maximum degree of privacy (i.e., k or ) that can be achieved such that
information loss does not exceed E. Subsequently, the data publisher can assess
whether the attainable privacy under this constraint is satisfactory, and can
decide whether it makes sense to publish the data at all. To the best of our
knowledge, the dual problem has not been addressed previously, despite its
important practical applications.
A possible solution for the dual problem is to use an existing method for
privacy-constrained anonymization, as shown in Figure 3 (we consider the
-diversity case). The algorithm, called Iterative Privacy-Constrained Solution
for the Dual problem (IPCSD), performs a binary search to find the maximum
value of for which the information loss does not exceed E. The
min
value of
1 (line 1) corresponds to no privacy, whereas
max
is the maximum achievable
privacy, and is a characteristic of the dataset. As we will formally discuss in
Section 2.3,
max
is equal to the total number of records divided by the number
of occurrences of the SA value with the highest frequency. The algorithm stops
when the search interval for the value is reduced below a certain thresh-
old Thr. IPCSD is a generic solution that can be used in conjunction with any
privacy-constrained -diversification method, such as our proposed 1D method
described earlier, or Mondrian. The invocation of a particular method is done
in line 4 of the pseudocode.
Because it is not specifically tailored for the dual problem, IPCSD can yield
unsatisfactory results. Consider the example of Figure 4 and assume the E
bound requires that the span of each group along any of the quasi-identifier
attributes does not exceed 15. IPCSD, used in conjunction with Mondrian, will
give the result in Figure 4(a), with a maximum achievable privacy metric of
= 4/3( is the inverse of the maximum association probability between a
record and an SA value, which is 3/4 for group 2). It is easy to see that all
splits, except that between Weight values 55 and 60, leave on one side of the
split only records with the same SA value, hence association probability is
100% (no privacy). The solution depicted in the example is the only one where
ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

Citations
More filters
Proceedings ArticleDOI

No free lunch in data privacy

TL;DR: This paper argues that privacy of an individual is preserved when it is possible to limit the inference of an attacker about the participation of the individual in the data generating process, different from limiting the inference about the presence of a tuple.
Journal Article

ACM Transactions on Database Systems

TL;DR: BLOCKIN BLOCKINÒ BLOCKin× ½¸ÔÔº ¾ßß¿º ¿ ¾ ¾ à ¼ à à 0
Journal ArticleDOI

Protecting Sensitive Labels in Social Network Data Anonymization

TL;DR: A k-degree-l-diversity anonymity model that considers the protection of structural information as well as sensitive labels of individuals is defined and a novel anonymization methodology based on adding noise nodes is proposed.
Journal ArticleDOI

ρ-uncertainty: inference-proof transaction anonymization

TL;DR: The problem of achieving ρ-uncertainty with low information loss is solved non-trivially by a technique that combines generalization and suppression, which achieves favorable results compared to a baseline perturbation-based scheme.
Journal ArticleDOI

A Survey on Blockchain-Based Internet Service Architecture: Requirements, Challenges, Trends, and Future

TL;DR: The blockchain-based mechanism that aims to improve the critical features of traditional Internet services, including autonomous and decentralized processing, smart contractual enforcement of goals, and traceable trustworthiness in tamper-proof transactions is explored.
References
More filters
Journal ArticleDOI

k -anonymity: a model for protecting privacy

TL;DR: The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment and examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected.
Journal ArticleDOI

L-diversity: Privacy beyond k-anonymity

TL;DR: This paper shows with two simple attacks that a \kappa-anonymized dataset has some subtle, but severe privacy problems, and proposes a novel and powerful privacy definition called \ell-diversity, which is practical and can be implemented efficiently.
Proceedings ArticleDOI

t-Closeness: Privacy Beyond k-Anonymity and l-Diversity

TL;DR: T-closeness as mentioned in this paper requires that the distribution of a sensitive attribute in any equivalence class is close to the distributions of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t).
Proceedings ArticleDOI

L-diversity: privacy beyond k-anonymity

TL;DR: This paper shows with two simple attacks that a \kappa-anonymized dataset has some subtle, but severe privacy problems, and proposes a novel and powerful privacy definition called \ell-diversity, which is practical and can be implemented efficiently.
Journal ArticleDOI

Protecting respondents identities in microdata release

TL;DR: This paper addresses the problem of releasing microdata while safeguarding the anonymity of respondents to which the data refer and introduces the concept of minimal generalization that captures the property of the release process not distorting the data more than needed to achieve k-anonymity.
Related Papers (5)
Frequently Asked Questions (7)
Q1. What contributions have the authors mentioned in the paper "A framework for efficient data anonymization under privacy and accuracy constraints" ?

Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy-preserving paradigms of k-anonymity and -diversity. However, existing approaches suffer from at least one of the following drawbacks: ( i ) -diversification is solved by techniques developed for the simpler k-anonymization problem, causing unnecessary information loss. In this article, the authors propose a framework for efficient anonymization of microdata that addresses these deficiencies. First, the authors focus on one-dimensional ( i. e., single-attribute ) quasi-identifiers, and study the properties of optimal solutions under the k-anonymity and -diversity models for the privacy-constrained ( i. e., direct ) and the accuracy-constrained ( i. e., dual ) anonymization problems. Extensive experimental evaluation shows that their techniques clearly outperform the existing approaches in terms of execution time and information loss. 

In the future the authors plan to extend their framework to other privacy paradigms, such as t-closeness and m-invariance. Furthermore, the authors intend to study the privacy- and accuracy-constrained problems for data streams. 

In their experiments the authors use KL-Divergence (K LD), which has been acknowledged as a representative metric in the data anonymization literature [Kifer and Gehrke 2006]. 

To address the inflexibility of single-dimensional recoding, Mondrian [LeFevre et al. 2006a] employs multidimensional global recoding, which achieves finer granularity. 

k-anonymity is commonly achieved either by generalization (e.g., show only the area code instead of the exact phone number) or suppression (i.e., hide some values of the quasi-identifier), both of which inevitably lead to information loss. 

Because Mondrian uses space partitioning, the data points within a group are not necessarily close to each other in the QT space (e.g., points 22 and 55 in Figure 1(b)), causing high information loss. 

The NCP of class G over all quasiidentifier attributes isNCP(G) = d∑i=1 wi · NCPAi (G), (1)where d is the number of attributes in QT (i.e., the dimensionality).