What are the future works mentioned in the paper "A framework for efficient data anonymization under privacy and accuracy constraints" ?

In the future the authors plan to extend their framework to other privacy paradigms, such as t-closeness and m-invariance. Furthermore, the authors intend to study the privacy- and accuracy-constrained problems for data streams.

What is the metric used in the experiments?

In their experiments the authors use KL-Divergence (K LD), which has been acknowledged as a representative metric in the data anonymization literature [Kifer and Gehrke 2006].

What is the way to address the inflexibility of single-dimensional recoding?

To address the inflexibility of single-dimensional recoding, Mondrian [LeFevre et al. 2006a] employs multidimensional global recoding, which achieves finer granularity.

Why is the data in Mondrian not close to each other?

Because Mondrian uses space partitioning, the data points within a group are not necessarily close to each other in the QT space (e.g., points 22 and 55 in Figure 1(b)), causing high information loss.

What is the NCP of class G over all quasiidentifier attributes?

The NCP of class G over all quasiidentifier attributes isNCP(G) = d∑i=1 wi · NCPAi (G), (1)where d is the number of attributes in QT (i.e., the dimensionality).

(Open Access) A framework for efficient data anonymization under privacy and accuracy constraints (2009) | Gabriel Ghinita

A Framework for Efﬁcient Data

Anonymization under Privacy and

Accuracy Constraints

GABRIEL GHINITA, PANAGIOTIS KARRAS, and PANOS KALNIS

National University of Singapore

and

NIKOS MAMOULIS

University of Hong Kong

Recent research studied the problem of publishing microdata without revealing sensitive infor-

mation, leading to the privacy-preserving paradigms of k-anonymity and -diversity. k-anonymity

protects against the identiﬁcation of an individual’s record. -diversity, in addition, safeguards

against the association of an individual with speciﬁc sensitive information. However, existing ap-

proaches suffer from at least one of the following drawbacks: (i) -diversiﬁcation is solved by tech-

niques developed for the simpler k-anonymization problem, causing unnecessary information loss.

(ii) The anonymization process is inefﬁcient in terms of computational and I/O cost. (iii) Previous

research focused exclusively on the privacy-constrained problem and ignored the equally important

accuracy-constrained (or dual) anonymization problem.

In this article, we propose a framework for efﬁcient anonymization of microdata that addresses

these deﬁciencies. First, we focus on one-dimensional (i.e., single-attribute) quasi-identiﬁers, and

study the properties of optimal solutions under the k-anonymity and -diversity models for the

privacy-constrained (i.e., direct) and the accuracy-constrained (i.e., dual) anonymization problems.

Guided by these properties, we develop efﬁcient heuristics to solve the one-dimensional problems

in linear time. Finally, we generalize our solutions to multidimensional quasi-identiﬁers using

space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly

outperform the existing approaches in terms of execution time and information loss.

Categories and Subject Descriptors: H.2.0 [Database Management]: General—Security, integrity,

and protection

General Terms: Design, Experimentation, Security

Additional Key Words and Phrases: Privacy, anonymity

This work was partially supported by grant HKU 715108E from Hong Kong RGC.

Authors’ addresses: G. Ghinita, P. Karras, P. Kalnis, National University of Singapore, Com-

puting 1, Computing Drive, Singapore 117417; email: {ghinitag,karras,kalnis}@comp.nus.edu.sg;

N. Mamoulis, University of Hong Kong, Pokfulam Road, Hong Kong; email: nikos@cs.hku.hk.

Permission to make digital or hard copies of part or all of this work for personal or classroom use

is granted without fee provided that copies are not made or distributed for proﬁt or commercial

advantage and that copies show this notice on the ﬁrst page or initial screen of a display along

with the full citation. Copyrights for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,

to redistribute to lists, or to use any component of this work in other works requires prior speciﬁc

permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn

Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.



2009 ACM 0362-5915/2009/06-ART9 $10.00

DOI 10.1145/1538909.1538911 http://doi.acm.org/10.1145/1538909.1538911

ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

9:2

•

G. Ghinita et al.

ACM Reference Format:

Ghinita, G., Karras, P., Kalnis, P., and Mamoulis, N. 2009. A framework for efﬁcient data anonymiza-

tion under privacy and accuracy constraints. ACM Trans. Database Syst., 34, 2, Article 9 (June

2009), 47 pages, DOI = 10.1145/1538909.1538911 http://doi.acm.org/10.1145/1538909.1538911

1. INTRODUCTION

Organizations, such as hospitals, need to release microdata (e.g., medical

records) for research and other public beneﬁt purposes. However, sensitive

personal information (e.g., medical condition of a speciﬁc person) may be re-

vealed in this process. Conventionally, identifying attributes such as name or

social security number are not disclosed, in order to protect privacy. Still, re-

cent research [Froomkin 2000; Sweeney 2002] has demonstrated that this is

not sufﬁcient, due to the existence of quasi-identiﬁers in the released micro-

data. Quasi-identiﬁers are sets of attributes (e.g., ZIP, Gender, DateOfBirth)

which can be joined with information obtained from diverse sources (e.g.,

public voting registration data) in order to reveal the identity of individual

records.

To address this threat, Samarati [2001] and Sweeney [2002] proposed the

k-anonymity model: For every record in a released table there should be at

least k − 1 other records identical to it along a set of quasi-identifying at-

tributes. Records with identical quasi-identiﬁer values constitute an equiva-

lence class. k-anonymity is commonly achieved either by generalization (e.g.,

show only the area code instead of the exact phone number) or suppression

(i.e., hide some values of the quasi-identiﬁer), both of which inevitably lead to

information loss. Still, the data should remain as accurate as possible in order

to be useful in practice. Hence a trade-off between privacy and information loss

emerges.

Recently, the concept of -diversity [Machanavajjhala et al. 2006] was in-

troduced to address the limitations of k-anonymity. The latter may disclose

sensitive information when there are many identical Sensitive Attribute (SA)

values within an equivalence class

(e.g., all persons suffer from the same dis-

ease). -diversity prevents uniformity and background knowledge attacks by

ensuring that at least  SA values are well represented in each equivalence

class (e.g., the probability to associate a tuple with an SA value is bounded

by 1/ [Xiao and Tao 2006a]). Machanavajjhala et al. [2006] suggest that any

k-anonymization algorithm can be adapted to achieve -diversity. However, the

following example demonstrates that such an approach may yield excessive in-

formation loss.

Consider the privacy-constrained anonymization problem for the microdata

in Figure 1(a), where the combination of Age, Weight is the quasi-identiﬁer and

Disease is the sensitive attribute. Let the required privacy constraint, within

the k-anonymity model, be k = 4. The current state-of-the-art k-anonymization

k-anonymity remains a useful concept, suitable for cases where the sensitive attribute is implicit

or omitted (e.g., a database containing information about convicted persons, regardless of speciﬁc

crimes).

ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

A Framework for Efﬁcient Data Anonymization

•

9:3

Fig. 1. k-anonymization example (k = 4).

algorithm (i.e., Mondrian [LeFevre et al. 2006a]) sorts the data points along

each dimension (i.e., Age and Weight), and partitions across the dimension with

the widest normalized range of values. In our example, the normalized ranges

for both dimensions are the same. Mondrian selects the ﬁrst one (i.e., Age) and

splits it into segments 35−55 and 60−70 (see Figure 1(b)). Further partitioning

is not possible because any split would result in groups with less than 4 records.

We propose a different approach. First, we map the multidimensional quasi-

identiﬁer to a 1D value. In this example we use an 8 × 8 Hilbert space ﬁlling

curve (see Section 6 for details); other mappings are also possible. The resulting

sorted 1D values are shown in Figure 1(a) (column 1D). Next, we partition the

1D space. We prove that the optimal 1D partitions are nonoverlapping and

contain between k and 2k − 1 records. We obtain 3 groups which correspond

to 1D ranges [22..31], [33..42], and [55..63]. The resulting 2D partitions are

enclosed by three rectangles in Figure 1(b). In this example, our method causes

less information loss because the extents of the obtained groups are smaller

than in the case of Mondrian. For instance, consider the query “Find how many

persons are in the age segment 35−45 and weight interval 50−60”: The correct

answer is 3. Assuming that records are uniformly distributed within each group,

our method returns the answer 4 × 9/12 = 3 (there are 4 records in Group

9 data space cells that match the query, and a total of 12 cells in Group

). On

the other hand, the answer obtained with Mondrian is 6 × 9/40 = 1.35 (from

the group situated to the left of the dotted line). Clearly, our k-anonymization

algorithm is more accurate.

The advantages of our approach are even more prominent with the

-diversiﬁcation problem. This problem is more difﬁcult because, in order to

cover a variety of SA values, the optimal 1D partitioning may have to include

overlapping ranges. For example, if =3, group 2 in Figure 2(a) contains tuples

{30, 35, 56}, whereas the third group contains tuples {33, 40, 42}. Nevertheless,

we prove that there exist optimal partitionings consisting of only consecutive

ranges with respect to each individual value of the sensitive attribute. Based on

this property, we develop a heuristic which essentially groups together records

ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

9:4

•

G. Ghinita et al.

Fig. 2. -diversiﬁcation example ( = 3).

that are close to each other in the 1D space, but have different sensitive attribute

values. The four resulting groups

are shown in Figure 2(b). From the result we

can infer, for instance, that no person younger than 55 suffers from Alzheimer’s.

On the other hand, if we use Mondrian, we cannot partition the space at all be-

cause any possible disjoint partitioning would violate the -diversity property.

For example, if the Age axis was split into segments 35 − 55 and 60 − 70 (i.e.,

as in the k-anonymity case), then gastritis would appear in the left-side par-

tition with probability 3/6, which is larger than the allowed 1/ = 1/3. Since

Mondrian includes all tuples in the same partition, young or old persons are as-

cribed the same probability to suffer from Alzheimer’s. Obviously the resulting

information loss is unacceptable.

The previous example demonstrates that existing techniques for the privacy-

constrained k-anonymization problem, such as Mondrian, are not appropriate

for the -diversiﬁcation problem. In Section 2 we also explain that Anatomy

[Xiao and Tao 2006a], which is an -diversity-speciﬁc method, exhibits high

information loss, despite relaxing the privacy requirements (i.e., it publishes

the exact quasi-identiﬁer). Moreover, while our techniques resemble clustering,

our experiments show that existing clustering-based anonymization techniques

(e.g., Xu et al. [2006]) are worse in terms of information loss and considerably

slower.

So far, research efforts focused on the privacy-constrained anonymization

problem, which minimizes information loss for a given value of k or ; we call

this the direct anonymization problem. However, the resulting information loss

may be high, rendering the published data useless for speciﬁc applications.

In practice, the data recipient may require certain bounds on the amount of

information loss. For instance, it is well known that the occurrence of certain

diseases is highly correlated to age (e.g., Alzheimer’s can only occur in elderly

patients). To ensure that anonymized hospital records make practical sense,

Note that although groups may overlap in their quasi-identiﬁer extents, each record belongs to

exactly one group.

ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

A Framework for Efﬁcient Data Anonymization

•

9:5

Fig. 3. Iterative privacy-constrained solution for the accuracy-constrained (i.e., dual) problem.

a medical researcher may require that no anonymized group should span a

range on attribute Age larger than 10 years. Motivated by such scenarios, we

introduce the accuracy-constrained or dual anonymization problem. Let E be

the maximum acceptable amount of information loss (the metric is formally

deﬁned in Section 2). The accuracy-constrained anonymization problem ﬁnds

the maximum degree of privacy (i.e., k or ) that can be achieved such that

information loss does not exceed E. Subsequently, the data publisher can assess

whether the attainable privacy under this constraint is satisfactory, and can

decide whether it makes sense to publish the data at all. To the best of our

knowledge, the dual problem has not been addressed previously, despite its

important practical applications.

A possible solution for the dual problem is to use an existing method for

privacy-constrained anonymization, as shown in Figure 3 (we consider the

-diversity case). The algorithm, called Iterative Privacy-Constrained Solution

for the Dual problem (IPCSD), performs a binary search to ﬁnd the maximum

value of  for which the information loss does not exceed E. The 

min

value of

1 (line 1) corresponds to no privacy, whereas 

max

is the maximum achievable

privacy, and is a characteristic of the dataset. As we will formally discuss in

Section 2.3, 

max

is equal to the total number of records divided by the number

of occurrences of the SA value with the highest frequency. The algorithm stops

when the search interval for the  value is reduced below a certain thresh-

old Thr. IPCSD is a generic solution that can be used in conjunction with any

privacy-constrained -diversiﬁcation method, such as our proposed 1D method

described earlier, or Mondrian. The invocation of a particular method is done

in line 4 of the pseudocode.

Because it is not speciﬁcally tailored for the dual problem, IPCSD can yield

unsatisfactory results. Consider the example of Figure 4 and assume the E

bound requires that the span of each group along any of the quasi-identiﬁer

attributes does not exceed 15. IPCSD, used in conjunction with Mondrian, will

give the result in Figure 4(a), with a maximum achievable privacy metric of

 = 4/3( is the inverse of the maximum association probability between a

record and an SA value, which is 3/4 for group 2). It is easy to see that all

splits, except that between Weight values 55 and 60, leave on one side of the

split only records with the same SA value, hence association probability is

100% (no privacy). The solution depicted in the example is the only one where

ACM Transactions on Database Systems, Vol. 34, No. 2, Article 9, Publication date: June 2009.

A framework for efficient data anonymization under privacy and accuracy constraints

Figures

Citations

No free lunch in data privacy

ACM Transactions on Database Systems

Protecting Sensitive Labels in Social Network Data Anonymization

ρ-uncertainty: inference-proof transaction anonymization

A Survey on Blockchain-Based Internet Service Architecture: Requirements, Challenges, Trends, and Future

References

k -anonymity: a model for protecting privacy

L-diversity: Privacy beyond k-anonymity

t-Closeness: Privacy Beyond k-Anonymity and l-Diversity

L-diversity: privacy beyond k-anonymity

Protecting respondents identities in microdata release

Related Papers (5)

k -anonymity: a model for protecting privacy

Protecting respondents identities in microdata release

L-diversity: Privacy beyond k-anonymity

Mondrian Multidimensional K-Anonymity

t-Closeness: Privacy Beyond k-Anonymity and l-Diversity

Frequently Asked Questions (7)

Q1. What contributions have the authors mentioned in the paper "A framework for efficient data anonymization under privacy and accuracy constraints" ?

Q2. What are the future works mentioned in the paper "A framework for efficient data anonymization under privacy and accuracy constraints" ?

Q3. What is the metric used in the experiments?

Q4. What is the way to address the inflexibility of single-dimensional recoding?

Q5. What is the common way to achieve k-anonymity?

Q6. Why is the data in Mondrian not close to each other?

Q7. What is the NCP of class G over all quasiidentifier attributes?