L-diversity: Privacy beyond k-anonymity

doi:10.1145/1217299.1217302

Home
/
Papers
/
L-diversity: Privacy beyond k-anonymity

Journal Article•DOI•

L-diversity: Privacy beyond k-anonymity

Ashwin Machanavajjhala¹, Daniel Kifer¹, Johannes Gehrke¹, Muthuramakrishnan Venkitasubramaniam¹•Institutions (1)

Cornell University¹

01 Mar 2007-ACM Transactions on Knowledge Discovery From Data (ACM)-Vol. 1, Iss: 1, pp 3

TL;DR: This paper shows with two simple attacks that a \kappa-anonymized dataset has some subtle, but severe privacy problems, and proposes a novel and powerful privacy definition called \ell-diversity, which is practical and can be implemented efficiently.

read less

Abstract: Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k − 1 other records with respect to certain identifying attributes.In this article, we show using two simple attacks that a k-anonymized dataset has some subtle but severe privacy problems. First, an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. This is a known problem. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks, and we propose a novel and powerful privacy criterion called e-diversity that can defend against such attacks. In addition to building a formal foundation for e-diversity, we show in an experimental evaluation that e-diversity is practical and can be implemented efficiently.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Differential privacy: a survey of results

[...]

Cynthia Dwork¹•Institutions (1)

Microsoft¹

25 Apr 2008

TL;DR: This survey recalls the definition of differential privacy and two basic techniques for achieving it, and shows some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.

...read moreread less

Abstract: Over the past five years a new approach to privacy-preserving data analysis has born fruit [13, 18, 7, 19, 5, 37, 35, 8, 32]. This approach differs from much (but not all!) of the related literature in the statistics, databases, theory, and cryptography communities, in that a formal and ad omnia privacy guarantee is defined, and the data analysis techniques presented are rigorously proved to satisfy the guarantee. The key privacy guarantee that has emerged is differential privacy. Roughly speaking, this ensures that (almost, and quantifiably) no risk is incurred by joining a statistical database. In this survey, we recall the definition of differential privacy and two basic techniques for achieving it. We then show some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.

...read moreread less

3,314 citations

Additional excerpts

...We then show some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning....
[...]

Proceedings Article•DOI•

t-Closeness: Privacy Beyond k-Anonymity and l-Diversity

[...]

Ninghui Li¹, Tiancheng Li¹, Suresh Venkatasubramanian²•Institutions (2)

Purdue University¹, AT&T Labs²

15 Apr 2007

TL;DR: T-closeness as mentioned in this paper requires that the distribution of a sensitive attribute in any equivalence class is close to the distributions of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t).

...read moreread less

Abstract: The k-anonymity privacy requirement for publishing microdata requires that each equivalence class (ie, a set of records that are indistinguishable from each other with respect to certain "identifying" attributes) contains at least k records Recently, several authors have recognized that k-anonymity cannot prevent attribute disclosure The notion of l-diversity has been proposed to address this; l-diversity requires that each equivalence class has at least l well-represented values for each sensitive attribute In this paper we show that l-diversity has a number of limitations In particular, it is neither necessary nor sufficient to prevent attribute disclosure We propose a novel privacy notion called t-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (ie, the distance between the two distributions should be no more than a threshold t) We choose to use the earth mover distance measure for our t-closeness requirement We discuss the rationale for t-closeness and illustrate its advantages through examples and experiments

...read moreread less

3,281 citations

Proceedings Article•DOI•

L-diversity: privacy beyond k-anonymity

[...]

Ashwin Machanavajjhala¹, Johannes Gehrke¹, Daniel Kifer¹, Muthuramakrishnan Venkitasubramaniam¹•Institutions (1)

Cornell University¹

03 Apr 2006

...read moreread less

Abstract: Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called \kappa-anonymity has gained popularity. In a \kappa-anonymized dataset, each record is indistinguishable from at least k—1 other records with respect to certain "identifying" attributes. In this paper we show with two simple attacks that a \kappa-anonymized dataset has some subtle, but severe privacy problems. First, we show that an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and we show that \kappa-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks and we propose a novel and powerful privacy definition called \ell-diversity. In addition to building a formal foundation for \ell-diversity, we show in an experimental evaluation that \ell-diversity is practical and can be implemented efficiently.

...read moreread less

2,700 citations

Proceedings Article•DOI•

Robust De-anonymization of Large Sparse Datasets

[...]

Arvind Narayanan, Vitaly Shmatikov

18 May 2008

TL;DR: This work applies the de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service, and demonstrates that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset.

...read moreread less

Abstract: We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information

...read moreread less

2,241 citations

Cites background from "L-diversity: Privacy beyond k-anony..."

...This does not guarantee any privacy, because the values of sensitive attributes associated with a given quasi-identifier may not be sufficiently diverse [20, 21] or the adversary may know more than just the quasiidentifiers [20]....
[...]

Proceedings Article•DOI•

Decentralizing Privacy: Using Blockchain to Protect Personal Data

[...]

Guy Zyskind¹, Oz Nathan², Alex Pentland¹•Institutions (2)

Massachusetts Institute of Technology¹, Tel Aviv University²

21 May 2015

TL;DR: A decentralized personal data management system that ensures users own and control their data is described, and a protocol that turns a block chain into an automated access-control manager that does not require trust in a third party is implemented.

...read moreread less

Abstract: The recent increase in reported incidents of surveillance and security breaches compromising users' privacy call into question the current model, in which third-parties collect and control massive amounts of personal data. Bit coin has demonstrated in the financial space that trusted, auditable computing is possible using a decentralized network of peers accompanied by a public ledger. In this paper, we describe a decentralized personal data management system that ensures users own and control their data. We implement a protocol that turns a block chain into an automated access-control manager that does not require trust in a third party. Unlike Bit coin, transactions in our system are not strictly financial -- they are used to carry instructions, such as storing, querying and sharing data. Finally, we discuss possible future extensions to block chains that could harness them into a well-rounded solution for trusted computing problems in society.

...read moreread less

1,953 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Fast Algorithms for Mining Association Rules in Large Databases

[...]

Rakesh Agrawal, Ramakrishnan Srikant

12 Sep 1994

10,454 citations

"L-diversity: Privacy beyond k-anony..." refers background or methods in this paper

...[1-10], [11-20], etc), we would end up with very large q-blocks....
[...]
...This is called the monotonicity property , and it has been used extensively in frequent itemset mining algorithms [4]....
[...]
...This is called the monotonicity property, and it has been used extensively in frequent itemset mining algorithms [Agrawal and Srikant 1994]. k-anonymity satis.es the monotonicity property, and it is this property which guarantees the correctness of all ef.cient algorithms [Bayardo and Agrawal…...
[...]
...[1-5], [6-10], [11-15], etc) were generalized to age groups of length 10 (i....
[...]

Journal Article•DOI•

k -anonymity: a model for protecting privacy

[...]

Latanya Sweeney¹•Institutions (1)

Carnegie Mellon University¹

01 Oct 2002-International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

TL;DR: The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment and examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected.

...read moreread less

Abstract: Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, µ-Argus and k-Similar provide guarantees of privacy protection.

...read moreread less

7,925 citations

"L-diversity: Privacy beyond k-anony..." refers background or methods in this paper

...To counter linking attacks using quasi-identi.ers, Samarati and Sweeney proposed a de.nition of privacy called k-anonymity [Samarati 2001; Sweeney 2002]....
[...]
...This “linking attack” managed to uniquely identify the medical records of the governor of Massachusetts in the medical data [24]....
[...]
...Samarati 2001; Sweeney 2002; Zhong et al. 2005], k-anonymity has grown in popularity....
[...]
...Because of its conceptual simplicity, k-anonymity has been widely discussed as a viable definition of privacy in data publishing, and due to algorithmic advances in creatin g k-anonymous versions of a dataset [3, 6, 16, 18, 21, 24, 25], k-anonymity has grown in popularity....
[...]
...has been proposed which guarantees that every individual is hidden in a group of size k with respect to the non-sensitive attributes [24]....
[...]

Proceedings Article•DOI•

How to play ANY mental game

[...]

Oded Goldreich¹, Silvio Micali², Avi Wigderson³•Institutions (3)

Technion – Israel Institute of Technology¹, Massachusetts Institute of Technology², Hebrew University of Jerusalem³

01 Jan 1987

TL;DR: This work presents a polynomial-time algorithm that, given as a input the description of a game with incomplete information and any number of players, produces a protocol for playing the game that leaks no partial information, provided the majority of the players is honest.

...read moreread less

Abstract: We present a polynomial-time algorithm that, given as a input the description of a game with incomplete information and any number of players, produces a protocol for playing the game that leaks no partial information, provided the majority of the players is honest. Our algorithm automatically solves all the multi-party protocol problems addressed in complexity-based cryptography during the last 10 years. It actually is a completeness theorem for the class of distributed protocols with honest majority. Such completeness theorem is optimal in the sense that, if the majority of the players is not honest, some protocol problems have no efficient solution [C].

...read moreread less

3,579 citations

Journal Article•DOI•

Privacy-preserving data mining

[...]

Rakesh Agrawal¹, Ramakrishnan Srikant¹•Institutions (1)

IBM¹

16 May 2000

TL;DR: This work considers the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed and proposes a novel reconstruction procedure to accurately estimate the distribution of original data values.

...read moreread less

Abstract: A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.

...read moreread less

3,173 citations

"L-diversity: Privacy beyond k-anony..." refers background in this paper

...[Agrawal and Srikant 2000] propose randomization techniques that can be employed by individuals to mask their sensitive information while allowing the data collector to build good decision trees on the data....
[...]

Journal Article•DOI•

Randomized response: a survey technique for eliminating evasive answer bias.

[...]

Stanley L. Warner¹•Institutions (1)

Claremont Graduate University¹

01 Mar 1965-Journal of the American Statistical Association

TL;DR: A survey technique for improving the reliability of responses to sensitive interview questions is described, which permits the respondent to answer "yes" or "no" to a question without the interviewer knowing what information is being conveyed by the respondent.

...read moreread less

Abstract: For various reasons individuals in a sample survey may prefer not to confide to the interviewer the correct answers to certain questions. In such cases the individuals may elect not to reply at all or to reply with incorrect answers. The resulting evasive answer bias is ordinarily difficult to assess. In this paper it is argued that such bias is potentially removable through allowing the interviewee to maintain privacy through the device of randomizing his response. A randomized response method for estimating a population proportion is presented as an example. Unbiased maximum likelihood estimates are obtained and their mean square errors are compared with the mean square errors of conventional estimates under various assumptions about the underlying population.

...read moreread less

2,929 citations