scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Differentially private histogram publication

TL;DR: This paper investigates the publication of DP-compliant histograms, which is an important analytical tool for showing the distribution of a random variable, e.g., hospital bill size for certain patients, and proposes two novel mechanisms, namely NoiseFirst and StructureFirst, for computing DP- Compliant histogram structure.
Abstract: Differential privacy (DP) is a promising scheme for releasing the results of statistical queries on sensitive data, with strong privacy guarantees against adversaries with arbitrary background knowledge. Existing studies on differential privacy mostly focus on simple aggregations such as counts. This paper investigates the publication of DP-compliant histograms, which is an important analytical tool for showing the distribution of a random variable, e.g., hospital bill size for certain patients. Compared to simple aggregations whose results are purely numerical, a histogram query is inherently more complex, since it must also determine its structure, i.e., the ranges of the bins. As we demonstrate in the paper, a DP-compliant histogram with finer bins may actually lead to significantly lower accuracy than a coarser one, since the former requires stronger perturbations in order to satisfy DP. Moreover, the histogram structure itself may reveal sensitive information, which further complicates the problem. Motivated by this, we propose two novel mechanisms, namely NoiseFirst and StructureFirst, for computing DP-compliant histograms. Their main difference lies in the relative order of the noise injection and the histogram structure computation steps. NoiseFirst has the additional benefit that it can improve the accuracy of an already published DP-compliant histogram computed using a naive method. For each of proposed mechanisms, we design algorithms for computing the optimal histogram structure with two different objectives: minimizing the mean square error and the mean absolute error, respectively. Going one step further, we extend both mechanisms to answer arbitrary range queries. Extensive experiments, using several real datasets, confirm that our two proposals output highly accurate query answers and consistently outperform existing competitors.
Citations
More filters
Proceedings ArticleDOI
24 Oct 2016
TL;DR: The main idea is to first gather a candidate set of heavy hitters using a portion of the privacy budget, and focus the remaining budget on refining the candidate set in a second phase, which is much more efficient budget-wise than obtaining the heavy hitters directly from the whole dataset.
Abstract: In local differential privacy (LDP), each user perturbs her data locally before sending the noisy data to a data collector. The latter then analyzes the data to obtain useful statistics. Unlike the setting of centralized differential privacy, in LDP the data collector never gains access to the exact values of sensitive data, which protects not only the privacy of data contributors but also the collector itself against the risk of potential data leakage. Existing LDP solutions in the literature are mostly limited to the case that each user possesses a tuple of numeric or categorical values, and the data collector computes basic statistics such as counts or mean values. To the best of our knowledge, no existing work tackles more complex data mining tasks such as heavy hitter discovery over set-valued data. In this paper, we present a systematic study of heavy hitter mining under LDP. We first review existing solutions, extend them to the heavy hitter estimation, and explain why their effectiveness is limited. We then propose LDPMiner, a two-phase mechanism for obtaining accurate heavy hitters with LDP. The main idea is to first gather a candidate set of heavy hitters using a portion of the privacy budget, and focus the remaining budget on refining the candidate set in a second phase, which is much more efficient budget-wise than obtaining the heavy hitters directly from the whole dataset. We provide both in-depth theoretical analysis and extensive experiments to compare LDPMiner against adaptations of previous solutions. The results show that LDPMiner significantly improves over existing methods. More importantly, LDPMiner successfully identifies the majority true heavy hitters in practical settings.

304 citations


Cites background from "Differentially private histogram pu..."

  • ...[31] propose a two-phase kdtree based spatial decomposition mechanism to publish histograms....

    [...]

Posted Content
TL;DR: In this paper, the authors proposed the functional mechanism, which perturbs the objective function of the optimization problem, rather than its results, and applied it to linear regression and logistic regression.
Abstract: \epsilon-differential privacy is the state-of-the-art model for releasing sensitive information while protecting privacy. Numerous methods have been proposed to enforce epsilon-differential privacy in various analytical tasks, e.g., regression analysis. Existing solutions for regression analysis, however, are either limited to non-standard types of regression or unable to produce accurate regression results. Motivated by this, we propose the Functional Mechanism, a differentially private method designed for a large class of optimization-based analyses. The main idea is to enforce epsilon-differential privacy by perturbing the objective function of the optimization problem, rather than its results. As case studies, we apply the functional mechanism to address two most widely used regression models, namely, linear regression and logistic regression. Both theoretical analysis and thorough experimental evaluations show that the functional mechanism is highly effective and efficient, and it significantly outperforms existing solutions.

297 citations

Journal ArticleDOI
TL;DR: This survey compares the diverse release mechanisms of differentially private data publishing given a variety of input data in terms of query type, the maximum number of queries, efficiency, and accuracy.
Abstract: Differential privacy is an essential and prevalent privacy model that has been widely explored in recent decades. This survey provides a comprehensive and structured overview of two research directions: differentially private data publishing and differentially private data analysis. We compare the diverse release mechanisms of differentially private data publishing given a variety of input data in terms of query type, the maximum number of queries, efficiency, and accuracy. We identify two basic frameworks for differentially private data analysis and list the typical algorithms used within each framework. The results are compared and discussed based on output accuracy and efficiency. Further, we propose several possible directions for future research and possible applications.

265 citations

Proceedings ArticleDOI
18 Jun 2014
TL;DR: Blowfish, a class of privacy definitions inspired by the Pufferfish framework, is presented that allows data publishers to extend differential privacy using a policy, which specifies secrets, or information that must be kept secret, and constraints that may be known about the data.
Abstract: Privacy definitions provide ways for trading-off the privacy of individuals in a statistical database for the utility of downstream analysis of the data. In this paper, we present Blowfish, a class of privacy definitions inspired by the Pufferfish framework, that provides a rich interface for this trade-off. In particular, we allow data publishers to extend differential privacy using a policy, which specifies (a) secrets, or information that must be kept secret, and (b) constraints that may be known about the data. While the secret specification allows increased utility by lessening protection for certain individual properties, the constraint specification provides added protection against an adversary who knows correlations in the data (arising from constraints). We formalize policies and present novel algorithms that can handle general specifications of sensitive information and certain count constraints. We show that there are reasonable policies under which our privacy mechanisms for k-means clustering, histograms and range queries introduce significantly lesser noise than their differentially private counterparts. We quantify the privacy-utility trade-offs for various policies analytically and empirically on real datasets.

179 citations


Cites methods from "Differentially private histogram pu..."

  • ...Various hierarchical methods have been proposed in the literature [9, 19, 15, 20, 18]....

    [...]

Journal ArticleDOI
01 Sep 2013
TL;DR: This paper examines the factors affecting the accuracy of hierarchical approaches by studying the mean squared error (MSE) when answering range queries, and analyzes how the MSE changes with different branching factors, after employing constrained inference, and with different methods to allocate the privacy budget among hierarchy levels.
Abstract: In recent years, many approaches to differentially privately publish histograms have been proposed. Several approaches rely on constructing tree structures in order to decrease the error when answer large range queries. In this paper, we examine the factors affecting the accuracy of hierarchical approaches by studying the mean squared error (MSE) when answering range queries. We start with one-dimensional histograms, and analyze how the MSE changes with different branching factors, after employing constrained inference, and with different methods to allocate the privacy budget among hierarchy levels. Our analysis and experimental results show that combining the choice of a good branching factor with constrained inference outperform the current state of the art. Finally, we extend our analysis to multi-dimensional histograms. We show that the benefits from employing hierarchical methods beyond a single dimension are significantly diminished, and when there are 3 or more dimensions, it is almost always better to use the Flat method instead of a hierarchy.

174 citations


Cites methods from "Differentially private histogram pu..."

  • ...[23] propose an alternative non-hierarchical mechanism for publishing histograms that can improve upon the Flat method....

    [...]

  • ...In order to generate the optimal histogram structure, [23] proposes using dynamic programming techniques....

    [...]

References
More filters
Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Book ChapterDOI
04 Mar 2006
TL;DR: In this article, the authors show that for several particular applications substantially less noise is needed than was previously understood to be the case, and also show the separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
Abstract: We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which f = ∑ig(xi), where xi denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.

6,211 citations

Journal Article
TL;DR: The study is extended to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f, which is the amount that any single argument to f can change its output.
Abstract: We continue a line of research initiated in [10, 11] on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user. Previous work focused on the case of noisy sums, in which f = Σ i g(x i ), where x i denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case. The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.

3,629 citations


"Differentially private histogram pu..." refers background or methods in this paper

  • ...This section experimentally compares the effectiveness of the proposed solutions NoiseFirst and StructureFirst for rangecount queries with three state-of-the-art methods, referred to as Dwork [3], Privelet [8] and Boost [6]....

    [...]

  • ...First, Dwork and NoiseFirst have significant advantages compared to the other solutions on unit-length range count queries....

    [...]

  • ...The recently proposed concept of differential privacy (DP) [3]–[8] addresses the above issues by injecting a small amount of random noise into statistical results....

    [...]

  • ...In the first step, the algorithm follows the standard solution to differential privacy [3] by adding Laplace noise to each xi in the original count sequence, outputting a randomized sequence...

    [...]

  • ...Especially, on NetTrace with = 0:1, Median-NF is over 6 times better than Dwork in terms of squared error....

    [...]

Book ChapterDOI
Cynthia Dwork1
25 Apr 2008
TL;DR: This survey recalls the definition of differential privacy and two basic techniques for achieving it, and shows some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.
Abstract: Over the past five years a new approach to privacy-preserving data analysis has born fruit [13, 18, 7, 19, 5, 37, 35, 8, 32]. This approach differs from much (but not all!) of the related literature in the statistics, databases, theory, and cryptography communities, in that a formal and ad omnia privacy guarantee is defined, and the data analysis techniques presented are rigorously proved to satisfy the guarantee. The key privacy guarantee that has emerged is differential privacy. Roughly speaking, this ensures that (almost, and quantifiably) no risk is incurred by joining a statistical database. In this survey, we recall the definition of differential privacy and two basic techniques for achieving it. We then show some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.

3,314 citations

Book
01 Jan 2001
TL;DR: The complexity class P is formally defined as the set of concrete decision problems that are polynomial-time solvable, and encodings are used to map abstract problems to concrete problems.
Abstract: problems To understand the class of polynomial-time solvable problems, we must first have a formal notion of what a "problem" is. We define an abstract problem Q to be a binary relation on a set I of problem instances and a set S of problem solutions. For example, an instance for SHORTEST-PATH is a triple consisting of a graph and two vertices. A solution is a sequence of vertices in the graph, with perhaps the empty sequence denoting that no path exists. The problem SHORTEST-PATH itself is the relation that associates each instance of a graph and two vertices with a shortest path in the graph that connects the two vertices. Since shortest paths are not necessarily unique, a given problem instance may have more than one solution. This formulation of an abstract problem is more general than is required for our purposes. As we saw above, the theory of NP-completeness restricts attention to decision problems: those having a yes/no solution. In this case, we can view an abstract decision problem as a function that maps the instance set I to the solution set {0, 1}. For example, a decision problem related to SHORTEST-PATH is the problem PATH that we saw earlier. If i = G, u, v, k is an instance of the decision problem PATH, then PATH(i) = 1 (yes) if a shortest path from u to v has at most k edges, and PATH(i) = 0 (no) otherwise. Many abstract problems are not decision problems, but rather optimization problems, in which some value must be minimized or maximized. As we saw above, however, it is usually a simple matter to recast an optimization problem as a decision problem that is no harder. Encodings If a computer program is to solve an abstract problem, problem instances must be represented in a way that the program understands. An encoding of a set S of abstract objects is a mapping e from S to the set of binary strings. For example, we are all familiar with encoding the natural numbers N = {0, 1, 2, 3, 4,...} as the strings {0, 1, 10, 11, 100,...}. Using this encoding, e(17) = 10001. Anyone who has looked at computer representations of keyboard characters is familiar with either the ASCII or EBCDIC codes. In the ASCII code, the encoding of A is 1000001. Even a compound object can be encoded as a binary string by combining the representations of its constituent parts. Polygons, graphs, functions, ordered pairs, programs-all can be encoded as binary strings. Thus, a computer algorithm that "solves" some abstract decision problem actually takes an encoding of a problem instance as input. We call a problem whose instance set is the set of binary strings a concrete problem. We say that an algorithm solves a concrete problem in time O(T (n)) if, when it is provided a problem instance i of length n = |i|, the algorithm can produce the solution in O(T (n)) time. A concrete problem is polynomial-time solvable, therefore, if there exists an algorithm to solve it in time O(n) for some constant k. We can now formally define the complexity class P as the set of concrete decision problems that are polynomial-time solvable. We can use encodings to map abstract problems to concrete problems. Given an abstract decision problem Q mapping an instance set I to {0, 1}, an encoding e : I → {0, 1}* can be used to induce a related concrete decision problem, which we denote by e(Q). If the solution to an abstract-problem instance i I is Q(i) {0, 1}, then the solution to the concreteproblem instance e(i) {0, 1}* is also Q(i). As a technicality, there may be some binary strings that represent no meaningful abstract-problem instance. For convenience, we shall assume that any such string is mapped arbitrarily to 0. Thus, the concrete problem produces the same solutions as the abstract problem on binary-string instances that represent the encodings of abstract-problem instances. We would like to extend the definition of polynomial-time solvability from concrete problems to abstract problems by using encodings as the bridge, but we would like the definition to be independent of any particular encoding. That is, the efficiency of solving a problem should not depend on how the problem is encoded. Unfortunately, it depends quite heavily on the encoding. For example, suppose that an integer k is to be provided as the sole input to an algorithm, and suppose that the running time of the algorithm is Θ(k). If the integer k is provided in unary-a string of k 1's-then the running time of the algorithm is O(n) on length-n inputs, which is polynomial time. If we use the more natural binary representation of the integer k, however, then the input length is n = ⌊lg k⌋ + 1. In this case, the running time of the algorithm is Θ (k) = Θ(2), which is exponential in the size of the input. Thus, depending on the encoding, the algorithm runs in either polynomial or superpolynomial time. The encoding of an abstract problem is therefore quite important to our under-standing of polynomial time. We cannot really talk about solving an abstract problem without first specifying an encoding. Nevertheless, in practice, if we rule out "expensive" encodings such as unary ones, the actual encoding of a problem makes little difference to whether the problem can be solved in polynomial time. For example, representing integers in base 3 instead of binary has no effect on whether a problem is solvable in polynomial time, since an integer represented in base 3 can be converted to an integer represented in base 2 in polynomial time. We say that a function f : {0, 1}* → {0,1}* is polynomial-time computable if there exists a polynomial-time algorithm A that, given any input x {0, 1}*, produces as output f (x). For some set I of problem instances, we say that two encodings e1 and e2 are polynomially related if there exist two polynomial-time computable functions f12 and f21 such that for any i I , we have f12(e1(i)) = e2(i) and f21(e2(i)) = e1(i). That is, the encoding e2(i) can be computed from the encoding e1(i) by a polynomial-time algorithm, and vice versa. If two encodings e1 and e2 of an abstract problem are polynomially related, whether the problem is polynomial-time solvable or not is independent of which encoding we use, as the following lemma shows. Lemma 34.1 Let Q be an abstract decision problem on an instance set I , and let e1 and e2 be polynomially related encodings on I . Then, e1(Q) P if and only if e2(Q) P. Proof We need only prove the forward direction, since the backward direction is symmetric. Suppose, therefore, that e1(Q) can be solved in time O(nk) for some constant k. Further, suppose that for any problem instance i, the encoding e1(i) can be computed from the encoding e2(i) in time O(n) for some constant c, where n = |e2(i)|. To solve problem e2(Q), on input e2(i), we first compute e1(i) and then run the algorithm for e1(Q) on e1(i). How long does this take? The conversion of encodings takes time O(n), and therefore |e1(i)| = O(n), since the output of a serial computer cannot be longer than its running time. Solving the problem on e1(i) takes time O(|e1(i)|) = O(n), which is polynomial since both c and k are constants. Thus, whether an abstract problem has its instances encoded in binary or base 3 does not affect its "complexity," that is, whether it is polynomial-time solvable or not, but if instances are encoded in unary, its complexity may change. In order to be able to converse in an encoding-independent fashion, we shall generally assume that problem instances are encoded in any reasonable, concise fashion, unless we specifically say otherwise. To be precise, we shall assume that the encoding of an integer is polynomially related to its binary representation, and that the encoding of a finite set is polynomially related to its encoding as a list of its elements, enclosed in braces and separated by commas. (ASCII is one such encoding scheme.) With such a "standard" encoding in hand, we can derive reasonable encodings of other mathematical objects, such as tuples, graphs, and formulas. To denote the standard encoding of an object, we shall enclose the object in angle braces. Thus, G denotes the standard encoding of a graph G. As long as we implicitly use an encoding that is polynomially related to this standard encoding, we can talk directly about abstract problems without reference to any particular encoding, knowing that the choice of encoding has no effect on whether the abstract problem is polynomial-time solvable. Henceforth, we shall generally assume that all problem instances are binary strings encoded using the standard encoding, unless we explicitly specify the contrary. We shall also typically neglect the distinction between abstract and concrete problems. The reader should watch out for problems that arise in practice, however, in which a standard encoding is not obvious and the encoding does make a difference. A formal-language framework One of the convenient aspects of focusing on decision problems is that they make it easy to use the machinery of formal-language theory. It is worthwhile at this point to review some definitions from that theory. An alphabet Σ is a finite set of symbols. A language L over Σ is any set of strings made up of symbols from Σ. For example, if Σ = {0, 1}, the set L = {10, 11, 101, 111, 1011, 1101, 10001,...} is the language of binary representations of prime numbers. We denote the empty string by ε, and the empty language by Ø. The language of all strings over Σ is denoted Σ*. For example, if Σ = {0, 1}, then Σ* = {ε, 0, 1, 00, 01, 10, 11, 000,...} is the set of all binary strings. Every language L over Σ is a subset of Σ*. There are a variety of operations on languages. Set-theoretic operations, such as union and intersection, follow directly from the set-theoretic definitions. We define the complement of L by . The concatenation of two languages L1 and L2 is the language L = {x1x2 : x1 L1 and x2 L2}. The closure or Kleene star of a language L is the language L*= {ε} L L L ···, where Lk is the language obtained by

2,817 citations