scispace - formally typeset
Search or ask a question

Showing papers in "Transactions on Data Privacy in 2012"


Journal ArticleDOI
TL;DR: In this paper, the problem of constructing private classifiers using decision trees, within the framework of differential privacy, was studied and a differentially private decision tree ensemble algorithm based on random decision trees was proposed.
Abstract: In this paper, we study the problem of constructing private classifiers using decision trees, within the framework of differential privacy. We first present experimental evidence that creating a differentially private ID3 tree using differentially private low-level queries does not simultaneously provide good privacy and good accuracy, particularly for small datasets. In search of better privacy and accuracy, we then present a differentially private decision tree ensemble algorithm based on random decision trees. We demonstrate experimentally that this approach yields good prediction while maintaining good privacy, even for small datasets. We also present differentially private extensions of our algorithm to two settings: (1) new data is periodically appended to an existing database and (2) the database is horizontally or vertically partitioned between multiple users.

166 citations


Journal Article
TL;DR: An alternative disclosure risk assessment approach is presented that integrates some of the strong confidential- ity protection features in ϵ-differential privacy with the interpretability and data-specific nature of probabilistic disclosure risk measures.
Abstract: We compare the disclosure risk criterion of e-differential privacy with a criterion based on probabilities that intruders uncover actual values given the released data. To do so, we generate fully synthetic data that satisfy e-differential privacy at different levels of e, make assumptions about the information available to intruders, and compute posterior probabilities of uncovering true values. The simulation results suggest that the two paradigms are not easily reconciled, since differential privacy is agnostic to the specific values in the observed data whereas probabilistic disclosure risk measures depend greatly on them. The results also suggest, perhaps surprisingly, that probabilistic disclosure risk measures can be small even when e is large. Motivated by these findings, we present an alternative disclosure risk assessment approach that integrates some of the strong confidentiality protection features in e-differential privacy with the interpretability and data-specific nature of probabilistic disclosure risk measures.

74 citations


Journal Article
TL;DR: A novel information theoretic approach to text sanitization is provided and efficient heuristics to sanitize text documents are developed to protect sensitive and identifying information.
Abstract: De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis ``tuberculosis'' is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term ``infectious disease'' also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

56 citations


Journal Article
TL;DR: It is shown how also k-concealment can be enhanced by a security measure such as p-sensitivity or l-diversity, and the usefulness of the models and algorithms presented are demonstrated through extensive experiments.
Abstract: We introduce a new model of k-type anonymity, called k-concealment, as an alternative to the well-known model of k-anonymity. This new model achieves similar privacy goals as k-anonymity: While in k-anonymity one generalizes the table records so that each one of them becomes equal to at least k-1 other records, when projected on the subset of quasi-identifiers, k-concealment proposes to generalize the table records so that each one of them becomes computationally-indistinguishable from at least k-1 others. As the new model extends that of k-anonymity, it offers higher utility. To motivate the new model and to lay the ground for its introduction, we first present three other models, called (1, k)-, (k, 1)-and (k, k)-anonymity which also extend k-anonymity. We characterize the interrelation between the four models and propose algorithms for anonymizing data according to them. Since k-anonymity, on its own, is insecure, as it may allow adversaries to learn the sensitive information of some individuals, it must be enhanced by a security measure such as p-sensitivity or l-diversity. We show how also k-concealment can be enhanced by such measures. We demonstrate the usefulness of our models and algorithms through extensive experiments.

55 citations


Journal Article
TL;DR: This article considers a new hypothesis, motivated by the fact that when the auxiliary information contains values corresponding to rare attributes, the de-anonymization achieved is stronger, and formalizes this using the notion on long tail, and gives new theorems expressing the level of de-Anonymization in terms of the parameters of the tail of the database D.
Abstract: Consider a database D with records containing history of individuals' transactions, that has been de-identified, i.e., the variables that uniquely associate records with individuals have been removed from the data. An adversary de-anonymizes D via a linkage attack if using some auxiliary information about a certain individual in the database, it can determine which record of D corresponds to such individual. One example of this is given in the article Robust De-anonymization of Large Sparse Datasets, by Narayanan and Shmatikov [19], which shows that an anonymized database containing records with ratings of different movies rented by customers of Netflix, could in fact be de-anonymized using very little auxiliary information, even with errors. Besides the heuristic de-anonymization of the Netflix database, Narayanan and Shmatikov provide interesting theoretical results about database de-anonymization that an adversary can produce under general conditions. In this article we revisit these theoretical results, and work them further. Our first contribution is to exhibit different simple cases in which the algorithm Scoreboard, meant to produce the theoretical de-anonymization in [19], fails to do so. By requiring 1-sim to be a pseudo-metric, and that the algorithm producing the de-anonymization outputs a record with minimum support among the candidates, we obtain and prove deanonymization results similar to those described in [19]. We then consider a new hypothesis, motivated by the fact (observed in heuristic de-anonymizations) that when the auxiliary information contains values corresponding to rare attributes, the de-anonymization achieved is stronger. We formalize this using the notion on long tail [4], and give new theorems expressing the level of de-anonymization in terms of the parameters of the tail of the database D. The improvement in the deanonymization is reflected in the fact that when at least one value in the auxiliary information corresponds to a rare attribute of D, the size of auxiliary information could be reduced in about 50%, provided that D has a long tail. We then explore a microdata file from the Joint Canada/United States Survey of Health 2004 [22], where the records reflect the answers of the survey respondents. While many of the variables are related to health issues, some other variables a related to characteristics that individuals may disclose easily, such as physical activities (sports) or demographic characteristics. We perform an experiment with this microdata file and show that using only some non-sensitive attribute values it is possible, with a significant probability, to link those values to the corresponding full record.

41 citations


Journal Article
TL;DR: In this paper, a distributed architecture for secure database services is proposed as a solution to this problem where data is stored at multiple servers and the results at the servers are integrated to obtain the answer at the client.
Abstract: The advent of database services has resulted in privacy concerns on the part of the client storing data with third party database service providers. Previous approaches to enabling such a service have been based on data encryption, causing a large overhead in query processing. A distributed architecture for secure database services is proposed as a solution to this problem where data is stored at multiple servers. The distributed architecture provides both privacy as well as fault tolerance to the client. In this paper we provide algorithms for (1) distributing data: our results include hardness of approximation results and hence a heuristic greedy algorithm for the distribution problem (2) partitioning the query at the client to queries for the servers implemented by a bottom up state based algorithm. Finally the results at the servers are integrated to obtain the answer at the client. We provide an experimental validation and performance study of our algorithms.

34 citations


Journal Article
TL;DR: A novel clustering-based framework to anonymizing transaction data is proposed, which provides the basis for designing algorithms that better preserve data utility and develops two anonymization algorithms which explore a larger solution space than existing methods and can satisfy a wide range of privacy requirements.
Abstract: Transaction data about individuals are increasingly collected to support a plethora of applications, spanning from marketing to biomedical studies. Publishing these data is required by many organizations, but may result in privacy breaches, if an attacker exploits potentially identifying information to link individuals to their records in the published data. Algorithms that prevent this threat by transforming transaction data prior to their release have been proposed recently, but they may incur significant utility loss due to their inability to: (i) accommodate a range of different privacy requirements that data owners often have, and (ii) guarantee that the produced data will satisfy data owners' utility requirements. To address this issue, we propose a novel clustering-based framework to anonymizing transaction data, which provides the basis for designing algorithms that better preserve data utility. Based on this framework, we develop two anonymization algorithms which explore a larger solution space than existing methods and can satisfy a wide range of privacy requirements. Additionally, the second algorithm allows the specification and enforcement of utility requirements, thereby ensuring that the anonymized data remain useful in intended tasks. Experiments with both benchmark and real medical datasets verify that our algorithms significantly outperform the current state-of-the-art algorithms in terms of data utility, while being comparable in terms of efficiency.

26 citations


Journal Article
TL;DR: A novel suite of algorithms called MobiPriv was introduced that addressed the shortcomings of previous work in location and query privacy in mobile systems and evaluated the efficiency and effectiveness of the Mobipriv scheme against previously proposed anonymization approaches.
Abstract: Many mobile phones have a GPS sensor that can report accurate location. Thus, if these location data are not protected adequately, they may cause privacy breeches. Moreover, several reports are available where persons have been stalked through GPS. The contributions of this paper are in two folds. First, we examine privacy issues in snapshot queries, and present our work and results in this area. The proposed method can guarantee that all queries are protected, while previously proposed algorithms only achieve a low success rate in some situations. Next, we discuss continuous queries and illustrate that current snapshot solutions cannot be applied to continuous queries. Then, we present results for our robust models for continuous queries. We will introduce a novel suite of algorithms called MobiPriv that addressed the shortcomings of previous work in location and query privacy in mobile systems. We evaluated the efficiency and effectiveness of the MobiPriv scheme against previously proposed anonymization approaches. For our experiments, we utilized real world traffic volume data, real world road network and mobile users generated realistically by a mobile object generator.

11 citations


Journal ArticleDOI
TL;DR: An example enables a side-by-side comparison of the outputs of exploratory data analysis and linear regression analysis conducted on a sample business dataset under these two approaches, and supports the conclusion that the advantages may outweigh the disadvantages in some cases, including for some analyses of unconfidentialised business data.
Abstract: This paper is concerned with the challenge of allowing statistical analysis of confidential business data while maintaining confidentiality. The most widely-used approach to date is statistical disclosure control, which involves modifying or confidentialising data before releasing it to users. Newer proposed approaches include the release of multiply imputed synthetic data in place of the original data, and the use of a remote analysis system enabling users to submit statistical queries and receive output without direct access to data. Most implementations of statistical disclosure control methods to date involve census or survey microdata on individual persons, because existing methods are generally acknowledged to provide inadequate confidentiality protection to business (or enterprise) data. In this paper we seek to compare the statistical disclosure control approach with the remote analysis approach, in the context of protecting the confidentiality of business data in statistical analysis. We provide an example which enables a side-by-side comparison of the outputs of exploratory data analysis and linear regression analysis conducted on a sample business dataset under these two approaches, and provide traditional unconfidentialised results as a standard for comparison. There are certainly advantages and disadvantages in the remote analysis approach and it is unlikely that remote analysis will replace statistical disclosure control methods in all applications. If the disadvantages are judged too serious in a given situation, the analyst may have to seek access to the unconfidentialised dataset. However, our example supports the conclusion that the advantages may outweigh the disadvantages in some cases, including for some analyses of unconfidentialised business data, provided the analyst is aware of the output confidentialisation methods and their potential impact.

10 citations


Journal Article
TL;DR: A knowledge model sharing based approach which learns a global knowledge model from pseudo-data generated according to anonymized knowledge models published by local data sources and can obtain significantly better predictive models (especially, decision trees) than the other methods.
Abstract: Privacy-preserving data mining (PPDM) is an important problem and is currently studied in three approaches: the cryptographic approach, the data publishing, and the model publishing. However, each of these approaches has some problems. The cryptographic approach does not protect privacy of learned knowledge models and may have performance and scalability issues. The data publishing, although is popular, may suffer from too much utility loss for certain types of data mining applications. The model publishing is lacking of efficient algorithms for practical use in a multiple data source environment. In this paper, we present a knowledge model sharing based approach which learns a global knowledge model from pseudo-data generated according to anonymized knowledge models published by local data sources. Specifically, for the anonymization of knowledge models, we present two privacy measures for decision trees and an algorithm that obtains an anonymized decision tree by tree pruning. For the pseudo-data generation, we present an algorithm that generates useful pseudo-data from decision trees. We empirically study our method by comparing it with several PPDM methods that utilize existing techniques, including three methods that publish anonymized-data, one method that learns anonymized decision trees directly from the original-data, and one method that uses ensemble classification. Our results show that in both single data source and multiple data source environments and for several different datasets, predictive models, and utility measures, our method can obtain significantly better predictive models (especially, decision trees) than the other methods.

7 citations


Journal Article
TL;DR: A data utility measurement, called the research value (RV), is proposed, which reflects the importance of an database attribute with respect to the other database attributes in a dataset as well as reflect the significance of the content of the data from a researcher's point of view.
Abstract: As medical data continues to transition to electronic formats, opportunities arise for researchers to use this microdata to discover patterns and increase knowledge that can improve patient care. We propose a data utility measurement, called the research value (RV), which reflects the importance of an database attribute with respect to the other database attributes in a dataset as well as reflect the significance of the content of the data from a researcher's point of view. Our algorithms use these research values to assess an attribute's data utility as it is generalizing the data to ensure k-anonymity. The proposed algorithms scale efficiently even when using datasets with large numbers of attributes.

Journal ArticleDOI
TL;DR: A database D with records containing history of individuals' transactions, that has been de-identified, i.e., the variables that uniquely associate records with individuals have been removed, is considered.
Abstract: Consider a database D with records containing history of individuals' transactions, that has been de-identified, i.e., the variables that uniquely associate records with individuals have been remov...

Journal Article
TL;DR: It is argued that P3ERS could contribute to increase objectivity during the conference review process while maintaining privacy of the various participants.
Abstract: Even though they integrate some blind submission functionalities, current conference review systems, such as EasyChair and EDAS, do not fully protect the privacy of authors and reviewers, in particular from the eyes of the program chair. As a consequence, their use may cause a lack of objectivity in the decision process. In~this paper, we address this issue by proposing P3ERS (for Privacy-Preserving PEer Review System), a distributed conference review system based on group signatures, which aims at preserving the privacy of all participants involved in the peer review process. For~this purpose, we have improved on a generic group signature scheme with revocation features and implemented it in order to ensure the anonymity of the submission and the reviewing phases. We argue that P3ERS could contribute to increase objectivity during the conference review process while maintaining privacy of the various participants.

Journal Article
TL;DR: This paper introduces a secure efficient protocol to semantically join sources when the join attributes are long attributes and provides two secure protocols for both scenarios when a training set exists and when there is no available training set.
Abstract: During the similarity join process, one or more sources may not allow sharing its data with other sources. In this case, a privacy preserving similarity join is required. We showed in our previous work [4] that using long attributes, such as paper abstracts, movie summaries, product descriptions, and user feedbacks, could improve the similarity join accuracy using supervised learning. However, the existing secure protocols for similarity join methods can not be used to join sources using these long attributes. Moreover, the majority of the existing privacy‐preserving protocols do not consider the semantic similarities during the similarity join process. In this paper, we introduce a secure efficient protocol to semantically join sources when the join attributes are long attributes. We provide two secure protocols for both scenarios when a training set exists and when there is no available training set. Furthermore, we introduced the multi‐label supervised secure protocol and the expandable supervised secure protocol. Results show that our protocols can efficiently join sources using the long attributes by considering the semantic relationships among the long string values. Therefore, it improves the overall secure similarity join performance.