scispace - formally typeset
Search or ask a question

Showing papers on "Data publishing published in 2016"


Journal ArticleDOI
TL;DR: An in-depth survey of the state-of-the-art privacy preserving techniques for social network data publishing, metrics for quantifying the anonymity level provided, and information loss as well as challenges and new research directions are presented.
Abstract: The introduction of online social networks (OSN) has transformed the way people connect and interact with each other as well as share information. OSN have led to a tremendous explosion of network-centric data that could be harvested for better understanding of interesting phenomena such as sociological and behavioural aspects of individuals or groups. As a result, online social network service operators are compelled to publish the social network data for use by third party consumers such as researchers and advertisers. As social network data publication is vulnerable to a wide variety of reidentification and disclosure attacks, developing privacy preserving mechanisms are an active research area. This paper presents a comprehensive survey of the recent developments in social networks data publishing privacy risks, attacks, and privacy-preserving techniques. We survey and present various types of privacy attacks and information exploited by adversaries to perpetrate privacy attacks on anonymized social network data. We present an in-depth survey of the state-of-the-art privacy preserving techniques for social network data publishing, metrics for quantifying the anonymity level provided, and information loss as well as challenges and new research directions. The survey helps readers understand the threats, various privacy preserving mechanisms, and their vulnerabilities to privacy breach attacks in social network data publishing as well as observe common themes and future directions.

108 citations


Proceedings ArticleDOI
Qian Wang1, Yan Zhang1, Xiao Lu1, Zhibo Wang1, Zhan Qin1, Kui Ren2 
10 Apr 2016
TL;DR: RescueDP-an online aggregate monitoring scheme over infinite streams with privacy guarantee that can achieve w-event privacy over data generated and published periodically by crowd users and outperforms the existing methods and improves the utility with strong privacy guarantee.
Abstract: Nowadays gigantic crowd-sourced data collected from mobile phone users have become widely available, which enables the possibility of many important data mining applications to improve the quality of our daily lives. While providing tremendous benefits, the release of these data to the public will pose a considerable threat to mobile users' privacy. To solve this problem, the notion of differential privacy has been proposed to provide privacy with theoretical guarantee, and recently it has been applied in streaming data publishing. However, most of the existing literature focus on either event-level privacy on infinite streams or user-level privacy on finite streams. In this paper, we investigate the problem of real-time spatiotemporal crowd-sourced data publishing with privacy preservation. Specifically, we consider continuous publication of population statistics for monitoring purposes and design RescueDP — an online aggregate monitoring scheme over infinite streams with privacy guarantee. RescueDP's key components include adaptive sampling, adaptive budget allocation, dynamic grouping, perturbation and filtering, which are seamlessly integrated as a whole to provide privacy-preserving statistics publishing on infinite time stamps. We show that RescueDP can achieve w-event privacy over data generated and published periodically by crowd users. We evaluate our scheme with real-world as well as synthetic datasets and compare it with two w-event privacy-assured representative benchmarks. Experimental results show that our solution outperforms the existing methods and improves the utility with strong privacy guarantee.

76 citations


Journal ArticleDOI
TL;DR: This study analyses the solutions offered by generalist scientific data repositories, i.e., repositories supporting the deposition of any type of research data, and emerges that these repositories implement well consolidated practices and pragmatic solutions for literature repositories.
Abstract: Research data publishing is intended as the release of research data to make it possible for practitioners to (re)use them according to “open science” dynamics. There are three main actors called to deal with research data publishing practices: researchers, publishers, and data repositories. This study analyses the solutions offered by generalist scientific data repositories, i.e., repositories supporting the deposition of any type of research data. These repositories cannot make any assumption on the application domain. They are actually called to face with the almost open ended typologies of data used in science. The current practices promoted by such repositories are analysed with respect to eight key aspects of data publishing, i.e., dataset formatting, documentation, licensing, publication costs, validation, availability, discovery and access, and citation. From this analysis it emerges that these repositories implement well consolidated practices and pragmatic solutions for literature repositories. These practices and solutions can not totally meet the needs of management and use of datasets resources, especially in a context where rapid technological changes continuously open new exploitation prospects.

74 citations


Journal ArticleDOI
TL;DR: PTD aims to strike a balance between the conflicting goals of data utility and data privacy in accordance with the privacy requirements of moving objects and can significantly improve the data utility of anonymized trajectory databases when compared with previous work in the literature.
Abstract: Trajectory data often provide useful information that can be used in real-life applications, such as traffic management, Geo-marketing, and location-based advertising. However, a trajectory database may contain detailed information about moving objects and associate them with sensitive attributes, such as disease, job, and income. Therefore, improper publishing of the trajectory database can put the privacy of moving objects at risk, especially when an adversary uses partial trajectory information as its background knowledge. The existing approaches for privacy preservation in trajectory data publishing provide the same privacy protection for all moving objects. The consequence is that some moving objects may be offered insufficient privacy protection, while some others may not require high privacy protection. In this paper, we address this problem and present PPTD, a novel approach for preserving privacy in trajectory data publishing based on the concept of personalized privacy. It aims to strike a balance between the conflicting goals of data utility and data privacy in accordance with the privacy requirements of moving objects. To the best of our knowledge, this is the first paper that combines sensitive attribute generalization and trajectory local suppression to achieve a tailored personalized privacy model for trajectory data publishing. Our experiments on two synthetic trajectory datasets suggest that PPTD is effective for preserving personalized privacy in trajectory data publishing. In particular, PPTD can significantly improve the data utility of anonymized trajectory databases when compared with previous work in the literature.

70 citations


Journal ArticleDOI
22 Aug 2016-PeerJ
TL;DR: This article proposes to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies, and presents a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data.
Abstract: Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable.

67 citations


Proceedings ArticleDOI
14 Jun 2016
TL;DR: The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts.
Abstract: We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

66 citations


Proceedings ArticleDOI
Haina Ye1, Xinzhou Cheng1, Mingqiang Yuan1, Lexi Xu1, Jie Gao1, Chen Cheng1 
01 Sep 2016
TL;DR: The effects of characteristics of big data on information security and privacy are described and privacy-preserving trajectory data publishing is studied due to its future utilization, especially in telecom operation.
Abstract: Big data has been arising a growing interest in both scientific and industrial fields for its potential value. However, before employing big data technology into massive applications, a basic but also principle topic should be investigated: security and privacy. In this paper, the recent research and development on security and privacy in big data is surveyed. First, the effects of characteristics of big data on information security and privacy are described. Then, topics and issues on security are discussed and reviewed. Further, privacy-preserving trajectory data publishing is studied due to its future utilization, especially in telecom operation.

56 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: The key idea of PrivCheck is to obfuscate user check-in data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which ensures the utility of the obfuscated data to empower personalized LBSs.
Abstract: With the widespread adoption of smartphones, we have observed an increasing popularity of Location-Based Services (LBSs) in the past decade. To improve user experience, LBSs often provide personalized recommendations to users by mining their activity (i.e., check-in) data from location-based social networks. However, releasing user check-in data makes users vulnerable to inference attacks, as private data (e.g., gender) can often be inferred from the users' check-in data. In this paper, we propose PrivCheck, a customizable and continuous privacy-preserving check-in data publishing framework providing users with continuous privacy protection against inference attacks. The key idea of PrivCheck is to obfuscate user check-in data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which ensures the utility of the obfuscated data to empower personalized LBSs. Since users often give LBS providers access to both their historical check-in data and future check-in streams, we develop two data obfuscation methods for historical and online check-in publishing, respectively. An empirical evaluation on two real-world datasets shows that our framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized LBSs.

46 citations


01 Jan 2016
TL;DR: This introduction to privacy preserving data publishing concepts and techniques helps people to enjoy a good book with a cup of tea in the afternoon, instead they cope with some malicious bugs inside their desktop computer.
Abstract: Thank you very much for reading introduction to privacy preserving data publishing concepts and techniques. Maybe you have knowledge that, people have search numerous times for their favorite books like this introduction to privacy preserving data publishing concepts and techniques, but end up in harmful downloads. Rather than enjoying a good book with a cup of tea in the afternoon, instead they cope with some malicious bugs inside their desktop computer.

37 citations


Journal ArticleDOI
TL;DR: This research proposes two approaches for generating k-anonymous β-likeness datasets that protect against identity and attribute disclosures and prevent attacks featured by any data correlations between QIs and sensitive attribute values as the adversary's background knowledge.
Abstract: We define a privacy model based on k-anonymity and one of its strong refinements to prevent the background knowledge attack.We propose two hierarchical anonymization algorithm to satisfy our privacy model.Our algorithms outperform the state-of the art anonymization algorithm in terms of utility and privacy.We extend an information loss measure to capture data inaccuracies caused by not-fitted records in any equivalence class. Preserving privacy in the presence of adversary's background knowledge is very important in data publishing. The k-anonymity model, while protecting identity, does not protect against attribute disclosure. One of strong refinements of k-anonymity, β-likeness, does not protect against identity disclosure. Neither model protects against attacks featured by background knowledge. This research proposes two approaches for generating k-anonymous β-likeness datasets that protect against identity and attribute disclosures and prevent attacks featured by any data correlations between QIs and sensitive attribute values as the adversary's background knowledge. In particular, two hierarchical anonymization algorithms are proposed. Both algorithms apply agglomerative clustering techniques in their first stage in order to generate clusters of records whose probability distributions extracted by background knowledge are similar. In the next phase, k-anonymity and β-likeness are enforced in order to prevent identity and attribute disclosures. Our extensive experiments demonstrate that the proposed algorithms outperform other state-of-the-art anonymization algorithms in terms of privacy and data utility where the number of unpublished records in our algorithms is less than that of the others. As well-known information loss metrics fail to measure precisely the imposed data inaccuracies stemmed from the removal of records that cannot be published in any equivalence class. This research also introduces an extension into the Global Certainty Penalty metric that considers unpublished records.

31 citations


Proceedings ArticleDOI
05 Dec 2016
TL;DR: This paper proposes two multi-cloud-based outsourced-ABE schemes, namely the parallel-cloud ABE and the chain- cloud ABE, which enable the receivers to partially outsource the computationally expensive decryption operations to the clouds, while preventing user attributes from being disclosed.
Abstract: With the increased popularity of ubiquitous computing and connectivity, the Internet of Things (IoT) also introduces new vulnerabilities and attack vectors. While secure data collection (i.e. the upward link) has been well studied in the literature, secure data dissemination (i.e. the downward link) remains an open problem. Attribute-based encryption (ABE) and outsourced-ABE has been used for secure message distribution in IoT, however, existing mechanisms suffer from extensive computation and/or privacy issues. In this paper, we explore the problem of privacy-preserving targeted broadcast in IoT. We propose two multi-cloud-based outsourced-ABE schemes, namely the parallel-cloud ABE and the chain-cloud ABE, which enable the receivers to partially outsource the computationally expensive decryption operations to the clouds, while preventing user attributes from being disclosed. In particular, the proposed solution protects three types of privacy (i.e., data, attribute and access policy privacy) by enforcing collaborations among multiple clouds. Our schemes also provide delegation verifiability that allows the receivers to verify whether the clouds have faithfully performed the outsourced operations. We extensively analyze the security guarantees of the proposed mechanisms and demonstrate the effectiveness and efficiency of our schemes with simulated resource-constrained IoT devices, which outsource operations to Amazon EC2 and Microsoft Azure.

Journal ArticleDOI
TL;DR: This work describes the process of migrating a database of notable relevance to the plant sciences—the Pathogen-Host Interaction Database (PHI-base)—to a form that conforms to each of the FAIR Principles, and demonstrates the utility of providing explicit and reliable access to provenance information.
Abstract: Pathogen-Host interaction data is core to our understanding of disease processes and their molecular/genetic bases. Facile access to such core data is particularly important for the plant sciences, where individual genetic and phenotypic observations have the added complexity of being dispersed over a wide diversity of plant species versus the relatively fewer host species of interest to biomedical researchers. Recently, an international initiative interested in scholarly data publishing proposed that all scientific data should be “FAIR” - Findable, Accessible, Interoperable, and Reusable. In this work, we describe the process of migrating a database of notable relevance to the plant sciences - the Pathogen-Host Interaction Database (PHI-base) - to a form that conforms to each of the FAIR Principles. We discuss the technical and architectural decisions, and the migration pathway, including observations of the difficulty and/or fidelity of each step. We examine how multiple FAIR principles can be addressed simultaneously through careful design decisions, including making data FAIR for both humans and machines with minimal duplication of effort. We note how FAIR data publishing involves more than data reformatting, requiring features beyond those exhibited by most life science Semantic Web or Linked Data resources. We explore the value-added by completing this FAIR data transformation, and then test the result through integrative questions that could not easily be asked over traditional Web-based data resources. Finally, we demonstrate the utility of providing explicit and reliable access to provenance information, which we argue enhances citation rates by encouraging and facilitating transparent scholarly reuse of these valuable data holdings.

Journal ArticleDOI
TL;DR: A hybrid algorithm, which combines sampling, perturbation and generalization to protect data privacy from composition attacks is proposed and experimentally demonstrates that the proposed anonymization technique significantly reduces the risk of composition attacks and also preserves good data utility.

Journal ArticleDOI
TL;DR: To increase the privacy of published data in the sliced tables, a new method called value swapping is proposed in this work, aimed at decreasing the attribute disclosure risk for the absolute facts and ensuring the l-diverse slicing.
Abstract: Privacy is an important concern in the society, and it has been a fundamental issue when to analyze and publish data involving human individual's sensitive information. Recently, the slicing method has been popularly used for privacy preservation in data publishing, because of its potential for preserving more data utility than others such as the generalization and bucketization approaches. However, in this paper, we show that the slicing method has disclosure risks for some absolute facts, which would help the adversary to find invalid records in the sliced microdata table, resulting in breach of individual privacy. To increase the privacy of published data in the sliced tables, a new method called value swapping is proposed in this work, aimed at decreasing the attribute disclosure risk for the absolute facts and ensuring the l-diverse slicing. By value swapping, the published table contains no invalid information such that the adversary cannot breach the individual privacy. Experimental results also show that the NEW method is able to keep more data utility than the existing slicing methods in a published microdata table. Copyright © 2016 John Wiley & Sons, Ltd.

Journal ArticleDOI
Jingyu Hua1, An Tang1, Yixin Fang1, Zhenyu Shen1, Sheng Zhong1 
TL;DR: A privacy-preserving utility verification mechanism based upon cryptographic technique for DiffPart-a differentially private scheme designed for set-valued data that can measure the data utility based upon the encrypted frequencies of the aggregated raw data instead of the plain values, which thus prevents privacy breach.
Abstract: In the problem of privacy-preserving collaborative data publishing, a central data publisher is responsible for aggregating sensitive data from multiple parties and then anonymizing it before publishing for data mining. In such scenarios, the data users may have a strong demand to measure the utility of the published data, since most anonymization techniques have side effects on data utility. Nevertheless, this task is non-trivial, because the utility measuring usually requires the aggregated raw data, which is not revealed to the data users due to privacy concerns. Furthermore, the data publishers may even cheat in the raw data, since no one, including the individual providers, knows the full data set. In this paper, we first propose a privacy-preserving utility verification mechanism based upon cryptographic technique for DiffPart —a differentially private scheme designed for set-valued data. This proposal can measure the data utility based upon the encrypted frequencies of the aggregated raw data instead of the plain values, which thus prevents privacy breach. Moreover, it is enabled to privately check the correctness of the encrypted frequencies provided by the publisher, which helps detect dishonest publishers. We also extend this mechanism to DiffGen—another differentially private publishing scheme designed for relational data. Our theoretical and experimental evaluations demonstrate the security and efficiency of the proposed mechanism.

Book ChapterDOI
29 May 2016
TL;DR: This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.
Abstract: In this demonstrator we introduce DataGraft – a platform for Open Data management. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.

Journal ArticleDOI
TL;DR: This work quantitatively shows the perfect and (1 - ε)-perfect de-anonymization conditions of the 26 real world structural data sets, including social networks, collaborations networks, communication networks, autonomous systems, peer-to-peer networks, and so on.
Abstract: In this paper, we study the quantification, practice, and implications of structural data de-anonymization, including social data, mobility traces, and so on. First, we answer several open questions in structural data de-anonymization by quantifying perfect and $(1-\epsilon )$ -perfect structural data de-anonymization, where $\epsilon $ is the error tolerated by a de-anonymization scheme. To the best of our knowledge, this is the first work on quantifying structural data de-anonymization under a general data model, which closes the gap between the structural data de-anonymization practice and theory. Second, we conduct the first large-scale study on the de-anonymizability of 26 real world structural data sets, including social networks, collaborations networks, communication networks, autonomous systems, peer-to-peer networks, and so on. We also quantitatively show the perfect and $(1-\epsilon )$ -perfect de-anonymization conditions of the 26 data sets. Third, following our quantification, we present a practical attack [a novel single-phase cold start optimization-based de-anonymization (ODA) algorithm]. An experimental analysis of ODA shows that $\sim 77.7$ %–83.3% of the users in Gowalla (196 591 users and 950 327 edges) and 86.9%–95.5% of the users in Google+ (4 692 671 users and 90 751 480 edges) are de-anonymizable in different scenarios, which implies that the structure-based de-anonymization is powerful in practice. Finally, we discuss the implications of our de-anonymization quantification and our ODA attack and provide some general suggestions for future secure data publishing.

Proceedings Article
12 Feb 2016
TL;DR: In this article, the authors lay the foundations of privacy-preserving data publishing (PPDP) in the context of linked data, and define notions of safe and optimal anonymisations.
Abstract: The widespread adoption of Linked Data has been driven by the increasing demand for information exchange between organisations, as well as by data publishing regulations in domains such as health care and governance. In this setting, sensitive information is at risk of disclosure since published data can be linked with arbitrary external data sources. In this paper we lay the foundations of privacy-preserving data publishing (PPDP) in the context of Linked Data. We consider anonymisations of RDF graphs (and, more generally, relational datasets with labelled nulls) and define notions of safe and optimal anonymisations. Safety ensures that the anonymised data can be published with provable protection guarantees against linking attacks, whereas optimality ensures that it preserves as much information from the original data as possible, while satisfying the safety requirement. We establish the complexity of the underpinning decision problems both under open-world semantics inherent to RDF and a closed-world semantics, where we assume that an attacker has complete knowledge over some part of the original data.

Journal ArticleDOI
TL;DR: A new privacy model called MS(k, θ*)-bounding for protecting published spontaneous ADE reporting data from privacy attacks and an anonymization algorithm to sanitize SRS data in accordance with the proposed model is proposed.
Abstract: To facilitate long-term safety surveillance of marketing drugs, many spontaneously reporting systems (SRSs) of ADR events have been established world-wide. Since the data collected by SRSs contain sensitive personal health information that should be protected to prevent the identification of individuals, it procures the issue of privacy preserving data publishing (PPDP), that is, how to sanitize (anonymize) raw data before publishing. Although much work has been done on PPDP, very few studies have focused on protecting privacy of SRS data and none of the anonymization methods is favorable for SRS datasets, due to which contain some characteristics such as rare events, multiple individual records, and multi-valued sensitive attributes. We propose a new privacy model called MS(k, θ * )-bounding for protecting published spontaneous ADE reporting data from privacy attacks. Our model has the flexibility of varying privacy thresholds, i.e., θ * , for different sensitive values and takes the characteristics of SRS data into consideration. We also propose an anonymization algorithm for sanitizing the raw data to meet the requirements specified through the proposed model. Our algorithm adopts a greedy-based clustering strategy to group the records into clusters, conforming to an innovative anonymization metric aiming to minimize the privacy risk as well as maintain the data utility for ADR detection. Empirical study was conducted using FAERS dataset from 2004Q1 to 2011Q4. We compared our model with four prevailing methods, including k-anonymity, (X, Y)-anonymity, Multi-sensitive l-diversity, and (α, k)-anonymity, evaluated via two measures, Danger Ratio (DR) and Information Loss (IL), and considered three different scenarios of threshold setting for θ * , including uniform setting, level-wise setting and frequency-based setting. We also conducted experiments to inspect the impact of anonymized data on the strengths of discovered ADR signals. With all three different threshold settings for sensitive value, our method can successively prevent the disclosure of sensitive values (nearly all observed DRs are zeros) without sacrificing too much of data utility. With non-uniform threshold setting, level-wise or frequency-based, our MS(k, θ *)-bounding exhibits the best data utility and the least privacy risk among all the models. The experiments conducted on selected ADR signals from MedWatch show that only very small difference on signal strength (PRR or ROR) were observed. The results show that our method can effectively prevent the disclosure of patient sensitive information without sacrificing data utility for ADR signal detection. We propose a new privacy model for protecting SRS data that possess some characteristics overlooked by contemporary models and an anonymization algorithm to sanitize SRS data in accordance with the proposed model. Empirical evaluation on the real SRS dataset, i.e., FAERS, shows that our method can effectively solve the privacy problem in SRS data without influencing the ADR signal strength.

Journal ArticleDOI
TL;DR: A novel methodology, Cooperative Privacy Game (CoPG), has been proposed to achieve data privacy in which Cooperative Game Theory is used to achieve the privacy and is named as Cooperative Privacy (CoP).
Abstract: Achieving data privacy before publishing has been becoming an extreme concern of researchers, individuals and service providers. A novel methodology, Cooperative Privacy Game (CoPG), has been proposed to achieve data privacy in which Cooperative Game Theory is used to achieve the privacy and is named as Cooperative Privacy (CoP). The core idea of CoP is to play the best strategy for a player to preserve his privacy by himself which in turn contributes to preserving other players privacy. CoP considers each tuple as a player and tuples form coalitions as described in the procedure. The main objective of the CoP is to obtain individuals (player) privacy as a goal that is rationally interested in other individuals' (players) privacy. CoP is formally defined in terms of Nash equilibria, i.e., all the players are in their best coalition, to achieve k-anonymity. The cooperative values of the each tuple are measured using the characteristic function of the CoPG to identify the coalitions. As the underlying game is convex; the algorithm is efficient and yields high quality coalition formation with respect to intensity and disperse. The efficiency of anonymization process is calculated using information loss metric. The variations of the information loss with the parameters $$\alpha$$ź (weight factor of nearness) and $$\beta$$β (multiplicity) are analyzed and the obtained results are discussed.

Journal ArticleDOI
TL;DR: A new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS), which uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k- anonymity.
Abstract: In healthcare, there is a vast amount of patients' data, which can lead to important discoveries if combined. Due to legal and ethical issues, such data cannot be shared and hence such information is underused. A new area of research has emerged, called privacy preserving data publishing (PPDP), which aims in sharing data in a way that privacy is preserved while the information lost is kept at a minimum. In this Letter, a new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS). The algorithm uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k-anonymity. Five datasets from different areas of life sciences [RETINOPATHY, Single Proton Emission Computed Tomography imaging, gene sequencing and drug discovery (two datasets)], were anonymised with kPB-MS. The produced anonymised datasets were evaluated using four different classifiers and in 74% of the test cases, they produced similar or better accuracies than using the full datasets.

Proceedings ArticleDOI
20 Jul 2016
TL;DR: The proposed algorithm is a non-interactive method to publish anonymize data set, and the decision tree classifier showed better results compared to other existing classification works on anonymized data set.
Abstract: In the recent past, there has been a tremendous increase of large repositories of data, examples being in healthcare data, consumer data from retailers, and airline passenger data. These data are continually being shared with interested parties, either anonymously -- for research purposes, or openly by financial or insurance companies, for decision-making purposes. When is shared anonymously, there is still the possibility of de-anonymizing the data. Privacy Preserving Data Publishing (PPDP) is a way to allow one to share secure data while ensuring protection against identity disclosure of an individual. Generalization of attributes is a technique of data anonymization where an attribute is replaced with a more generalized value. Differential privacy is a technique that ensures the highest level of privacy for a record owner while providing actual information about the data set. This research develops a framework by generalizing attributes of a data set that satisfy differential privacy principles for publishing secure data for sharing. The proposed algorithm is a non-interactive method to publish anonymize data set, and the decision tree classifier showed better results compared to other existing classification works on anonymized data set. In this paper differential privacy refers to e-differential privacy.

Proceedings ArticleDOI
16 May 2016
TL;DR: This work presents QB2OLAP, a tool for enabling OLAP on existing QB data without requiring any RDF, QB(4OLAP), or SPARQL skills, and allows semi-automatic transformation of a QB data set into a QB4 OLAP one via enrichment with QB4olAP semantics.
Abstract: Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language for RDF, which typical OLAP users are not familiar with. In this demo, we present QB2OLAP, a tool for enabling OLAP on existing QB data. Without requiring any RDF, QB(4OLAP), or SPARQL skills, it allows semi-automatic transformation of a QB data set into a QB4OLAP one via enrichment with QB4OLAP semantics, exploration of the enriched schema, and querying with the high-level OLAP language QL that exploits the QB4OLAP semantics and is automatically translated to SPARQL.

Journal ArticleDOI
TL;DR: The AusPlots Rangelands field data collection solution is introduced, providing a systems-level view and motivating its development through the discussion of key functional requirements, and it is demonstrated that the combined system provides a unique end-to-end data collection, curation, archiving and publishing mechanism for ecological data.

Journal ArticleDOI
TL;DR: The proposed scheme is able to automatically detect sensitive data in users' publications; construct sanitized versions of such data; and provide privacy-preserving transparent access to sensitive contents by disclosing more or less information to readers according to their credentials toward the owner of the publications.
Abstract: A system to enforce privacy protection in existing social networks.An automatic method to assess the privacy risks of users' publications.A transparent data storage and accessing mechanism according to users' credentials.Feasibility study and two illustrative case studies in PatientsLikeMe and Twitter. Social networks have become an essential meeting point for millions of individuals willing to publish and consume huge quantities of heterogeneous information. Some studies have shown that the data published in these platforms may contain sensitive personal information and that external entities can gather and exploit this knowledge for their own benefit. Even though some methods to preserve the privacy of social networks users have been proposed, they generally apply rigid access control measures to the protected content and, even worse, they do not enable the users to understand which contents are sensitive. Last but not least, most of them require the collaboration of social network operators or they fail to provide a practical solution capable of working with well-known and already deployed social platforms (e.g., Twitter). In this paper, we propose a new scheme that addresses all these issues. The new system is envisaged as an independent piece of software that does not depend on the social network in use and that can be transparently applied to most existing ones. According to a set of privacy requirements intuitively defined by the users of a social network, the proposed scheme is able to: (i) automatically detect sensitive data in users' publications; (ii) construct sanitized versions of such data; and (iii) provide privacy-preserving transparent access to sensitive contents by disclosing more or less information to readers according to their credentials toward the owner of the publications. We also study the applicability of the proposed system in general and illustrate its behavior in two well-known social networks (i.e., Twitter and PatientsLikeMe).

Journal ArticleDOI
15 Sep 2016
TL;DR: This article extends two standards for privacy protection in tabular data (k-anonymity and ℓ-diversity) and applies them to hierarchical data and presents utility-aware algorithms that enforce these definitions of privacy using generalizations and suppressions of data values.
Abstract: Many applications today rely on storage and management of semi-structured information, for example, XML databases and document-oriented databases. These data often have to be shared with untrusted third parties, which makes individuals’ privacy a fundamental problem. In this article, we propose anonymization techniques for privacy-preserving publishing of hierarchical data. We show that the problem of anonymizing hierarchical data poses unique challenges that cannot be readily solved by existing mechanisms. We extend two standards for privacy protection in tabular data (k-anonymity and e-diversity) and apply them to hierarchical data. We present utility-aware algorithms that enforce these definitions of privacy using generalizations and suppressions of data values. To evaluate our algorithms and their heuristics, we experiment on synthetic and real datasets obtained from two universities. Our experiments show that we significantly outperform related methods that provide comparable privacy guarantees.

Journal ArticleDOI
TL;DR: The status of the work to develop data formats for engineering materials in the frame of CEN Workshops is described and the added value of data citation beyond simply ensuring that data creators are properly accredited for their work is reported on.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: A differentially private trajectory data publishing algorithm aiming to protect the privacy of sensitive areas is proposed and it is shown that the proposed scheme achieves good data utility and is scalable to large trajectory databases.
Abstract: Trajectory data. Like human mobility trace, in participatory sensing is of vital importance to many applications, like traffic monitoring, urban planning and social relationship mining. However, improper release of trajectory data can incur great threats to user's privacy. Recent researches have adopted Laplace mechanism to achieve differential privacy which can guarantee that small change of one record in database will not breach a user's privacy. However, existing work cannot guarantee privacy perfectly because a randomly picked noise will not contribute to a meaningful trajectory data release and people need to hide their visits to certain sensitive area. In this paper, we propose a differentially private trajectory data publishing algorithm aiming to protect the privacy of sensitive areas. Privacy analysis show that the proposed scheme achieves differential privacy and experiments with real trajectory data exhibits that the proposed scheme achieves good data utility and is scalable to large trajectory databases.

Journal ArticleDOI
01 Dec 2016
TL;DR: An improved version of anatomy is proposed: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and in the meanwhile is able to retain significantly more information in the microdata.
Abstract: In data publishing, anonymization techniques have been designed to provide privacy protection. Anatomy is an important techniques for privacy preserving in data publication and attracts considerable attention in the literature. However, anatomy is fragile under background knowledge attack and the presence attack. In addition, anatomy can only be applied into limited applications. To overcome these drawbacks, we propose an improved version of anatomy: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and in the meanwhile is able to retain significantly more information in the microdata. We present the detail of the technique and build the underlying theory of the technique. Extensive experiments on real data are conducted, showing that our technique allows highly effective data analysis, while offering strong privacy guarantees.

01 Jan 2016
TL;DR: In this article, the authors describe the opportunities and challenges for privacy-preserving visualization of electronic health record data by analyzing the different disclosure risk types, and vulnerabilities associated with commonly used visualization techniques.
Abstract: In this paper, we reflect on the use of visualization techniques for analyzing electronic health record data with privacy concerns. Privacy-preserving data visualization is a relatively new area of research compared to the more established research areas of privacy-preserving data publishing and data mining. We describe the opportunities and challenges for privacy-preserving visualization of electronic health record data by analyzing the different disclosure risk types, and vulnerabilities associated with commonly used visualization techniques.