Showing papers on "Data publishing published in 2016"

PDF

Open Access

Journal Article•DOI•

Privacy Preserving Social Network Data Publication

[...]

Jemal H. Abawajy¹, Mohd Izuan Hafez Ninggal, Tutut Herawan²•Institutions (2)

Deakin University¹, University of Malaya²

08 Mar 2016-IEEE Communications Surveys and Tutorials

TL;DR: An in-depth survey of the state-of-the-art privacy preserving techniques for social network data publishing, metrics for quantifying the anonymity level provided, and information loss as well as challenges and new research directions are presented.

...read moreread less

Abstract: The introduction of online social networks (OSN) has transformed the way people connect and interact with each other as well as share information. OSN have led to a tremendous explosion of network-centric data that could be harvested for better understanding of interesting phenomena such as sociological and behavioural aspects of individuals or groups. As a result, online social network service operators are compelled to publish the social network data for use by third party consumers such as researchers and advertisers. As social network data publication is vulnerable to a wide variety of reidentification and disclosure attacks, developing privacy preserving mechanisms are an active research area. This paper presents a comprehensive survey of the recent developments in social networks data publishing privacy risks, attacks, and privacy-preserving techniques. We survey and present various types of privacy attacks and information exploited by adversaries to perpetrate privacy attacks on anonymized social network data. We present an in-depth survey of the state-of-the-art privacy preserving techniques for social network data publishing, metrics for quantifying the anonymity level provided, and information loss as well as challenges and new research directions. The survey helps readers understand the threats, various privacy preserving mechanisms, and their vulnerabilities to privacy breach attacks in social network data publishing as well as observe common themes and future directions.

...read moreread less

108 citations

Proceedings Article•DOI•

RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy

[...]

Qian Wang¹, Yan Zhang¹, Xiao Lu¹, Zhibo Wang¹, Zhan Qin¹, Kui Ren² - Show less +2 more•Institutions (2)

Wuhan University¹, University at Buffalo²

10 Apr 2016

TL;DR: RescueDP-an online aggregate monitoring scheme over infinite streams with privacy guarantee that can achieve w-event privacy over data generated and published periodically by crowd users and outperforms the existing methods and improves the utility with strong privacy guarantee.

...read moreread less

Abstract: Nowadays gigantic crowd-sourced data collected from mobile phone users have become widely available, which enables the possibility of many important data mining applications to improve the quality of our daily lives. While providing tremendous benefits, the release of these data to the public will pose a considerable threat to mobile users' privacy. To solve this problem, the notion of differential privacy has been proposed to provide privacy with theoretical guarantee, and recently it has been applied in streaming data publishing. However, most of the existing literature focus on either event-level privacy on infinite streams or user-level privacy on finite streams. In this paper, we investigate the problem of real-time spatiotemporal crowd-sourced data publishing with privacy preservation. Specifically, we consider continuous publication of population statistics for monitoring purposes and design RescueDP — an online aggregate monitoring scheme over infinite streams with privacy guarantee. RescueDP's key components include adaptive sampling, adaptive budget allocation, dynamic grouping, perturbation and filtering, which are seamlessly integrated as a whole to provide privacy-preserving statistics publishing on infinite time stamps. We show that RescueDP can achieve w-event privacy over data generated and published periodically by crowd users. We evaluate our scheme with real-world as well as synthetic datasets and compare it with two w-event privacy-assured representative benchmarks. Experimental results show that our solution outperforms the existing methods and improves the utility with strong privacy guarantee.

...read moreread less

76 citations

Journal Article•DOI•

Are Scientific Data Repositories Coping with Research Data Publishing

[...]

Massimiliano Assante¹, Leonardo Candela¹, Donatella Castelli¹, Alice Tani¹•Institutions (1)

Istituto di Scienza e Tecnologie dell'Informazione¹

26 Apr 2016-Data Science Journal

TL;DR: This study analyses the solutions offered by generalist scientific data repositories, i.e., repositories supporting the deposition of any type of research data, and emerges that these repositories implement well consolidated practices and pragmatic solutions for literature repositories.

...read moreread less

Abstract: Research data publishing is intended as the release of research data to make it possible for practitioners to (re)use them according to “open science” dynamics. There are three main actors called to deal with research data publishing practices: researchers, publishers, and data repositories. This study analyses the solutions offered by generalist scientific data repositories, i.e., repositories supporting the deposition of any type of research data. These repositories cannot make any assumption on the application domain. They are actually called to face with the almost open ended typologies of data used in science. The current practices promoted by such repositories are analysed with respect to eight key aspects of data publishing, i.e., dataset formatting, documentation, licensing, publication costs, validation, availability, discovery and access, and citation. From this analysis it emerges that these repositories implement well consolidated practices and pragmatic solutions for literature repositories. These practices and solutions can not totally meet the needs of management and use of datasets resources, especially in a context where rapid technological changes continuously open new exploitation prospects.

...read moreread less

74 citations

Journal Article•DOI•

PPTD: Preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression

[...]

Elahe Ghasemi Komishani¹, Mahdi Abadi¹, Fatemeh Deldar¹•Institutions (1)

Tarbiat Modares University¹

15 Feb 2016-Knowledge Based Systems

TL;DR: PTD aims to strike a balance between the conflicting goals of data utility and data privacy in accordance with the privacy requirements of moving objects and can significantly improve the data utility of anonymized trajectory databases when compared with previous work in the literature.

...read moreread less

Abstract: Trajectory data often provide useful information that can be used in real-life applications, such as traffic management, Geo-marketing, and location-based advertising. However, a trajectory database may contain detailed information about moving objects and associate them with sensitive attributes, such as disease, job, and income. Therefore, improper publishing of the trajectory database can put the privacy of moving objects at risk, especially when an adversary uses partial trajectory information as its background knowledge. The existing approaches for privacy preservation in trajectory data publishing provide the same privacy protection for all moving objects. The consequence is that some moving objects may be offered insufficient privacy protection, while some others may not require high privacy protection. In this paper, we address this problem and present PPTD, a novel approach for preserving privacy in trajectory data publishing based on the concept of personalized privacy. It aims to strike a balance between the conflicting goals of data utility and data privacy in accordance with the privacy requirements of moving objects. To the best of our knowledge, this is the first paper that combines sensitive attribute generalization and trajectory local suppression to achieve a tailored personalized privacy model for trajectory data publishing. Our experiments on two synthetic trajectory datasets suggest that PPTD is effective for preserving personalized privacy in trajectory data publishing. In particular, PPTD can significantly improve the data utility of anonymized trajectory databases when compared with previous work in the literature.

...read moreread less

70 citations

Journal Article•DOI•

Decentralized provenance-aware publishing with nanopublications

[...]

Tobias Kuhn¹, Christine Chichester², Michael Krauthammer³, Núria Queralt-Rosinach⁴, Ruben Verborgh⁵, George Giannakopoulos, Axel-Cyrille Ngonga Ngomo⁶, Raffaele Viglianti⁷, Michel Dumontier⁸ - Show less +5 more•Institutions (8)

VU University Amsterdam¹, Nestlé², Yale University³, Pompeu Fabra University⁴, Ghent University⁵, Leipzig University⁶, University of Maryland, College Park⁷, Stanford University⁸

22 Aug 2016-PeerJ

TL;DR: This article proposes to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies, and presents a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data.

...read moreread less

Abstract: Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable.

...read moreread less

67 citations

Proceedings Article•DOI•

SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment

[...]

Shrainik Jain¹, Dominik Moritz¹, Daniel Halperin¹, Bill Howe¹, Edward D. Lazowska¹ - Show less +1 more•Institutions (1)

University of Washington¹

14 Jun 2016

TL;DR: The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts.

...read moreread less

Abstract: We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

...read moreread less

66 citations

Proceedings Article•DOI•

A survey of security and privacy in big data

[...]

Haina Ye¹, Xinzhou Cheng¹, Mingqiang Yuan¹, Lexi Xu¹, Jie Gao¹, Chen Cheng¹ - Show less +2 more•Institutions (1)

China Unicom¹

01 Sep 2016

TL;DR: The effects of characteristics of big data on information security and privacy are described and privacy-preserving trajectory data publishing is studied due to its future utilization, especially in telecom operation.

...read moreread less

Abstract: Big data has been arising a growing interest in both scientific and industrial fields for its potential value. However, before employing big data technology into massive applications, a basic but also principle topic should be investigated: security and privacy. In this paper, the recent research and development on security and privacy in big data is surveyed. First, the effects of characteristics of big data on information security and privacy are described. Then, topics and issues on security are discussed and reviewed. Further, privacy-preserving trajectory data publishing is studied due to its future utilization, especially in telecom operation.

...read moreread less

56 citations

Proceedings Article•DOI•

PrivCheck: privacy-preserving check-in data publishing for personalized location based services

[...]

Dingqi Yang¹, Daqing Zhang², Bingqing Qu³, Philippe Cudré-Mauroux¹•Institutions (3)

University of Fribourg¹, Peking University², University of Rennes³

12 Sep 2016

TL;DR: The key idea of PrivCheck is to obfuscate user check-in data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which ensures the utility of the obfuscated data to empower personalized LBSs.

...read moreread less

Abstract: With the widespread adoption of smartphones, we have observed an increasing popularity of Location-Based Services (LBSs) in the past decade. To improve user experience, LBSs often provide personalized recommendations to users by mining their activity (i.e., check-in) data from location-based social networks. However, releasing user check-in data makes users vulnerable to inference attacks, as private data (e.g., gender) can often be inferred from the users' check-in data. In this paper, we propose PrivCheck, a customizable and continuous privacy-preserving check-in data publishing framework providing users with continuous privacy protection against inference attacks. The key idea of PrivCheck is to obfuscate user check-in data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which ensures the utility of the obfuscated data to empower personalized LBSs. Since users often give LBS providers access to both their historical check-in data and future check-in streams, we develop two data obfuscation methods for historical and online check-in publishing, respectively. An empirical evaluation on two real-world datasets shows that our framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized LBSs.

...read moreread less

46 citations

Introduction To Privacy Preserving Data Publishing Concepts And Techniques

[...]

Torsten Werner

01 Jan 2016

TL;DR: This introduction to privacy preserving data publishing concepts and techniques helps people to enjoy a good book with a cup of tea in the afternoon, instead they cope with some malicious bugs inside their desktop computer.

...read moreread less

Abstract: Thank you very much for reading introduction to privacy preserving data publishing concepts and techniques. Maybe you have knowledge that, people have search numerous times for their favorite books like this introduction to privacy preserving data publishing concepts and techniques, but end up in harmful downloads. Rather than enjoying a good book with a cup of tea in the afternoon, instead they cope with some malicious bugs inside their desktop computer.

...read moreread less

37 citations

Journal Article•DOI•

Hierarchical anonymization algorithms against background knowledge attack in data releasing

[...]

Fatemeh Amiri¹, Nasser Yazdani¹, Azadeh Shakery, Amir H. Chinaei²•Institutions (2)

University of Tehran¹, University of Puerto Rico at Mayagüez²

01 Jun 2016-Knowledge Based Systems

TL;DR: This research proposes two approaches for generating k-anonymous β-likeness datasets that protect against identity and attribute disclosures and prevent attacks featured by any data correlations between QIs and sensitive attribute values as the adversary's background knowledge.

...read moreread less

Abstract: We define a privacy model based on k-anonymity and one of its strong refinements to prevent the background knowledge attack.We propose two hierarchical anonymization algorithm to satisfy our privacy model.Our algorithms outperform the state-of the art anonymization algorithm in terms of utility and privacy.We extend an information loss measure to capture data inaccuracies caused by not-fitted records in any equivalence class. Preserving privacy in the presence of adversary's background knowledge is very important in data publishing. The k-anonymity model, while protecting identity, does not protect against attribute disclosure. One of strong refinements of k-anonymity, β-likeness, does not protect against identity disclosure. Neither model protects against attacks featured by background knowledge. This research proposes two approaches for generating k-anonymous β-likeness datasets that protect against identity and attribute disclosures and prevent attacks featured by any data correlations between QIs and sensitive attribute values as the adversary's background knowledge. In particular, two hierarchical anonymization algorithms are proposed. Both algorithms apply agglomerative clustering techniques in their first stage in order to generate clusters of records whose probability distributions extracted by background knowledge are similar. In the next phase, k-anonymity and β-likeness are enforced in order to prevent identity and attribute disclosures. Our extensive experiments demonstrate that the proposed algorithms outperform other state-of-the-art anonymization algorithms in terms of privacy and data utility where the number of unpublished records in our algorithms is less than that of the others. As well-known information loss metrics fail to measure precisely the imposed data inaccuracies stemmed from the removal of records that cannot be published in any equivalence class. This research also introduces an extension into the Global Certainty Penalty metric that considers unpublished records.

...read moreread less

31 citations

Proceedings Article•DOI•

A multi-cloud based privacy-preserving data publishing scheme for the internet of things

[...]

Lei Yang¹, Abdulmalik Humayed¹, Fengjun Li¹•Institutions (1)

University of Kansas¹

05 Dec 2016

TL;DR: This paper proposes two multi-cloud-based outsourced-ABE schemes, namely the parallel-cloud ABE and the chain- cloud ABE, which enable the receivers to partially outsource the computationally expensive decryption operations to the clouds, while preventing user attributes from being disclosed.

...read moreread less

Abstract: With the increased popularity of ubiquitous computing and connectivity, the Internet of Things (IoT) also introduces new vulnerabilities and attack vectors. While secure data collection (i.e. the upward link) has been well studied in the literature, secure data dissemination (i.e. the downward link) remains an open problem. Attribute-based encryption (ABE) and outsourced-ABE has been used for secure message distribution in IoT, however, existing mechanisms suffer from extensive computation and/or privacy issues. In this paper, we explore the problem of privacy-preserving targeted broadcast in IoT. We propose two multi-cloud-based outsourced-ABE schemes, namely the parallel-cloud ABE and the chain-cloud ABE, which enable the receivers to partially outsource the computationally expensive decryption operations to the clouds, while preventing user attributes from being disclosed. In particular, the proposed solution protects three types of privacy (i.e., data, attribute and access policy privacy) by enforcing collaborations among multiple clouds. Our schemes also provide delegation verifiability that allows the receivers to verify whether the clouds have faithfully performed the outsourced operations. We extensively analyze the security guarantees of the proposed mechanisms and demonstrate the effectiveness and efficiency of our schemes with simulated resource-constrained IoT devices, which outsource operations to Amazon EC2 and Microsoft Azure.

...read moreread less

Journal Article•DOI•

Publishing FAIR Data: an exemplar methodology utilizing PHI-base

[...]

Alejandro Rodríguez-Iglesias¹, Alejandro Rodríguez-González¹, Alistair G. Irvine², Ane Sesma¹, Martin Urban², Kim E. Hammond-Kosack², Mark Wilkinson¹ - Show less +3 more•Institutions (2)

Technical University of Madrid¹, Rothamsted Research²

12 May 2016-Frontiers in Plant Science

TL;DR: This work describes the process of migrating a database of notable relevance to the plant sciences—the Pathogen-Host Interaction Database (PHI-base)—to a form that conforms to each of the FAIR Principles, and demonstrates the utility of providing explicit and reliable access to provenance information.

...read moreread less

Abstract: Pathogen-Host interaction data is core to our understanding of disease processes and their molecular/genetic bases. Facile access to such core data is particularly important for the plant sciences, where individual genetic and phenotypic observations have the added complexity of being dispersed over a wide diversity of plant species versus the relatively fewer host species of interest to biomedical researchers. Recently, an international initiative interested in scholarly data publishing proposed that all scientific data should be “FAIR” - Findable, Accessible, Interoperable, and Reusable. In this work, we describe the process of migrating a database of notable relevance to the plant sciences - the Pathogen-Host Interaction Database (PHI-base) - to a form that conforms to each of the FAIR Principles. We discuss the technical and architectural decisions, and the migration pathway, including observations of the difficulty and/or fidelity of each step. We examine how multiple FAIR principles can be addressed simultaneously through careful design decisions, including making data FAIR for both humans and machines with minimal duplication of effort. We note how FAIR data publishing involves more than data reformatting, requiring features beyond those exhibited by most life science Semantic Web or Linked Data resources. We explore the value-added by completing this FAIR data transformation, and then test the result through integrative questions that could not easily be asked over traditional Web-based data resources. Finally, we demonstrate the utility of providing explicit and reliable access to provenance information, which we argue enhances citation rates by encouraging and facilitating transparent scholarly reuse of these valuable data holdings.

...read moreread less

Journal Article•DOI•

A hybrid approach to prevent composition attacks for independent data releases

[...]

Jiuyong Li¹, Muzammil M. Baig¹, A.H.M. Sarowar Sattar¹, Xiaofeng Ding², Jixue Liu¹, Millist W. Vincent¹ - Show less +2 more•Institutions (2)

University of South Australia¹, Huazhong University of Science and Technology²

01 Nov 2016-Information Sciences

TL;DR: A hybrid algorithm, which combines sampling, perturbation and generalization to protect data privacy from composition attacks is proposed and experimentally demonstrates that the proposed anonymization technique significantly reduces the risk of composition attacks and also preserves good data utility.

...read moreread less

Journal Article•DOI•

An effective value swapping method for privacy preserving data publishing

[...]

A S M Touhidul Hasan¹, Qingshan Jiang, Jun Luo, Chengming Li, Lifei Chen² - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Fujian Normal University²

10 Nov 2016-Security and Communication Networks

TL;DR: To increase the privacy of published data in the sliced tables, a new method called value swapping is proposed in this work, aimed at decreasing the attribute disclosure risk for the absolute facts and ensuring the l-diverse slicing.

...read moreread less

Abstract: Privacy is an important concern in the society, and it has been a fundamental issue when to analyze and publish data involving human individual's sensitive information. Recently, the slicing method has been popularly used for privacy preservation in data publishing, because of its potential for preserving more data utility than others such as the generalization and bucketization approaches. However, in this paper, we show that the slicing method has disclosure risks for some absolute facts, which would help the adversary to find invalid records in the sliced microdata table, resulting in breach of individual privacy. To increase the privacy of published data in the sliced tables, a new method called value swapping is proposed in this work, aimed at decreasing the attribute disclosure risk for the absolute facts and ensuring the l-diverse slicing. By value swapping, the published table contains no invalid information such that the adversary cannot breach the individual privacy. Experimental results also show that the NEW method is able to keep more data utility than the existing slicing methods in a published microdata table. Copyright © 2016 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

Privacy-Preserving Utility Verification of the Data Published by Non-Interactive Differentially Private Mechanisms

[...]

Jingyu Hua¹, An Tang¹, Yixin Fang¹, Zhenyu Shen¹, Sheng Zhong¹ - Show less +1 more•Institutions (1)

Nanjing University¹

25 Feb 2016-IEEE Transactions on Information Forensics and Security

TL;DR: A privacy-preserving utility verification mechanism based upon cryptographic technique for DiffPart-a differentially private scheme designed for set-valued data that can measure the data utility based upon the encrypted frequencies of the aggregated raw data instead of the plain values, which thus prevents privacy breach.

...read moreread less

Abstract: In the problem of privacy-preserving collaborative data publishing, a central data publisher is responsible for aggregating sensitive data from multiple parties and then anonymizing it before publishing for data mining. In such scenarios, the data users may have a strong demand to measure the utility of the published data, since most anonymization techniques have side effects on data utility. Nevertheless, this task is non-trivial, because the utility measuring usually requires the aggregated raw data, which is not revealed to the data users due to privacy concerns. Furthermore, the data publishers may even cheat in the raw data, since no one, including the individual providers, knows the full data set. In this paper, we first propose a privacy-preserving utility verification mechanism based upon cryptographic technique for DiffPart —a differentially private scheme designed for set-valued data. This proposal can measure the data utility based upon the encrypted frequencies of the aggregated raw data instead of the plain values, which thus prevents privacy breach. Moreover, it is enabled to privately check the correctness of the encrypted frequencies provided by the publisher, which helps detect dishonest publishers. We also extend this mechanism to DiffGen—another differentially private publishing scheme designed for relational data. Our theoretical and experimental evaluations demonstrate the security and efficiency of the proposed mechanism.

...read moreread less

Book Chapter•DOI•

DataGraft: Simplifying Open Data Publishing

[...]

Dumitru Roman¹, Marin Dimitrov², Nikolay Nikolov¹, Antoine Putlier¹, Dina Sukhobok¹, Brian Elvesæter¹, Arne J. Berre¹, Xianglin Ye¹, Alex Simov², Yavor Petkov² - Show less +6 more•Institutions (2)

SINTEF¹, Ontotext²

29 May 2016

TL;DR: This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.

...read moreread less

Abstract: In this demonstrator we introduce DataGraft – a platform for Open Data management. DataGraft provides data transformation, publishing and hosting capabilities that aim to simplify the data publishing lifecycle for data workers (i.e., Open Data publishers, Linked Data developers, data scientists). This demonstrator highlights the key features of DataGraft by exemplifying a data transformation and publishing use case with property-related data.

...read moreread less

Journal Article•DOI•

Structural Data De-Anonymization: Theory and Practice

[...]

Shouling Ji¹, Weiqing Li², Mudhakar Srivatsa³, Raheem Beyah²•Institutions (3)

Zhejiang University¹, Georgia Institute of Technology², IBM³

01 Dec 2016-IEEE ACM Transactions on Networking

TL;DR: This work quantitatively shows the perfect and (1 - ε)-perfect de-anonymization conditions of the 26 real world structural data sets, including social networks, collaborations networks, communication networks, autonomous systems, peer-to-peer networks, and so on.

...read moreread less

Abstract: In this paper, we study the quantification, practice, and implications of structural data de-anonymization, including social data, mobility traces, and so on. First, we answer several open questions in structural data de-anonymization by quantifying perfect and $(1-\epsilon )$ -perfect structural data de-anonymization, where $\epsilon $ is the error tolerated by a de-anonymization scheme. To the best of our knowledge, this is the first work on quantifying structural data de-anonymization under a general data model, which closes the gap between the structural data de-anonymization practice and theory. Second, we conduct the first large-scale study on the de-anonymizability of 26 real world structural data sets, including social networks, collaborations networks, communication networks, autonomous systems, peer-to-peer networks, and so on. We also quantitatively show the perfect and $(1-\epsilon )$ -perfect de-anonymization conditions of the 26 data sets. Third, following our quantification, we present a practical attack [a novel single-phase cold start optimization-based de-anonymization (ODA) algorithm]. An experimental analysis of ODA shows that $\sim 77.7$ %–83.3% of the users in Gowalla (196 591 users and 950 327 edges) and 86.9%–95.5% of the users in Google+ (4 692 671 users and 90 751 480 edges) are de-anonymizable in different scenarios, which implies that the structure-based de-anonymization is powerful in practice. Finally, we discuss the implications of our de-anonymization quantification and our ODA attack and provide some general suggestions for future secure data publishing.

...read moreread less

Proceedings Article•

Logical foundations of privacy-preserving publishing of Linked Data

[...]

Bernardo Cuenca Grau¹, Egor V. Kostylev¹•Institutions (1)

University of Oxford¹

12 Feb 2016

TL;DR: In this article, the authors lay the foundations of privacy-preserving data publishing (PPDP) in the context of linked data, and define notions of safe and optimal anonymisations.

...read moreread less

Abstract: The widespread adoption of Linked Data has been driven by the increasing demand for information exchange between organisations, as well as by data publishing regulations in domains such as health care and governance. In this setting, sensitive information is at risk of disclosure since published data can be linked with arbitrary external data sources. In this paper we lay the foundations of privacy-preserving data publishing (PPDP) in the context of Linked Data. We consider anonymisations of RDF graphs (and, more generally, relational datasets with labelled nulls) and define notions of safe and optimal anonymisations. Safety ensures that the anonymised data can be published with provable protection guarantees against linking attacks, whereas optimality ensures that it preserves as much information from the original data as possible, while satisfying the safety requirement. We establish the complexity of the underpinning decision problems both under open-world semantics inherent to RDF and a closed-world semantics, where we assume that an attacker has complete knowledge over some part of the original data.

...read moreread less

Journal Article•DOI•

Privacy preserving data anonymization of spontaneous ADE reporting system dataset.

[...]

Wen-Yang Lin¹, Duen-Chuan Yang¹, Jie-Teng Wang¹•Institutions (1)

National University of Kaohsiung¹

18 Jul 2016-BMC Medical Informatics and Decision Making

TL;DR: A new privacy model called MS(k, θ*)-bounding for protecting published spontaneous ADE reporting data from privacy attacks and an anonymization algorithm to sanitize SRS data in accordance with the proposed model is proposed.

...read moreread less

Abstract: To facilitate long-term safety surveillance of marketing drugs, many spontaneously reporting systems (SRSs) of ADR events have been established world-wide. Since the data collected by SRSs contain sensitive personal health information that should be protected to prevent the identification of individuals, it procures the issue of privacy preserving data publishing (PPDP), that is, how to sanitize (anonymize) raw data before publishing. Although much work has been done on PPDP, very few studies have focused on protecting privacy of SRS data and none of the anonymization methods is favorable for SRS datasets, due to which contain some characteristics such as rare events, multiple individual records, and multi-valued sensitive attributes. We propose a new privacy model called MS(k, θ * )-bounding for protecting published spontaneous ADE reporting data from privacy attacks. Our model has the flexibility of varying privacy thresholds, i.e., θ * , for different sensitive values and takes the characteristics of SRS data into consideration. We also propose an anonymization algorithm for sanitizing the raw data to meet the requirements specified through the proposed model. Our algorithm adopts a greedy-based clustering strategy to group the records into clusters, conforming to an innovative anonymization metric aiming to minimize the privacy risk as well as maintain the data utility for ADR detection. Empirical study was conducted using FAERS dataset from 2004Q1 to 2011Q4. We compared our model with four prevailing methods, including k-anonymity, (X, Y)-anonymity, Multi-sensitive l-diversity, and (α, k)-anonymity, evaluated via two measures, Danger Ratio (DR) and Information Loss (IL), and considered three different scenarios of threshold setting for θ * , including uniform setting, level-wise setting and frequency-based setting. We also conducted experiments to inspect the impact of anonymized data on the strengths of discovered ADR signals. With all three different threshold settings for sensitive value, our method can successively prevent the disclosure of sensitive values (nearly all observed DRs are zeros) without sacrificing too much of data utility. With non-uniform threshold setting, level-wise or frequency-based, our MS(k, θ *)-bounding exhibits the best data utility and the least privacy risk among all the models. The experiments conducted on selected ADR signals from MedWatch show that only very small difference on signal strength (PRR or ROR) were observed. The results show that our method can effectively prevent the disclosure of patient sensitive information without sacrificing data utility for ADR signal detection. We propose a new privacy model for protecting SRS data that possess some characteristics overlooked by contemporary models and an anonymization algorithm to sanitize SRS data in accordance with the proposed model. Empirical evaluation on the real SRS dataset, i.e., FAERS, shows that our method can effectively solve the privacy problem in SRS data without influencing the ADR signal strength.

...read moreread less

Journal Article•DOI•

Cooperative privacy game: a novel strategy for preserving privacy in data publishing

[...]

V. Valli Kumari¹, Srinivasa L. Chakravarthy•Institutions (1)

Andhra University¹

01 Dec 2016-Human-centric Computing and Information Sciences

TL;DR: A novel methodology, Cooperative Privacy Game (CoPG), has been proposed to achieve data privacy in which Cooperative Game Theory is used to achieve the privacy and is named as Cooperative Privacy (CoP).

...read moreread less

Abstract: Achieving data privacy before publishing has been becoming an extreme concern of researchers, individuals and service providers. A novel methodology, Cooperative Privacy Game (CoPG), has been proposed to achieve data privacy in which Cooperative Game Theory is used to achieve the privacy and is named as Cooperative Privacy (CoP). The core idea of CoP is to play the best strategy for a player to preserve his privacy by himself which in turn contributes to preserving other players privacy. CoP considers each tuple as a player and tuples form coalitions as described in the procedure. The main objective of the CoP is to obtain individuals (player) privacy as a goal that is rationally interested in other individuals' (players) privacy. CoP is formally defined in terms of Nash equilibria, i.e., all the players are in their best coalition, to achieve k-anonymity. The cooperative values of the each tuple are measured using the characteristic function of the CoPG to identify the coalitions. As the underlying game is convex; the algorithm is efficient and yields high quality coalition formation with respect to intensity and disperse. The efficiency of anonymization process is calculated using information loss metric. The variations of the information loss with the parameters $$\alpha$$ź (weight factor of nearness) and $$\beta$$β (multiplicity) are analyzed and the obtained results are discussed.

...read moreread less

Journal Article•DOI•

Privacy preserving data publishing of categorical data through k-anonymity and feature selection.

[...]

Aristos Aristodimou¹, Athos Antoniades¹, Constantinos S. Pattichis¹•Institutions (1)

University of Cyprus¹

23 Mar 2016-Healthcare technology letters

TL;DR: A new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS), which uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k- anonymity.

...read moreread less

Abstract: In healthcare, there is a vast amount of patients' data, which can lead to important discoveries if combined. Due to legal and ethical issues, such data cannot be shared and hence such information is underused. A new area of research has emerged, called privacy preserving data publishing (PPDP), which aims in sharing data in a way that privacy is preserved while the information lost is kept at a minimum. In this Letter, a new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS). The algorithm uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k-anonymity. Five datasets from different areas of life sciences [RETINOPATHY, Single Proton Emission Computed Tomography imaging, gene sequencing and drug discovery (two datasets)], were anonymised with kPB-MS. The produced anonymised datasets were evaluated using four different classifiers and in 74% of the test cases, they produced similar or better accuracies than using the full datasets.

...read moreread less

Proceedings Article•DOI•

A Novel Differential Privacy Approach that Enhances Classification Accuracy

[...]

A. N. K. Zaman¹, Charlie Obimbo¹, Rozita Dara¹•Institutions (1)

University of Guelph¹

20 Jul 2016

TL;DR: The proposed algorithm is a non-interactive method to publish anonymize data set, and the decision tree classifier showed better results compared to other existing classification works on anonymized data set.

...read moreread less

Abstract: In the recent past, there has been a tremendous increase of large repositories of data, examples being in healthcare data, consumer data from retailers, and airline passenger data. These data are continually being shared with interested parties, either anonymously -- for research purposes, or openly by financial or insurance companies, for decision-making purposes. When is shared anonymously, there is still the possibility of de-anonymizing the data. Privacy Preserving Data Publishing (PPDP) is a way to allow one to share secure data while ensuring protection against identity disclosure of an individual. Generalization of attributes is a technique of data anonymization where an attribute is replaced with a more generalized value. Differential privacy is a technique that ensures the highest level of privacy for a record owner while providing actual information about the data set. This research develops a framework by generalizing attributes of a data set that satisfy differential privacy principles for publishing secure data for sharing. The proposed algorithm is a non-interactive method to publish anonymize data set, and the decision tree classifier showed better results compared to other existing classification works on anonymized data set. In this paper differential privacy refers to e-differential privacy.

...read moreread less

Proceedings Article•DOI•

QB2OLAP: Enabling OLAP on Statistical Linked Open Data

[...]

Jovan Varga¹, Lorena Etcheverry, Alejandro A. Vaisman², Oscar Romero¹, Torben Bach Pedersen³, Christian Thomsen³ - Show less +2 more•Institutions (3)

Polytechnic University of Catalonia¹, Instituto Tecnológico de Buenos Aires², Aalborg University³

16 May 2016

TL;DR: This work presents QB2OLAP, a tool for enabling OLAP on existing QB data without requiring any RDF, QB(4OLAP), or SPARQL skills, and allows semi-automatic transformation of a QB data set into a QB4 OLAP one via enrichment with QB4olAP semantics.

...read moreread less

Abstract: Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language for RDF, which typical OLAP users are not familiar with. In this demo, we present QB2OLAP, a tool for enabling OLAP on existing QB data. Without requiring any RDF, QB(4OLAP), or SPARQL skills, it allows semi-automatic transformation of a QB data set into a QB4OLAP one via enrichment with QB4OLAP semantics, exploration of the enriched schema, and querying with the high-level OLAP language QL that exploits the QB4OLAP semantics and is automatically translated to SPARQL.

...read moreread less

Journal Article•DOI•

AusPlots Rangelands field data collection and publication

[...]

Andrew Tokmakoff¹, Ben Sparrow¹, David Turner¹, Andrew J. Lowe¹•Institutions (1)

University of Adelaide¹

01 Mar 2016-Future Generation Computer Systems

TL;DR: The AusPlots Rangelands field data collection solution is introduced, providing a systems-level view and motivating its development through the discussion of key functional requirements, and it is demonstrated that the combined system provides a unique end-to-end data collection, curation, archiving and publishing mechanism for ecological data.

...read moreread less

Journal Article•DOI•

Enforcing transparent access to private content in social networks by means of automatic sanitization

[...]

Alexandre Viejo, David Sánchez

15 Nov 2016-Expert Systems With Applications

TL;DR: The proposed scheme is able to automatically detect sensitive data in users' publications; construct sanitized versions of such data; and provide privacy-preserving transparent access to sensitive contents by disclosing more or less information to readers according to their credentials toward the owner of the publications.

...read moreread less

Abstract: A system to enforce privacy protection in existing social networks.An automatic method to assess the privacy risks of users' publications.A transparent data storage and accessing mechanism according to users' credentials.Feasibility study and two illustrative case studies in PatientsLikeMe and Twitter. Social networks have become an essential meeting point for millions of individuals willing to publish and consume huge quantities of heterogeneous information. Some studies have shown that the data published in these platforms may contain sensitive personal information and that external entities can gather and exploit this knowledge for their own benefit. Even though some methods to preserve the privacy of social networks users have been proposed, they generally apply rigid access control measures to the protected content and, even worse, they do not enable the users to understand which contents are sensitive. Last but not least, most of them require the collaboration of social network operators or they fail to provide a practical solution capable of working with well-known and already deployed social platforms (e.g., Twitter). In this paper, we propose a new scheme that addresses all these issues. The new system is envisaged as an independent piece of software that does not depend on the social network in use and that can be transparently applied to most existing ones. According to a set of privacy requirements intuitively defined by the users of a social network, the proposed scheme is able to: (i) automatically detect sensitive data in users' publications; (ii) construct sanitized versions of such data; and (iii) provide privacy-preserving transparent access to sensitive contents by disclosing more or less information to readers according to their credentials toward the owner of the publications. We also study the applicability of the proposed system in general and illustrate its behavior in two well-known social networks (i.e., Twitter and PatientsLikeMe).

...read moreread less

Journal Article•DOI•

Privacy-Preserving Publishing of Hierarchical Data

[...]

Ismet Ozalp¹, Mehmet Emre Gursoy², Mehmet Ercan Nergiz, Yucel Saygin¹•Institutions (2)

Sabancı University¹, University of California, Los Angeles²

15 Sep 2016

TL;DR: This article extends two standards for privacy protection in tabular data (k-anonymity and ℓ-diversity) and applies them to hierarchical data and presents utility-aware algorithms that enforce these definitions of privacy using generalizations and suppressions of data values.

...read moreread less

Abstract: Many applications today rely on storage and management of semi-structured information, for example, XML databases and document-oriented databases. These data often have to be shared with untrusted third parties, which makes individuals’ privacy a fundamental problem. In this article, we propose anonymization techniques for privacy-preserving publishing of hierarchical data. We show that the problem of anonymizing hierarchical data poses unique challenges that cannot be readily solved by existing mechanisms. We extend two standards for privacy protection in tabular data (k-anonymity and e-diversity) and apply them to hierarchical data. We present utility-aware algorithms that enforce these definitions of privacy using generalizations and suppressions of data values. To evaluate our algorithms and their heuristics, we experiment on synthetic and real datasets obtained from two universities. Our experiments show that we significantly outperform related methods that provide comparable privacy guarantees.

...read moreread less

Journal Article•DOI•

Towards a digital infrastructure for engineering materials data

[...]

Tim Austin

01 Mar 2016-Materials Discovery

TL;DR: The status of the work to develop data formats for engineering materials in the frame of CEN Workshops is described and the added value of data citation beyond simply ensuring that data creators are properly accredited for their work is reported on.

...read moreread less

Proceedings Article•DOI•

Differentially Private Publication Scheme for Trajectory Data

[...]

Meng Li¹, Liehuang Zhu¹, Zijian Zhang¹, Rixin Xu¹•Institutions (1)

Beijing Institute of Technology¹

01 Jun 2016

TL;DR: A differentially private trajectory data publishing algorithm aiming to protect the privacy of sensitive areas is proposed and it is shown that the proposed scheme achieves good data utility and is scalable to large trajectory databases.

...read moreread less

Abstract: Trajectory data. Like human mobility trace, in participatory sensing is of vital importance to many applications, like traffic monitoring, urban planning and social relationship mining. However, improper release of trajectory data can incur great threats to user's privacy. Recent researches have adopted Laplace mechanism to achieve differential privacy which can guarantee that small change of one record in database will not breach a user's privacy. However, existing work cannot guarantee privacy perfectly because a randomly picked noise will not contribute to a meaningful trajectory data release and people need to hide their visits to certain sensitive area. In this paper, we propose a differentially private trajectory data publishing algorithm aiming to protect the privacy of sensitive areas. Privacy analysis show that the proposed scheme achieves differential privacy and experiments with real trajectory data exhibits that the proposed scheme achieves good data utility and is scalable to large trajectory databases.

...read moreread less

Journal Article•DOI•

Permutation anonymization

[...]

Dong Li¹, Xianmang He², Longbin Cao³, Huahui Chen⁴•Institutions (4)

National Natural Science Foundation of China¹, Fudan University², University of Technology, Sydney³, Ningbo University⁴

01 Dec 2016

TL;DR: An improved version of anatomy is proposed: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and in the meanwhile is able to retain significantly more information in the microdata.

...read moreread less

Abstract: In data publishing, anonymization techniques have been designed to provide privacy protection. Anatomy is an important techniques for privacy preserving in data publication and attracts considerable attention in the literature. However, anatomy is fragile under background knowledge attack and the presence attack. In addition, anatomy can only be applied into limited applications. To overcome these drawbacks, we propose an improved version of anatomy: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and in the meanwhile is able to retain significantly more information in the microdata. We present the detail of the technique and build the underlying theory of the technique. Extensive experiments on real data are conducted, showing that our technique allows highly effective data analysis, while offering strong privacy guarantees.

...read moreread less

Opportunities and challenges for privacy-preserving visualization of electronic health record data

[...]

Aritra Dasgupta, Eamonn Maguire, Alfie Abdul-Rahman, Minsi Chen

01 Jan 2016

TL;DR: In this article, the authors describe the opportunities and challenges for privacy-preserving visualization of electronic health record data by analyzing the different disclosure risk types, and vulnerabilities associated with commonly used visualization techniques.

...read moreread less

Abstract: In this paper, we reflect on the use of visualization techniques for analyzing electronic health record data with privacy concerns. Privacy-preserving data visualization is a relatively new area of research compared to the more established research areas of privacy-preserving data publishing and data mining. We describe the opportunities and challenges for privacy-preserving visualization of electronic health record data by analyzing the different disclosure risk types, and vulnerabilities associated with commonly used visualization techniques.

...read moreread less