Showing papers on "Data publishing published in 2014"

PDF

Open Access

Journal Article•DOI•

Information Security in Big Data: Privacy and Data Mining

[...]

Lei Xu¹, Chunxiao Jiang¹, Jian Wang¹, Jian Yuan¹, Yong Ren¹ - Show less +1 more•Institutions (1)

09 Oct 2014-IEEE Access

TL;DR: This paper identifies four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker, and examines various approaches that can help to protect sensitive information.

...read moreread less

Abstract: The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.

...read moreread less

528 citations

Proceedings Article•DOI•

PrivBayes: private data release via bayesian networks

[...]

Jun Zhang¹, Graham Cormode², Cecilia M. Procopiuc³, Divesh Srivastava³, Xiaokui Xiao¹ - Show less +1 more•Institutions (3)

Nanyang Technological University¹, University of Warwick², AT&T Labs³

18 Jun 2014

TL;DR: PrivBayes, a differentially private method for releasing high-dimensional data that circumvents the curse of dimensionality, and introduces a novel approach that uses a surrogate function for mutual information to build the model more accurately.

...read moreread less

Abstract: Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless.To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

...read moreread less

433 citations

Journal Article•DOI•

The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet

[...]

Tim Robertson¹, Markus Döring¹, Robert P. Guralnick², David Bloom³, John Wieczorek³, Kyle Braak¹, Javier Otegui², Laura Russell⁴, Peter Desmet⁵ - Show less +5 more•Institutions (5)

Global Biodiversity Information Facility¹, University of Colorado Boulder², University of California, Berkeley³, University of Kansas⁴, Research Institute for Nature and Forest⁵

06 Aug 2014-PLOS ONE

TL;DR: The key need for the IPT is discussed, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records are discussed.

...read moreread less

Abstract: The planet is experiencing an ongoing global biodiversity crisis. Measuring the magnitude and rate of change more effectively requires access to organized, easily discoverable, and digitally-formatted biodiversity data, both legacy and new, from across the globe. Assembling this coherent digital representation of biodiversity requires the integration of data that have historically been analog, dispersed, and heterogeneous. The Integrated Publishing Toolkit (IPT) is a software package developed to support biodiversity dataset publication in a common format. The IPT’s two primary functions are to 1) encode existing species occurrence datasets and checklists, such as records from natural history collections or observations, in the Darwin Core standard to enhance interoperability of data, and 2) publish and archive data and metadata for broad use in a Darwin Core Archive, a set of files following a standard format. Here we discuss the key need for the IPT, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records. We close with a discussion how IPT has impacted the biodiversity research community, how it enhances data publishing in more traditional journal venues, along with new features implemented in the latest version of the IPT, and future plans for more enhancements.

...read moreread less

198 citations

Proceedings Article•DOI•

Structural Data De-anonymization: Quantification, Practice, and Implications

[...]

Shouling Ji¹, Weiqing Li¹, Mudhakar Srivatsa², Raheem Beyah¹•Institutions (2)

Georgia Institute of Technology¹, IBM²

03 Nov 2014

TL;DR: This work quantitatively shows the conditions for perfect and (1-ε)-perfect structural data DA of 26 real world structural datasets, including Social Networks, Collaborations Networks, Communication Networks, Autonomous Systems, and Peer-to-Peer networks and designs a practical and novel single-phase cold start Optimization based DA (ODA) algorithm.

...read moreread less

Abstract: In this paper, we study the quantification, practice, and implications of structural data (e.g., social data, mobility traces) De-Anonymization (DA). First, we address several open problems in structural data DA by quantifying perfect and (1-e)-perfect structural data DA}, where e is the error tolerated by a DA scheme. To the best of our knowledge, this is the first work on quantifying structural data DA under a general data model, which closes the gap between structural data DA practice and theory. Second, we conduct the first large-scale study on the de-anonymizability of 26 real world structural datasets, including Social Networks (SNs), Collaborations Networks, Communication Networks, Autonomous Systems, and Peer-to-Peer networks. We also quantitatively show the conditions for perfect and (1-e)-perfect DA of the 26 datasets. Third, following our quantification, we design a practical and novel single-phase cold start Optimization based DA} (ODA) algorithm. Experimental analysis of ODA shows that about 77.7% - 83.3% of the users in Gowalla (.2M users and 1M edges) and 86.9% - 95.5% of the users in Google+ (4.7M users and 90.8M edges) are de-anonymizable in different scenarios, which implies optimization based DA is implementable and powerful in practice. Finally, we discuss the implications of our DA quantification and ODA and provide some general suggestions for future secure data publishing.

...read moreread less

122 citations

Book Chapter•DOI•

LOD Laundromat: A Uniform Way of Publishing Other People's Dirty Data

[...]

Wouter Beek¹, Laurens Rietveld¹, Hamid R. Bazoobandi¹, Jan Wielemaker¹, Stefan Schlobach¹ - Show less +1 more•Institutions (1)

VU University Amsterdam¹

19 Oct 2014

TL;DR: The LOD Laundromat is presented, which removes stains from data without any human intervention and is able to make very large amounts of LOD more easily available for further processing right now.

...read moreread less

Abstract: It is widely accepted that proper data publishing is difficult. The majority of Linked Open Data (LOD) does not meet even a core set of data publishing guidelines. Moreover, datasets that are clean at creation, can get stains over time. As a result, the LOD cloud now contains a high level of dirty data that is difficult for humans to clean and for machines to process. Existing solutions for cleaning data (standards, guidelines, tools) are targeted towards human data creators, who can (and do) choose not to use them. This paper presents the LOD Laundromat which removes stains from data without any human intervention. This fully automated approach is able to make very large amounts of LOD more easily available for further processing right now. LOD Laundromat is not a new dataset, but rather a uniform point of entry to a collection of cleaned siblings of existing datasets. It provides researchers and application developers a wealth of data that is guaranteed to conform to a specified set of best practices, thereby greatly improving the chance of data actually being (re)used.

...read moreread less

113 citations

Journal Article•DOI•

The linked data benchmark council: a graph and RDF industry benchmarking effort

[...]

Renzo Angles¹, Peter Boncz, Josep L. Larriba-Pey², Irini Fundulaki, Thomas Neumann³, Orri Erling⁴, Peter Neubauer, Norbert Martínez-Bazan, Venelin Kotsev⁵, Ioan Toma - Show less +6 more•Institutions (5)

VU University Amsterdam¹, Polytechnic University of Catalonia², Technische Universität München³, OpenLink Software⁴, Ontotext⁵

13 May 2014

TL;DR: An overview of the LDBC project including its goals and organization is presented, and so-called "choke-point" based benchmark development through which experts identify key technical challenges, and introduce them in the benchmark workload is introduced.

...read moreread less

Abstract: The Linked Data Benchmark Council (LDBC) is an EU project that aims to develop industry-strength benchmarks for graph and RDF data management systems. It includes the creation of a non-profit LDBC organization, where industry players and academia come together for managing the development of benchmarks as well as auditing and publishing official results. We present an overview of the project including its goals and organization, and describe its process and design methodology for benchmark development. We introduce so-called "choke-point" based benchmark development through which experts identify key technical challenges, and introduce them in the benchmark workload. Finally, we present the status of two benchmarks currently in development, one targeting graph data management systems using a social network data case, and the other targeting RDF systems using a data publishing case.

...read moreread less

86 citations

Journal Article•DOI•

A Survey of Privacy Preserving Data Publishing using Generalization and Suppression

[...]

Yang Xu, Tinghuai Ma, Meili Tang, Wei Tian

01 May 2014-Applied Mathematics & Information Sciences

TL;DR: This paper provides an overview of the development of privacy preserving data publishing, restricted to the scope of anonymity alg orithms using generalization and suppression, and introduces the privacy preserving models for attack.

...read moreread less

Abstract: Nowadays, information sharing as an indispensable part appears in our vision, bringing about a mass of discussions about methods and techniques of privacy preserving data publishing which are regarded as strong guarantee to avoid information disclosure and protect individuals' privacy. Recent work focuses on propos ing different anonymity algorithms for varying data publishing scenarios to satisfy privacy requirements, and keep data utility at the same time. K-anonymity has been proposed for privacy preserving data publishing, which can prevent linkage attacks by the means of anonymity operation, such as generalization and suppression. Numerous anonymity algorithms have been utilized for achieving k-anonymity. This paper provides an overview of the development of privacy preserving data publishing, which is restricted to the scope of anonymity alg orithms using generalization and suppression. The privacy preserving models for attack is introduced at first. An overview of seve ral anonymity operations follow behind. The most important part is the coverage of anonymity algorithms and information metric which is essential ingredient of algorithms. The conclusion and perspective are proposed finally.

...read moreread less

76 citations

Journal Article•DOI•

Secure Two-Party Differentially Private Data Release for Vertically Partitioned Data

[...]

Noman Mohammed¹, Dima Alhadidi², Benjamin C. M. Fung¹, Mourad Debbabi²•Institutions (2)

McGill University¹, Concordia University²

01 Jan 2014-IEEE Transactions on Dependable and Secure Computing

TL;DR: A two-party algorithm for differentially private data release for vertically partitioned data between two parties in the semihonest adversary model is presented and Experimental results on real-life data suggest that the proposed algorithm can effectively preserve information for a data mining task.

...read moreread less

Abstract: Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among the existing privacy models, ϵ-differential privacy provides one of the strongest privacy guarantees. In this paper, we address the problem of private data publishing, where different attributes for the same set of individuals are held by two parties. In particular, we present an algorithm for differentially private data release for vertically partitioned data between two parties in the semihonest adversary model. To achieve this, we first present a two-party protocol for the exponential mechanism. This protocol can be used as a subprotocol by any other algorithm that requires the exponential mechanism in a distributed setting. Furthermore, we propose a two-party algorithm that releases differentially private data in a secure way according to the definition of secure multiparty computation. Experimental results on real-life data suggest that the proposed algorithm can effectively preserve information for a data mining task.

...read moreread less

73 citations

Journal Article•DOI•

$m$ -Privacy for Collaborative Data Publishing

[...]

Slawomir Goryczka¹, Li Xiong¹, Benjamin C. M. Fung²•Institutions (2)

Emory University¹, Concordia University²

01 Oct 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper considers the collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers and introduces the notion of m-privacy, which guarantees that the anonymized data satisfies a given privacy constraint against any group of up to m colluding data providers.

...read moreread less

Abstract: In this paper, we consider the collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers. We consider a new type of "insider attack" by colluding data providers who may use their own data records (a subset of the overall data) to infer the data records contributed by other data providers. The paper addresses this new threat, and makes several contributions. First, we introduce the notion of m-privacy, which guarantees that the anonymized data satisfies a given privacy constraint against any group of up to m colluding data providers. Second, we present heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently checking m-privacy given a group of records. Third, we present a data provider-aware anonymization algorithm with adaptive m-privacy checking strategies to ensure high utility and m-privacy of anonymized data with efficiency. Finally, we implement the m-privacy anonymization and verification algorithms with a trusted third party (TTP), and propose secure multiparty computation protocols for scenarios without TTP. All protocols are extensively analyzed and their security and efficiency are formally proved. Experiments on real-life datasets suggest that our approach achieves better or comparable utility and efficiency than existing and baseline algorithms while satisfying m-privacy.

...read moreread less

68 citations

Journal Article•DOI•

Generalization-based privacy preservation and discrimination prevention in data publishing and mining

[...]

Sara Hajian, Josep Domingo-Ferrer, Oriol Farràs

01 Sep 2014-Data Mining and Knowledge Discovery

TL;DR: This paper presents the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention, and shows how to extend the approach to different privacy models and anti-discrimination legal concepts.

...read moreread less

Abstract: Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.

...read moreread less

66 citations

Journal Article•DOI•

Publishing and Pushing: Mixing Models for Communicating Research Data in Archaeology

[...]

Eric Kansa¹, Sarah Whitcher Kansa, Benjamin S. Arbuckle•Institutions (1)

University of California, Berkeley¹

30 May 2014-International Journal of Digital Curation

TL;DR: It is demonstrated that the “publish” and “push” models of data dissemination need not be mutually exclusive; on the contrary, they can play complementary roles in sharing high quality data in support of research.

...read moreread less

Abstract: We present a case study of data integration and reuse involving 12 researchers who published datasets in Open Context, an online data publishing platform, as part of collaborative archaeological research on early domesticated animals in Anatolia. Our discussion reports on how different editorial and collaborative review processes improved data documentation and quality, and created ontology annotations needed for comparative analyses by domain specialists. To prepare data for shared analysis, this project adapted editor-supervised review and revision processes familiar to conventional publishing, as well as more novel models of revision adapted from open source software development of public version control. Preparing the datasets for publication and analysis required significant investment of effort and expertise, including archaeological domain knowledge and familiarity with key ontologies. To organize this work effectively, we emphasized these different models of collaboration at various stages of this data publication and analysis project. Collaboration first centered on data editors working with data contributors, then widened to include other researchers who provided additional peer-review feedback, and finally the widest research community, whose collaboration is facilitated by GitHub’s version control system. We demonstrate that the “publish” and “push” models of data dissemination need not be mutually exclusive; on the contrary, they can play complementary roles in sharing high quality data in support of research. This work highlights the value of combining multiple models in different stages of data dissemination.

...read moreread less

Journal Article•

A Systematic Comparison and Evaluation of k-Anonymization Algorithms for Practitioners

[...]

Vanessa Ayala-Rivera¹, Patrick McDonagh², Thomas Cerqueus¹, Liam Murphy¹•Institutions (2)

University College Dublin¹, Dublin City University²

01 Dec 2014-Transactions on Data Privacy

TL;DR: A systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness ( in terms of data utility) and demonstrates through experimental evaluation, the conditions in which one algorithm outperforms the others for a particular metric, depending on the input dataset and privacy requirements.

...read moreread less

Abstract: The vast amount of data being collected about individuals has brought new challenges in protecting their privacy when this data is disseminated. As a result, Privacy-Preserving Data Publishing has become an active research area, in which multiple anonymization algorithms have been proposed. However, given the large number of algorithms available and limited information regarding their performance, it is difficult to identify and select the most appropriate algorithm given a particular publishing scenario, especially for practitioners. In this paper, we perform a systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness (in terms of data utility). We extend the scope of their original evaluation by employing a more comprehensive set of scenarios: different parameters, metrics and datasets. Using publicly available implementations of those algorithms, we conduct a series of experiments and a comprehensive analysis to identify the factors that influence their performance, in order to guide practitioners in the selection of an algorithm. We demonstrate through experimental evaluation, the conditions in which one algorithm outperforms the others for a particular metric, depending on the input dataset and privacy requirements. Our findings motivate the necessity of creating methodologies that provide recommendations about the best algorithm given a particular publishing scenario.

...read moreread less

Book Chapter•DOI•

Linked Data Finland: A 7-star Model and Platform for Publishing and Re-using Linked Datasets

[...]

Eero Hyvönen¹, Jouni Tuominen¹, Miika Alonen¹, Eetu Mäkelä¹•Institutions (1)

Aalto University¹

25 May 2014

TL;DR: The “Linked Data Finland” platform LDF.fi is introduced, extending the famous 5-star model of Tim Berners-Lee, with the sixth star for providing the dataset with a schema that explains the dataset, and the seventh star for validating the data against the schema.

...read moreread less

Abstract: The idea of Linked Data is to aggregate, harmonize, integrate, enrich, and publish data for re-use on the Web in a cost-efficient way using Semantic Web technologies. We concern two major hindrances for re-using Linked Data: It is often difficult for a re-user to (1) understand the characteristics of the dataset and (2) evaluate the quality the data for the intended purpose. This paper introduces the “Linked Data Finland” platform LDF.fi addressing these issues. We extend the famous 5-star model of Tim Berners-Lee, with the sixth star for providing the dataset with a schema that explains the dataset, and the seventh star for validating the data against the schema. LDF.fi also automates data publishing and provides data curation tools. The first prototype of the platform is available on the web as a service, hosting tens of datasets and supporting several applications.

...read moreread less

Proceedings Article•DOI•

A rate-disortion perspective on local differential privacy

[...]

Anand D. Sarwate¹, Lalitha Sankar²•Institutions (2)

Rutgers University¹, Arizona State University²

30 Jan 2014

TL;DR: It is shown that under Hamming distortions, the differential privacy risk is lower bounded for all nontrivial distortions, and that the lower bound grows logarithmically in the alphabet size.

...read moreread less

Abstract: Local differential privacy is a model for privacy in which an untrusted statistician collects data from individuals who mask their data before revealing it. While randomized response has shown to be a good strategy when the statistician's goal is to estimate a parameter of the population, we consider instead the problem of locally private data publishing, in which the data collector must publish a version of the data it has collected. We model utility by a distortion measure and consider privacy mechanisms that act via a memoryless channnel operating on the data. If we consider a the source distribution to be unknown but in a class of distributions, we arrive at a robust-rate distortion model for the privacy-distortion tradeoff. We show that under Hamming distortions, the differential privacy risk is lower bounded for all nontrivial distortions, and that the lower bound grows logarithmically in the alphabet size.

...read moreread less

Journal Article•DOI•

Size matters

[...]

Raymond Heatherly¹, Joshua C. Denny¹, Jonathan L. Haines², Dan M. Roden¹, Bradley A. Malin¹ - Show less +1 more•Institutions (2)

Vanderbilt University¹, Case Western Reserve University²

01 Dec 2014-Journal of Biomedical Informatics

TL;DR: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.

...read moreread less

Journal Article•DOI•

Identity obfuscation in graphs through the information theoretic lens

[...]

Francesco Bonchi¹, Aristides Gionis², Tamir Tassa³•Institutions (3)

Yahoo!¹, Aalto University², Open University³

10 Aug 2014-Information Sciences

TL;DR: It is concluded that random perturbation can not achieve meaningful levels of anonymity without deteriorating the graph features.

...read moreread less

Journal Article•DOI•

PPFSCADA: Privacy preserving framework for SCADA data publishing

[...]

Adil Fahad¹, Adil Fahad², Zahir Tari², Abdulmohsen Almalawi², Abdulmohsen Almalawi³, Andrzej Goscinski⁴, Ibrahim Khalil², Abdun Naser Mahmood⁵ - Show less +4 more•Institutions (5)

Al Baha University¹, RMIT University², King Abdulaziz University³, Deakin University⁴, University of New South Wales⁵

01 Jul 2014-Future Generation Computer Systems

TL;DR: It is shown that PPFSCADA effectively deals with multivariate traffic attributes, producing compatible results as the original data, and also substantially improving the performance of the five supervised approaches and provides high level of privacy protection.

...read moreread less

Book Chapter•DOI•

Anti-discrimination analysis using privacy attack strategies

[...]

Salvatore Ruggieri¹, Sara Hajian, Faisal Kamiran², Xiangliang Zhang³•Institutions (3)

University of Pisa¹, University of the Punjab², King Abdullah University of Science and Technology³

15 Sep 2014

TL;DR: This work designs strategies and algorithms inspired/based on Frechet bounds attacks, attribute inference attacks, and minimality attacks to the purpose of unveiling hidden discriminatory practices and shows that they can be effective tools in the hands of anti-discrimination authorities.

...read moreread less

Abstract: Social discrimination discovery from data is an important task to identify illegal and unethical discriminatory patterns towards protected-by-law groups, e.g., ethnic minorities. We deploy privacy attack strategies as tools for discrimination discovery under hard assumptions which have rarely tackled in the literature: indirect discrimination discovery, privacy-aware discrimination discovery, and discrimination data recovery. The intuition comes from the intriguing parallel between the role of the anti-discrimination authority in the three scenarios above and the role of an attacker in private data publishing. We design strategies and algorithms inspired/based on Frechet bounds attacks, attribute inference attacks, and minimality attacks to the purpose of unveiling hidden discriminatory practices. Experimental results show that they can be effective tools in the hands of anti-discrimination authorities.

...read moreread less

Journal Article•DOI•

Improving accuracy of classification models induced from anonymized datasets

[...]

Tamir Tassa¹, Alexandra Zhmudyak², Erez Shmueli²•Institutions (2)

Open University¹, Ben-Gurion University of the Negev²

01 Jan 2014-Information Sciences

TL;DR: A new anonymization algorithm that, given minimal anonymity and diversity parameters along with an information loss measure, issues corresponding non-homogeneous anonymizations where the sensitive attribute is published as frequency distributions over the sensitive domain rather than in the usual form of exact sensitive values.

...read moreread less

Book Chapter•DOI•

Task Oriented Privacy Preserving Data Publishing Using Feature Selection

[...]

Yasser Jafer¹, Stan Matwin², Stan Matwin¹, Stan Matwin³, Marina Sokolova³, Marina Sokolova¹ - Show less +2 more•Institutions (3)

University of Ottawa¹, Polish Academy of Sciences², Dalhousie University³

06 May 2014

TL;DR: This work shows that feature selection can be used to preserve privacy of individuals without compromising the accuracy of data classification, and when feature selection is combined with anonymization techniques, it is able to publish privacy preserving datasets.

...read moreread less

Abstract: In this work we show that feature selection can be used to preserve privacy of individuals without compromising the accuracy of data classification. Furthermore, when feature selection is combined with anonymization techniques, we are able to publish privacy preserving datasets. We use several UCI data sets to empirically support our claim. The obtained results show that these privacy-preserving datasets provide classification accuracy comparable and in some cases superior to the accuracy of classification of the original datasets. We generalized the results with a paired t-test applied on different levels of anonymization.

...read moreread less

Journal Article•DOI•

A new anonymity model for privacy-preserving data publishing

[...]

Xuezhen Huang¹, Jiqiang Liu¹, Zhen Han¹, Jun Yang¹•Institutions (1)

Beijing Jiaotong University¹

28 Nov 2014-China Communications

Journal Article•

Survey on Privacy Preserving Techniques for Publishing Social Network Data

[...]

Liu Xiang¹•Institutions (1)

Northeastern University¹

01 Jan 2014-Journal of Software

TL;DR: This paper aims to offer a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis of privacy preserving technologies.

...read moreread less

Abstract: This paper surveys the state of the art of privacy preserving for publishing social network data.First,the research background of privacy preserving for social network data is presented.Then,four important aspects of privacy preserving for social network data are summarized and analyzed in detail.The discussed topics includes privacy of social network,adversaries' background knowledge,privacy preserving technologies of social network,and data utility and experimental analysis.In addition,this paper points out the defects of privacy preserving in social networks and provides the comparisons of privacy preserving technologies.Finally,some potential future research directions are introduced.This paper aims to offer a deep insight into the mainstream methods and recent progress in this field,making detailed comparison and analysis.

...read moreread less

Journal Article•DOI•

The research data alliance: globally co-ordinated action against barriers to data publishing and sharing

[...]

Andrew Treloar¹•Institutions (1)

Australian National Data Service¹

01 Sep 2014-Learned Publishing

TL;DR: The drivers behind the formation of the Research Data Alliance (RDA), its current state, the lessons learned from its first full year of operation, and its anticipated impact on data publishing and sharing are discussed.

...read moreread less

Abstract: This article discusses the drivers behind the formation of the Research Data Alliance (RDA), its current state, the lessons learned from its fi rst full year of operation, and its anticipated impact on data publishing and sharing. One of the pressing challenges in data infrastructure (taken here to include issues relating to hardware, software and content format, as well as human actors) is how best to enable data interoperability across boundaries. This is particularly critical as the world deals with bigger and more complex problems that require data and insights from a range of disciplines. The RDA has been set up to enable more data to be shared across barriers to address these challenges. It does this through focused Working Groups and Interest Groups, formed of experts from around the world, and drawing from the academic, industry, and government sectors. Context The amount of activity dealing with the importance of data to research has increased perceptibly over the last fi ve years. This includes conferences specifi cally focused on research data issues,1–3 data-focused tracks at discipline conferences (too many to cite), national reports,4,5 funder requirements,6–10 special issues of journals,11,12 and new journals altogether. Those wishing to read further about some of these issues can consult a selective bibliography dealing with publications about this space.13 The foci for this activity are quite diverse: researcher behaviour, incentives and rewards, changes in the ecology of scholarly communication, technical issues, and the challenges of building and operating data infrastructure. Role of data infrastructure This paper will focus specifi cally on data infrastructure, interpreted broadly: hardware (storage and associated computer hardware), software, content and format standards, and human actors. As the bulk of the data needed by and generated by researchers is increasingly managed electronically, the role of this data infrastructure is becoming critical. One of the pressing challenges in data infrastructure is how best to enable data interoperability across boundaries. These boundaries include those between countries, between disciplines, and between producers of research data and the consumers of those data. A new organization, the Research Data Alliance (RDA), has been brought into existence specifi cally to address those boundaries from an infrastructure perspective. Genesis of the Research Data Alliance

...read moreread less

Journal Article•DOI•

MET𝔸P: revisiting Privacy-Preserving Data Publishing using secure devices

[...]

Tristan Allard, Benjamin Nguyen¹, Philippe Pucheral¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jun 2014-Distributed and Parallel Databases

TL;DR: It is argued that the trust assumption on the central server is far too strong and Met𝔸P, a generic fully distributed protocol, is proposed, to execute various forms of PPDP algorithms on an asymmetric architecture composed of low power secure devices and a powerful but untrusted infrastructure.

...read moreread less

Abstract: The goal of Privacy-Preserving Data Publishing (PPDP) is to generate a sanitized (i.e. harmless) view of sensitive personal data (e.g. a health survey), to be released to some agencies or simply the public. However, traditional PPDP practices all make the assumption that the process is run on a trusted central server. In this article, we argue that the trust assumption on the central server is far too strong. We propose Met 𝔸P, a generic fully distributed protocol, to execute various forms of PPDP algorithms on an asymmetric architecture composed of low power secure devices and a powerful but untrusted infrastructure. We show that this protocol is both correct and secure against honest-but-curious or malicious adversaries. Finally, we provide an experimental validation showing that this protocol can support PPDP processes scaling up to nation-wide surveys.

...read moreread less

Book Chapter•DOI•

A Privacy Risk Model for Trajectory Data

[...]

Anirban Basu, Anna Monreale¹, Juan Camilo Corena, Fosca Giannotti², Dino Pedreschi¹, Shinsaku Kiyomoto, Yutaka Miyake, Tadashi Yanagihara³, Roberto Trasarti² - Show less +5 more•Institutions (3)

University of Pisa¹, Istituto di Scienza e Tecnologie dell'Informazione², Toyota³

07 Jul 2014

TL;DR: A novel empirical risk model for privacy is proposed which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release.

...read moreread less

Abstract: Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data.

...read moreread less

Journal Article•DOI•

Using Feature Selection to Improve the Utility of Differentially Private Data Publishing

[...]

Yasser Jafer¹, Stan Matwin², Stan Matwin³, Stan Matwin¹, Marina Sokolova³, Marina Sokolova¹ - Show less +2 more•Institutions (3)

University of Ottawa¹, Polish Academy of Sciences², Dalhousie University³

01 Jan 2014-Procedia Computer Science

TL;DR: It is shown that when the K-anonymity is preceded by feature selection, it is possible to obtain a contingency table with higher counts and, when noise is added to satisfy differential privacy, its distorting effect is minimized and high utility of the data is preserved.

...read moreread less

Proceedings Article•DOI•

A Privacy-Preserving Data Publishing Method for Multiple Numerical Sensitive Attributes via Clustering and Multi-sensitive Bucketization

[...]

Qinghai Liu¹, Hong Shen², Yingpeng Sang¹•Institutions (2)

Beijing Jiaotong University¹, Sun Yat-sen University²

13 Jul 2014

TL;DR: This paper proposes a privacy-preserving data publishing method, namely MNSACM, that uses the ideas of clustering and Multi-Sensitive Bucketization (MSB) to publish microdata with multiple numerical sensitive attributes and shows the effectiveness of this method in privacy protection tomultiple numericalsensitive attributes.

...read moreread less

Abstract: Anonymized data publication has received considerable attention from the research community in recent years. For numerical sensitive attributes, most of the existing privacy preserving data publishing techniques concentrate on microdata with multiple categorical sensitive attributes or only one numerical sensitive attribute. However, many real-world applications may contain multiple numerical sensitive attributes. Directly applying the existing single-numerical-sensitive-attribute and multiple categorical-sensitive-attributes privacy preserving techniques often causes unexpected private information disclosure. They are particularly prone to the proximity breach, a privacy threat specific to numerical sensitive attributes in data publication. In this paper we propose a privacy-preserving data publishing method, namely MNSACM, that uses the ideas of clustering and Multi-Sensitive Bucketization (MSB) to publish microdata with multiple numerical sensitive attributes. Through an example we show the effectiveness of this method in privacy protection tomultiple numerical sensitive attributes.

...read moreread less

Proceedings Article•

The Cross-Linguistic Linked Data project

[...]

Robert Forkel¹•Institutions (1)

Max Planck Society¹

01 Jan 2014

TL;DR: The Cross-Linguistic Linked Data project helps record the world’s language diversity heritage by establishing an interoperable data publishing infrastructure with an emphasis on the datasets that are published within the project.

...read moreread less

Abstract: The Cross-Linguistic Linked Data project (CLLD – http://clld.org) helps record the world’s language diversity heritage by establishing an interoperable data publishing infrastructure. I describe the project and the environment it operates in, with an emphasis on the datasets that are published within the project. The publishing infrastructure is built upon a custom software stack – the clld framework – which is described next. I then proceed to explain how Linked Data plays an important role in the strategy regarding interoperability and sustainability. Finally I gauge the impact the project may have on its environment.

...read moreread less

Proceedings Article•DOI•

TAFC: Time and Attribute Factors Combined Access Control on Time-Sensitive Data in Public Cloud

[...]

Jianan Hong¹, Kaiping Xue¹, Wei Li¹, Yingjie Xue¹•Institutions (1)

University of Science and Technology of China¹

01 Dec 2014

TL;DR: By embedding the mechanism of timed-release encryption into CP-ABE (Ciphertext-Policy Attribute-based Encryption), this paper proposes TAFC: a new time and attribute factors combined access control on time-sensitive data stored in cloud.

...read moreread less

Abstract: The new paradigm of outsourcing data to the cloud is a double-edged sword. On one side, it frees up data owners from the technical management, and is easier for the data owners to share their data with intended recipients when data are stored in the cloud. On the other side, it brings about new challenges about privacy and security protection. To protect data confidentiality against the honest-but-curious cloud service provider, numerous works have been proposed to support fine-grained data access control. However, till now, no efficient schemes can provide the scenario of fine-grained access control together with the capacity of time-sensitive data publishing. In this paper, by embedding the mechanism of timed-release encryption into CP-ABE (Ciphertext-Policy Attribute-based Encryption), we propose TAFC: a new time and attribute factors combined access control on time-sensitive data stored in cloud. Extensive security and performance analysis shows that our proposed scheme is highly efficient and satisfies the security requirements for time-sensitive data storage in public cloud.

...read moreread less

Proceedings Article•DOI•

Handling Uncertainty in Data Services Composition

[...]

Soumaya Amdouni, Mahmoud Barhamgi, Djamal Benslimane, Rim Faiz¹•Institutions (1)

Carthage University¹

27 Jun 2014

TL;DR: This paper revisits the basic activities related to (Web) data services that are impacted by uncertainty, including the service description, invocation and composition, and proposes a probabilistic approach to deal with uncertainty in all of these activities.

...read moreread less

Abstract: Recent years have witnessed a growing interest in using Web Services as a powerful means for data publishing and sharing on top of the Web. This class of services is commonly known as DaaS (Data-as-a-Service), or also data services. The data returned by a data service is often subject to uncertainty for various reasons (e.g., privacy constraints, unreliable data collection instruments, etc. In this paper, we revisit the basic activities related to (Web) data services that are impacted by uncertainty, including the service description, invocation and composition. We propose a probabilistic approach to deal with uncertainty in all of these activities.

...read moreread less