scispace - formally typeset
Search or ask a question

Showing papers on "Data publishing published in 2014"


Journal ArticleDOI
Lei Xu1, Chunxiao Jiang1, Jian Wang1, Jian Yuan1, Yong Ren1 
TL;DR: This paper identifies four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker, and examines various approaches that can help to protect sensitive information.
Abstract: The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.

528 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: PrivBayes, a differentially private method for releasing high-dimensional data that circumvents the curse of dimensionality, and introduces a novel approach that uses a surrogate function for mutual information to build the model more accurately.
Abstract: Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless.To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

433 citations


Journal ArticleDOI
06 Aug 2014-PLOS ONE
TL;DR: The key need for the IPT is discussed, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records are discussed.
Abstract: The planet is experiencing an ongoing global biodiversity crisis. Measuring the magnitude and rate of change more effectively requires access to organized, easily discoverable, and digitally-formatted biodiversity data, both legacy and new, from across the globe. Assembling this coherent digital representation of biodiversity requires the integration of data that have historically been analog, dispersed, and heterogeneous. The Integrated Publishing Toolkit (IPT) is a software package developed to support biodiversity dataset publication in a common format. The IPT’s two primary functions are to 1) encode existing species occurrence datasets and checklists, such as records from natural history collections or observations, in the Darwin Core standard to enhance interoperability of data, and 2) publish and archive data and metadata for broad use in a Darwin Core Archive, a set of files following a standard format. Here we discuss the key need for the IPT, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records. We close with a discussion how IPT has impacted the biodiversity research community, how it enhances data publishing in more traditional journal venues, along with new features implemented in the latest version of the IPT, and future plans for more enhancements.

198 citations


Proceedings ArticleDOI
03 Nov 2014
TL;DR: This work quantitatively shows the conditions for perfect and (1-ε)-perfect structural data DA of 26 real world structural datasets, including Social Networks, Collaborations Networks, Communication Networks, Autonomous Systems, and Peer-to-Peer networks and designs a practical and novel single-phase cold start Optimization based DA (ODA) algorithm.
Abstract: In this paper, we study the quantification, practice, and implications of structural data (e.g., social data, mobility traces) De-Anonymization (DA). First, we address several open problems in structural data DA by quantifying perfect and (1-e)-perfect structural data DA}, where e is the error tolerated by a DA scheme. To the best of our knowledge, this is the first work on quantifying structural data DA under a general data model, which closes the gap between structural data DA practice and theory. Second, we conduct the first large-scale study on the de-anonymizability of 26 real world structural datasets, including Social Networks (SNs), Collaborations Networks, Communication Networks, Autonomous Systems, and Peer-to-Peer networks. We also quantitatively show the conditions for perfect and (1-e)-perfect DA of the 26 datasets. Third, following our quantification, we design a practical and novel single-phase cold start Optimization based DA} (ODA) algorithm. Experimental analysis of ODA shows that about 77.7% - 83.3% of the users in Gowalla (.2M users and 1M edges) and 86.9% - 95.5% of the users in Google+ (4.7M users and 90.8M edges) are de-anonymizable in different scenarios, which implies optimization based DA is implementable and powerful in practice. Finally, we discuss the implications of our DA quantification and ODA and provide some general suggestions for future secure data publishing.

122 citations


Book ChapterDOI
19 Oct 2014
TL;DR: The LOD Laundromat is presented, which removes stains from data without any human intervention and is able to make very large amounts of LOD more easily available for further processing right now.
Abstract: It is widely accepted that proper data publishing is difficult. The majority of Linked Open Data (LOD) does not meet even a core set of data publishing guidelines. Moreover, datasets that are clean at creation, can get stains over time. As a result, the LOD cloud now contains a high level of dirty data that is difficult for humans to clean and for machines to process. Existing solutions for cleaning data (standards, guidelines, tools) are targeted towards human data creators, who can (and do) choose not to use them. This paper presents the LOD Laundromat which removes stains from data without any human intervention. This fully automated approach is able to make very large amounts of LOD more easily available for further processing right now. LOD Laundromat is not a new dataset, but rather a uniform point of entry to a collection of cleaned siblings of existing datasets. It provides researchers and application developers a wealth of data that is guaranteed to conform to a specified set of best practices, thereby greatly improving the chance of data actually being (re)used.

113 citations


Journal ArticleDOI
13 May 2014
TL;DR: An overview of the LDBC project including its goals and organization is presented, and so-called "choke-point" based benchmark development through which experts identify key technical challenges, and introduce them in the benchmark workload is introduced.
Abstract: The Linked Data Benchmark Council (LDBC) is an EU project that aims to develop industry-strength benchmarks for graph and RDF data management systems. It includes the creation of a non-profit LDBC organization, where industry players and academia come together for managing the development of benchmarks as well as auditing and publishing official results. We present an overview of the project including its goals and organization, and describe its process and design methodology for benchmark development. We introduce so-called "choke-point" based benchmark development through which experts identify key technical challenges, and introduce them in the benchmark workload. Finally, we present the status of two benchmarks currently in development, one targeting graph data management systems using a social network data case, and the other targeting RDF systems using a data publishing case.

86 citations


Journal ArticleDOI
TL;DR: This paper provides an overview of the development of privacy preserving data publishing, restricted to the scope of anonymity alg orithms using generalization and suppression, and introduces the privacy preserving models for attack.
Abstract: Nowadays, information sharing as an indispensable part appears in our vision, bringing about a mass of discussions about methods and techniques of privacy preserving data publishing which are regarded as strong guarantee to avoid information disclosure and protect individuals' privacy. Recent work focuses on propos ing different anonymity algorithms for varying data publishing scenarios to satisfy privacy requirements, and keep data utility at the same time. K-anonymity has been proposed for privacy preserving data publishing, which can prevent linkage attacks by the means of anonymity operation, such as generalization and suppression. Numerous anonymity algorithms have been utilized for achieving k-anonymity. This paper provides an overview of the development of privacy preserving data publishing, which is restricted to the scope of anonymity alg orithms using generalization and suppression. The privacy preserving models for attack is introduced at first. An overview of seve ral anonymity operations follow behind. The most important part is the coverage of anonymity algorithms and information metric which is essential ingredient of algorithms. The conclusion and perspective are proposed finally.

76 citations


Journal ArticleDOI
TL;DR: A two-party algorithm for differentially private data release for vertically partitioned data between two parties in the semihonest adversary model is presented and Experimental results on real-life data suggest that the proposed algorithm can effectively preserve information for a data mining task.
Abstract: Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among the existing privacy models, ϵ-differential privacy provides one of the strongest privacy guarantees. In this paper, we address the problem of private data publishing, where different attributes for the same set of individuals are held by two parties. In particular, we present an algorithm for differentially private data release for vertically partitioned data between two parties in the semihonest adversary model. To achieve this, we first present a two-party protocol for the exponential mechanism. This protocol can be used as a subprotocol by any other algorithm that requires the exponential mechanism in a distributed setting. Furthermore, we propose a two-party algorithm that releases differentially private data in a secure way according to the definition of secure multiparty computation. Experimental results on real-life data suggest that the proposed algorithm can effectively preserve information for a data mining task.

73 citations


Journal ArticleDOI
TL;DR: This paper considers the collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers and introduces the notion of m-privacy, which guarantees that the anonymized data satisfies a given privacy constraint against any group of up to m colluding data providers.
Abstract: In this paper, we consider the collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers. We consider a new type of "insider attack" by colluding data providers who may use their own data records (a subset of the overall data) to infer the data records contributed by other data providers. The paper addresses this new threat, and makes several contributions. First, we introduce the notion of m-privacy, which guarantees that the anonymized data satisfies a given privacy constraint against any group of up to m colluding data providers. Second, we present heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently checking m-privacy given a group of records. Third, we present a data provider-aware anonymization algorithm with adaptive m-privacy checking strategies to ensure high utility and m-privacy of anonymized data with efficiency. Finally, we implement the m-privacy anonymization and verification algorithms with a trusted third party (TTP), and propose secure multiparty computation protocols for scenarios without TTP. All protocols are extensively analyzed and their security and efficiency are formally proved. Experiments on real-life datasets suggest that our approach achieves better or comparable utility and efficiency than existing and baseline algorithms while satisfying m-privacy.

68 citations


Journal ArticleDOI
TL;DR: This paper presents the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention, and shows how to extend the approach to different privacy models and anti-discrimination legal concepts.
Abstract: Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.

66 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the “publish” and “push” models of data dissemination need not be mutually exclusive; on the contrary, they can play complementary roles in sharing high quality data in support of research.
Abstract: We present a case study of data integration and reuse involving 12 researchers who published datasets in Open Context, an online data publishing platform, as part of collaborative archaeological research on early domesticated animals in Anatolia. Our discussion reports on how different editorial and collaborative review processes improved data documentation and quality, and created ontology annotations needed for comparative analyses by domain specialists. To prepare data for shared analysis, this project adapted editor-supervised review and revision processes familiar to conventional publishing, as well as more novel models of revision adapted from open source software development of public version control. Preparing the datasets for publication and analysis required significant investment of effort and expertise, including archaeological domain knowledge and familiarity with key ontologies. To organize this work effectively, we emphasized these different models of collaboration at various stages of this data publication and analysis project. Collaboration first centered on data editors working with data contributors, then widened to include other researchers who provided additional peer-review feedback, and finally the widest research community, whose collaboration is facilitated by GitHub’s version control system. We demonstrate that the “publish” and “push” models of data dissemination need not be mutually exclusive; on the contrary, they can play complementary roles in sharing high quality data in support of research. This work highlights the value of combining multiple models in different stages of data dissemination.

Journal Article
TL;DR: A systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness ( in terms of data utility) and demonstrates through experimental evaluation, the conditions in which one algorithm outperforms the others for a particular metric, depending on the input dataset and privacy requirements.
Abstract: The vast amount of data being collected about individuals has brought new challenges in protecting their privacy when this data is disseminated. As a result, Privacy-Preserving Data Publishing has become an active research area, in which multiple anonymization algorithms have been proposed. However, given the large number of algorithms available and limited information regarding their performance, it is difficult to identify and select the most appropriate algorithm given a particular publishing scenario, especially for practitioners. In this paper, we perform a systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness (in terms of data utility). We extend the scope of their original evaluation by employing a more comprehensive set of scenarios: different parameters, metrics and datasets. Using publicly available implementations of those algorithms, we conduct a series of experiments and a comprehensive analysis to identify the factors that influence their performance, in order to guide practitioners in the selection of an algorithm. We demonstrate through experimental evaluation, the conditions in which one algorithm outperforms the others for a particular metric, depending on the input dataset and privacy requirements. Our findings motivate the necessity of creating methodologies that provide recommendations about the best algorithm given a particular publishing scenario.

Book ChapterDOI
25 May 2014
TL;DR: The “Linked Data Finland” platform LDF.fi is introduced, extending the famous 5-star model of Tim Berners-Lee, with the sixth star for providing the dataset with a schema that explains the dataset, and the seventh star for validating the data against the schema.
Abstract: The idea of Linked Data is to aggregate, harmonize, integrate, enrich, and publish data for re-use on the Web in a cost-efficient way using Semantic Web technologies. We concern two major hindrances for re-using Linked Data: It is often difficult for a re-user to (1) understand the characteristics of the dataset and (2) evaluate the quality the data for the intended purpose. This paper introduces the “Linked Data Finland” platform LDF.fi addressing these issues. We extend the famous 5-star model of Tim Berners-Lee, with the sixth star for providing the dataset with a schema that explains the dataset, and the seventh star for validating the data against the schema. LDF.fi also automates data publishing and provides data curation tools. The first prototype of the platform is available on the web as a service, hosting tens of datasets and supporting several applications.

Proceedings ArticleDOI
30 Jan 2014
TL;DR: It is shown that under Hamming distortions, the differential privacy risk is lower bounded for all nontrivial distortions, and that the lower bound grows logarithmically in the alphabet size.
Abstract: Local differential privacy is a model for privacy in which an untrusted statistician collects data from individuals who mask their data before revealing it. While randomized response has shown to be a good strategy when the statistician's goal is to estimate a parameter of the population, we consider instead the problem of locally private data publishing, in which the data collector must publish a version of the data it has collected. We model utility by a distortion measure and consider privacy mechanisms that act via a memoryless channnel operating on the data. If we consider a the source distribution to be unknown but in a class of distributions, we arrive at a robust-rate distortion model for the privacy-distortion tradeoff. We show that under Hamming distortions, the differential privacy risk is lower bounded for all nontrivial distortions, and that the lower bound grows logarithmically in the alphabet size.

Journal ArticleDOI
TL;DR: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.

Journal ArticleDOI
TL;DR: It is concluded that random perturbation can not achieve meaningful levels of anonymity without deteriorating the graph features.

Journal ArticleDOI
TL;DR: It is shown that PPFSCADA effectively deals with multivariate traffic attributes, producing compatible results as the original data, and also substantially improving the performance of the five supervised approaches and provides high level of privacy protection.

Book ChapterDOI
15 Sep 2014
TL;DR: This work designs strategies and algorithms inspired/based on Frechet bounds attacks, attribute inference attacks, and minimality attacks to the purpose of unveiling hidden discriminatory practices and shows that they can be effective tools in the hands of anti-discrimination authorities.
Abstract: Social discrimination discovery from data is an important task to identify illegal and unethical discriminatory patterns towards protected-by-law groups, e.g., ethnic minorities. We deploy privacy attack strategies as tools for discrimination discovery under hard assumptions which have rarely tackled in the literature: indirect discrimination discovery, privacy-aware discrimination discovery, and discrimination data recovery. The intuition comes from the intriguing parallel between the role of the anti-discrimination authority in the three scenarios above and the role of an attacker in private data publishing. We design strategies and algorithms inspired/based on Frechet bounds attacks, attribute inference attacks, and minimality attacks to the purpose of unveiling hidden discriminatory practices. Experimental results show that they can be effective tools in the hands of anti-discrimination authorities.

Journal ArticleDOI
TL;DR: A new anonymization algorithm that, given minimal anonymity and diversity parameters along with an information loss measure, issues corresponding non-homogeneous anonymizations where the sensitive attribute is published as frequency distributions over the sensitive domain rather than in the usual form of exact sensitive values.

Book ChapterDOI
06 May 2014
TL;DR: This work shows that feature selection can be used to preserve privacy of individuals without compromising the accuracy of data classification, and when feature selection is combined with anonymization techniques, it is able to publish privacy preserving datasets.
Abstract: In this work we show that feature selection can be used to preserve privacy of individuals without compromising the accuracy of data classification. Furthermore, when feature selection is combined with anonymization techniques, we are able to publish privacy preserving datasets. We use several UCI data sets to empirically support our claim. The obtained results show that these privacy-preserving datasets provide classification accuracy comparable and in some cases superior to the accuracy of classification of the original datasets. We generalized the results with a paired t-test applied on different levels of anonymization.


Journal Article
TL;DR: This paper aims to offer a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis of privacy preserving technologies.
Abstract: This paper surveys the state of the art of privacy preserving for publishing social network data.First,the research background of privacy preserving for social network data is presented.Then,four important aspects of privacy preserving for social network data are summarized and analyzed in detail.The discussed topics includes privacy of social network,adversaries' background knowledge,privacy preserving technologies of social network,and data utility and experimental analysis.In addition,this paper points out the defects of privacy preserving in social networks and provides the comparisons of privacy preserving technologies.Finally,some potential future research directions are introduced.This paper aims to offer a deep insight into the mainstream methods and recent progress in this field,making detailed comparison and analysis.

Journal ArticleDOI
TL;DR: The drivers behind the formation of the Research Data Alliance (RDA), its current state, the lessons learned from its first full year of operation, and its anticipated impact on data publishing and sharing are discussed.
Abstract: This article discusses the drivers behind the formation of the Research Data Alliance (RDA), its current state, the lessons learned from its fi rst full year of operation, and its anticipated impact on data publishing and sharing. One of the pressing challenges in data infrastructure (taken here to include issues relating to hardware, software and content format, as well as human actors) is how best to enable data interoperability across boundaries. This is particularly critical as the world deals with bigger and more complex problems that require data and insights from a range of disciplines. The RDA has been set up to enable more data to be shared across barriers to address these challenges. It does this through focused Working Groups and Interest Groups, formed of experts from around the world, and drawing from the academic, industry, and government sectors. Context The amount of activity dealing with the importance of data to research has increased perceptibly over the last fi ve years. This includes conferences specifi cally focused on research data issues,1–3 data-focused tracks at discipline conferences (too many to cite), national reports,4,5 funder requirements,6–10 special issues of journals,11,12 and new journals altogether. Those wishing to read further about some of these issues can consult a selective bibliography dealing with publications about this space.13 The foci for this activity are quite diverse: researcher behaviour, incentives and rewards, changes in the ecology of scholarly communication, technical issues, and the challenges of building and operating data infrastructure. Role of data infrastructure This paper will focus specifi cally on data infrastructure, interpreted broadly: hardware (storage and associated computer hardware), software, content and format standards, and human actors. As the bulk of the data needed by and generated by researchers is increasingly managed electronically, the role of this data infrastructure is becoming critical. One of the pressing challenges in data infrastructure is how best to enable data interoperability across boundaries. These boundaries include those between countries, between disciplines, and between producers of research data and the consumers of those data. A new organization, the Research Data Alliance (RDA), has been brought into existence specifi cally to address those boundaries from an infrastructure perspective. Genesis of the Research Data Alliance

Journal ArticleDOI
TL;DR: It is argued that the trust assumption on the central server is far too strong and Met𝔸P, a generic fully distributed protocol, is proposed, to execute various forms of PPDP algorithms on an asymmetric architecture composed of low power secure devices and a powerful but untrusted infrastructure.
Abstract: The goal of Privacy-Preserving Data Publishing (PPDP) is to generate a sanitized (i.e. harmless) view of sensitive personal data (e.g. a health survey), to be released to some agencies or simply the public. However, traditional PPDP practices all make the assumption that the process is run on a trusted central server. In this article, we argue that the trust assumption on the central server is far too strong. We propose Met 𝔸P, a generic fully distributed protocol, to execute various forms of PPDP algorithms on an asymmetric architecture composed of low power secure devices and a powerful but untrusted infrastructure. We show that this protocol is both correct and secure against honest-but-curious or malicious adversaries. Finally, we provide an experimental validation showing that this protocol can support PPDP processes scaling up to nation-wide surveys.

Book ChapterDOI
07 Jul 2014
TL;DR: A novel empirical risk model for privacy is proposed which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release.
Abstract: Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data.

Journal ArticleDOI
TL;DR: It is shown that when the K-anonymity is preceded by feature selection, it is possible to obtain a contingency table with higher counts and, when noise is added to satisfy differential privacy, its distorting effect is minimized and high utility of the data is preserved.

Proceedings ArticleDOI
13 Jul 2014
TL;DR: This paper proposes a privacy-preserving data publishing method, namely MNSACM, that uses the ideas of clustering and Multi-Sensitive Bucketization (MSB) to publish microdata with multiple numerical sensitive attributes and shows the effectiveness of this method in privacy protection tomultiple numericalsensitive attributes.
Abstract: Anonymized data publication has received considerable attention from the research community in recent years. For numerical sensitive attributes, most of the existing privacy preserving data publishing techniques concentrate on microdata with multiple categorical sensitive attributes or only one numerical sensitive attribute. However, many real-world applications may contain multiple numerical sensitive attributes. Directly applying the existing single-numerical-sensitive-attribute and multiple categorical-sensitive-attributes privacy preserving techniques often causes unexpected private information disclosure. They are particularly prone to the proximity breach, a privacy threat specific to numerical sensitive attributes in data publication. In this paper we propose a privacy-preserving data publishing method, namely MNSACM, that uses the ideas of clustering and Multi-Sensitive Bucketization (MSB) to publish microdata with multiple numerical sensitive attributes. Through an example we show the effectiveness of this method in privacy protection tomultiple numerical sensitive attributes.

Proceedings Article
Robert Forkel1
01 Jan 2014
TL;DR: The Cross-Linguistic Linked Data project helps record the world’s language diversity heritage by establishing an interoperable data publishing infrastructure with an emphasis on the datasets that are published within the project.
Abstract: The Cross-Linguistic Linked Data project (CLLD – http://clld.org) helps record the world’s language diversity heritage by establishing an interoperable data publishing infrastructure. I describe the project and the environment it operates in, with an emphasis on the datasets that are published within the project. The publishing infrastructure is built upon a custom software stack – the clld framework – which is described next. I then proceed to explain how Linked Data plays an important role in the strategy regarding interoperability and sustainability. Finally I gauge the impact the project may have on its environment.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: By embedding the mechanism of timed-release encryption into CP-ABE (Ciphertext-Policy Attribute-based Encryption), this paper proposes TAFC: a new time and attribute factors combined access control on time-sensitive data stored in cloud.
Abstract: The new paradigm of outsourcing data to the cloud is a double-edged sword. On one side, it frees up data owners from the technical management, and is easier for the data owners to share their data with intended recipients when data are stored in the cloud. On the other side, it brings about new challenges about privacy and security protection. To protect data confidentiality against the honest-but-curious cloud service provider, numerous works have been proposed to support fine-grained data access control. However, till now, no efficient schemes can provide the scenario of fine-grained access control together with the capacity of time-sensitive data publishing. In this paper, by embedding the mechanism of timed-release encryption into CP-ABE (Ciphertext-Policy Attribute-based Encryption), we propose TAFC: a new time and attribute factors combined access control on time-sensitive data stored in cloud. Extensive security and performance analysis shows that our proposed scheme is highly efficient and satisfies the security requirements for time-sensitive data storage in public cloud.

Proceedings ArticleDOI
27 Jun 2014
TL;DR: This paper revisits the basic activities related to (Web) data services that are impacted by uncertainty, including the service description, invocation and composition, and proposes a probabilistic approach to deal with uncertainty in all of these activities.
Abstract: Recent years have witnessed a growing interest in using Web Services as a powerful means for data publishing and sharing on top of the Web. This class of services is commonly known as DaaS (Data-as-a-Service), or also data services. The data returned by a data service is often subject to uncertainty for various reasons (e.g., privacy constraints, unreliable data collection instruments, etc. In this paper, we revisit the basic activities related to (Web) data services that are impacted by uncertainty, including the service description, invocation and composition. We propose a probabilistic approach to deal with uncertainty in all of these activities.