scispace - formally typeset
Search or ask a question

Showing papers on "Data publishing published in 2015"


Journal ArticleDOI
TL;DR: This paper proposes a practical solution for privacy preserving medical record sharing for cloud computing, where the statistical analysis and cryptography are innovatively combined together to provide multiple paradigms of balance between medical data utilization and privacy protection.

246 citations


Journal ArticleDOI
TL;DR: It is emphasized that the future of open data will be driven by the negotiation of the ethical-economic tension that exists between provisioning governments, citizens, and private sector data users.

204 citations


Journal ArticleDOI
01 Sep 2015
TL;DR: This study of more than 100 currently existing data journals describes the approaches they promote for data set description, availability, citation, quality, and open access and identifies ways to expand and strengthen the data journals approach as a means to promote data set access and exploitation.
Abstract: Data occupy a key role in our information society. However, although the amount of published data continues to grow and terms such as data deluge and big data today characterize numerous research initiatives, much work is still needed in the direction of publishing data in order to make them effectively discoverable, available, and reusable by others. Several barriers hinder data publishing, from lack of attribution and rewards, vague citation practices, and quality issues to a rather general lack of a data-sharing culture. Lately, data journals have overcome some of these barriers. In this study of more than 100 currently existing data journals, we describe the approaches they promote for data set description, availability, citation, quality, and open access. We close by identifying ways to expand and strengthen the data journals approach as a means to promote data set access and exploitation.

117 citations


Proceedings ArticleDOI
24 Aug 2015
TL;DR: This paper proposes the first differentially-private generalization algorithm for trajectories, which leverage a carefully-designed exponential mechanism to probabilistically merge nodes based on trajectory distances, and proposes another efficient algorithm to release trajectories after generalization in a differential private manner.
Abstract: Trajectory data, i.e., human mobility traces, is extremely valuable for a wide range of mobile applications. However, publishing raw trajectories without special sanitization poses serious threats to individual privacy. Recently, researchers begin to leverage differential privacy to solve this challenge. Nevertheless, existing mechanisms make an implicit assumption that the trajectories contain a lot of identical prefixes or n-grams, which is not true in many applications. This paper aims to remove this assumption and propose a differentially private publishing mechanism for more general time-series trajectories. One natural solution is to generalize the trajectories, i.e., merge the locations at the same time. However, trivial merging schemes may breach differential privacy. We, thus, propose the first differentially-private generalization algorithm for trajectories, which leverage a carefully-designed exponential mechanism to probabilistically merge nodes based on trajectory distances. Afterwards, we propose another efficient algorithm to release trajectories after generalization in a differential private manner. Our experiments with real-life trajectory data show that the proposed mechanism maintains high data utility and is scalable to large trajectory datasets.

88 citations


Book ChapterDOI
31 Aug 2015
TL;DR: S2X is introduced, a SPARQL query processor for Hadoop where it is claimed that this is the first approach to combine graph-par parallel and data-parallel computation for SParQL querying of RDF data based on Hadoops.
Abstract: RDF has constantly gained attention for data publishing due to its flexible data model, raising the need for distributed querying. However, existing approaches using general-purpose cluster frameworks employ a record-oriented perception of RDF ignoring its inherent graph-like structure. Recently, GraphX was published as a graph abstraction on top of Spark, an in-memory cluster computing system. It allows to seamlessly combine graph-parallel and data-parallel computation in a single system, an unique feature not available in other systems. In this paper we introduce S2X, a SPARQL query processor for Hadoop where we leverage this unified abstraction by implementing basic graph pattern matching of SPARQL as a graph-parallel task while other operators are implemented in a data-parallel manner. To the best of our knowledge, this is the first approach to combine graph-parallel and data-parallel computation for SPARQL querying of RDF data based on Hadoop.

62 citations


Journal ArticleDOI
TL;DR: This paper proposes a privacy-preserving data publishing method, namely MNSACM, which uses the ideas of clustering and Multi-Sensitive Bucketization to publish microdata with multiple numerical sensitive attributes.

35 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose trusty URIs containing cryptographic hash values, which can be used for the verification of digital artifacts in a manner that is independent of the serialization format in the case of structured data files such as nanopublications.
Abstract: The current Web has no general mechanisms to make digital artifacts—such as datasets, code, texts, and images—verifiable and permanent. For digital artifacts that are supposed to be immutable, there is moreover no commonly accepted method to enforce this immutability. These shortcomings have a serious negative impact on the ability to reproduce the results of processes that rely on Web resources, which in turn heavily impacts areas such as science where reproducibility is important. To solve this problem, we propose trusty URIs containing cryptographic hash values. We show how trusty URIs can be used for the verification of digital artifacts, in a manner that is independent of the serialization format in the case of structured data files such as nanopublications. We demonstrate how the contents of these files become immutable, including dependencies to external digital artifacts and thereby extending the range of verifiability to the entire reference tree. Our approach sticks to the core principles of the Web, namely openness and decentralized architecture, and is fully compatible with existing standards and protocols. Evaluation of our reference implementations shows that these design goals are indeed accomplished by our approach, and that it remains practical even for very large files.

26 citations


Journal ArticleDOI
TL;DR: It is argued that big data can offer new opportunities and roles for educational researchers and the methodological consequences of these developments for research methods are explored.
Abstract: In this article, we argue that big data can offer new opportunities and roles for educational researchers. In the traditional model of evidence-gathering and interpretation in education, researchers are independent observers, who pre-emptively create instruments of measurement, and insert these into the educational process in specialized times and places (a pre-test or post-test, a survey, an interview, a focus group). The ‘big data’ approach is to collect data through practice-integrated research. If a record is kept of everything that happens, then it is possible analyze what happened, ex post facto. Data collection is embedded. It is on-the-fly and ever-present. With the relevant analysis and presentation software, the data is readable in the form of data reports, analytics dashboards and visualizations. We explore the methodological consequences of these developments for research methods.

20 citations


Journal Article
TL;DR: This paper considers anonymisations of RDF graphs (and, more generally, relational datasets with labelled nulls) and defines notions of safe and optimal anonymisations and establishes the complexity of the underpinning decision problems both under open-world semantics inherent to RDF and a closed- world semantics, where an attacker has complete knowledge over some part of the original data.
Abstract: The widespread adoption of Linked Data has been driven by the increasing demand for information exchange between organisations, as well as by data publishing regulations in domains such as health care and governance. In this setting, sensitive information is at risk of disclosure since published data can be linked with arbitrary external data sources In this paper we lay the foundations of privacy-preserving data publishing (PPDP) in the context of Linked Data. We consider anonymisations of RDF graphs (and, more generally, relational datasets with labelled nulls) and define notions of safe and optimal anonymisations. Safety ensures that the anonymised data can be published with provable protection guarantees against linking attacks, whereas optimality ensures that it preserves as much information from the original data as possible, while satisfying the safety requirement. We establish the complexity of the underpinning decision problems both under open-world semantics inherent to RDF and a closed-world semantics, where we assume that an attacker has complete knowledge over some part of the original data.

19 citations


Journal ArticleDOI
TL;DR: This paper discusses Linked Open Data (LOD) as an approach to improving data description, intelligibility and discoverability to facilitate reuse and presents examples of how annotating zooarchaeology datasets with LOD can facilitate data integration without forcing standardization.
Abstract: The inability of journals and books to accommodate data and to make it reusable has led to the gradual loss of vast amounts of information. The practice of disseminating selected sub-sets of data (usually in summary tables) permits only very limited types of reuse, and thus hampers scholarship. In recent years, largely in response to increasing government and institutional requirements for full data access, the scholarly community is giving data more attention, and solutions for data management are emerging. However, seeing data management primarily as a matter of compliance means that the research community faces continued data loss, as many datasets enter repositories without adequate description to enable their reuse. Furthermore, because many archaeologists do not yet have experience in data reuse, they lack understanding of what “good” data management means in terms of their own research practices. This paper discusses Linked Open Data (LOD) as an approach to improving data description, intelligibility and discoverability to facilitate reuse. I present examples of how annotating zooarchaeology datasets with LOD can facilitate data integration without forcing standardization. I conclude by recognizing that data sharing is not without its challenges. However, the research community’s careful attention and recognition of datasets as valuable scholarly outputs will go a long way toward ensuring that the products of our work are more widely useful.

17 citations


Book ChapterDOI
31 May 2015
TL;DR: The evaluation shows the potential of semantic applications for data publishing in contextual environment, semantic search, visualization and automated enrichment according to needs and expectations of art experts and regular museum visitors.
Abstract: In this paper we present an architecture and approach to publishing open linked data in the cultural heritage domain. We demonstrate our approach for building a system both for data publishing and consumption and show how user benefits can be achieved with semantic technologies. For domain knowledge representation the CIDOC-CRM ontology is used. As a main source of trusted data, we use the data of the web portal of the Russian Museum. For data enrichment we selected DBpedia and the published Linked Data of the British Museum. The evaluation shows the potential of semantic applications for data publishing in contextual environment, semantic search, visualization and automated enrichment according to needs and expectations of art experts and regular museum visitors.

Journal ArticleDOI
TL;DR: Performance evaluation of AUA model using data sets and proves that the data anonymization can be done without compromising the quality of data mining results.

Book ChapterDOI
31 May 2015
TL;DR: In this paper, the authors combine a large-scale Linked data publication project LOD Laundromat with a low-cost server-side interface Triple Pattern Fragments, in order to bridge the gap between the Web of downloadable data documents and the web of live queryable data.
Abstract: Ad-hoc querying is crucial to access information from Linked Data, yet publishing queryable RDF datasets on the Web is not a trivial exercise. The most compelling argument to support this claim is that the Web contains hundreds of thousands of data documents, while only 260 queryable SPARQL endpoints are provided. Even worse, the SPARQL endpoints we do have are often unstable, may not comply with the standards, and may differ in supported features. In other words, hosting data online is easy, but publishing Linked Data via a queryable API such as SPARQL appears to be too difficult. As a consequence, in practice, there is no single uniform way to query the LOD Cloud today. In this paper, we therefore combine a large-scale Linked Data publication project LOD Laundromat with a low-cost server-side interface Triple Pattern Fragments, in order to bridge the gap between the Web of downloadable data documents and the Web of live queryable data. The result is ai¾?repeatable, low-cost, open-source data publication process. To demonstrate its applicability, we made over 650,000 data documents available as datai¾?APIs, consisting of 30i¾?billion i¾?triples.

Journal ArticleDOI
TL;DR: This paper proposes a utility-aware social network graph anonymization based on a new metric that calculates the utility impact of social network link modification and guarantees that the distance between vertices in the modified social network stays as close as the original social networkgraph prior to the modification.

Journal ArticleDOI
TL;DR: Through proof and analysis, the new model can prevent adversary from using the background knowledge about association rules to attack privacy, and it is able to get high-quality released information.
Abstract: At present, most studies on data publishing only considered single sensitive attribute, and the works on multiple sensitive attributes are still few. And almost all the existing studies on multiple sensitive attributes had not taken the inherent relationship between sensitive attributes into account, so that adversary can use the background knowledge about this relationship to attack the privacy of users. This paper presents an attack model with the association rules between the sensitive attributes and, accordingly, presents a data publication for multiple sensitive attributes. Through proof and analysis, the new model can prevent adversary from using the background knowledge about association rules to attack privacy, and it is able to get high-quality released information. At last, this paper verifies the above conclusion with experiments.

Book ChapterDOI
11 Oct 2015
TL;DR: This work proposes to design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities such as publishing companies, and presents a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data.
Abstract: Making available and archiving scientific results is for the most part still considered the task of classical publishing companies, despite the fact that classical forms of publishing centered around printed narrative articles no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. Here we propose to design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used for the Semantic Web in general. Evaluation of the current small network shows that this system is efficient and reliable.

Journal ArticleDOI
TL;DR: An independent -diversity principle is proposed, which could prevent attacks from attackers who have known data publishing algorithms and have the corruption abilities and, when compared with other solutions against corruption attacks, the method would result in less information loss.
Abstract: Datasets containing individuals’ information (stored in back-end databases) are often published and shared in social networks. The disclosure of sensitive individuals’ information in social networks is potentially a serious problem. When an attacker studies a published table in a social network, the attacker could infer valuable information of individuals if the attacker learnt some sensitive information of other related individuals from other sources which are different from the published table. This type of attack is referred to as corruption attack. Existing privacy-preserving data publication (PPDP) approaches have been developed against corruption attacks, however, they could cause severe information loss, and reduce the usefulness of the published data. In addition, PPDP models based on l-diversity and its variants may lead to individual sensitive information disclosure. Motivated by providing a solution to overcome these drawbacks, an independent l-diversity principle is proposed in this study. Based on this principle a PPDP model is presented. The model could prevent attacks from attackers who have known data publishing algorithms and have the corruption abilities. A new data utility measurement global loss penalty is also proposed in this study. Related algorithms to our approach have been developed and implemented. Extensive experiments have been performed and comparisons with other related methods have been made. The results have shown the effectiveness of our approaches. It has been noted that when compared with l-diversity model and its variant models, our model could resist corruption attacks more effectively; furthermore, when compared with other solutions against corruption attacks, our method would result in less information loss.

Proceedings ArticleDOI
01 Oct 2015
TL;DR: In this approach the data is anonymized using two phases Map phase and Reduce phase using Two Phase Top Down Specialization (Two Phase T DS) algorithm and the scalability and efficiency of Two Phase TDS is experimentally evaluated.
Abstract: Nowadays data security plays a major issue in cloud computing and it remains a problem in data publishing. Lot of people share the data over cloud for business requirements which can be used for data analysis brings privacy as a big concern. In order to protect privacy in data publishing the anonymization technique is enforced on data. In this technique the data can be either generalized or suppressed using various algorithms. Top Down Specialization (TDS) in k-Anonymity is the majorly used generalization algorithm for data anonymization. In cloud the privacy is given through this algorithm for data publishing but another bigger problem is scalability of data. When data is tremendously increased on cloud which is shared for the data analysis there anonymization process becomes tedious. Big Data helps here in a way that large scale data can be partitioned using mapreduce framework on cloud. In our approach the data is anonymized using two phases Map phase and Reduce phase using Two Phase Top Down Specialization (Two Phase TDS) algorithm and the scalability and efficiency of Two Phase TDS is experimentally evaluated.

Journal ArticleDOI
TL;DR: This survey will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish P PDP from other related problems, and propose future research directions.
Abstract: Privacy preserving data publishing (PPDP) methods a new class of privacy preserving data mining (PPDM) technology, has been developed by the research community working on security and knowledge discovery. It is common to share data between two organizations in many application areas. When data are to be shared between parties, there could be some sensitive patterns which should not be disclosed to the other parties. These methods aims to keep the underlying data useful based on privacy preservation “utility based method based on privacy preservation, and created tremendous opportunities for knowledge- and information-based decision making. Recently, PPDP has received considerable attention in research communities, and many approaches have been proposed for different data publishing scenarios. In this survey, we will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish PPDP from other related problems, and propose future research directions. Key words: Privacy preserving, privacy preserving data publishing, privacy preserving data mining, republishing, security, privacy, decision making, knowledge.

Journal ArticleDOI
01 Nov 2015
TL;DR: This work argues that the classical DBRL risk measure is insufficient, and introduces the novel Global Distance-Based Record Linkage (GDBRL) risk measure, which is recommended to be considered as an additional measure when analyzing the privacy offered by SDC protection algorithms.
Abstract: Statistical Disclosure Control (SDC, for short) studies the problem of privacy-preserving data publishing in cases where the data is expected to be used for statistical analysis An original dataset T containing sensitive information is transformed into a sanitized version T' which is released to the public Both utility and privacy aspects are very important in this setting For utility, T' must allow data miners or statisticians to obtain similar results to those which would have been obtained from the original dataset T For privacy, T' must significantly reduce the ability of an adversary to infer sensitive information on the data subjects in TOne of the main a-posteriori measures that the SDC community has considered up to now when analyzing the privacy offered by a given protection method is the Distance-Based Record Linkage (DBRL) risk measure In this work, we argue that the classical DBRL risk measure is insufficient For this reason, we introduce the novel Global Distance-Based Record Linkage (GDBRL) risk measure We claim that this new measure must be evaluated alongside the classical DBRL measure in order to better assess the risk in publishing T' instead of T After that, we describe how this new measure can be computed by the data owner and discuss the scalability of those computationsWe conclude by extensive experimentation where we compare the risk assessments offered by our novel measure as well as by the classical one, using well-known SDC protection methods Those experiments validate our hypothesis that the GDBRL risk measure issues, in many cases, higher risk assessments than the classical DBRL measure In other words, relying solely on the classical DBRL measure for risk assessment might be misleading, as the true risk may be in fact higher Hence, we strongly recommend that the SDC community considers the new GDBRL risk measure as an additional measure when analyzing the privacy offered by SDC protection algorithms

Proceedings ArticleDOI
21 Jul 2015
TL;DR: A new mathematical model based on the Zipf distribution is proposed for evaluating a given anonymized dataset that needs to be reidentified and the theoretical bound for reidentification is defined, which yields the appropriate optimal level for anonymization.
Abstract: In this paper, we proposes a new mathematical model for evaluating a given anonymized dataset that needs to be reidentified. Many anonymization algorithms have been proposed in the area called privacy-preserving data publishing (PPDP), but, no anonymization algorithms are suitable for all scenarios because many factors are involved. In order to address the issues of anonymization, we propose a new mathematical model based on the Zipf distribution. Our model is simple, but it fits well with the real distribution of trajectory data. We demonstrate the primary property of our model and we extend it to a more complex environment. Using our model, we define the theoretical bound for reidentification, which yields the appropriate optimal level for anonymization.

Proceedings ArticleDOI
27 Jun 2015
TL;DR: This paper proposes a privacy-preserving data publishing framework for publishing large datasets with the goals of providing different levels of utility to the users based on their access privileges, and designs and implements multi-level utility-controlled data anonymization schemes in the context of large association graphs.
Abstract: Conventional private data publication schemes are targeted at publication of sensitive datasets with the objective of retaining as much utility as possible for statistical (aggregate) queries while ensuring the privacy of individuals' information. However, such an approach to data publishing is no longer applicable in shared multi-tenant cloud scenarios where users often have different levels of access to the same data. In this paper, we present a privacy-preserving data publishing framework for publishing large datasets with the goals of providing different levels of utility to the users based on their access privileges. We design and implement our proposed multi-level utility-controlled data anonymization schemes in the context of large association graphs considering three levels of user utility namely: (i) users having access to only the graph structure (ii) users having access to graph structure and aggregate query results and (iii) users having access to graph structure, aggregate query results as well as individual associations. Our experiments on real large association graphs show that the proposed techniques are effective, scalable and yield the required level of privacy and utility for user-specific utility and access privilege levels.

Journal ArticleDOI
TL;DR: A novel empirical risk model for privacy is proposed which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release.
Abstract: Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper, we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data and then, we show how the empirical evaluation of the privacy risk has a different trend in synthetic data describing random movements.

Journal ArticleDOI
TL;DR: Slicing conserves superior data efficacy than bucketization and generalization, which mislays substantial range of data, particularly for high dimensional information.
Abstract: Background: Slicing method has been recommended as a process for defending seclusion in data publication and data publishing. Whilst liberating the operation data to a third party to conceal assured customer particular information. Statistical analysis : Many anonymization methods have been utilized for data publication and data publishing. Generalization method mislays substantial range of data, particularly for high dimensional information. Bucketization method doesn’t have a obvious partition among quasi recognizing aspects and perceptive aspects. Result: Slicing conserves superior data efficacy than bucketization and generalization. This method separates the data both vertically and horizontally. Slicing may knob soaring dimensional data.

Journal ArticleDOI
TL;DR: This work presents techniques that explicitly address right-protection topics and provably preserve the outcome of certain mining operations and quantifies the tradeoff between obfuscation and utility for spatiotemporal datasets and discovers very favorable characteristics of the process.
Abstract: The emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms to establish ownership in the event of data leakage. Current right-protection technologies, however, rarely provide strong guarantees on dataset utility after the protection process. This work presents techniques that explicitly address this topic and provably preserve the outcome of certain mining operations. In particular, we take special care to guarantee that the outcome of hierarchical clustering operations remains the same before and after right protection. Our approach considers all prevalent hierarchical clustering variants: single-, complete-, and average-linkage. We imprint the ownership in a dataset using watermarking principles, and we derive tight bounds on the expansion/contraction of distances incurred by the process. We leverage our analysis to design fast algorithms for right protection without exhaustively searching the vast design space. Finally, because the right-protection process introduces a user-tunable distortion on the dataset, we explore the possibility of using this mechanism for data obfuscation. We quantify the tradeoff between obfuscation and utility for spatiotemporal datasets and discover very favorable characteristics of the process. An additional advantage is that when one is interested in both right-protecting and obfuscating the original data values, the proposed mechanism can accomplish both tasks simultaneously.

Proceedings ArticleDOI
24 Aug 2015
TL;DR: This paper is proposing a new privacy preserving model, which minimizes attacks and overcomes drawbacks experienced by existing popular anonymizing approaches, and includes features combining these two techniques which lead to an efficient model.
Abstract: Data publishing for analysis by maintaining individual privacy when it contains sensitive attributes is a problem of increasing significance today. So, preserving sensitive data without experiencing serious attacks is the greatest challenge in the field of privacy preserving data publishing. The existing privacy models are experiencing one or the other type of attacks which leads to privacy breach. In this paper, we are proposing a new privacy preserving model, which minimizes attacks and overcomes drawbacks experienced by existing popular anonymizing approaches. Our technique is inspired from two of the popular techniques such as primarily, Closeness approach which was proposed by Ninghui in 2012. t-closeness requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of that attribute in the overall table i.e., the distance between the two distributions should be no more than a threshold t. But it was subjected its own limitations. And the second technique, “p-sensitive, k- anonymity” proposed by Truta and Vinay in 2006, which reduces information loss by increasing utility. Our technique includes features combining these two techniques which lead to an efficient model. Experimental results show that our privacy measure runs faster and significantly reduce the privacy breach by overcoming many of the attacks faced by existing models in the literature.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: An approach that uses a clustering algorithm as a pre-process for privacy preserving methods to improve the diversity of anonymized data is proposed and the results are evaluated with the aspect of usability in scientific works.
Abstract: The data obtained by various organizations provide opportunities for generating solutions in the future. It is essential that, the accurate data must be sharable with research communities and scientists in order to improve quality of life. However, accurate records of personal data include sensitive information about individuals. Hence sharing these records without applying any anonymization criteria paves the way for disclosure of personal privacy. In an effort to protect personal privacy, Privacy-Preserving Data Mining (PPDM) and Privacy-Preserving Data Publishing (PPDP) approaches have been studied extensively. Numerous works have been dedicated to diversifying techniques for de-identification or anonymization of identifiable datasets, but there is an important trade-off between data loss and data privacy. While original data anonymized, it exposed to information loss. In order to minimize information loss, the anonymization algorithms discard keeping diversity. In this study, we proposed an approach that uses a clustering algorithm as a pre-process for privacy preserving methods to improve the diversity of anonymized data. In addition, the effect of clustering on anonymization was evaluated by using original and clustered form of a real world dataset. The results are evaluated with the aspect of usability in scientific works and it was observed that a clustering algorithm and an affective anonymization algorithm must be used in privacy preservation approaches in order to keep diversity of the original datasets.

Book ChapterDOI
12 Nov 2015
TL;DR: This paper presents the architecture and processes for RDF dataset management for Data.go.th based on OAM Framework which supports the entire processes: RDF data publishing, and data querying, and shows how this framework can simplify the user’s tasks in publishing datasets and create applications for the datasets.
Abstract: Recently, Thailand has started initiating the Thailand open government data project that continuously triggers an increment in the number of open datasets. Open data is valuable when the data is reused, shared and integrated. Converting the existing datasets to the RDF format can increase the values of these datasets. In this paper, we present the architecture and processes for RDF dataset management for Data.go.th based on OAM Framework which supports the entire processes: RDF data publishing, and data querying. Our approach is different from other LOD platforms in that users do not require the knowledge of RDF and SPARQL. Our platform would facilitate data publishing and querying process for novice users and make it easier to use. This framework provides a common ontology-based search interface and RESTFul APIs constructed automatically for the datasets when they are published. With the provided services for the datasets, it can simplify the user’s tasks in publishing datasets and create applications for the datasets. In consuming the RDF data, we implemented a sample mash-up application which accessed the published weather and reservoir datasets from the Data.go.th website via RESTful APIs.

Journal ArticleDOI
TL;DR: It is proposed that the decision analytic techniques of multicriteria decision analysis, value of information (VOI), weight of evidence (WOE), and portfolio decision analysis (PDA) can bridge the gap from current data collection and visualization efforts to present information relevant to specific decision needs.
Abstract: The increase in nanomaterial research has resulted in increased nanomaterial data. The next challenge is to meaningfully integrate and interpret these data for better and more efficient decisions. Due to the complex nature of nanomaterials, rapid changes in technology, and disunified testing and data publishing strategies, information regarding material properties is often illusive, uncertain, and/or of varying quality, which limits the ability of researchers and regulatory agencies to process and use the data. The vision of nanoinformatics is to address this problem by identifying the information necessary to support specific decisions (a top-down approach) and collecting and visualizing these relevant data (a bottom-up approach). Current nanoinformatics efforts, however, have yet to efficiently focus data acquisition efforts on the research most relevant for bridging specific nanomaterial data gaps. Collecting unnecessary data and visualizing irrelevant information are expensive activities that overwhelm decision makers. We propose that the decision analytic techniques of multicriteria decision analysis (MCDA), value of information (VOI), weight of evidence (WOE), and portfolio decision analysis (PDA) can bridge the gap from current data collection and visualization efforts to present information relevant to specific decision needs. Decision analytic and Bayesian models could be a natural extension of mechanistic and statistical models for nanoinformatics practitioners to master in solving complex nanotechnology challenges.

Journal ArticleDOI
TL;DR: Through an abstraction process presented in this paper, this paper provides data publishers with simplified descriptions for the generalization technique and its algorithms that facilitate the understanding of the algorithms by data publishers having low programing skills.