scispace - formally typeset
Search or ask a question

Showing papers on "Data anonymization published in 2023"


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a novel privacy-preserving data collection protocol that anonymizes healthcare data without using a third-party anonymizer or a private channel for data transmission.
Abstract: Digital health data collection is vital for healthcare and medical research. But it contains sensitive information about patients, which makes it challenging. To collect health data without privacy breaches, it must be secured between the data owner and the collector. Existing data collection research studies have too stringent assumptions such as using a third-party anonymizer or a private channel amid the data owner and the collector. These studies are more susceptible to privacy attacks due to third-party involvement, which makes them less applicable for privacy-preserving healthcare data collection. This article proposes a novel privacy-preserving data collection protocol that anonymizes healthcare data without using a third-party anonymizer or a private channel for data transmission. A clustering-based k-anonymity model was adopted to efficiently prevent identity disclosure attacks, and the communication between the data owner and the collector is restricted to some elected representatives of each equivalent group of data owners. We also identified a privacy attack, known as “leader collusion”, in which the elected representatives may collaborate to violate an individual's privacy. We propose solutions for such collisions and sensitive attribute protection. A greedy heuristic method is devised to efficiently handle the data owners who join or depart the anonymization process dynamically. Furthermore, we present the potential privacy attacks on the proposed protocol and theoretical analysis. Extensive experiments are conducted in real-world datasets, and the results suggest that our solution outperforms the state-of-the-art techniques in terms of privacy protection and computational complexity.

3 citations


Journal ArticleDOI
31 Jan 2023-Sensors
TL;DR: Wang et al. as discussed by the authors proposed a multi-dimensional sensitive data clustering algorithm based on improved African vultures optimization, which improves the initialization, fitness calculation, and solution update strategy of the clustering center.
Abstract: Currently, a significant focus has been established on the privacy protection of multi-dimensional data publishing in various application scenarios, such as scientific research and policy-making. The K-anonymity mechanism based on clustering is the main method of shared-data desensitization, but it will cause problems of inconsistent clustering results and low clustering accuracy. It also cannot defend against several common attacks, such as skewness and similarity attacks at the same time. To defend against these attacks, we propose a K-anonymity privacy protection algorithm for multi-dimensional data against skewness and similarity attacks (KAPP) combined with t-closeness. Firstly, we propose a multi-dimensional sensitive data clustering algorithm based on improved African vultures optimization. More specifically, we improve the initialization, fitness calculation, and solution update strategy of the clustering center. The improved African vultures optimization can provide the optimal solution with various dimensions and achieve highly accurate clustering of the multi-dimensional dataset based on multiple sensitive attributes. It ensures that multi-dimensional data of different clusters are different in sensitive data. After the dataset anonymization, similar sensitive data of the same equivalence class will become less, and it eventually does not satisfy the premise of being theft by skewness and similarity attacks. We also propose an equivalence class partition method based on the sensitive data distribution difference value measurement and t-closeness. Namely, we calculate the sensitive data distribution’s difference value of each equivalence class and then combine the equivalence classes with larger difference values. Each equivalence class satisfies t-closeness. This method can ensure that multi-dimensional data of the same equivalence class are different in multiple sensitive attributes, and thus can effectively defend against skewness and similarity attacks. Moreover, we generalize sensitive attributes with significant weight and all quasi-identifier attributes to achieve anonymous protection of the dataset. The experimental results show that KAPP improves clustering accuracy, diversity, and anonymity compared to other similar methods under skewness and similarity attacks.

1 citations


Journal ArticleDOI
TL;DR: In this paper , the authors proposed an anonymization algorithm called the utility-based hierarchical algorithm (UHRA) for producing k-anonymous t-closed data that can prevent background knowledge attacks.
Abstract: Recent studies have shown that data are some of the most valuable resources for making government policies and business decisions in different organizations. In privacy preserving, the challenging task is to keep an individual’s data protected and private, and at the same time the modified data must have sufficient accuracy for answering data mining queries. However, it is difficult to implement sufficient privacy where re-identification of a record is claimed to be impossible because the adversary has background knowledge from different sources. The k-anonymity model is prone to attribute disclosure, while the t-closeness model does not prevent identity disclosure. Moreover, both models do not consider background knowledge attacks. This paper proposes an anonymization algorithm called the utility-based hierarchical algorithm (UHRA) for producing k-anonymous t-closed data that can prevent background knowledge attacks. The proposed framework satisfies the privacy requirements using a hierarchical approach. Finally, to enhance utility of the anonymized data, records are moved between different anonymized groups, while the requirements of the privacy model are not violated. Our experiments indicate that our proposed algorithm outperforms its counterparts in terms of data utility and privacy.

1 citations


Journal ArticleDOI
TL;DR: In this paper , a probabilistic model is presented to map the z-anonymity property into the k-anonymous property, where the decision to publish a given attribute (atomic information) is made in real time.

1 citations


Journal ArticleDOI
TL;DR: In this article , the authors conducted a systematic literature review on data anonymization in the context of the Internet of Things (IoT), focusing on grouping, analyzing, and classifying existing data security and privacy methods in IoT.
Abstract: The Internet of Things (IoT) has shown rapid growth in recent years. However, it presents challenges related to the lack of standardization of communication produced by different types of devices. Another problem area is the security and privacy of data generated by IoT devices. Thus, with the focus on grouping, analyzing, and classifying existing data security and privacy methods in IoT, based on data anonymization, we have conducted a Systematic Literature Review (SLR). We have therefore reviewed the history of works developing solutions for security and privacy in the IoT, particularly data anonymization and the leading technologies used by researchers in their work. We also discussed the challenges and future directions for research. The objective of the work is to give order to the main approaches that promise to provide or facilitate data privacy using anonymization in the IoT area. The study’s results can help us understand the best anonymization techniques to provide data security and privacy in IoT environments. In addition, the findings can also help us understand the limitations of existing approaches and identify areas for improvement. The results found in most of the studies analyzed indicate a lack of consensus in the following areas: (i) with regard to a solution with a standardized methodology to be applied in all scenarios that encompass IoT; (ii) the use of different techniques to anonymize the data; and (iii), the resolution of privacy issues. On the other hand, results made available by the k-anonymity technique proved efficient in combination with other techniques. In this context, data privacy presents one of the main challenges for broadening secure domains in applying privacy with anonymity.

1 citations


Proceedings ArticleDOI
19 Feb 2023
TL;DR: Li et al. as discussed by the authors proposed a hierarchical DP-K anonymous data release model based on binary tree clustering for the existing k-anonymity scheme and for minimizing the amount of information loss.
Abstract: With the acceleration of data opening and sharing process in the power industry, the risk of sensitive data leakage is also gradually increasing. Privacy protection is one of the hot issues of privacy leakage control technology research in data release, and k-anonymity is the hot topic of privacy protection research in recent years. In this paper, we propose a hierarchical DP-K anonymous data release model based on binary tree clustering for the existing k-anonymity scheme and for minimizing the amount of information loss. A binary tree-based clustering algorithm (BTCA) is proposed to classify similar data records into the same equivalent class, which can improve the effect of clustering, reduce the information loss caused by anonymous data set release, and improve the availability of data. The clustered anonymous data sets redistribute different privacy budgets according to the privacy rights of the quasi-identifier attributes, and realize the hierarchical protection of the data with different degrees of sensitivity through the differential privacy noise increase mechanism, which enhances the privacy of the data.

1 citations



Posted ContentDOI
09 Jun 2023
TL;DR: In this paper , a logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea, and 19 de-identified datasets were generated based on various de-identification configurations using ARX.
Abstract: Abstract Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset to observe the association between data privacy and utility, and to determine whether it is feasible to identify a viable tradeoff between the two. The findings of this study demonstrated that securing data privacy resulted in some loss of data utility. Due to the complexity of the process of ensuring data privacy while maintaining utility understanding the purpose of data use may be required. Including the data user in the data de-identification process may be helpful in the effort to find an acceptable tradeoff between data privacy and utility.

Posted ContentDOI
27 Mar 2023
TL;DR: PADME as discussed by the authors uses a federated approach where the model is implemented and deployed by all parties and visits each data location incrementally for training, enabling the analysis of data across locations while still allowing the model to be trained as if all data were in a single location.
Abstract: Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identification. To address these limitations, we present PADME, a distributed analytics tool that federates model implementation and training. PADME uses a federated approach where the model is implemented and deployed by all parties and visits each data location incrementally for training. This enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location. Training the model on data in its original location preserves data ownership. Furthermore, the results are not provided until the analysis is completed on all data locations to ensure privacy and avoid bias in the results.


Journal ArticleDOI
TL;DR: HeuristicMin this paper applies generalisations to satisfy user-specified anonymity requirements while maximising data retention for analysis purposes, by exploiting monotonicity property of generalisation and simple heuristics for pruning, provides an efficient exhaustive search for optimal generalised data.
Abstract: Abundance of data makes privacy more vulnerable than ever as it increases the attackers' ability to infer confidential data from multiple data sources. Anonymisation protects data privacy by ensuring that critical data are non-unique to any individual so that we can conceal the individual's identity. Existing techniques aim to minimally alter the original data so that either the anonymised data or its analytical results (e.g., classification) will not disclose certain privacy. Our research aims both. This paper presents HeuristicMin, an anonymisation approach that applies generalisations to satisfy user-specified anonymity requirements while maximising data retention (for analysis purposes). Unlike others, by exploiting monotonicity property of generalisation and simple heuristics for pruning, HeuristicMin provides an efficient exhaustive search for optimal generalised data. The paper articulates different meanings of optimality in anonymisation and compares HeuristicMin with well-known approaches analytically and empirically. HeuristicMin produces competitive results on the classification obtained from the anonymised data.

Journal ArticleDOI
TL;DR: Anonymeter as mentioned in this paper is a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets, such as singling out, linkability, and inference risks, which are the three key indicators of factual anonymization according to data protection regulations.
Abstract: Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without dis closing sensitive information about any individual. In practice, as with other anonymization methods, synthetic data cannot entirely eliminate privacy risks. These residual privacy risks need instead to be ex-post uncovered and assessed. However, quantifying the actual privacy risks of any synthetic dataset is a hard task, given the multitude of facets of data privacy. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, which are the three key indicators of factual anonymization according to data protection regulations, such as the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, as well as to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, with a quantitative comparison we demonstrate that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we publish Anonymeter as an open-source library (https://github.com/statice/anonymeter).

Journal ArticleDOI
TL;DR: In this paper , the authors focus on the privacy concerns and commonly used techniques for data protection in smart cities, specifically addressing geolocation data and video surveillance, and they categorize the attacks into linking, predictive and inference, and side-channel attacks.
Abstract: Smart cities, leveraging IoT technologies, are revolutionizing the quality of life for citizens. However, the massive data generated in these cities also poses significant privacy risks, particularly in de-anonymization and re-identification. This survey focuses on the privacy concerns and commonly used techniques for data protection in smart cities, specifically addressing geolocation data and video surveillance. We categorize the attacks into linking, predictive and inference, and side-channel attacks. Furthermore, we examine the most widely employed de-identification and anonymization techniques, highlighting privacy-preserving techniques and anonymization tools; while these methods can reduce the privacy risks, they are not enough to address all the challenges. In addition, we argue that de-identification must involve properties such as unlikability, selective disclosure and self-sovereignty. This paper concludes by outlining future research challenges in achieving complete de-identification in smart cities.


Journal ArticleDOI
15 Mar 2023-PeerJ
TL;DR: In this article , the authors highlight the high utility loss and inapplicability for the 1:M dataset of the θ-Sensitive k-Anonymity privacy model.
Abstract: With the advent of modern information systems, sharing Electronic Health Records (EHRs) with different organizations for better medical treatment, and analysis is beneficial for both academic as well as for business development. However, an individual’s personal privacy is a big concern because of the trust issue across organizations. At the same time, the utility of the shared data that is required for its favorable use is also important. Studies show that plenty of conventional work is available where an individual has only one record in a dataset (1:1 dataset), which is not the case in many applications. In a more realistic form, an individual may have more than one record in a dataset (1:M). In this article, we highlight the high utility loss and inapplicability for the 1:M dataset of the θ-Sensitive k-Anonymity privacy model. The high utility loss and low data privacy of (p, l)-angelization, and (k, l)-diversity for the 1:M dataset. As a mitigation solution, we propose an improved (θ∗, k)-utility algorithm to preserve enhanced privacy and utility of the anonymized 1:M dataset. Experiments on the real-world dataset reveal that the proposed approach outperforms its counterpart, in terms of utility and privacy for the 1:M dataset.

Journal ArticleDOI
TL;DR: In this article , the authors propose an intermediary, practical strategy to support linkage in studies that share de-identified data with Data Coordinating Centers, which can be extended to link data across multiple data hubs to support privacy preserving record linkage.

Proceedings ArticleDOI
07 Apr 2023
TL;DR: In this article , the authors compared the performance impact of current data anonymization algorithms to the suggested k-anonymization methods utilizing both original and anonymized data in order to assess the correctness and execution time.
Abstract: A supplementary method for ensuring that private data is inaccessible to outside parties is data anonymization. Anonymization might affect the outcomes of data mining procedures since it may make it more difficult for commonly used algorithms to analyze the data. This practical experience report compares the performance impact of current data anonymization algorithms to the suggested k-anonymization methods utilizing both original and anonymized data in order to assess the correctness and execution time. Through the use of kanonymization, l-diversity, t-closeness, and differential privacy techniques, a sample of genuine data produced by a healthcare facility was made anonymous. Contrary to predictions, the Hadoop framework was able to handle anonymization approaches, improving accuracy and performance while speeding up execution. These findings show that data anonymization techniques, when properly implemented through Hadoop ecosystems, can help to increase the effectiveness of data anonymization. Furthermore, the suggested method can produce the data anonymization with the necessary utility and protection trade-offs and with a performance scalable to large datasets.

Posted ContentDOI
04 Apr 2023
TL;DR: In this paper , the state-of-the-art methods used to evaluate the performance of anonymization techniques for facial images and gait patterns are evaluated and a stronger adversary model is proposed, which is alert to the recognition scenario and to the anonymization scenario.
Abstract: Biometric data contains distinctive human traits such as facial features or gait patterns. The use of biometric data permits an individuation so exact that the data is utilized effectively in identification and authentication systems. But for this same reason, privacy protections become indispensably necessary. Privacy protection is extensively afforded by the technique of anonymization. Anonymization techniques obfuscate or remove the sensitive personal data to achieve high levels of anonymity. However, the effectiveness of anonymization relies, in equal parts, on the effectiveness of the methods employed to evaluate anonymization performance. In this paper, we assess the state-of-the-art methods used to evaluate the performance of anonymization techniques for facial images and gait patterns. We demonstrate that the state-of-the-art evaluation methods have serious and frequent shortcomings. In particular, we find that the underlying assumptions of the state-of-the-art are quite unwarranted. When a method evaluating the performance of anonymization assumes a weak adversary or a weak recognition scenario, then the resulting evaluation will very likely be a gross overestimation of the anonymization performance. Therefore, we propose a stronger adversary model which is alert to the recognition scenario as well as to the anonymization scenario. Our adversary model implements an appropriate measure of anonymization performance. We improve the selection process for the evaluation dataset, and we reduce the numbers of identities contained in the dataset while ensuring that these identities remain easily distinguishable from one another. Our novel evaluation methodology surpasses the state-of-the-art because we measure worst-case performance and so deliver a highly reliable evaluation of biometric anonymization techniques.

Posted ContentDOI
12 May 2023
TL;DR: In this paper , the authors study four machine learning methods currently used for classification purposes in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them.
Abstract: Anonymization techniques based on obfuscating the quasi-identifiers by means of value generalization hierarchies are widely used to achieve preset levels of privacy. To prevent different types of attacks against database privacy it is necessary to apply several anonymization techniques beyond the classical k-anonymity or $\ell$-diversity. However, the application of these methods is directly connected to a reduction of their utility in prediction and decision making tasks. In this work we study four classical machine learning methods currently used for classification purposes in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them. The performance of these models is studied when varying the value of k for k-anonymity and additional tools such as $\ell$-diversity, t-closeness and $\delta$-disclosure privacy are also deployed on the well-known adult dataset.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a method called Quasi Identification Based on Tree (QIBT) for automatic QI identification based on the relationship between the numbers of distinct values assumed by the set of attributes.
Abstract: The fast advancement of information technology has resulted in more efficient information storage and retrieval. As a result, most organizations, businesses, and governments are releasing and exchanging a large amount of micro data among themselves for commercial or research purposes. However, incorrect data exchange will result in privacy breaches. Many methods and strategies have been developed to address privacy breaches, and Anonymization is one of them that many companies use. In order to perform anonymization, identification of the Quasi Identifier (QI) is significant. Hence this paper proposes a method called Quasi Identification Based on Tree (QIBT) for automatic QI identification. The proposed method derives the QI, based on the relationship between the numbers of distinct values assumed by the set of attributes. So, it uses the tree data structure to derive the unique and infrequent attribute values from the entire dataset with less computational cost. The proposed method consists of four phases: (i) Unique attribute value computation (ii) Tree construction and (iii) Computation of quasi-identifier from the tree (iv) Applying Anonymization Technique to the identified QI. Attributes with high risk of disclosure are identified using our proposed algorithm. Synthetic data are created exclusively for the detected QI using a partial synthetic data generating technique to improve usefulness. The suggested method’s efficiency is tested with a subset of the UCI machine learning dataset and produces superior results when compared to other current approaches.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel anonymization method using similarity and diversity-based clustering that effectively preserves both the subjects' privacy and anonymous-data utility, and they identified influential attributes from the original data using a machine learning algorithm that assists in preserving a subject's privacy in imbalanced clusters.
Abstract: Most data owners publish personal data for information consumers, which is used for hidden knowledge discovery. But data publishing in its original form may be subjected to unwanted disclosure of subjects' identities and their associated sensitive information, and therefore, data is usually anonymized before publication. Many anonymization techniques have been proposed, but most of them often sacrifice utility for privacy, or vice versa, and explicitly disclose sensitive information when original data have skewed distributions. To address these technical problems, we propose a novel anonymization method using similarity and diversity-based clustering that effectively preserves both the subjects' privacy and anonymous-data utility. We identify influential attributes from the original data using a machine learning algorithm that assists in preserving a subject's privacy in imbalanced clusters, and that remained unexplored in previous research. The objective function of the clustering process considers both similarity and diversity in the attributes while assigning records to clusters, whereas most of the existing clustering-based anonymity techniques consider either similarity or diversity, thereby sacrificing either privacy or utility. Attribute values in each cluster set are minimally generalized to effectively achieve both competing goals. Extensive experiments were conducted on four real-world benchmark datasets to prove the feasibility of proposed method. The experimental results showed that the common and AI-based privacy risks were reduced by 13.01% and 24.3% respectively in contrast to existing methods. Data utility was augmented by 11.25% and 20.21% on two distinct metrics compared to its counterparts. The complications (e.g., # of iterations) of the clustering process were 2.25× lower than the state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this article , the authors proposed big data privacy preservation techniques using various privacy functions, such as data anonymization, generalization, random permutation, kanonymity, bucketization, l-diversity with slicing approach.
Abstract: Various diagnostic health data formats and standards include both structured and unstructured data. Sensitive information contained in such metadata requires the development of specific approaches that can combine methods and techniques that can extract and reconcile the information hidden in such data. However, when this data needs to be processed and used for other reasons, there are still many obstacles and concerns to overcome. Modern approaches based on machine learning including big data analytics, assist in the information refinement process for later use of clinical evidence. These strategies consist of transforming various data into standard formats in specific scenarios. In fact, in order to conform to these rules, only de-identified diagnostic and personal data may be handled for secondary analysis, especially when information is distributed or transferred across institutions. This paper proposes big data privacy preservation techniques using various privacy functions. This research focused on secure data distribution as well as security access control to revoke the malicious activity or similarity attacks from end-user. The various privacy preservation techniques such as data anonymization, generalization, random permutation, kanonymity, bucketization, l-diversity with slicing approach have been proposed during the data distribution. The efficiency of system has been evaluated in Hadoop distributed file system (HDFS) with numerous experiments. The results obtained from different experiments show that the computation should be changed when changing k-anonymity and l-diversity. As a result, the proposed system offers greater efficiency in Hadoop environments by reducing execution time by 15% to 18% and provides a higher level of access control security than other security algorithms. Keywords—Privacy preservation; data privacy; data distribution; anonymization; slicing; privacy attacks; HDFS

Journal ArticleDOI
TL;DR: In this paper , the authors introduce a Security-Centric Enterprise Data Anonymization Governance Model, a structured framework for managing data privacy across healthcare, finance, and government industries.
Abstract: The increasing need for data privacy and the rising complexity of data environments necessitate robust data anonymization techniques to safeguard personal and sensitive information. A multi-model approach to data anonymization can strike an optimal balance between privacy protection and data utility, integrating techniques such as data masking, differential privacy, machine learning algorithms, blockchain technology, and data encryption. This article introduces a Security-Centric Enterprise Data Anonymization Governance Model, a structured framework for managing data privacy across healthcare, finance, and government industries. The model ensures adherence to best practices and compliance with legal and regulatory requirements. The article addresses challenges in implementing data anonymization techniques, including maintaining data utility and preventing re-identification, by advocating for a multi-model approach that combines various technologies and methods. We suggest that by adopting this holistic approach, organizations can enhance their data protection measures and foster a culture of data privacy.

Posted ContentDOI
18 Apr 2023
TL;DR: Wang et al. as discussed by the authors proposed a new efficient and effective privacy preservation model based on aggregate query answers that can guarantee the confidence of the range and the minimized number of values that can be reidentified.
Abstract: Abstract Data utility and data privacy are both serious issues that must be considered when datasets are released to use in big data analytics because they are traded-off issues. That is, high-utility datasets are generally high risks in terms of privacy violation issues. Likewise, datasets are formed to be high security in terms of privacy preservation, they often lead to data utility issues. To address these issues in datasets, several privacy preservation models have been proposed such as k-Anonymity, l-Diversity, t-Closeness, Anatomy, k-Likeness, and (l p1 , . . . , l pn )-Privacy, i.e., all users’ explicit identifier values and all unique quasi-identifier values in datasets are removed and distorted respectively. Unfortunately, these privacy preservation models are static data models and still have data utility issues that must be addressed. Thus, they could be insufficient to address privacy violation issues in big data analytics. For this reason, a new efficient and effective privacy preservation model is proposed in this work such that it is based on aggregate query answers that can guarantee the confidence of the range and the minimized number of values that can be re-identified. The aims of the proposed privacy preservation model are that aside from privacy preservation constraints, the complexity and data utility are also maintained as much as possible. Furthermore, we show that the proposed model is a privacy preservation model that is efficient and effective by using extensive experiments.

Proceedings ArticleDOI
16 Feb 2023
TL;DR: In this article , the authors test analytical opportunities against a learning method while paying attention and protecting the security of student data and find that the decrease in the number of coefficients must be balanced with a significant enough loss of information.
Abstract: Along with the times and rapid technological advances, some methods can collect, store, and analyze data with extraordinary capabilities. More recently, academic institutions have also offered open and distance learning programs. Through these efforts, they obtain big data related to information and communication systems that involve their students by combining various modern big data analytical tools and techniques to maintain student data privacy. The main purpose of this research is to test analytical opportunities against a learning method while paying attention and protecting the security of student data. We can achieve balance in our data mining efforts to secure data privacy through the K-anonymization method. The results of this study indicate that, in testing the correlation coefficient on each online activity forum and student achievement scores, the most significant result was 0.298 on the "Total Log" activity platform. After the anonymization process, it is found that the decrease in the number of coefficients must be balanced with a significant enough loss of information.

Journal ArticleDOI
TL;DR: In this article , a generic anonymization approach for person-specific data, which retains more information for data mining and analytical purposes while providing considerable privacy, is proposed, taking into account the usefulness and uncertainty of attributes while anonymizing the data to significantly enhance data utility.
Abstract: This paper proposes a generic anonymization approach for person-specific data, which retains more information for data mining and analytical purposes while providing considerable privacy. The proposed approach takes into account the usefulness and uncertainty of attributes while anonymizing the data to significantly enhance data utility. We devised a method for determining the usefulness weight for each attribute item in a dataset, rather than manually deciding (or assuming based on domain knowledge) that a certain attribute might be more useful than another. We employed an information theory concept for measuring the uncertainty regarding sensitive attribute’s value in equivalence classes to prevent unnecessary generalization of data. A flexible generalization scheme that simultaneously considers both attribute usefulness and uncertainty is suggested to anonymize person-specific data. The proposed methodology involves six steps: primitive analysis of the dataset, such as analyzing attribute availability in the data, arranging the attributes into relevant categories, and sophisticated pre-processing, computing usefulness weights of attributes, ranking users based on similarities, computing uncertainty in sensitive attributes (SAs), and flexible data generalization. Our methodology offers the advantage of retaining higher truthfulness in data without losing guarantees of privacy. Experimental analysis on two real-life benchmark datasets with varying scales, and comparisons with prior state-of-the-art methods, demonstrate the potency of our anonymization approach. Specifically, our approach yielded better performance on three metrics, namely accuracy, information loss, and disclosure risk. The accuracy and information loss were improved by restraining heavier anonymization of data, and disclosure risk was improved by preserving higher uncertainty in the SA column. Lastly, our approach is generic and can be applied to any real-world person-specific tabular datasets encompassing both demographics and SAs of individuals.

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , the authors compare different methods and different anonymization algorithms, like the 1:m-generalization algorithm and Mondrian, and compare them to show which of them maintains data privacy and high utility of analysis results at the same time.
Abstract: Today, there are many sources of data, such as IoT devices, that produce a massive amount of data, particularly in the healthcare industry. This microdata needs to be published, and shared for medical research purposes, data analysis, mining, learning analytics tasks, and the decision-making process. But this published data contains sensitive and private information for individuals, and if this microdata is published in its original format, the privacy of individuals may be disclosed, which puts the individuals at risk, especially if an adversary has strong background knowledge about the target individual. Owning multiple records and multiple sensitive attributes (MSA) for an individual can lead to new privacy leakages or disclosure. So, the fundamental issue is how to protect the privacy of 1:M with the MSA dataset using anonymization techniques and methods, as well as how to balance utility and privacy, for this data while reducing information loss and misuse. The objective of this paper is to use different methods and different anonymization algorithms, like the 1:m-generalization algorithm and Mondrian, and compare them to show which of them maintains data privacy and high utility of analysis results at the same time. From this comparison, we found that the m-generalization algorithm and the (p, k) angelization method perform well in terms of information loss and data utility compared to the other remaining methods and algorithms.

Posted ContentDOI
22 Feb 2023
TL;DR: Wang et al. as discussed by the authors proposed a new privacy preservation model, LKC-Privacy, which can address privacy violation issues in high-dimensional datasets such that its satiable released datasets do not have any concern of privacy violation attacks from using data comparison attacks and are highly efficient and effective in data maintenance.
Abstract: Abstract A major challenge is when datasets are released to utilize in the outside scope of data-collecting organizations, it is how to balance data utilities and data privacies. To achieve this aim in data collection (datasets), there are several privacy preservation models that have been proposed such as k-Anonymity and l-Diversity. Unfortunately, these privacy preservation models can be sufficient to address privacy violation issues in datasets that do not have high-dimensional attributes. For this reason, a privacy preservation model, LKC-Privacy, can address privacy violation issues in high-dimensional datasets to be proposed. With this privacy preservation model, datasets cannot have any concern of privacy violation issues when all L-size distinct quasi-identifier values are distorted (suppressed or generalized) to be at least K indistinguishable tuples. Moreover, every protected sensitive value relates to each group of indistinguishable quasi-identifier values, it must have the confidence of re-identifications to be at most C. Although LKC-Privacy is more efficient and effective than k-Anonymity and l-Diversity, it is generally efficient and effective to address privacy violation issues in location-based datasets. Moreover, we see that datasets satisfy LKC-Privacy constraints, they still have privacy violation issues from using data comparison attacks and they further have data utility issues that must be addressed. Therefore, a new privacy preservation model can address privacy violation issues in high-dimensional datasets such that its satiable released datasets do not have any concern of privacy violation issues from using data comparison attacks and are highly efficient and effective in data maintenance to be proposed in this work. Furthermore, we show that the proposed model is more efficient and effective by using extensive experiments.