scispace - formally typeset
Search or ask a question

Showing papers by "Zhiyuan Chen published in 2010"


Journal ArticleDOI
10 Dec 2010
TL;DR: This article proposes a technique that not only protects privacy, but also guarantees that the same model, in the form of decision trees or regression trees, can be built from the sanitized data.
Abstract: Data mining techniques have been widely used in many research disciplines such as medicine, life sciences, and social sciences to extract useful knowledge (such as mining models) from research data. Research data often needs to be published along with the data mining model for verification or reanalysis. However, the privacy of the published data needs to be protected because otherwise the published data is subject to misuse such as linking attacks. Therefore, employing various privacy protection methods becomes necessary. However, these methods only consider privacy protection and do not guarantee that the same mining models can be built from sanitized data. Thus the published models cannot be verified using the sanitized data. This article proposes a technique that not only protects privacy, but also guarantees that the same model, in the form of decision trees or regression trees, can be built from the sanitized data. We have also experimentally shown that other mining techniques can be used to reanalyze the sanitized data. This technique can be used to promote sharing of research data.

11 citations


Journal ArticleDOI
TL;DR: The authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible.
Abstract: While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible. Their methodology allows the user to adjust their preference on the weights assigned to benefits in terms of the number of restrictive patterns hidden, cost in terms of the number of legitimate patterns hidden, and damage to the database in terms of the difference between marginal frequencies of items for the original and sanitized databases. Most approaches in solving the given problem found in literature are all-heuristic based without formal treatment for optimality. While in a few work, ILP has been used previously as a formal optimization approach, the novelty of this method is the extremely low cost-complexity model in contrast to the others. They implement our methodology in C and C++ and ran several experiments with synthetic data generated with the IBM synthetic data generator. The experiments show excellent results when compared to those in the literature. DOI: 10.4018/jcmam.2010072002 IGI PUBLISHING This paper appears in the publication, International Journal of Computational Models and Algorithms in Medicine, Volume 1, Issue 1 edited by Aryya Gangopadhyay © 2010, IGI Global 701 E. Chocolate Avenue, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-global.com ITJ 5527 20 International Journal of Computational Models and Algorithms in Medicine, 1(1), 19-33, January-March 2010 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2002; Oliviera et al., 2003a, 2003b; Han et al., 2006). A number of cases have been reported in literature where data mining actually has posed threats to discovery of sensitive knowledge and violating privacy. One typical problem is that of inferencing, which means inferring sensitive information from non-sensitive or unclassified data (Oliviera et al., 2002; Clifton, 2001). Data mining is part of the larger business intelligence initiatives that are taking place in organizations across government and industry sectors, many of which include medical applications. It is being used for prediction as well knowledge discovery that can lead to cost reduction, business expansion, and detection of fraud or wastage of resources, among other things. With its many benefits, data mining has given rise to increasingly complex and controversial privacy issues. For example, the privacy implications of data mining have lead to high profile controversies involving the use of data mining tools and techniques on data related to drug prescriptions. Two major health care data publishers filed a petition to the Supreme Court on whether commercial use of data mining is protected by the First Amendment1, an appeal to a controversial ruling by the 1st U.S. Circuit Court of Appeals that upheld a 2006 New Hampshire law that banned the usage of doctor’s prescription history to increase drug sales. Privacy implications are a major roadblock to information sharing across organizations. For example, sharing inventory data might reveal information that can be used to gain strategic advantages by competitors. Unless the actual or perceived implications of data mining methods on privacy issues are properly dealt with, it can lead to sub-optimal decision making in organizations, and reluctance to accept such tools by the public in general. For example there could be benefits in sharing prescription data from different pharmacy stores to mine for information such as the use of generic drugs, socio-demographic and geographic analysis of prescription drugs, which will require moving the data from each store or site to a central location, which increases the risks of litigation. In general several potential problems that have been identified for privacy protection make the case for privacy reserving data mining. These include: legal requirements for protecting data (e.g. HIPAA healthcare regulations in the US) Federal register (2002), liability from inadvertent disclosure of data, risk of misuse of proprietary information (Atallah et al., 2003), and antitrust concerns (Vaidya et al., 2006). Thus it is of growing importance to devise efficient tradeoffs between knowledge discovery and knowledge hiding from databases so that cost to the involved, in general, gets minimized in the process yet the benefit is maximized. The work that will be presented in this article will focus on formulating a model for sanitization of databases against discovery of restrictive associative patterns, while distorting the databases and legitimate pattern discovery as little as possible. To illustrate the problem, consider a classic example given in (Evfimienski et al., 2002; Oliviera et al., 2002). There is a server and several clients, each having its own set of items. The clients want the server to provide them with recommendations based on statistical information about association among items. However the clients do not want the server to know some restrictive patterns. Now what is sent to the server is the raw database and in its process of searching for frequent patterns the server will discover the restrictive patterns as well. Thus what the client has to send is the raw database, modified in a manner so that the restrictive patterns are not discovered. But this needs distortion to the raw database before sending it to the server and the distortion should be such that it is minimal and hiding of the legitimate patterns is also minimal. Other examples of the problem are given in (Verikyos et al., 2004). The example shows the vulnerability of critical frequent patterns, however it is directly associated with the problems of exposing critical association rules as well since rules are built from patterns. Indeed some of the research work like (Verikyos et al 2004) use reduction of support of sensitive frequent patterns as one of the methods to hide association rules that could be generated from them. All these methods are based on modifying the 13 more pages are available in the full version of this document, which may be purchased using the \"Add to Cart\" button on the publisher's webpage: www.igi-global.com/article/partial-optimization-approachprivacy-preserving/38942

9 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: The impact of random perturbation for a popular data mining and analysis method: linear discriminant analysis is studied to discover that for large data sets, the impact of perturbations is quite limited, and for small data sets the negative impact can be reduced by publishing additional statistics about the perturbing along with the perturbed data.
Abstract: The ubiquity of the internet not only makes it very convenient for individuals or organizations to share data for data mining or statistical analysis, but also greatly increases the chance of privacy breach There exist many techniques such as random perturbation to protect the privacy of such data sets However, perturbation often has negative impacts on the quality of data mining or statistical analysis conducted over the perturbed data This paper studies the impact of random perturbation for a popular data mining and analysis method: linear discriminant analysis The contributions are two fold First, we discover that for large data sets, the impact of perturbation is quite limited (ie, high quality results may be obtained directly from perturbed data) if the perturbation process satisfies certain conditions Second, we discover that for small data sets, the negative impact of perturbation can be reduced by publishing additional statistics about the perturbation along with the perturbed data We provide both theoretical derivations and experimental verifications of these results

4 citations