scispace - formally typeset
Search or ask a question
Author

Geetha Jagannathan

Bio: Geetha Jagannathan is an academic researcher from Rutgers University. The author has contributed to research in topics: Medicine & Decision tree. The author has an hindex of 8, co-authored 15 publications receiving 863 citations. Previous affiliations of Geetha Jagannathan include Stevens Institute of Technology & Columbia University.

Papers
More filters
Proceedings ArticleDOI
21 Aug 2005
TL;DR: The concept of arbitrarily partitioned data is introduced, which is a generalization of both horizontally and vertically partitionedData, and an efficient privacy-preserving protocol for k-means clustering in the setting of arbitrarily partitions data is provided.
Abstract: Advances in computer networking and database technologies have enabled the collection and storage of vast quantities of data. Data mining can extract valuable knowledge from this data, and organizations have realized that they can often obtain better results by pooling their data together. However, the collected data may contain sensitive or private information about the organizations or their customers, and privacy concerns are exacerbated if data is shared between multiple organizations.Distributed data mining is concerned with the computation of models from data that is distributed among multiple participants. Privacy-preserving distributed data mining seeks to allow for the cooperative computation of such models without the cooperating parties revealing any of their individual data items. Our paper makes two contributions in privacy-preserving data mining. First, we introduce the concept of arbitrarily partitioned data, which is a generalization of both horizontally and vertically partitioned data. Second, we provide an efficient privacy-preserving protocol for k-means clustering in the setting of arbitrarily partitioned data.

422 citations

Journal ArticleDOI
TL;DR: In this paper, the problem of constructing private classifiers using decision trees, within the framework of differential privacy, was studied and a differentially private decision tree ensemble algorithm based on random decision trees was proposed.
Abstract: In this paper, we study the problem of constructing private classifiers using decision trees, within the framework of differential privacy. We first present experimental evidence that creating a differentially private ID3 tree using differentially private low-level queries does not simultaneously provide good privacy and good accuracy, particularly for small datasets. In search of better privacy and accuracy, we then present a differentially private decision tree ensemble algorithm based on random decision trees. We demonstrate experimentally that this approach yields good prediction while maintaining good privacy, even for small datasets. We also present differentially private extensions of our algorithm to two settings: (1) new data is periodically appended to an existing database and (2) the database is horizontally or vertically partitioned between multiple users.

166 citations

Proceedings ArticleDOI
06 Dec 2009
TL;DR: This paper first constructs privacy-preserving ID3 decision trees using differentially private sum queries, then presents a differentiallyPrivate decision tree ensemble algorithm using the random decision tree approach, which shows good prediction accuracy even when the size of the datasets is small.
Abstract: In this paper, we study the problem of constructing private classifiers using decision trees, within the framework of differential privacy. We first construct privacy-preserving ID3 decision trees using differentially private sum queries. Our experiments show that for many data sets a reasonable privacy guarantee can only be obtained via this method at a steep cost of accuracy in predictions. We then present a differentially private decision tree ensemble algorithm using the random decision tree approach. We demonstrate experimentally that our approach yields good prediction accuracy even when the size of the datasets is small. We also present a differentially private algorithm for the situation in which new data is periodically appended to an existing database. Our experiments show that our differentially private random decision tree classifier handles data updates in a way that maintains the same level of privacy guarantee.

119 citations

Proceedings Article
03 Jul 2006
TL;DR: A simple I/O-efficient k-clustering algorithm that was designed with the goal of enabling a privacy-preserving version of the algorithm and produces cluster centers that are, on average, more accurate than the ones produced by the well known iterative k-means algorithm.
Abstract: We present a simple I/O-efficient k-clustering algorithm that was designed with the goal of enabling a privacy-preserving version of the algorithm. Our experiments show that this algorithm produces cluster centers that are, on average, more accurate than the ones produced by the well known iterative k-means algorithm. We use our new algorithm as the basis for a communication-efficient privacy-preservingk-clustering protocol for databases that are horizontally partitioned between two parties. Unlike existing privacy-preserving protocols based on the k-means algorithm, this protocol does not reveal intermediate candidate cluster centers.

108 citations

Journal ArticleDOI
01 Apr 2008
TL;DR: A privacy-preserving protocol for filling in missing values using a lazy decision tree imputation algorithm for data that is horizontally partitioned between two parties.
Abstract: Handling missing data is a critical step to ensuring good results in data mining. Like most data mining algorithms, existing privacy-preserving data mining algorithms assume data is complete. In order to maintain privacy in the data mining process while cleaning data, privacy-preserving methods of data cleaning will be required. In this paper, we address the problem of privacy-preserving data imputation of missing data. Specifically, we present a privacy-preserving protocol for filling in missing values using a lazy decision tree imputation algorithm for data that is horizontally partitioned between two parties. The participants of the protocol learn only the imputed values; the computed decision tree is not learned by either party.

42 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

01 Jan 2002

9,314 citations

Proceedings ArticleDOI
22 May 2017
TL;DR: This work quantitatively investigates how machine learning models leak information about the individual data records on which they were trained and empirically evaluates the inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon.
Abstract: We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data record and black-box access to a model, determine if the record was in the model's training dataset. To perform membership inference against a target model, we make adversarial use of machine learning and train our own inference model to recognize differences in the target model's predictions on the inputs that it trained on versus the inputs that it did not train on. We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.

2,059 citations

Proceedings ArticleDOI
12 Oct 2015
TL;DR: This paper presents a practical system that enables multiple parties to jointly learn an accurate neural-network model for a given objective without sharing their input datasets, and exploits the fact that the optimization algorithms used in modern deep learning, namely, those based on stochastic gradient descent, can be parallelized and executed asynchronously.
Abstract: Deep learning based on artificial neural networks is a very popular approach to modeling, classifying, and recognizing complex data such as images, speech, and text The unprecedented accuracy of deep learning methods has turned them into the foundation of new AI-based services on the Internet Commercial companies that collect user data on a large scale have been the main beneficiaries of this trend since the success of deep learning techniques is directly proportional to the amount of data available for training Massive data collection required for deep learning presents obvious privacy issues Users' personal, highly sensitive data such as photos and voice recordings is kept indefinitely by the companies that collect it Users can neither delete it, nor restrict the purposes for which it is used Furthermore, centrally kept data is subject to legal subpoenas and extra-judicial surveillance Many data owners--for example, medical institutions that may want to apply deep learning methods to clinical records--are prevented by privacy and confidentiality concerns from sharing the data and thus benefitting from large-scale deep learning In this paper, we design, implement, and evaluate a practical system that enables multiple parties to jointly learn an accurate neural-network model for a given objective without sharing their input datasets We exploit the fact that the optimization algorithms used in modern deep learning, namely, those based on stochastic gradient descent, can be parallelized and executed asynchronously Our system lets participants train independently on their own datasets and selectively share small subsets of their models' key parameters during training This offers an attractive point in the utility/privacy tradeoff space: participants preserve the privacy of their respective data while still benefitting from other participants' models and thus boosting their learning accuracy beyond what is achievable solely on their own inputs We demonstrate the accuracy of our privacy-preserving deep learning on benchmark datasets

1,836 citations