scispace - formally typeset
Search or ask a question
Author

Flavio P. Calmon

Bio: Flavio P. Calmon is an academic researcher from Harvard University. The author has contributed to research in topics: Mutual information & Computer science. The author has an hindex of 24, co-authored 130 publications receiving 2282 citations. Previous affiliations of Flavio P. Calmon include Maynooth University & IBM.


Papers
More filters
Proceedings Article
01 Jan 2017
TL;DR: This paper proposes a convex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility, and describes the impact of limited sample size in accomplishing this objective.
Abstract: Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we introduce a novel probabilistic formulation of data pre-processing for reducing discrimination. We propose a convex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility. We characterize the impact of limited sample size in accomplishing this objective. Two instances of the proposed optimization are applied to datasets, including one on real-world criminal recidivism. Results show that discrimination can be greatly reduced at a small cost in classification accuracy.

566 citations

Proceedings ArticleDOI
01 Oct 2012
TL;DR: It is proved that under both metrics the resulting design problem of finding the optimal mapping from the user's data to a privacy-preserving output can be cast as a modified rate-distortion problem which, in turn, can be formulated as a convex program.
Abstract: We propose a general statistical inference framework to capture the privacy threat incurred by a user that releases data to a passive but curious adversary, given utility constraints. We show that applying this general framework to the setting where the adversary uses the self-information cost function naturally leads to a non-asymptotic information-theoretic approach for characterizing the best achievable privacy subject to utility constraints. Based on these results we introduce two privacy metrics, namely average information leakage and maximum information leakage. We prove that under both metrics the resulting design problem of finding the optimal mapping from the user's data to a privacy-preserving output can be cast as a modified rate-distortion problem which, in turn, can be formulated as a convex program. Finally, we compare our framework with differential privacy.

315 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This work reduces the optimization size by introducing a quantization step, and shows how to generate privacy mappings under quantization, and evaluates the method on a dataset showing correlations between political views and TV viewing habits, and demonstrates that good privacy properties can be achieved with limited distortion.
Abstract: We propose a practical methodology to protect a user's private data, when he wishes to publicly release data that is correlated with his private data, in the hope of getting some utility Our approach relies on a general statistical inference framework that captures the privacy threat under inference attacks, given utility constraints Under this framework, data is distorted before it is released, according to a privacy-preserving probabilistic mapping This mapping is obtained by solving a convex optimization problem, which minimizes information leakage under a distortion constraint We address a practical challenge encountered when applying this theoretical framework to real world data: the optimization may become untractable and face scalability issues when data assumes values in large size alphabets, or is high dimensional Our work makes two major contributions We first reduce the optimization size by introducing a quantization step, and show how to generate privacy mappings under quantization Second, we evaluate our method on a dataset showing correlations between political views and TV viewing habits, and demonstrate that good privacy properties can be achieved with limited distortion so as not to undermine the original purpose of the publicly released data, eg recommendations

89 citations

Patent
30 Jan 2013
TL;DR: In this article, techniques, devices, systems, and protocols related to data transfer between communication nodes via multiple heterogeneous paths are described, where network coding may be used to improve data flow and reliability in a multiple path scenario.
Abstract: Techniques, devices, systems, and protocols are disclosed herein that relate to data transfer between communication nodes via multiple heterogeneous paths. In various embodiments, network coding may be used to improve data flow and reliability in a multiple path scenario. Transmission control protocol (TCP) may also be used within different paths to further enhance data transfer reliability. In some embodiments, multiple levels of network coding may be provided within a transmitter in a multiple path scenario, with one level being applied across all paths and another being applied within individual paths.

84 citations

Proceedings ArticleDOI
14 Jun 2015
TL;DR: It is proved that a non-trivial amount of useful information can be disclosed while not disclosing any private information if and only if the smallest principal inertia component of the joint distribution of S and X is 0.
Abstract: We investigate the problem of intentionally disclosing information about a set of measurement points X (useful information), while guaranteeing that little or no information is revealed about a private variable S (private information). Given that S and X are drawn from a finite set with joint distribution pS,X, we prove that a non-trivial amount of useful information can be disclosed while not disclosing any private information if and only if the smallest principal inertia component of the joint distribution of S and X is 0. This fundamental result characterizes when useful information can be privately disclosed for any privacy metric based on statistical dependence. We derive sharp bounds for the tradeoff between disclosure of useful and private information, and provide explicit constructions of privacy-assuring mappings that achieve these bounds.

80 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

21 Jan 2018
TL;DR: It is shown that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men, in commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition.
Abstract: The paper “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” by Joy Buolamwini and Timnit Gebru, that will be presented at the Conference on Fairness, Accountability, and Transparency (FAT*) in February 2018, evaluates three commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition. The study finds these services to have recognition capabilities that are not balanced over genders and skin tones [1]. In particular, the authors show that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men.

2,528 citations

Journal ArticleDOI

2,415 citations