scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Discrimination-aware data mining: a survey

14 Mar 2017-Journal of data science (Inderscience Publishers (IEL))-Vol. 2, Iss: 1, pp 70-84
TL;DR: A detailed survey of discrimination discovery methods and discrimination prevention methods is presented and the list of datasets used for experiments in different discrimination-aware data mining (DADM) approaches is presented.
Abstract: Data mining is a very important and useful technique to extract knowledge from raw data. However, there is a challenge faced by data mining researchers, in the form of potential discrimination. Discrimination means giving unfair treatment to a person just because one belongs to a minority group, without considering one's individual merit or qualification. The results extracted using data mining techniques may lead to discrimination, if a biased historical/training dataset is used. It is very important to prevent data mining technique from becoming a source of discrimination. A detailed survey of discrimination discovery methods and discrimination prevention methods is presented in this paper. This paper also presents the list of datasets used for experiments in different discrimination-aware data mining (DADM) approaches. Some ideas for future research work that may help in preventing discrimination are also discussed.
Citations
More filters
21 Jan 2018
TL;DR: This position paper argues for applying recent research on ensuring sociotechnical systems are fair and nondiscriminatory to the privacy protections those systems may provide to help explain the disparate impact of privacy failure.
Abstract: In this position paper, we argue for applying recent research on ensuring sociotechnical systems are fair and nondiscriminatory to the privacy protections those systems may provide. Privacy literature seldom considers whether a proposed privacy scheme protects all persons uniformly, irrespective of membership in protected classes or particular risk in the face of privacy failure. Just as algorithmic decision-making systems may have discriminatory outcomes even without explicit or deliberate discrimination, so also privacy regimes may disproportionately fail to protect vulnerable members of their target population, resulting in disparate impact with respect to the effectiveness of privacy

70 citations


Additional excerpts

  • ...Privacy and fairness have been addressed separately for many years, however, recent studies (Hajian et al., 2016; Hintoglu et al., 2005; Kashid et al., 2015, 2017; Pedreshi et al., 2008; Ruggieri et al., 2014) have expanded the application of methods to achieve both goals....

    [...]

Posted Content
TL;DR: In this paper, the authors proposed an extended framework based on fair classification algorithms that are formulated as optimization problems, by introducing a stability-focused regularization term, which can be used to inform the selection of the regularization parameter in their framework.
Abstract: Fair classification has been a topic of intense study in machine learning, and several algorithms have been proposed towards this important task. However, in a recent study, Friedler et al. observed that fair classification algorithms may not be stable with respect to variations in the training dataset -- a crucial consideration in several real-world applications. Motivated by their work, we study the problem of designing classification algorithms that are both fair and stable. We propose an extended framework based on fair classification algorithms that are formulated as optimization problems, by introducing a stability-focused regularization term. Theoretically, we prove a stability guarantee, that was lacking in fair classification algorithms, and also provide an accuracy guarantee for our extended framework. Our accuracy guarantee can be used to inform the selection of the regularization parameter in our framework. To the best of our knowledge, this is the first work that combines stability and fairness in automated decision-making tasks. We assess the benefits of our approach empirically by extending several fair classification algorithms that are shown to achieve the best balance between fairness and accuracy over the Adult dataset. Our empirical results show that our framework indeed improves the stability at only a slight sacrifice in accuracy.

26 citations

Journal ArticleDOI
TL;DR: In this paper, the academic literature on quantum technologies (QT) using bibliometric tools was investigated using a set of 49,823 articles obtained from the Web of Science (WoS) database.
Abstract: In this study, we investigated the academic literature on quantum technologies (QT) using bibliometric tools. We used a set of 49,823 articles obtained from the Web of Science (WoS) database using ...

9 citations

Journal ArticleDOI
TL;DR: The analyses of the datasets allowed us to cluster the literature into three distinct sets, construct the core corpus of the academic literature in QT, and to identify the key players on country and organization levels, thus offering insight into the current state of the field.
Abstract: In this study, we investigated the academic literature on quantum technologies (QT) using bibliometric tools. We used a set of 49,823 articles obtained from the Web of Science (WoS) database using a search query constructed through expert opinion. Analysis of this revealed that QT is deeply rooted in physics, and the majority of the articles are published in physics journals. Keyword analysis revealed that the literature could be clustered into three distinct sets, which are (i) quantum communication/cryptography, (ii) quantum computation, and (iii) physical realizations of quantum systems. We performed a burst analysis that showed the emergence and fading away of certain key concepts in the literature. This is followed by co-citation analysis on the highly cited articles provided by the WoS, using these we devised a set of core corpus of 34 publications. Comparing the most highly cited articles in this set with respect to the initial set we found that there is a clear difference in most cited subjects. Finally, we performed co-citation analyses on country and organization levels to find the central nodes in the literature. Overall, the analyses of the datasets allowed us to cluster the literature into three distinct sets, construct the core corpus of the academic literature in QT, and to identify the key players on country and organization levels, thus offering insight into the current state of the field. Search queries and access to figures are provided in the appendix.

7 citations

Journal ArticleDOI
TL;DR: In this article , the authors summarized fairness protection methods in terms of three aspects: the problem settings, the models, and the challenges, and summarized the main challenges to producing fairer models.
Abstract: Abstract In recent years, it has been revealed that machine learning models can produce discriminatory predictions. Hence, fairness protection has come to play a pivotal role in machine learning. In the past, most studies on fairness protection have used traditional machine learning methods to enforce fairness. However, these studies focus on low dimensional inputs, such as numerical inputs, whereas more recent deep learning technologies have encouraged fairness protection with image inputs through deep model methods. These approaches involve various object functions and structural designs that break the spurious correlations between targets and sensitive features. With these connections broken, we are left with fairer predictions. To better understand the proposed methods and encourage further development in the field, this paper summarizes fairness protection methods in terms of three aspects: the problem settings, the models, and the challenges. Through this survey, we hope to reveal research trends in the field, discover the fundamentals of enforcing fairness, and summarize the main challenges to producing fairer models.

7 citations

References
More filters
01 Jan 1998

12,940 citations


"Discrimination-aware data mining: a..." refers background in this paper

  • ...The research in this field was started by Pedreschi et al. (2008). The problem was further extended by Ruggieri et al....

    [...]

  • ...The research in this field was started by Pedreschi et al. (2008). The problem was further extended by Ruggieri et al. (2010a) to discover discrimination in the dataset (shown in the left part of Figure 1)....

    [...]

  • ...Kohavi and Becker (1996) constructed this dataset, which consists of 48,842 instances....

    [...]

Journal ArticleDOI
TL;DR: This survey will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish P PDP from other related problems, and propose future research directions.
Abstract: The collection of digital information by governments, corporations, and individuals has created tremendous opportunities for knowledge- and information-based decision making. Driven by mutual benefits, or by regulations that require certain data to be published, there is a demand for the exchange and publication of data among various parties. Data in its original form, however, typically contains sensitive information about individuals, and publishing such data will violate individual privacy. The current practice in data publishing relies mainly on policies and guidelines as to what types of data can be published and on agreements on the use of published data. This approach alone may lead to excessive data distortion or insufficient protection. Privacy-preserving data publishing (PPDP) provides methods and tools for publishing useful information while preserving data privacy. Recently, PPDP has received considerable attention in research communities, and many approaches have been proposed for different data publishing scenarios. In this survey, we will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish PPDP from other related problems, and propose future research directions.

1,669 citations


"Discrimination-aware data mining: a..." refers background in this paper

  • ...Fung et al. (2010) have presented a detailed survey in this area....

    [...]

Journal ArticleDOI
TL;DR: This paper surveys and extends existing data preprocessing techniques, being suppression of the sensitive attribute, massaging the dataset by changing class labels, and reweighing or resampling the data to remove discrimination without relabeling instances and presents the results of experiments on real-life data.
Abstract: Recently, the following Discrimination-Aware Classification Problem was introduced: Suppose we are given training data that exhibit unlawful discrimination; e.g., toward sensitive attributes such as gender or ethnicity. The task is to learn a classifier that optimizes accuracy, but does not have this discrimination in its predictions on test data. This problem is relevant in many settings, such as when the data are generated by a biased decision process or when the sensitive attribute serves as a proxy for unobserved features. In this paper, we concentrate on the case with only one binary sensitive attribute and a two-class classification problem. We first study the theoretically optimal trade-off between accuracy and non-discrimination for pure classifiers. Then, we look at algorithmic solutions that preprocess the data to remove discrimination before a classifier is learned. We survey and extend our existing data preprocessing techniques, being suppression of the sensitive attribute, massaging the dataset by changing class labels, and reweighing or resampling the data to remove discrimination without relabeling instances. These preprocessing techniques have been implemented in a modified version of Weka and we present the results of experiments on real-life data.

905 citations


"Discrimination-aware data mining: a..." refers background or methods in this paper

  • ...…techniques described by Hajian and Domingo-Ferrer (2013) for discrimination prevention are better than pre-processing techniques developed by Kamiran and Calders (2012), because of the following two reasons: • The discrimination prevention methods developed by Kamiran and Calders (2012)…...

    [...]

  • ...Another technique called, Massaging, developed by Kamiran and Calders (2012), changes the class labels of some of the minority community data objects to remove discrimination....

    [...]

  • ...E.g., in the pre-processing technique called Suppression developed by Kamiran and Calders (2012), all the discriminatory attributes are removed and then this transformed data are used to perform data mining tasks....

    [...]

  • ...The research was done in parallel by Kamiran et al. (2010), Calders and Verwer (2010), Kamiran and Calders (2009, 2012), who identified different techniques for DADM (depicted in the right part of Figure 1)....

    [...]

  • ...…Domingo-Ferrer (2013) for discrimination prevention are better than pre-processing techniques developed by Kamiran and Calders (2012), because of the following two reasons: • The discrimination prevention methods developed by Kamiran and Calders (2012) consider only one discriminatory attribute....

    [...]

Journal ArticleDOI
TL;DR: Three approaches for making the naive Bayes classifier discrimination-free are presented: modifying the probability of the decision being positive, training one model for every sensitive attribute value and balancing them, and adding a latent variable to the Bayesian model that represents the unbiased label and optimizing the model parameters for likelihood using expectation maximization.
Abstract: In this paper, we investigate how to modify the naive Bayes classifier in order to perform classification that is restricted to be independent with respect to a given sensitive attribute. Such independency restrictions occur naturally when the decision process leading to the labels in the data-set was biased; e.g., due to gender or racial discrimination. This setting is motivated by many cases in which there exist laws that disallow a decision that is partly based on discrimination. Naive application of machine learning techniques would result in huge fines for companies. We present three approaches for making the naive Bayes classifier discrimination-free: (i) modifying the probability of the decision being positive, (ii) training one model for every sensitive attribute value and balancing them, and (iii) adding a latent variable to the Bayesian model that represents the unbiased label and optimizing the model parameters for likelihood using expectation maximization. We present experiments for the three approaches on both artificial and real-life data.

750 citations


"Discrimination-aware data mining: a..." refers methods in this paper

  • ...(2010), Calders and Verwer (2010), Kamiran and Calders (2009, 2012), who identified different techniques for DADM (depicted in the right part of Figure 1). Custers et al. (2013) present detailed information about DADM such as discrimination discovery, discrimination prevention, privacy protection and conditional discrimination....

    [...]

  • ...(2010), Calders and Verwer (2010), Kamiran and Calders (2009, 2012), who identified different techniques for DADM (depicted in the right part of Figure 1)....

    [...]

  • ...The research was done in parallel by Kamiran et al. (2010), Calders and Verwer (2010), Kamiran and Calders (2009, 2012), who identified different techniques for DADM (depicted in the right part of Figure 1)....

    [...]

  • ...Calders and Verwer (2010) have developed the Naïve Bayes methods to detect and remove discrimination....

    [...]

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This approach leads to a precise formulation of the redlining problem along with a formal result relating discriminatory rules with apparently safe ones by means of background knowledge, and an empirical assessment of the results on the German credit dataset.
Abstract: In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Rules extracted from databases by data mining techniques, such as classification or association rules, when used for decision tasks such as benefit or credit approval, can be discriminatory in the above sense. In this paper, the notion of discriminatory classification rules is introduced and studied. Providing a guarantee of non-discrimination is shown to be a non trivial task. A naive approach, like taking away all discriminatory attributes, is shown to be not enough when other background knowledge is available. Our approach leads to a precise formulation of the redlining problem along with a formal result relating discriminatory rules with apparently safe ones by means of background knowledge. An empirical assessment of the results on the German credit dataset is also provided.

631 citations


"Discrimination-aware data mining: a..." refers background or methods in this paper

  • ...Pedreschi et al. (2009a) presented a reference model, named LP2DD, for finding evidence of discrimination in automatic decision support system (DSS). Ruggieri et al. (2010b) developed a tool, called DCUBE, to discover discrimination....

    [...]

  • ...The research in this field was started by Pedreschi et al. (2008)....

    [...]

  • ...Pedreschi et al. (2009a) presented a reference model, named LP2DD, for finding evidence of discrimination in automatic decision support system (DSS)....

    [...]