scispace - formally typeset
Search or ask a question

Showing papers on "Interpretability published in 2016"


Journal ArticleDOI
TL;DR: The basic ideas of PCA are introduced, discussing what it can and cannot do, and some variants of the technique have been developed that are tailored to various different data types and structures.
Abstract: Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. Finding such new variables, the principal components, reduces to solving an eigenvalue/eigenvector problem, and the new variables are defined by the dataset at hand, not a priori , hence making PCA an adaptive data analysis technique. It is adaptive in another sense too, since variants of the technique have been developed that are tailored to various different data types and structures. This article will begin by introducing the basic ideas of PCA, discussing what it can and cannot do. It will then describe some variants of PCA and their application.

4,289 citations


Posted Content
TL;DR: This research presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive and therefore expensive and expensive process of training and deploying supervised machine-learning models.
Abstract: Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.

1,423 citations


Proceedings ArticleDOI
13 Aug 2016
TL;DR: This work proposes interpretable decision sets, a framework for building predictive models that are highly accurate, yet also highly interpretable, and provides a new approach to interpretable machine learning that balances accuracy, interpretability, and computational efficiency.
Abstract: One of the most important obstacles to deploying predictive models is the fact that humans do not understand and trust them. Knowing which variables are important in a model's prediction and how they are combined can be very powerful in helping people understand and trust automatic decision making systems. Here we propose interpretable decision sets, a framework for building predictive models that are highly accurate, yet also highly interpretable. Decision sets are sets of independent if-then rules. Because each rule can be applied independently, decision sets are simple, concise, and easily interpretable. We formalize decision set learning through an objective function that simultaneously optimizes accuracy and interpretability of the rules. In particular, our approach learns short, accurate, and non-overlapping rules that cover the whole feature space and pay attention to small but important classes. Moreover, we prove that our objective is a non-monotone submodular function, which we efficiently optimize to find a near-optimal set of rules. Experiments show that interpretable decision sets are as accurate at classification as state-of-the-art machine learning techniques. They are also three times smaller on average than rule-based models learned by other methods. Finally, results of a user study show that people are able to answer multiple-choice questions about the decision boundaries of interpretable decision sets and write descriptions of classes based on them faster and more accurately than with other rule-based models that were designed for interpretability. Overall, our framework provides a new approach to interpretable machine learning that balances accuracy, interpretability, and computational efficiency.

648 citations


Posted Content
TL;DR: The REverse Time AttentIoN model (RETAIN) is developed for application to Electronic Health Records (EHR) data and achieves high accuracy while remaining clinically interpretable and is based on a two-level neural attention model that detects influential past visits and significant clinical variables within those visits.
Abstract: Accuracy and interpretability are two dominant features of successful predictive models. Typically, a choice must be made in favor of complex black box models such as recurrent neural networks (RNN) for accuracy versus less accurate but more interpretable traditional models such as logistic regression. This tradeoff poses challenges in medicine where both accuracy and interpretability are important. We addressed this challenge by developing the REverse Time AttentIoN model (RETAIN) for application to Electronic Health Records (EHR) data. RETAIN achieves high accuracy while remaining clinically interpretable and is based on a two-level neural attention model that detects influential past visits and significant clinical variables within those visits (e.g. key diagnoses). RETAIN mimics physician practice by attending the EHR data in a reverse time order so that recent clinical visits are likely to receive higher attention. RETAIN was tested on a large health system EHR dataset with 14 million visits completed by 263K patients over an 8 year period and demonstrated predictive accuracy and computational scalability comparable to state-of-the-art methods such as RNN, and ease of interpretability comparable to traditional models.

643 citations


Proceedings Article
01 Jan 2016
TL;DR: Motivated by the Bayesian model criticism framework, MMD-critic is developed, which efficiently learns prototypes and criticism, designed to aid human interpretability.
Abstract: Example-based explanations are widely used in the effort to improve the interpretability of highly complex distributions. However, prototypes alone are rarely sufficient to represent the gist of the complexity. In order for users to construct better mental models and understand complex data distributions, we also need {\em criticism} to explain what are \textit{not} captured by prototypes. Motivated by the Bayesian model criticism framework, we develop \texttt{MMD-critic} which efficiently learns prototypes and criticism, designed to aid human interpretability. A human subject pilot study shows that the \texttt{MMD-critic} selects prototypes and criticism that are useful to facilitate human understanding and reasoning. We also evaluate the prototypes selected by \texttt{MMD-critic} via a nearest prototype classifier, showing competitive performance compared to baselines.

570 citations


Proceedings Article
01 Jan 2016
TL;DR: In this paper, a two-level neural attention model is proposed to detect influential past visits and significant clinical variables within those visits (e.g. key diagnoses) in reverse time order so that recent clinical visits are likely to receive higher attention.
Abstract: Accuracy and interpretability are two dominant features of successful predictive models. Typically, a choice must be made in favor of complex black box models such as recurrent neural networks (RNN) for accuracy versus less accurate but more interpretable traditional models such as logistic regression. This tradeoff poses challenges in medicine where both accuracy and interpretability are important. We addressed this challenge by developing the REverse Time AttentIoN model (RETAIN) for application to Electronic Health Records (EHR) data. RETAIN achieves high accuracy while remaining clinically interpretable and is based on a two-level neural attention model that detects influential past visits and significant clinical variables within those visits (e.g. key diagnoses). RETAIN mimics physician practice by attending the EHR data in a reverse time order so that recent clinical visits are likely to receive higher attention. RETAIN was tested on a large health system EHR dataset with 14 million visits completed by 263K patients over an 8 year period and demonstrated predictive accuracy and computational scalability comparable to state-of-the-art methods such as RNN, and ease of interpretability comparable to traditional models.

566 citations


Posted Content
TL;DR: DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a neural network that compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference.
Abstract: Note: This paper describes an older version of DeepLIFT. See https://arxiv.org/abs/1704.02685 for the newer version. Original abstract follows: The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a neural network. DeepLIFT compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference. We apply DeepLIFT to models trained on natural images and genomic data, and show significant advantages over gradient-based methods.

532 citations


Posted Content
TL;DR: This paper argues for explaining machine learning predictions using model-agnostic approaches, treating the machine learning models as black-box functions, which provide crucial flexibility in the choice of models, explanations, and representations, improving debugging, comparison, and interfaces for a variety of users and models.
Abstract: Understanding why machine learning models behave the way they do empowers both system designers and end-users in many ways: in model selection, feature engineering, in order to trust and act upon the predictions, and in more intuitive user interfaces. Thus, interpretability has become a vital concern in machine learning, and work in the area of interpretable models has found renewed interest. In some applications, such models are as accurate as non-interpretable ones, and thus are preferred for their transparency. Even when they are not accurate, they may still be preferred when interpretability is of paramount importance. However, restricting machine learning to interpretable models is often a severe limitation. In this paper we argue for explaining machine learning predictions using model-agnostic approaches. By treating the machine learning models as black-box functions, these approaches provide crucial flexibility in the choice of models, explanations, and representations, improving debugging, comparison, and interfaces for a variety of users and models. We also outline the main challenges for such methods, and review a recently-introduced model-agnostic explanation approach (LIME) that addresses these challenges.

531 citations


Posted Content
TL;DR: This paper proposes a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.
Abstract: While neural networks have been successfully applied to many natural language processing tasks, they come at the cost of interpretability. In this paper, we propose a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words. We present several approaches to analyzing the effects of such erasure, from computing the relative difference in evaluation metrics, to using reinforcement learning to erase the minimum set of input words in order to flip a neural model's decision. In a comprehensive analysis of multiple NLP tasks, including linguistic feature classification, sentence-level sentiment analysis, and document level sentiment aspect prediction, we show that the proposed methodology not only offers clear explanations about neural model decisions, but also provides a way to conduct error analysis on neural models.

509 citations


Journal ArticleDOI
TL;DR: Multiple regression and machine learning approaches are used to simulate monthly streamflow in five highly seasonal rivers in the highlands of Ethiopia and compare their performance in terms of predictive accuracy, error structure and bias, model interpretability, and uncertainty when faced with extreme climate conditions.
Abstract: . In the past decade, machine learning methods for empirical rainfall–runoff modeling have seen extensive development and been proposed as a useful complement to physical hydrologic models, particularly in basins where data to support process-based models are limited. However, the majority of research has focused on a small number of methods, such as artificial neural networks, despite the development of multiple other approaches for non-parametric regression in recent years. Furthermore, this work has often evaluated model performance based on predictive accuracy alone, while not considering broader objectives, such as model interpretability and uncertainty, that are important if such methods are to be used for planning and management decisions. In this paper, we use multiple regression and machine learning approaches (including generalized additive models, multivariate adaptive regression splines, artificial neural networks, random forests, and M5 cubist models) to simulate monthly streamflow in five highly seasonal rivers in the highlands of Ethiopia and compare their performance in terms of predictive accuracy, error structure and bias, model interpretability, and uncertainty when faced with extreme climate conditions. While the relative predictive performance of models differed across basins, data-driven approaches were able to achieve reduced errors when compared to physical models developed for the region. Methods such as random forests and generalized additive models may have advantages in terms of visualization and interpretation of model structure, which can be useful in providing insights into physical watershed function. However, the uncertainty associated with model predictions under extreme climate conditions should be carefully evaluated, since certain models (especially generalized additive models and multivariate adaptive regression splines) become highly variable when faced with high temperatures.

178 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this paper, the authors propose a Part-Stacked CNN architecture that explicitly explains the fine-grained recognition process by modeling subtle differences from object parts and adopts a set of sharing strategies between the computation of multiple object parts.
Abstract: In the context of fine-grained visual categorization, the ability to interpret models as human-understandable visual manuals is sometimes as important as achieving high classification accuracy. In this paper, we propose a novel Part-Stacked CNN architecture that explicitly explains the finegrained recognition process by modeling subtle differences from object parts. Based on manually-labeled strong part annotations, the proposed architecture consists of a fully convolutional network to locate multiple object parts and a two-stream classification network that encodes object-level and part-level cues simultaneously. By adopting a set of sharing strategies between the computation of multiple object parts, the proposed architecture is very efficient running at 20 frames/sec during inference. Experimental results on the CUB-200-2011 dataset reveal the effectiveness of the proposed architecture, from multiple perspectives of classification accuracy, model interpretability, and efficiency. Being able to provide interpretable recognition results in realtime, the proposed method is believed to be effective in practical applications.

Journal ArticleDOI
TL;DR: This work proposes a new class of partially functional linear models to characterize the regression between a scalar response and covariates of both functional and scalar types, and establishes the consistency and oracle properties of the proposed method under mild conditions.
Abstract: SUMMARY In modern experiments, functional and nonfunctional data are often encountered simultaneously when observations are sampled from random processes and high-dimensional scalar covariates. It is difficult to apply existing methods for model selection and estimation. We propose a new class of partially functional linear models to characterize the regression between a scalar response and covariates of both functional and scalar types. The new approach provides a unified and flexible framework that simultaneously takes into account multiple functional and ultrahigh-dimensional scalar predictors, enables us to identify important features, and offers improved interpretability of the estimators. The underlying processes of the functional predictors are considered to be infinite-dimensional, and one of our contributions is to characterize the effects of regularization on the resulting estimators. We establish the consistency and oracle properties of the proposed method under mild conditions, demonstrate its performance with simulation studies, and illustrate its application using air pollution data.

01 Jan 2016
TL;DR: In this paper, a review examines the measurement of academic motivation in college students and distinguishes pencil-and-paper group-administered instruments according to their conceptions of academic motivations: academic motivation taken as a single general motivation, as single specific motivations, or as a complex of motivations.
Abstract: This review examines the measurement of academic motivation in college students. It distinguishes pencil-and-paper group-administered instruments according to their conceptions of academic motivation: academic motivation taken as a single general motivation, as single specific motivations, or as a complex of motivations. It evaluates these classes of instruments in terms of the interpretability and the utility of the information each type of instrument is likely to provide.

Proceedings ArticleDOI
16 May 2016
TL;DR: A unified probabilistic generative model, User-Community-Geo-Topic (UCGT), is proposed to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity.
Abstract: Social community detection is a growing field of interest in the area of social network applications, and many approaches have been developed, including graph partitioning, latent space model, block model and spectral clustering. Most existing work purely focuses on network structure information which is, however, often sparse, noisy and lack of interpretability. To improve the accuracy and interpretability of community discovery, we propose to infer users' social communities by incorporating their spatiotemporal data and semantic information. Technically, we propose a unified probabilistic generative model, User-Community-Geo-Topic (UCGT), to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity. With a well-designed multi-component model structure and a parallel inference implementation to leverage the power of multicores and clusters, our UCGT model is expressive while remaining efficient and scalable to growing large-scale geo-social networking data. We deploy UCGT to two application scenarios of user behavior predictions: check-in prediction and social interaction prediction. Extensive experiments on two large-scale geo-social networking datasets show that UCGT achieves better performance than existing state-of-the-art comparison methods.

Journal ArticleDOI
TL;DR: A decision tree based data-driven diagnostic strategy for AHUs is presented, in which classification and regression tree (CART) algorithm is used for decision tree induction, and it is shown that this strategy can achieve a good diagnostic performance with an average F-measure of 0.97.

Proceedings Article
01 Jan 2016
TL;DR: A clustering based regularization that encourages parsimonious representations is proposed that is easy to optimize and flexible supporting various forms of clustering, including sample and spatial clustering as well as co-clustering.
Abstract: In this paper we aim at facilitating generalization for deep networks while supporting interpretability of the learned representations. Towards this goal, we propose a clustering based regularization that encourages parsimonious representations. Our k-means style objective is easy to optimize and flexible supporting various forms of clustering, including sample and spatial clustering as well as co-clustering. We demonstrate the effectiveness of our approach on the tasks of unsupervised learning, classification, fine grained categorization and zero-shot learning.

Journal ArticleDOI
TL;DR: This work presents an MIQO-based approach for designing high quality linear regression models that explicitly addresses various competing objectives and demonstrates the effectiveness of the approach on both real and synthetic data sets.
Abstract: Linear regression models are traditionally built through trial and error to balance many competing goals such as predictive power, interpretability, significance, robustness to error in data, and sparsity, among others. This problem lends itself naturally to a mixed integer quadratic optimization (MIQO) approach but has not been modeled this way because of the belief in the statistics community that MIQO is intractable for large scale problems. However, in the last 25 years (1991–2015), algorithmic advances in integer optimization combined with hardware improvements have resulted in an astonishing 450 billion factor speedup in solving mixed integer optimization problems. We present an MIQO-based approach for designing high quality linear regression models that explicitly addresses various competing objectives and demonstrate the effectiveness of our approach on both real and synthetic data sets.

Journal ArticleDOI
TL;DR: MediBoost is presented, a novel framework for constructing decision trees that retain interpretability while having accuracy similar to ensemble methods, and its performance is compared to that of conventional decision trees and ensemble methods on 13 medical classification problems.
Abstract: Machine learning algorithms that are both interpretable and accurate are essential in applications such as medicine where errors can have a dire consequence. Unfortunately, there is currently a tradeoff between accuracy and interpretability among state-of-the-art methods. Decision trees are interpretable and are therefore used extensively throughout medicine for stratifying patients. Current decision tree algorithms, however, are consistently outperformed in accuracy by other, less-interpretable machine learning models, such as ensemble methods. We present MediBoost, a novel framework for constructing decision trees that retain interpretability while having accuracy similar to ensemble methods, and compare MediBoost's performance to that of conventional decision trees and ensemble methods on 13 medical classification problems. MediBoost significantly outperformed current decision tree algorithms in 11 out of 13 problems, giving accuracy comparable to ensemble methods. The resulting trees are of the same type as decision trees used throughout clinical practice but have the advantage of improved accuracy. Our algorithm thus gives the best of both worlds: it grows a single, highly interpretable tree that has the high accuracy of ensemble methods.

Posted Content
TL;DR: This work presents how a model-agnostic additive representation of the importance of input features unifies current methods and shows how this representation is optimal, in the sense that it is the only set of additive values that satisfies important properties.
Abstract: Understanding why a model made a certain prediction is crucial in many data science fields. Interpretable predictions engender appropriate trust and provide insight into how the model may be improved. However, with large modern datasets the best accuracy is often achieved by complex models even experts struggle to interpret, which creates a tension between accuracy and interpretability. Recently, several methods have been proposed for interpreting predictions from complex models by estimating the importance of input features. Here, we present how a model-agnostic additive representation of the importance of input features unifies current methods. This representation is optimal, in the sense that it is the only set of additive values that satisfies important properties. We show how we can leverage these properties to create novel visual explanations of model predictions. The thread of unity that this representation weaves through the literature indicates that there are common principles to be learned about the interpretation of model predictions that apply in many scenarios.

Journal ArticleDOI
TL;DR: In this paper, the Clock Drawing Test (DDT) was used as a screening tool to differentiate normal individuals from those with cognitive impairment, and has proven useful in helping to diagnose cognitive dysfunction associated with neurological disorders such as Alzheimer's disease, Parkinson's disease and other dementias and conditions.
Abstract: The Clock Drawing Test--a simple pencil and paper test--has been used for more than 50 years as a screening tool to differentiate normal individuals from those with cognitive impairment, and has proven useful in helping to diagnose cognitive dysfunction associated with neurological disorders such as Alzheimer's disease, Parkinson's disease, and other dementias and conditions. We have been administering the test using a digitizing ballpoint pen that reports its position with considerable spatial and temporal precision, making available far more detailed data about the subject's performance. Using pen stroke data from these drawings categorized by our software, we designed and computed a large collection of features, then explored the tradeoffs in performance and interpretability in classifiers built using a number of different subsets of these features and a variety of different machine learning techniques. We used traditional machine learning methods to build prediction models that achieve high accuracy. We operationalized widely used manual scoring systems so that we could use them as benchmarks for our models. We worked with clinicians to define guidelines for model interpretability, and constructed sparse linear models and rule lists designed to be as easy to use as scoring systems currently used by clinicians, but more accurate. While our models will require additional testing for validation, they offer the possibility of substantial improvement in detecting cognitive impairment earlier than currently possible, a development with considerable potential impact in practice.

Posted Content
TL;DR: In this article, the authors revisited the task of learning a Euclidean metric from data and formulated it as a simple optimization problem, and even admitted a closed-form solution.
Abstract: We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of interpretability; and (iii) computational speed several orders of magnitude faster than the widely used LMNN and ITML methods. Furthermore, on standard benchmark datasets, our closed-form solution consistently attains higher classification accuracy.

Journal ArticleDOI
01 Mar 2016
TL;DR: This paper proposes an approach for automatic designing of fuzzy rule-based classifiers (FRBCs) from financial data using multi-objective evolutionary optimization algorithms (MOEOAs) and significantly outperforms the alternative methods in terms of the interpretability of the obtained financial data classifiers while remaining either competitive or superior in their accuracy and the speed of decision making.
Abstract: Graphical abstractDisplay Omitted HighlightsNovel fuzzy rule-based classifiers (FRBCs) for financial data classification tasks are proposed.FRBCs are designed using multi-objective evolutionary optimization algorithms.Collection of FRBCs with various levels of accuracy-interpretability trade-off is generated.Benchmark financial data sets are used to evaluate the effectiveness of the proposed system.Highly accurate, interpretable and fast financial decision support is provided by our approach. Credit classification is an important component of critical financial decision making tasks such as credit scoring and bankruptcy prediction. Credit classification methods are usually evaluated in terms of their accuracy, interpretability, and computational efficiency. In this paper, we propose an approach for automatic designing of fuzzy rule-based classifiers (FRBCs) from financial data using multi-objective evolutionary optimization algorithms (MOEOAs). Our method generates, in a single experiment, an optimized collection of solutions (financial FRBCs) characterized by various levels of accuracy-interpretability trade-off. In our approach we address the complexity- and semantics-related interpretability issues, we introduce original genetic operators for the classifier's rule base processing, and we implement our ideas in the context of Non-dominated Sorting Genetic Algorithm II (NSGA-II), i.e., one of the presently most advanced MOEOAs. A significant part of the paper is devoted to an extensive comparative analysis of our approach and 24 alternative methods applied to three standard financial benchmark data sets, i.e., Statlog (Australian Credit Approval), Statlog (German Credit Approval), and Credit Approval (also referred to as Japanese Credit) sets available from the UCI repository of machine learning databases (http://archive.ics.uci.edu/ml). Several performance measures including accuracy, sensitivity, specificity, and some number of interpretability measures are employed in order to evaluate the obtained systems. Our approach significantly outperforms the alternative methods in terms of the interpretability of the obtained financial data classifiers while remaining either competitive or superior in terms of their accuracy and the speed of decision making.

Proceedings Article
01 Jan 2016
TL;DR: This short survey aims to clarify the concepts related to interpretability and emphasises the distinction between interpreting models and representations, as well as heuristic-based and user-based approaches.
Abstract: Interpretability is often a major concern in machine learning. Although many authors agree with this statement, interpretability is often tackled with intuitive arguments, distinct (yet related) terms and heuristic quantifications. This short survey aims to clarify the concepts related to interpretability and emphasises the distinction between interpreting models and representations, as well as heuristic-based and user-based approaches.

Journal ArticleDOI
TL;DR: In this paper, the structural risk minimization framework of lattice regression is extended to monotonic functions by adding linear inequality constraints and jointly learning interpretable calibrations of each feature to normalize continuous features and handle categorical or missing data.
Abstract: Real-world machine learning applications may have requirements beyond accuracy, such as fast evaluation times and interpretability. In particular, guaranteed monotonicity of the learned function with respect to some of the inputs can be critical for user confidence. We propose meeting these goals for low-dimensional machine learning problems by learning flexible, monotonic functions using calibrated interpolated look-up tables. We extend the structural risk minimization framework of lattice regression to monotonic functions by adding linear inequality constraints. In addition, we propose jointly learning interpretable calibrations of each feature to normalize continuous features and handle categorical or missing data, at the cost of making the objective non-convex. We address large-scale learning through parallelization, mini-batching, and random sampling of additive regularizer terms. Case studies on real-world problems with up to sixteen features and up to hundreds of millions of training samples demonstrate the proposed monotonic functions can achieve state-of-the-art accuracy in practice while providing greater transparency to users.

Proceedings Article
19 Jun 2016
TL;DR: This work revisits the task of learning a Euclidean metric from data from first principles and forms a surprisingly simple optimization problem that possesses several very attractive properties, including an innate geometric appeal through the Riemannian geometry of positive definite matrices and ease of interpretability.
Abstract: We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of interpretability; and (iii) computational speed several orders of magnitude faster than the widely used LMNN and ITML methods. Furthermore, on standard benchmark datasets, our closed-form solution consistently attains higher classification accuracy.

Book ChapterDOI
19 Sep 2016
TL;DR: This paper finds that the simple Pearson’s correlation coefficient has all necessary properties, yet has somehow been overlooked in favour of more complex alternatives and illustrates how the use of this measure in practice can provide better interpretability and more confidence in the model selection process.
Abstract: In feature selection algorithms, “stability” is the sensitivity of the chosen feature set to variations in the supplied training data. As such it can be seen as an analogous concept to the statistical variance of a predictor. However unlike variance, there is no unique definition of stability, with numerous proposed measures over 15 years of literature. In this paper, instead of defining a new measure, we start from an axiomatic point of view and identify what properties would be desirable. Somewhat surprisingly, we find that the simple Pearson’s correlation coefficient has all necessary properties, yet has somehow been overlooked in favour of more complex alternatives. Finally, we illustrate how the use of this measure in practice can provide better interpretability and more confidence in the model selection process. The data and software related to this paper are available at https://github.com/nogueirs/ECML2016.

Journal ArticleDOI
TL;DR: A Convex Sparse Principal Component Analysis algorithm is proposed and applied to feature learning and demonstrates that the proposed algorithm outperforms state-of-the-art unsupervised feature selection algorithms.
Abstract: Principal component analysis (PCA) has been widely applied to dimensionality reduction and data pre-processing for different applications in engineering, biology, social science, and the like. Classical PCA and its variants seek for linear projections of the original variables to obtain the low-dimensional feature representations with maximal variance. One limitation is that it is difficult to interpret the results of PCA. Besides, the classical PCA is vulnerable to certain noisy data. In this paper, we propose a Convex Sparse Principal Component Analysis (CSPCA) algorithm and apply it to feature learning. First, we show that PCA can be formulated as a low-rank regression optimization problem. Based on the discussion, the l2, 1-normminimization is incorporated into the objective function to make the regression coefficients sparse, thereby robust to the outliers. Also, based on the sparse model used in CSPCA, an optimal weight is assigned to each of the original feature, which in turn provides the output with good interpretability. With the output of our CSPCA, we can effectively analyze the importance of each feature under the PCA criteria. Our new objective function is convex, and we propose an iterative algorithm to optimize it. We apply the CSPCA algorithm to feature selection and conduct extensive experiments on seven benchmark datasets. Experimental results demonstrate that the proposed algorithm outperforms state-of-the-art unsupervised feature selection algorithms.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: In this article, the authors study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework and propose snapshot learning to facilitate learning from supervised sequential signals.
Abstract: Recently a variety of LSTM-based conditional language models (LM) have been applied across a range of language generation tasks. In this work we study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework. A method called snapshot learning is also proposed to facilitate learning from supervised sequential signals by applying a companion cross-entropy objective function to the conditioning vector. The experimental and analytical results demonstrate firstly that competition occurs between the conditioning vector and the LM, and the differing architectures provide different trade-offs between the two. Secondly, the discriminative power and transparency of the conditioning vector is key to providing both model interpretability and better performance. Thirdly, snapshot learning leads to consistent performance improvements independent of which architecture is used.

Posted Content
TL;DR: A post processing method is proposed that improves the model interpretability of tree ensembles by approximateing a complex model by a simpler model that is interpretable for human.
Abstract: Tree ensembles, such as random forest and boosted trees, are renowned for their high prediction performance, whereas their interpretability is critically limited. In this paper, we propose a post processing method that improves the model interpretability of tree ensembles. After learning a complex tree ensembles in a standard way, we approximate it by a simpler model that is interpretable for human. To obtain the simpler model, we derive the EM algorithm minimizing the KL divergence from the complex ensemble. A synthetic experiment showed that a complicated tree ensemble was approximated reasonably as interpretable.

Journal ArticleDOI
Gang Luo1
08 Mar 2016
TL;DR: This paper presents the first complete method for automatically explaining results for any machine learning predictive model without degrading accuracy, using the electronic medical record data set from the Practice Fusion diabetes classification competition containing patient records from all 50 states in the United States.
Abstract: Predictive modeling is a key component of solutions to many healthcare problems. Among all predictive modeling approaches, machine learning methods often achieve the highest prediction accuracy, but suffer from a long-standing open problem precluding their widespread use in healthcare. Most machine learning models give no explanation for their prediction results, whereas interpretability is essential for a predictive model to be adopted in typical healthcare settings. This paper presents the first complete method for automatically explaining results for any machine learning predictive model without degrading accuracy. We did a computer coding implementation of the method. Using the electronic medical record data set from the Practice Fusion diabetes classification competition containing patient records from all 50 states in the United States, we demonstrated the method on predicting type 2 diabetes diagnosis within the next year. For the champion machine learning model of the competition, our method explained prediction results for 87.4 % of patients who were correctly predicted by the model to have type 2 diabetes diagnosis within the next year. Our demonstration showed the feasibility of automatically explaining results for any machine learning predictive model without degrading accuracy.