RESEARCH ARTICLE
"What is relevant in a text document?": An
interpretable machine learning approach
Leila Arras
1
, Franziska Horn
2
, Gre
´
goire Montavon
2
, Klaus-Robert Mu
¨
ller
2,3,4
*,
Wojciech Samek
1
*
1 Machine Learning Group, Fraunhofer Heinrich Hertz Institute, Berlin, Germany, 2 Machine Learning Group,
Technische Universita
¨
t Berlin, Berlin, Germany, 3 Department of Brain and Cognitive Engineering, Korea
University, Seoul, Korea, 4 Max Planck Institute for Informatics, Saarbru¨cken, Germany
*
klaus-robert.mueller@tu-berlin.de (KRM); wojciech.samek@hhi.fraunhofer.de (WS)
Abstract
Text documents can be described by a number of abstract concepts such as semantic cate-
gory, writing style, or sentiment. Machine learning (ML) models have been trained to auto-
matically map documents to these abstract concepts, allowing to annotate very large text
collections, more than could be processed by a human in a lifetime. Besides predicting the
text’s category very accurately, it is also highly desirable to understand how and why the cat-
egorization process takes place. In this paper, we demonstrate that such understanding can
be achieved by tracing the classification decision back to individual words using layer-wise
relevance propagation (LRP), a recently developed technique for explaining predictions of
complex non-linear classifiers. We train two word-based ML models, a convolutional neural
network (CNN) and a bag-of-words SVM classifier, on a topic categorization task and adapt
the LRP method to decompose the predictions of these models onto words. Resulting
scores indicate how much individual words contribute to the overall classification decision.
This enables one to distill relevant information from text documents without an explicit
semantic information extraction step. We further use the word-wise relevance scores for
generating novel vector-based document representations which capture semantic informa-
tion. Based on these document vectors, we introduce a measure of model explanatory
power and show that, although the SVM and CNN models perform similarly in terms of clas-
sification accuracy, the latter exhibits a higher level of explainability which makes it more
comprehensible for humans and potentially more useful for other applications.
1 Introduction
A number of real-world problems related to text data have been studied under the framework
of natural language processing (NLP). Examples of such problems include topic categorization,
sentiment analysis, machine translation, structured information extraction, and automatic
summarization. Due to the overwhelming amount of text data available on the Internet from
various sources such as user-generated content or digitized books, methods to automatically
and intelligently process large collections of text documents are in high demand. For several
PLOS ONE | https://doi.org/10.1371/journal.pone.01811 42 August 11, 2017 1 / 23
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Arras L, Horn F, Montavon G, Mu¨ller K-R,
Samek W (2017) "What is relevant in a text
document?": An interpretable machine learning
approach. PLoS ONE 12(8): e0181142.
https://doi.
org/10.1371/journal.pone.0181142
Editor: Grigori Sidorov, MEXICO
Received: December 23, 2016
Accepted: June 26, 2017
Published: August 11, 2017
Copyright: © 2017 Arras et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: Data are available
from the UCI Machine Learning Repository:
https://
archive.ics.uci.edu/ml/datasets/Twenty
+Newsgroups
.
Funding: This work was supported by the German
Ministry for Education and Research as Berlin Big
Data Center BBDC, funding mark 01IS14013A, by
the Institute for Information & Communications
Technology Promotion (IITP) grant funded by the
Korea government (No. 2017-0-00451) and by
DFG. KRM thanks for partial funding by the
National Research Foundation of Korea funded by
text applications, machine learning (ML) models based on global word statistics like TFIDF [1,
2] or linear classifiers are known to perform remarkably well, e.g. for unsupervised keyword
extraction [3] or document classification [4]. However more recently, neural network models
based on vector space representations of words (like [
5]) have shown to be of great benefit to a
large number of tasks. The trend was initiated by the seminal work of [
6] and [7], who intro-
duced word-based neural networks to perform various NLP tasks such as language modeling,
chunking, named entity recognition, and semantic role labeling. A number of recent works
(e.g. [
7, 8]) also refined the basic neural network architecture by incorporating useful struc-
tures such as convolution, pooling, and parse tree hierarchies, leading to further improvements
in model predictions. Overall, these ML models have permitted to assign automatically and
accurately concepts to entire documents or to sub-document levels like phrases; the assigned
information can then be mined on a large scale.
In parallel, a set of techniques were developed in the context of image categorization to
explain the predictions of convolutional neural networks (a state-of-the-art ML model in this
field) or related models. These techniques were able to associate to each prediction of the
model a meaningful pattern in the space of input features [
9–11] or to perform a decomposi-
tion onto the input pixels of the model output [12–14]. In this paper, we will make use of the
layer-wise relevance propagation (LRP) technique [
13], which has already been substantially
tested on various datasets and ML models [
15–18].
In the present work, we propose a method to identify which words in a text document are
important to explain the category associated to it. The approach consists in using a ML classi-
fier to predict the categories as accurately as possible, and in a second step, decompose the ML
prediction onto the input domain, thus assigning to each word in the document a relevance
score. The ML model of study will be a word-embedding based convolutional neural network
that we train on a text classification task, namely topic categorization of newsgroup docu-
ments. As a second ML model we consider a classical bag-of-words support vector machine
(BoW/SVM) classifier.
We contribute the following:
1. The LRP technique [
13] is brought to the NLP domain and its suitability for identifying rel-
evant words in text documents is demonstrated.
2. LRP relevances are validated, at the document level, by building document heatmap visuali-
zations, and at the dataset level, by compiling representative words for a text category. It is
also shown quantitatively that LRP better identifies relevant words than sensitivity analysis.
3. A novel way of generating vector-based document representations is introduced and it is
verified that these document vectors present semantic regularities within their original fea-
ture space akin to word vector representations.
4. A measure for model explanatory power is proposed and it is shown that two ML models, a
neural network and a BoW/SVM classifier, although presenting similar classification per-
formance, may substantially differ in terms of explainability.
The work is organized as follows. In Section 2 we describe the related work for explaining
classifier decisions with respect to input space variables. In Section 3 we introduce our neural
network ML model for document classification, as well as the LRP decomposition procedure
associated to its predictions. We describe how LRP relevance scores can be used to identify
important words in documents and introduce a novel way of condensing the semantic infor-
mation of a text document into a single document vector. Likewise in section 3 we introduce a
baseline ML model for document classification, as well as a gradient-based alternative for
"What is relevant in a text document?": An interpretable machine learning approach
PLOS ONE | https://doi.org/10.1371/journal.pone.01811 42 August 11, 2017 2 / 23
the Ministry of Education, Science, and Technology
in the BK21 program.
Competing interests: The authors have declared
that no competing interests exist.
assigning relevance scores to words. In Section 4 we define objective criteria for evaluating
word relevance scores, as well as for assessing model explanatory power. In Section 5 we intro-
duce the dataset and experimental setup, and in Section 6 we present the results. Finally, Sec-
tion 7 concludes our work.
2 Related work
Explanation of individual classification decisions in terms of input variables has been studied
for a variety of machine learning classifiers such as additive classifiers [
19], kernel-based classi-
fiers [20] or hierarchical networks [12]. Model-agnostic methods for explanations relying on
random sampling have also been proposed [
21–23]. Despite their generality, the latter however
incur an additional computational cost due to the need to process the whole sample to provide
a single explanation. Other methods are more specific to deep convolutional neural networks
used in computer vision: the authors of [
9] proposed a network propagation technique based
on deconvolutions to reconstruct input image patterns that are linked to a particular feature
map activation or prediction. The work of [
10] is aimed at revealing salient structures within
images related to a specific class by computing the corresponding prediction score derivative
with respect to the input image. The latter method is based on gradient magnitude, and thus
reveals the sensitivity of the classifier decision to some local variation of the input image; this
technique is related to sensitivity analysis [
24, 25].
In contrast, the LRP method of [
13] corresponds to a full decomposition of the classifier’s
actual prediction score value for the current input image. One can show that sensitivity analysis
decomposes the gradient square norm of the function f, i.e., ∑
i
R
i
= kr
x
f(x)k
2
, whereas LRP
decomposes the function value itself ∑
i
R
i
= f(x). Intuitively, when the classifier e.g. detects cars
in images, then sensitivity analysis answers the question “what makes this car image more or
less a car?”, whereas LRP answers the more fundamental question “what makes this image a
car at all?”. Note that the LRP framework can be applied to various models such as kernel sup-
port vector machines and deep neural networks [
13, 18]. We refer the reader to [15] for a com-
parison of the three explanation methods, and to [
14] for a view of particular instances of LRP
as a “deep Taylor decomposition” of the decision function. A tutorial on methods for inter-
preting and understanding deep neural networks can be found in [
26].
In the context of neural networks for text classification [
27] proposed to extract salient sen-
tences from text documents using loss gradient magnitudes. In order to validate the pertinence
of the sentences extracted via the neural network classifier, the latter work proposed to subse-
quently use these sentences as an input to an external classifier and compare the resulting clas-
sification performance to random and heuristic sentence selection. The work by [
28] also
employs gradient magnitudes to identify salient words within sentences, analogously to the
method proposed in computer vision by [10]. However their analysis is based on qualitative
interpretation of saliency heatmaps for exemplary sentences. In addition to the heatmap visu-
alizations, we provide a classifier-intrinsic quantitative validation of the word-level relevances.
We furthermore extend previous work from [
29] by adding a BoW/SVM baseline to the exper-
iments and proposing a new criterion for assessing model explanatory power. Recent work
from [
30, 31] uses LRP to explain recurrent neural network predictions in sentiment analysis
and machine translation.
3 Interpretable text classification
In this Section we describe our method for identifying words in a text document, that are rele-
vant with respect to a given category of a classification problem. For this, we assume that we
are given a vector-based word representation and a convolutional neural network that has
"What is relevant in a text document?": An interpretable machine learning approach
PLOS ONE | https://doi.org/10.1371/journal.pone.01811 42 August 11, 2017 3 / 23
already been trained to map accurately documents to their actual category. Our method can be
divided into four steps: (1) Compute an input representation of a text document based on
word vectors. (2) Forward-propagate the input representation through the convolutional neu-
ral network until the output is reached. (3) Backward-propagate the output through the net-
work using the layer-wise relevance propagation (LRP) method, until the input is reached. (4)
Pool the relevance scores associated to each input variable of the network onto the words to
which they belong. As a result of this four-step procedure, a decomposition of the prediction
score for a category onto the words of the documents is obtained. Decomposed terms are
called relevance scores. These relevance scores can be viewed as highlighted text or can be used
to form a list of top-words in the document. The whole procedure is also described visually in
Fig 1. While we detail in this Section the LRP method for a specific network architecture and
with predefined choices of layers, the method can in principle be extended to any architecture
composed of a similar or larger number of layers.
At the end of this Section we introduce different methods which will serve as baselines for
comparison. A baseline for the convolutional neural network model is the BoW/SVM classi-
fier, with the LRP procedure adapted accordingly [
13]. A baseline for the LRP relevance
decomposition procedure is gradient-based sensitivity analysis (SA), a technique which assigns
sensitivity scores to individual words. In the vector-based document representation experi-
ments, we will also compare LRP to uniform and TFIDF baselines.
3.1 Representing words and documents
Prior to training the neural network and using it for prediction and explanation, we first derive
a numerical representation of the text documents that will serve as an input to the neural clas-
sifier. To this end, we map each individual word in the document to a vector embedding, and
concatenate these embeddings to form a matrix of size the number of words in the document
Fig 1. Diagram of a CNN-based interpretable machine learning system. It consists of a forward processing that computes for each input document a
high-level concept (e.g. semantic category or sentiment), and a redistribution procedure that explains the prediction in terms of words.
https://doi.org/10.1371 /journal.pone.0181142.g001
"What is relevant in a text document?": An interpretable machine learning approach
PLOS ONE | https://doi.org/10.1371/journal.pone.01811 42 August 11, 2017 4 / 23
times the dimension of the word embeddings. A distributed representation of words can be
learned from scratch, or fine-tuned simultaneously with the classification task of interest. In
the present work, we use only pre-training as it was shown that, even without fine-tuning, this
leads to good neural network classification performance for a variety of tasks like e.g. part-of-
speech tagging or sentiment analysis [
7, 32].
One shallow neural network model for learning word embeddings from unlabeled text
sources, is the continuous bag-of-words (CBOW) model of [
33], which is similar to the log-
bilinear language model from [
34, 35] but ignores the order of context words. In the CBOW
model, the objective is to predict a target middle word from the average of the embeddings of
the context words that are surrounding the middle word, by means of direct dot products
between word embeddings. During training, a set of word embeddings for context words v
and for target words v
0
are learned separately. After training is completed, only the context
word embeddings v will be retained for further applications. The CBOW objective has a simple
maximum likelihood formulation, where one maximizes over the training data the sum of the
logarithm of probabilities of the form:
Pðw
t
jw
tn:tþn
Þ ¼
exp ð
1
2n
P
njn;j6¼0
v
w
tþj
Þ
>
v
0
w
t
P
w2V
exp ð
1
2n
P
njn;j6¼0
v
w
tþj
Þ
>
v
0
w
where the softmax normalization runs over all words w in the vocabulary V, 2n is the number
of context words per training text window, w
t
represents the target word at the t
th
position in
the training data and w
t−n:t+n
represent the corresponding context words.
In the present work, we utilize pre-trained word embeddings obtained with the CBOW
architecture and the negative sampling training procedure [
5]. We will refer to these embed-
dings as word2vec embeddings.
3.2 Predicting category with a convolutional neural network
Our ML model for classifying text documents, is a word-embedding based convolutional neu-
ral network (CNN) model similar to the one proposed in [
32] for sentence classification,
which itself is a slight variant of the model introduced in [
7] for semantic role labeling. This
architecture is depicted in
Fig 1 (left) and is composed of several layers.
As previously described, in a first step we map each word in the document to its word2vec
vector. Denoting by D the word embedding dimension and by L the document length, our
input is a matrix of shape D × L (e.g., for the purpose of illustration, in
Fig 1 we have D = 8 and
L = 6). We denote by x
i,t
the value of the i
th
component of the word2vec vector representing
the t
th
word in the document. The convolution/detection layer produces a new representation
composed of F sequences indexed by j, where each element of the sequence is computed as:
8j; t : x
j;t
¼ max 0;
X
i;t
x
i;tt
w
ð1Þ
i;j;t
þ b
ð1Þ
j
¼ max 0;
X
i
ðx
i
w
ð1Þ
i;j
Þ
t
þ b
ð1Þ
j
where t indicates a position within the text sequence, j designates a feature map, and
τ 2 {0, 1, . . ., H − 1} is a delay with range H, the filter size of the one-dimensional convolu-
tional operation . After the convolutional operation, which yields F features maps of length
L − H + 1, we apply the ReLU non-linearity element-wise (e.g., in
Fig 1, we have F = 5 features
maps and a filter size H = 2, hence we use τ 2 {0, 1} and the resulting feature maps have a
length of 5). Note that the trainable parameters w
(1)
and b
(1)
do not depend on the position t
in the text document, hence the convolutional processing is equivariant with this physical
dimension. The next layer computes, for each dimension j of the previous representation, the
"What is relevant in a text document?": An interpretable machine learning approach
PLOS ONE | https://doi.org/10.1371/journal.pone.01811 42 August 11, 2017 5 / 23