scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Scalable and Interpretable Predictive Models for Electronic Health Records

TL;DR: This work considers the problem of complication risk prediction, such as inpatient mortality, from the electronic health records of the patients, and develops distributed models that are scalable and interpretable.
Abstract: Early identification of patients at risk of developing complications during their hospital stay is currently one of the most challenging issues in healthcare. Complications include hospital-acquired infections, admissions to intensive care units, and in-hospital mortality. Being able to accurately predict the patients' outcomes is a crucial prerequisite for tailoring the care that certain patients receive, if it is believed that they will do poorly without additional intervention. We consider the problem of complication risk prediction, such as inpatient mortality, from the electronic health records of the patients. We study the question of making predictions on the first day at the hospital, and of making updated mortality predictions day after day during the patient's stay. We develop distributed models that are scalable and interpretable. Key insights include analysing diagnoses known at admission and drugs served, which evolve during the hospital stay. We leverage a distributed architecture to learn interpretable models from training datasets of gigantic size. We test our analyses with more than one million of patients from hundreds of hospitals, and report on the lessons learned from these experiments.

Summary (5 min read)

I. INTRODUCTION

  • One major expectation of data science in healthcare is the ability to leverage on digitized health information and computer systems to better apprehend and improve care.
  • The availability of EHR data opens the way to the development of quantitative models for patients that can be used to predict health status, as well as to help prevent disease, adverse effects, and ultimately death.
  • These approaches often trade some model interpretability for more predictive accuracy.
  • The authors consider complication risk prediction and focus on two aspects of this problem: (i) how to make accurate predictions with interpretable models; and (ii) how to take into account evolving clinical information during hospital stay.
  • The rest of the paper is organized as follows: the authors first present the data and methods used in § II.

A. Data source

  • The authors used EHR data from the Premier healthcare database which is one of the largest clinical databases in the United States, gathering information from millions of patients over a period of 12 months from 417 hospitals in the USA [19] .
  • These hospitals are believed to be broadly representative of the United States hospital experience.
  • The database contains hospital discharge files that are dated records of all billable items (including therapeutic and diagnostic procedures, medication, and laboratory usage) which are all linked to a given admission [15] .
  • The authors focus on hospital admissions of adults hospitalized for at least 3 days, excluding elective admissions.
  • The snapshot of the database used in their work comprises the EHR data of 1,271,733 hospital admissions.

B. Outcomes

  • Patients who experienced a given outcome are considered positive cases for this outcome; those who did not are considered negative cases.
  • Table I presents the distribution of patients with respect to the considered outcomes.

C. Preparing the data for supervised learning

  • The authors methodology assumes no a priori clinical knowledge.
  • The authors models also use the list of admitting diagnoses known for a given patient as available in the EHR data at admission 1 , which the authors denote by A. Procedures can be performed during the hospital stay.
  • The authors filter out unused procedures and drugs, and use a perfect hash function to encode the features.
  • A small proportion of patients receive procedures during their stay (∼ 20% of patients receive procedures on the first day).
  • On the first day of stay, a patient is served 8.6 drugs on average.

D. Development of models

  • Following [4] , the authors pay specific attention to the interpretability of the predictive models they develop, also known as 1) Interpretability.
  • Accurate models such as deep neural nets and random forests are usually not interpretable, but more interpretable models such as logistic regression are usually less accurate.
  • Y i ∈ {0, 1} are their corresponding labels, which the authors want to predict (e.g. for the mortality case study, 0 means the patient survived and 1 means the patient died at the hospital) R(w) is the regularizer that controls the complexity of the model.
  • The classes the authors consider are heavily imbalanced (as shown in Table I ): in-hospital death for instance can be considered as a rare event.
  • This is achieved by implementing a distributed version of logistic regression, including a distributed version of the L-BFGS optimization algorithm which the authors use to solve the aforementioned optimization problem.

E. Model evaluation and statistical analysis

  • Patients were randomly split into disjoint train and test subsets.
  • Model accuracy is reported in terms of several metrics on the (naturally imbalanced) test set, which is used exclusively for evaluation purposes.
  • The authors report on the receiver operator characteristic (ROC) curves and especially on the area under the ROC curve .
  • For the sake of completeness, the authors also include the commonly used Accuracy metric [8] .
  • Since the authors deal with highly skewed datasets (as shown in Table I ), they also report on the precision-recall (PR) curves and on the area under the PR curve (AuPR), in order to give a more complete picture of the performance of the models [6] .

F. Prediction timing

  • The authors consider making predictions at different times.
  • First, the authors consider making predictions on the first day at the hospital.
  • The authors report on corresponding results, for all considered clinical outcomes, in § III.
  • The authors then report on how to make new mortality predictions, day after day, whenever new EHR information becomes available, and present corresponding results in § IV.

III. RESULTS ON PREDICTIONS ON THE FIRST DAY

  • On the first day, the authors consider predictive models built with different sets of features (that they later combine).
  • The authors name the models they consider after the sets of features they rely on.
  • For example the authors consider the model EA for making predictions at hospital admission time t 0 (i.e. at the moment when the patient arrives at the hospital).
  • This model uses the elementary features E and the diagnoses A known at admission.
  • The authors also consider making predictions whenever the set of drugs served on the first day is known (typically at t 0 + 24h).

A. Mortality

  • For predicting in-hospital mortality, AuROC was 77.8% and AuPR was 12.7% with the D 1 model, indicating significant predictive power of the drugs served on the first day (as already known from [9] ).
  • D 1 ) in which all the features found in A and D 1 are concatenated.the authors.
  • The authors also use ensemble techniques and in particular the stacking technique [7] to create combined models.
  • Table II gives an overview of the AuROC, Accuracy and AuPR obtained with the basic and combined models considered, on the same population, having admitting diagnosis information.
  • This suggests that classifiers trained from large amounts of diagnoses and drugs served found in EHR data can produce valid predictions across a variety of clinical outcomes (not only mortality) on the first day at the hospital.

C. Benefits of interpretability and explainability of predictions

  • The authors investigate the stability and the consistency of the models when learned with different training sets.
  • For this purpose, the authors study to which extent the logistic regression weights vary, when radically different training sets are randomly picked.
  • A systematic pairwise comparison of the lists of topmost weights for each run showed that the least proportion of common weights between two runs was 90%.
  • Table VI presents an excerpt of the most important weights in the logistic regression model along with their ranking and their impact (positive/negative) on the outcome.
  • The clinical interpretation is beyond the scope of this paper, but the point is that their model allows this vector to be given to medical experts for further clinical research.

IV. RESULTS WITH EVOLVING DATA

  • While taking into account new clinical information becoming available since admission.the authors.
  • The authors consider making inpatient mortality predictions on a daily basis.
  • The authors investigate interpretable models that predict on day k using data available up to that day.

A. Preliminary observations

  • Figure 5 gives insights on the number of patients remaining hospitalized at a certain day (no matter how long they stay).
  • For each day i, it illustrates the subset of patients who have at least one drug served on that day (i.e. for which D i = ∅), and the subset of patients who have a least one procedure on that day (i.e. for which P i = ∅) , respectively.
  • The vast majority of patients (more than 99.8%) are served drugs during their stay whereas only a small proportion of the population receive new procedures.
  • In particular, the authors created separate models using E and P i as features for each day i; but their combinations with ensemble techniques did not yield any significant improvement in prediction accuracy over the global population 5 .
  • The authors did not obtain significant improvements when restricting to the patients having new procedures on the last day neither.

B. Daily mortality predictions

  • For making predictions on a certain day k, the authors consider a variety of models built from different sets of features, that they combine with ensemble techniques (in a similar manner than for the first day -except that the set of basic models is now much richer as they can consider various models and several days).
  • To avoid running into the curse of dimensionality, the authors define a threshold for the maximum acceptable ratio between the number of features and the number of training instances (that decreases with higher values of k as shown in Figure 5 ).
  • The authors arbitrarily set this ratio to 10, which allows us to conduct analyses with sliding windows until the 6 th day.
  • This raises the question of how much historical data (since admission) it is worth to consider for making predictions (or in other terms, identifying tradeoffs between predictive Results suggest that [D k ] models provide an interesting tradeoff (between accuracy and complexity) for predicting on day k compared to all the other models.
  • The authors observe that for the majority of patients, the set of drugs served tend to only slightly change from a day to another, a majority of drugs being continuously served day after day.

C. Discussion

  • All the predictions the authors made in this Section consider drug served data from the first day and onwards, but do not take into account the data A known as cause of admission (as opposed to § III).
  • For patients having an admitting diagnosis (∼ 70% of the overall population), the authors study combinations of the predictions made at admission (from § III) with predictions made later during the stay.
  • This suggests that data known at admission still helps in improving the accuracy of mortality predictions made at a later stage during the hospital stay.
  • The authors observe that the most important weight is associated with the A basic model.
  • The authors observe that the AuROC of the predictions made with A decreases over time with the remaining population (as also illustrated by the blue line in Figure 8 ).

VI. CONCLUSION AND PERSPECTIVES

  • The authors develop a distributed supervised machine learning system for predicting clinical outcomes based on EHR data.
  • The authors propose interpretable models, based on the analysis of admitting diagnoses and drugs served during the hospital stay.
  • The authors models can be used to make predictions concerning the risk of hospital-acquired infections, pressure ulcers, and inpatient mortality.
  • The authors use a distributed implementation to train models on millions of patient profiles.
  • The authors report on lessons learned with a large-scale experimental study with real data from US hospitals.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: hal-01877742
https://hal.inria.fr/hal-01877742
Submitted on 20 Sep 2018
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Scalable and Interpretable Predictive Models for
Electronic Health Records
Amela Fejza, Pierre Genevès, Nabil Layaïda, Jean-Luc Bosson
To cite this version:
Amela Fejza, Pierre Genevès, Nabil Layaïda, Jean-Luc Bosson. Scalable and Interpretable Predictive
Models for Electronic Health Records. DSAA 2018 - 5th IEEE International Conference on Data
Science and Advanced Analytics, Oct 2018, Turin, Italy. pp.1-10. �hal-01877742�

Scalable and Interpretable Predictive Models for
Electronic Health Records
Amela Fejza
, Pierre Genevès
, Nabil Layaïda
, Jean-Luc Bosson
Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France
{amela.fejza, pierre.geneves, nabil.layaida}@inria.fr
Univ. Grenoble Alpes, CNRS, Public Health department CHU Grenoble Alpes,
Grenoble INP, TIMC-IMAG, 38000 Grenoble, France
JLBosson@chu-grenoble.fr
Abstract—Early identification of patients at risk of developing
complications during their hospital stay is currently one of the
most challenging issues in healthcare. Complications include
hospital-acquired infections, admissions to intensive care units,
and in-hospital mortality. Being able to accurately predict the
patients’ outcomes is a crucial prerequisite for tailoring the care
that certain patients receive, if it is believed that they will do
poorly without additional intervention. We consider the problem
of complication risk prediction, such as inpatient mortality, from
the electronic health records of the patients. We study the
question of making predictions on the first day at the hospital,
and of making updated mortality predictions day after day
during the patient’s stay. We develop distributed models that
are scalable and interpretable. Key insights include analysing
diagnoses known at admission and drugs served, which evolve
during the hospital stay. We leverage a distributed architecture
to learn interpretable models from training datasets of gigantic
size. We test our analyses with more than one million of patients
from hundreds of hospitals, and report on the lessons learned
from these experiments.
I. INTRODUCTION
One major expectation of data science in healthcare is
the ability to leverage on digitized health information and
computer systems to better apprehend and improve care. Over
the past few years the adoption of electronic health records
(EHRs) in hospitals has surged to an unprecedented level.
In the USA for example, more than 84% of hospitals have
adopted a basic EHR system, up from only 15% in 2010
[1], [12]. The availability of EHR data opens the way to the
development of quantitative models for patients that can be
used to predict health status, as well as to help prevent disease,
adverse effects, and ultimately death.
We consider the problem of predicting important clinical
outcomes such as inpatient mortality, based on EHR data. This
raises many challenges including dealing with the very high
number of potential predictor variables in EHRs. Traditional
approaches have overcome this complexity by extracting only
a very limited number of considered variables [5], [14]. These
approaches basically trade predictive accuracy for simplicity
and feasibility of model implementation. Other approaches
have dealt with this complexity by developing black box
This research was partially supported by the ANR project CLEAR (ANR-
16-CE25-0010).
machine learning models that retain predictor variables from a
large set of possible inputs, especially with deep learning [3],
[18], [20], [22]. These approaches often trade some model
interpretability for more predictive accuracy.
Predictive accuracy is crucial as wrong predictions might
have critical consequences. False positives might overwhelm
the hospital staff, and false negatives can miss to trigger
important alarms, exposing patients to poor clinical outcomes.
However, model interpretability is essential as it allows physi-
cians to get better insights on the factors that influence
the predictions, understand, edit and fix predictive models
when needed [4]. The search for tradeoffs between predictive
accuracy and model interpretability is challenging.
We consider complication risk prediction and focus on two
aspects of this problem: (i) how to make accurate predictions
with interpretable models; and (ii) how to take into account
evolving clinical information during hospital stay. Our main
contributions are the following:
we show that with interpretable models it is possible to
make accurate risk predictions, based on data concerning
admitting diagnoses and drugs served on the first day.
we further develop mortality risk prediction models to
make updated predictions when new clinical information
becomes available during hospitalization, in particular we
analyze the evolution of drugs served.
we report on lessons learned through practical experi-
ments with real EHR data from more than one million of
patients admitted to US hospitals, which is, to the best
of our knowledge, one of the largest such experimental
study conducted so far.
Outline: The rest of the paper is organized as follows:
we first present the data and methods used in § II. In § III we
present results obtained when making predictions of clinical
outcomes on the first day at the hospital. In § IV we investigate
to which extent the predictive models can benefit from the
availability of supplemental information becoming available
during the hospital stay to make updated predictions. We
finally review related works in § V before concluding in § VI.

II. METHODS
A. Data source
We used EHR data from the Premier healthcare database
which is one of the largest clinical databases in the United
States, gathering information from millions of patients over
a period of 12 months from 417 hospitals in the USA [19].
These hospitals are believed to be broadly representative of
the United States hospital experience. The database contains
hospital discharge files that are dated records of all billable
items (including therapeutic and diagnostic procedures, med-
ication, and laboratory usage) which are all linked to a given
admission [15]. We focus on hospital admissions of adults
hospitalized for at least 3 days, excluding elective admissions.
The snapshot of the database used in our work comprises the
EHR data of 1,271,733 hospital admissions.
B. Outcomes
For a given patient, we consider the problem of predicting
the occurence of several important clinical outcomes:
death: in-hospital mortality, defined as a discharge dispo-
sition of “expired” [9], [20];
hospital-acquired infections (HAI) developed during the
stay [21];
admissions to intensive care unit (ICU) on or after the
second day, excluding direct admissions on the first day;
pressure ulcers (PU) developed during the stay (not
present at admission).
Patients who experienced a given outcome are considered
positive cases for this outcome; those who did not are consid-
ered negative cases. Table I presents the distribution of patients
with respect to the considered outcomes.
TABLE I: Number of instances for each case study.
Problem studied Positive cases Negative cases Ratio
Mortality 28,236 857,005 3.29%
HAI 22,402 862,839 2.59%
ICU Admission 32,310 852,931 3.78%
Pressure Ulcers 23,742 861,499 2.75%
C. Preparing the data for supervised learning
Our methodology assumes no a priori clinical knowledge.
For a given patient, we first extract a list E of elementary
features including the age, gender, and admission type. Our
models also use the list of admitting diagnoses known for
a given patient as available in the EHR data at admission
1
,
which we denote by A. Procedures can be performed during
the hospital stay. We denote the list of procedures performed
on the i
th
day of the stay (with i > 0) by P
i
. We also consider
the lists of drugs served, on a daily basis: D
i
denotes the list
of drug names (and their quantities) served on the i
th
day.
1
We use a list of unique identifiers encoded using The International
Classification of Diseases, Ninth Revision, Clinical Modification known as
ICD-9-CM.
We filter out unused procedures and drugs, and use a perfect
hash function to encode the features. The feature matrix is very
sparse so in the implementation we use a sparse representation
of feature vectors. Most patients are admitted at the hospital
with at least one admitting diagnosis (among 5,094 possible
diagnoses). A small proportion of patients receive procedures
during their stay ( 20% of patients receive procedures on the
first day). The total number of possible procedures is 11,338.
Furthermore, during the stay, a total of 10,739 possible drugs
can be served. On the first day of stay, a patient is served
8.6 drugs on average. Figure 1 shows the distribution of the
considered population in terms of the number of drugs received
on the first day. Figure 2 shows an excerpt of the data for a
0 20 40 60 80 100
Number of drugs
0
20000
40000
60000
80000
100000
Patients
Fig. 1: Population distribution in terms of the number of drugs
received on the first day (|D
1
|).
sample patient.
[(434456800,
(82, ’M’, 1,
[’A(0)’: [(’578.1’, ’A’, ’999’, 999)]],
[’D(1)’: [(’250258001120000’, ’2’),
(’460460947620000’, ’1’),
(’380381000310000’, ’2’),
(’300305857300000’, ’1’),
(’300305850250000’, ’1’),
...
(’250250043500000’, ’1’)]
],
[’D(2)’: [(’380381000310000’, ’1’),
(’320320721000000’, ’1’),
(’300301825750000’, ’1’),
(’250257025740000’, ’1.85’),
...
(’250250052970000’, ’1’)]
]))]
Fig. 2: Data excerpt for a 82 years old male patient who went
out alive on the third day.
D. Development of models
1) Interpretability: Following [4], we pay specific attention
to the interpretability of the predictive models we develop.
Model interpretability (or “intelligibility” as found in [4])
refers to the ability to understand, validate, edit, and trust

a learned model, which is particularly important in critical
applications in healthcare such as the one we consider here.
Accurate models such as deep neural nets and random forests
are usually not interpretable, but more interpretable models
such as logistic regression are usually less accurate. This
often imposes a tradeoff between accuracy and interpretability.
We choose to preserve interpretability and develop classifiers
based on logistic regression. Advantages of logistic regres-
sion include yielding insights on the factors that influence
the predictions, such as an interpretable vector of weights
associated to features, and predictions that can be interpreted
as probabilities.
2) Mathematical formulation: Logistic regression can be
formulated as the optimization problem min
wR
d
f(w) in
which the objective function is of the form:
f(w) = λR(w) +
1
n
n
X
i=1
θ
i
L(w; x
i
, y
i
)
where n is the number of instances in the training set, and for
1 i n:
w is the vector of weights we are looking for.
the vectors x
i
R
d
are the instances of the training
data set: each vector x
i
is composed of the d values
corresponding to features retained for a given admission.
y
i
{0, 1} are their corresponding labels, which we
want to predict (e.g. for the mortality case study, 0 means
the patient survived and 1 means the patient died at the
hospital)
R(w) is the regularizer that controls the complexity of
the model. For the purpose of favoring simple models
and avoiding overfitting, in the reported experiments we
used R(w) =
1
2
||w||
2
2
.
λ is the regularization parameter that defines the trade-
off between the two goals of minimizing the loss (i.e.,
training error) and minimizing model complexity (i.e., to
avoid overfitting). In the reported experiments we used
λ =
1
2
.
θ
i
is the weight factor that we use to compensate for
class imbalance. The classes we consider are heavily
imbalanced (as shown in Table I): in-hospital death for
instance can be considered as a rare event. Notice that
we do not use downsampling (that would drastically
reduce the set of negative instances for the purpose
of rebalancing classes); instead we apply the weighting
technique [13] that allows our models to learn from
all instances of imbalanced training sets. θ
i
is thus in
charge of adjusting the impact of the error associated
to each instance proportionally to class imbalance: θ
i
=
τ ·y
i
+(1τ )·(1y
i
) where τ is the fraction of negative
instances in the training set.
the loss function L measures the error of the model on
the training data set, we use the logistic loss:
L(w; x
i
, y
i
) = ln(1 + e
(12y
i
)w
T
x
i
)
Given a new instance x of the test data set, the model makes
a prediction by applying the logistic function:
f(z) =
1
1 + e
z
where z = w
T
x. The raw output f (z) has a probabilistic in-
terpretation: the probability that x is positive. In the sequel we
rely on this probability to build further models (in particular
using the stacking technique: see the meta-model built from
such probabilities in § IV). We also use the common threshold
t = 0.5 such that if f (z) > t, the outcome is predicted as
positive (and negative otherwise)
2
.
3) Scalability with distributed computations: a particularity
of our study is that we want our models to be able to
learn from very large amounts of training data (coming from
many hospitals). We typically consider models for which
both n and d are large: for instance n > 8 10
5
and
d > 16 · 10
3
when we train models using features found in
EAD
1
. This is achieved by implementing a distributed version
of logistic regression, including a distributed version of the
L-BFGS optimization algorithm which we use to solve the
aforementioned optimization problem. L-BFGS is known for
often achieving faster convergence compared with other first-
order optimization techniques [2]. We use a cluster composed
of one driver machine and a set of worker machines
3
. Each
worker machine receives a fraction of the training data set.
The driver machine then triggers several rounds of distributed
computations performed independently by worker machines,
until convergence is reached. The software was implemented
using the Python programming language and the Apache Spark
machine learning library [17].
E. Model evaluation and statistical analysis
Patients were randomly split into disjoint train and test
subsets. We perform k-fold cross validation with k = 5 unless
indicated otherwise (k = 10 when indicated). Model accuracy
is reported in terms of several metrics on the (naturally
imbalanced) test set, which is used exclusively for evaluation
purposes. We report on the receiver operator characteristic
(ROC) curves and especially on the area under the ROC
curve (AuROC). For the sake of completeness, we also include
the commonly used Accuracy metric [8]. Since we deal with
highly skewed datasets (as shown in Table I), we also report
on the precision-recall (PR) curves and on the area under the
PR curve (AuPR), in order to give a more complete picture
of the performance of the models [6].
F. Prediction timing
We consider making predictions at different times. First, we
consider making predictions on the first day at the hospital.
We report on corresponding results, for all considered clinical
2
We make t vary in [0, 1] for computing ROC curves.
3
Reported experiments were conducted with 5 machines (1 driver and 4
workers), each equipped with two Intel Xeon CPU (1.90GHz-2.6Ghz), with
24 to 40 cores, 60-160 GB of RAM, and a 1GB ethernet network.

outcomes, in § III. We then report on how to make new mortal-
ity predictions, day after day, whenever new EHR information
becomes available, and present corresponding results in § IV.
III. RESULTS ON PREDICTIONS ON THE FIRST DAY
On the first day, we consider predictive models built with
different sets of features (that we later combine). We name the
models we consider after the sets of features they rely on. For
example we consider the model EA for making predictions
at hospital admission time t
0
(i.e. at the moment when the
patient arrives at the hospital). This model uses the elementary
features E and the diagnoses A known at admission. We
also consider making predictions whenever the set of drugs
served on the first day is known (typically at t
0
+ 24h). For
this purpose, we consider the model ED
1
of [9] that uses
elementary features and drugs served on the first day. All the
considered models systematically use the elementary features
E, so we often omit E in model names in the sequel.
A. Mortality
For predicting in-hospital mortality, AuROC was 77.8% and
AuPR was 12.7% with the D
1
model, indicating significant
predictive power of the drugs served on the first day (as
already known from [9]). Over the total considered population
of 1,271,733 patients, 885,241 ( 70%) of them have non-
empty admitting diagnosis information at admission time
(A6= ). AuROC was 76.4% and AuPR was 10.9% with the
A model, which is aimed to leverage this information for
making predictions directly at admission time. This indicates
predictive power of the admitting diagnoses as well. It thus
makes sense to study how these models could be combined to
obtain more accurate predictions for the concerned population
of 885,241 patients. We study combinations of the predictions
made at admission with predictions made at t
0
+ 24h with the
knowledge of the set of drugs served on the first day.
More generally, we consider different model combinations:
we consider models obtained by the flattening and con-
catenation of features found in several basic models. In
the sequel, we denote by C(B
1
, B
2
, ..., B
n
) (or equiv-
alently by B
1
B
2
...B
n
) the single model obtained from
the concatenation of the features used in the basic mod-
els B
1
, B
2
, ..., B
n
. For instance we consider the model
C(A,D
1
) in which all the features found in A and D
1
are concatenated.
we also use ensemble techniques and in particular the
stacking technique [7] to create combined models. The
advantage of using logistic regressions as basic models
to be combined with the stacking technique is that we
can reuse not only their predictions, but also their raw
output probabilities (which are more precise, as pointed
out in § II-D2) as features for the meta-model. In the
sequel, we denote by S(B
1
, B
2
, ..., B
n
) the meta-model
obtained from the raw probabilities of the basic models
B
1
, B
2
, ..., B
n
with the stacking technique. For example,
we consider the model S(A,D
1
) built from the stacking
of the two models A and D
1
.
Table II gives an overview of the AuROC, Accuracy and
AuPR obtained with the basic and combined models con-
sidered, on the same population, having admitting diagnosis
information. Table II indicates the average, minimum and
maximum values of each metric obtained with a 5-fold cross-
validation process.
TABLE II: Mortality risk predictions on the first day.
Model AuROC % Accuracy % AuPR %
A 76.4 (76.0-76.8) 65.4 (65.2-65.6) 10.9 (10.6-11.2)
D
1
[9] 77.4 (77.2-77.7) 74.5 (74.5-74.8) 12.3 (12.0-12.5)
S(A,D
1
) 80.1 (79.9-80.2) 69.2 (68.9-69.4) 14.0 (13.5-14.3)
C(A,D
1
) 80.4 (80.2-80.7) 75.3 (75.2-75.5) 14.2 (13.8-14.6)
We observe that the combined models yield significantly
more accurate predictions than the basic ones, improving over
comparable earlier works. For predicting inpatient mortality,
with the AD
1
model AuROC was 80.4% and AuPR was
14.2%, compared to respectively 77.4% and 12.3% obtained
with the D
1
model of [9].
Figure 3 presents the ROC curve obtained for a run of the
C(A
,
D
1
) model on a given train and test set. The PR curve is
shown on Figure 4. Table III presents sizes of train and test
sets, and Table IV presents the confusion matrix and associated
metrics.
TABLE III: Number of instances for train and test sets.
Mortality case study Train set Test set
Total size 708,373 176,868
Positive instances 22,660 5,576
Negative instances 685,713 171,292
0% 20% 40% 60% 80% 100%
False Positive Rate
20%
40%
60%
80%
100%
True Positive Rate
AUROC of A(0) (80.7%)
Fig. 3: ROC curve for mortality prediction at t
0
+ 24h.
B. HAI, ICU admission, and pressure ulcers
Table V presents results obtained when predicting all the
other considered clinical outcomes using the C(A
,
D
1
) model.
To the best of our knowledge, our models outperform state-
of-the-art interpretable models found in the literature for

Citations
More filters
Proceedings ArticleDOI
08 Apr 2021
TL;DR: In this paper, a bandwidth-efficient privacy-preserving federated learning that provides theoretical privacy guarantees based on differential privacy is proposed for in-hospital mortality prediction using a real dataset, containing Electronic Health Records of about one million patients.
Abstract: Machine Learning, and in particular Federated Machine Learning, opens new perspectives in terms of medical research and patient care. Although Federated Machine Learning improves over centralized Machine Learning in terms of privacy, it does not provide provable privacy guarantees. Furthermore, Federated Machine Learning is quite expensive in term of bandwidth consumption as it requires participant nodes to regularly exchange large updates. This paper proposes a bandwidth-efficient privacy-preserving Federated Learning that provides theoretical privacy guarantees based on Differential Privacy. We experimentally evaluate our proposal for in-hospital mortality prediction using a real dataset, containing Electronic Health Records of about one million patients. Our results suggest that strong and provable patient-level privacy can be enforced at the expense of only a moderate loss of prediction accuracy.

24 citations

Journal ArticleDOI
TL;DR: Early warning tools identify patients at risk of deterioration in general wards in hospitals and provide real-time, dynamic risk estimates as discussed by the authors, however, despite relative progress in the development of algorithms to predict patient deterioration, the literature has not shown that the deployment or implementation of such algorithms is reproducibly associated with improvements in patient outcomes.
Abstract: Background: Early warning tools identify patients at risk of deterioration in hospitals. Electronic medical records in hospitals offer real-time data and the opportunity to automate early warning tools and provide real-time, dynamic risk estimates. Objective: This review describes published studies on the development, validation, and implementation of tools for predicting patient deterioration in general wards in hospitals. Methods: An electronic database search of peer reviewed journal papers from 2008-2020 identified studies reporting the use of tools and algorithms for predicting patient deterioration, defined by unplanned transfer to the intensive care unit, cardiac arrest, or death. Studies conducted solely in intensive care units, emergency departments, or single diagnosis patient groups were excluded. Results: A total of 46 publications were eligible for inclusion. These publications were heterogeneous in design, setting, and outcome measures. Most studies were retrospective studies using cohort data to develop, validate, or statistically evaluate prediction tools. The tools consisted of early warning, screening, or scoring systems based on physiologic data, as well as more complex algorithms developed to better represent real-time data, deal with complexities of longitudinal data, and warn of deterioration risk earlier. Only a few studies detailed the results of the implementation of deterioration warning tools. Conclusions: Despite relative progress in the development of algorithms to predict patient deterioration, the literature has not shown that the deployment or implementation of such algorithms is reproducibly associated with improvements in patient outcomes. Further work is needed to realize the potential of automated predictions and update dynamic risk estimates as part of an operational early warning system for inpatient deterioration.

22 citations

Posted Content
TL;DR: Compressive sensing is used to reduce the model size and hence increase model quality without sacrificing privacy and it is shown experimentally that this privacy-preserving proposal can reduce the communication costs by up to 95 % with only a negligible performance penalty compared to traditional non-private federated learning schemes.
Abstract: Federated Learning allows distributed entities to train a common model collaboratively without sharing their own data. Although it prevents data collection and aggregation by exchanging only parameter updates, it remains vulnerable to various inference and reconstruction attacks where a malicious entity can learn private information about the participants' training data from the captured gradients. Differential Privacy is used to obtain theoretically sound privacy guarantees against such inference attacks by noising the exchanged update vectors. However, the added noise is proportional to the model size which can be very large with modern neural networks. This can result in poor model quality. In this paper, compressive sensing is used to reduce the model size and hence increase model quality without sacrificing privacy. We show experimentally, using 2 datasets, that our privacy-preserving proposal can reduce the communication costs by up to 95% with only a negligible performance penalty compared to traditional non-private federated learning schemes.

12 citations


Cites background or methods from "Scalable and Interpretable Predicti..."

  • ...We used EHR data from the Premier healthcare database4 which is one of the largest clinical databases in the United States, collecting information from millions of patients over a period of 12 months from 415 hospitals in the USA [30]....

    [...]

  • ...We used EHR data from the Premier healthcare database(4) which is one of the largest clinical databases in the United States, collecting information from millions of patients over a period of 12 months from 415 hospitals in the USA [30]....

    [...]

  • ...These hospitals are supposedly representative of the United States hospital experience [30]....

    [...]

  • ...As commonly found in the literature [30], for such predictions, we focus on hospital admissions of adults hospitalized for at least 3 days, excluding elective admissions....

    [...]

  • ...The ability to accurately predict the risks in the patient’s perspectives of evolution is a crucial prerequisite in order to adapt the care that certain patients receive [30]....

    [...]

Proceedings ArticleDOI
06 Sep 2021
TL;DR: In this article, compressive sensing is used to reduce the model size and hence increase model quality without sacrificing privacy, which can reduce communication costs by up to 95 % with only a negligible performance penalty compared to traditional non-private federated learning schemes.
Abstract: Federated Learning allows distributed entities to train a common model collaboratively without sharing their own data. Although it prevents data collection and aggregation by exchanging only parameter updates, it remains vulnerable to various inference and reconstruction attacks where a malicious entity can learn private information about the participants' training data from the captured gradients. Differential Privacy is used to obtain theoretically sound privacy guarantees against such inference attacks by noising the exchanged update vectors. However, the added noise is proportional to the model size which can be very large with modern neural networks. This can result in poor model quality. In this paper, compressive sensing is used to reduce the model size and hence increase model quality without sacrificing privacy. We show experimentally, using 2 datasets, that our privacy-preserving proposal can reduce the communication costs by up to 95 % with only a negligible performance penalty compared to traditional non-private federated learning schemes.

6 citations

Journal ArticleDOI
01 Nov 2020
TL;DR: It is proved that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem, and a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences is proposed.
Abstract: The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration.” Our contributions are twofold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem. We propose a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling.” Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge. We prove NP-completeness with a reduction in the maximum edge subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts.” To speed up the computation of cohort similarity, we use “event sets” that are inspired from the double dictionary encoding proposed for keyword search. Moreover, we explore the usefulness and efficiency of Core using an extensive set of qualitative and quantitative experiments on two real health-care datasets. In a user study with medical experts, we show that Core reduces time-to-insight from hours to seconds and helps them find better insights than baseline approaches. Moreover, we show that the obtained cohort representations offer the right trade-off between quality and performance. We study the benefits of trajectory families and stratified sampling for cohort representation and show their applicability on large and heterogeneous cohorts. We also show the benefit of event sets for cohort exploration in providing interactive performance.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: The method of classifying comorbidity provides a simple, readily applicable and valid method of estimating risk of death fromComorbid disease for use in longitudinal studies and further work in larger populations is still required to refine the approach.

39,961 citations

Journal ArticleDOI
TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

17,017 citations


"Scalable and Interpretable Predicti..." refers methods in this paper

  • ...For the sake of completeness, we also include the commonly used Accuracy metric [8]....

    [...]

Book ChapterDOI
21 Jun 2000
TL;DR: Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Abstract: Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.

5,679 citations


"Scalable and Interpretable Predicti..." refers methods in this paper

  • ...• we also use ensemble techniques and in particular the stacking technique [7] to create combined models....

    [...]

Proceedings ArticleDOI
25 Jun 2006
TL;DR: It is shown that a deep connection exists between ROC space and PR space, such that a curve dominates in R OC space if and only if it dominates in PR space.
Abstract: Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.

5,063 citations


Additional excerpts

  • ...Since we deal with highly skewed datasets (as shown in Table I), we also report on the precision-recall (PR) curves and on the area under the PR curve (AuPR), in order to give a more complete picture of the performance of the models [6]....

    [...]

Posted Content
TL;DR: It is shown that more efficient sampling designs exist for making valid inferences, such as sampling all available events and a tiny fraction of nonevents, which enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables.
Abstract: We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros ("nonevents"). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all variable events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.

3,170 citations


"Scalable and Interpretable Predicti..." refers methods in this paper

  • ...Notice that we do not use downsampling (that would drastically reduce the set of negative instances for the purpose of rebalancing classes); instead we apply the weighting technique [13] that allows our models to learn from all instances of imbalanced training sets....

    [...]

Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Scalable and interpretable predictive models for electronic health records" ?

The authors consider the problem of complication risk prediction, such as inpatient mortality, from the electronic health records of the patients. The authors study the question of making predictions on the first day at the hospital, and of making updated mortality predictions day after day during the patient ’ s stay. The authors test their analyses with more than one million of patients from hundreds of hospitals, and report on the lessons learned from these experiments. 

One perspective for further work would be to study to which extent the system generalizes for predicting other clinical outcomes such as long lengths of stay and hospital readmissions. 

The advantage of using logistic regressions as basic models to be combined with the stacking technique is that the authors can reuse not only their predictions, but also their raw output probabilities (which are more precise, as pointed out in § II-D2) as features for the meta-model. 

The driver machine then triggers several rounds of distributed computations performed independently by worker machines, until convergence is reached. 

θi is thus in charge of adjusting the impact of the error associated to each instance proportionally to class imbalance: θi = τ ·yi+(1−τ) ·(1−yi) where τ is the fraction of negative instances in the training set. 

For instance, the authors retained the 100 most important weights corresponding to A features (with the most significant absolute value) obtained for each random training set. 

Notice that the authors do not use downsampling (that would drastically reduce the set of negative instances for the purpose of rebalancing classes); instead the authors apply the weighting technique [13] that allows their models to learn from all instances of imbalanced training sets. 

This is one reason why simple linear models (such as logistic regression) might be preferred over DNNs even when their accuracy is significantly lower, as detailed in [4]. 

To avoid running into the curse of dimensionality, the authors define a threshold for the maximum acceptable ratio between the number of features and the number of training instances (that decreases with higher values of k as shown in Figure 5). 

The models proposed in [20] typically achieve areas under the ROC curve within the 0.79-0.89 range for mortality prediction at admission on their dataset (they do not report on AuPR nor on accuracy though). 

• yi ∈ {0, 1} are their corresponding labels, which the authors want to predict (e.g. for the mortality case study, 0 means the patient survived and 1 means the patient died at the hospital) • R(w) is the regularizer that controls the complexity of the model. 

Model accuracy is reported in terms of several metrics on the (naturally imbalanced) test set, which is used exclusively for evaluation purposes. 

It is also worth noticing that the results reported in [20] are achieved with an important sacrifice: a major drawback of DNNs is their lack of interpretability, as notoriously known.