Proceedings Article•DOI•

Scalable and Interpretable Predictive Models for Electronic Health Records

Amela Fejza, Pierre Genevès, Nabil Layaïda, Jean-Luc Bosson

01 Oct 2018-pp 341-350

TL;DR: This work considers the problem of complication risk prediction, such as inpatient mortality, from the electronic health records of the patients, and develops distributed models that are scalable and interpretable.

read less

Abstract: Early identification of patients at risk of developing complications during their hospital stay is currently one of the most challenging issues in healthcare. Complications include hospital-acquired infections, admissions to intensive care units, and in-hospital mortality. Being able to accurately predict the patients' outcomes is a crucial prerequisite for tailoring the care that certain patients receive, if it is believed that they will do poorly without additional intervention. We consider the problem of complication risk prediction, such as inpatient mortality, from the electronic health records of the patients. We study the question of making predictions on the first day at the hospital, and of making updated mortality predictions day after day during the patient's stay. We develop distributed models that are scalable and interpretable. Key insights include analysing diagnoses known at admission and drugs served, which evolve during the hospital stay. We leverage a distributed architecture to learn interpretable models from training datasets of gigantic size. We test our analyses with more than one million of patients from hundreds of hospitals, and report on the lessons learned from these experiments.

...read moreread less

Summary (5 min read)

Jump to: [I. INTRODUCTION] – [A. Data source] – [B. Outcomes] – [C. Preparing the data for supervised learning] – [D. Development of models] – [E. Model evaluation and statistical analysis] – [F. Prediction timing] – [III. RESULTS ON PREDICTIONS ON THE FIRST DAY] – [A. Mortality] – [C. Benefits of interpretability and explainability of predictions] – [IV. RESULTS WITH EVOLVING DATA] – [A. Preliminary observations] – [B. Daily mortality predictions] – [C. Discussion] – [V. RELATED WORKS] and [VI. CONCLUSION AND PERSPECTIVES]

I. INTRODUCTION

One major expectation of data science in healthcare is the ability to leverage on digitized health information and computer systems to better apprehend and improve care.
The availability of EHR data opens the way to the development of quantitative models for patients that can be used to predict health status, as well as to help prevent disease, adverse effects, and ultimately death.
These approaches often trade some model interpretability for more predictive accuracy.
The authors consider complication risk prediction and focus on two aspects of this problem: (i) how to make accurate predictions with interpretable models; and (ii) how to take into account evolving clinical information during hospital stay.
The rest of the paper is organized as follows: the authors first present the data and methods used in § II.

A. Data source

The authors used EHR data from the Premier healthcare database which is one of the largest clinical databases in the United States, gathering information from millions of patients over a period of 12 months from 417 hospitals in the USA [19] .
These hospitals are believed to be broadly representative of the United States hospital experience.
The database contains hospital discharge files that are dated records of all billable items (including therapeutic and diagnostic procedures, medication, and laboratory usage) which are all linked to a given admission [15] .
The authors focus on hospital admissions of adults hospitalized for at least 3 days, excluding elective admissions.
The snapshot of the database used in their work comprises the EHR data of 1,271,733 hospital admissions.

B. Outcomes

Patients who experienced a given outcome are considered positive cases for this outcome; those who did not are considered negative cases.
Table I presents the distribution of patients with respect to the considered outcomes.

C. Preparing the data for supervised learning

The authors methodology assumes no a priori clinical knowledge.
The authors models also use the list of admitting diagnoses known for a given patient as available in the EHR data at admission 1 , which the authors denote by A. Procedures can be performed during the hospital stay.
The authors filter out unused procedures and drugs, and use a perfect hash function to encode the features.
A small proportion of patients receive procedures during their stay (∼ 20% of patients receive procedures on the first day).
On the first day of stay, a patient is served 8.6 drugs on average.

D. Development of models

Following [4] , the authors pay specific attention to the interpretability of the predictive models they develop, also known as 1) Interpretability.
Accurate models such as deep neural nets and random forests are usually not interpretable, but more interpretable models such as logistic regression are usually less accurate.
Y i ∈ {0, 1} are their corresponding labels, which the authors want to predict (e.g. for the mortality case study, 0 means the patient survived and 1 means the patient died at the hospital) R(w) is the regularizer that controls the complexity of the model.
The classes the authors consider are heavily imbalanced (as shown in Table I ): in-hospital death for instance can be considered as a rare event.
This is achieved by implementing a distributed version of logistic regression, including a distributed version of the L-BFGS optimization algorithm which the authors use to solve the aforementioned optimization problem.

E. Model evaluation and statistical analysis

Patients were randomly split into disjoint train and test subsets.
Model accuracy is reported in terms of several metrics on the (naturally imbalanced) test set, which is used exclusively for evaluation purposes.
The authors report on the receiver operator characteristic (ROC) curves and especially on the area under the ROC curve .
For the sake of completeness, the authors also include the commonly used Accuracy metric [8] .
Since the authors deal with highly skewed datasets (as shown in Table I ), they also report on the precision-recall (PR) curves and on the area under the PR curve (AuPR), in order to give a more complete picture of the performance of the models [6] .

F. Prediction timing

The authors consider making predictions at different times.
First, the authors consider making predictions on the first day at the hospital.
The authors report on corresponding results, for all considered clinical outcomes, in § III.
The authors then report on how to make new mortality predictions, day after day, whenever new EHR information becomes available, and present corresponding results in § IV.

III. RESULTS ON PREDICTIONS ON THE FIRST DAY

On the first day, the authors consider predictive models built with different sets of features (that they later combine).
The authors name the models they consider after the sets of features they rely on.
For example the authors consider the model EA for making predictions at hospital admission time t 0 (i.e. at the moment when the patient arrives at the hospital).
This model uses the elementary features E and the diagnoses A known at admission.
The authors also consider making predictions whenever the set of drugs served on the first day is known (typically at t 0 + 24h).

A. Mortality

For predicting in-hospital mortality, AuROC was 77.8% and AuPR was 12.7% with the D 1 model, indicating significant predictive power of the drugs served on the first day (as already known from [9] ).
D 1 ) in which all the features found in A and D 1 are concatenated.the authors.
The authors also use ensemble techniques and in particular the stacking technique [7] to create combined models.
Table II gives an overview of the AuROC, Accuracy and AuPR obtained with the basic and combined models considered, on the same population, having admitting diagnosis information.
This suggests that classifiers trained from large amounts of diagnoses and drugs served found in EHR data can produce valid predictions across a variety of clinical outcomes (not only mortality) on the first day at the hospital.

C. Benefits of interpretability and explainability of predictions

The authors investigate the stability and the consistency of the models when learned with different training sets.
For this purpose, the authors study to which extent the logistic regression weights vary, when radically different training sets are randomly picked.
A systematic pairwise comparison of the lists of topmost weights for each run showed that the least proportion of common weights between two runs was 90%.
Table VI presents an excerpt of the most important weights in the logistic regression model along with their ranking and their impact (positive/negative) on the outcome.
The clinical interpretation is beyond the scope of this paper, but the point is that their model allows this vector to be given to medical experts for further clinical research.

IV. RESULTS WITH EVOLVING DATA

While taking into account new clinical information becoming available since admission.the authors.
The authors consider making inpatient mortality predictions on a daily basis.
The authors investigate interpretable models that predict on day k using data available up to that day.

A. Preliminary observations

Figure 5 gives insights on the number of patients remaining hospitalized at a certain day (no matter how long they stay).
For each day i, it illustrates the subset of patients who have at least one drug served on that day (i.e. for which D i = ∅), and the subset of patients who have a least one procedure on that day (i.e. for which P i = ∅) , respectively.
The vast majority of patients (more than 99.8%) are served drugs during their stay whereas only a small proportion of the population receive new procedures.
In particular, the authors created separate models using E and P i as features for each day i; but their combinations with ensemble techniques did not yield any significant improvement in prediction accuracy over the global population 5 .
The authors did not obtain significant improvements when restricting to the patients having new procedures on the last day neither.

B. Daily mortality predictions

For making predictions on a certain day k, the authors consider a variety of models built from different sets of features, that they combine with ensemble techniques (in a similar manner than for the first day -except that the set of basic models is now much richer as they can consider various models and several days).
To avoid running into the curse of dimensionality, the authors define a threshold for the maximum acceptable ratio between the number of features and the number of training instances (that decreases with higher values of k as shown in Figure 5 ).
The authors arbitrarily set this ratio to 10, which allows us to conduct analyses with sliding windows until the 6 th day.
This raises the question of how much historical data (since admission) it is worth to consider for making predictions (or in other terms, identifying tradeoffs between predictive Results suggest that [D k ] models provide an interesting tradeoff (between accuracy and complexity) for predicting on day k compared to all the other models.
The authors observe that for the majority of patients, the set of drugs served tend to only slightly change from a day to another, a majority of drugs being continuously served day after day.

C. Discussion

All the predictions the authors made in this Section consider drug served data from the first day and onwards, but do not take into account the data A known as cause of admission (as opposed to § III).
For patients having an admitting diagnosis (∼ 70% of the overall population), the authors study combinations of the predictions made at admission (from § III) with predictions made later during the stay.
This suggests that data known at admission still helps in improving the accuracy of mortality predictions made at a later stage during the hospital stay.
The authors observe that the most important weight is associated with the A basic model.
The authors observe that the AuROC of the predictions made with A decreases over time with the remaining population (as also illustrated by the blue line in Figure 8 ).

VI. CONCLUSION AND PERSPECTIVES

The authors develop a distributed supervised machine learning system for predicting clinical outcomes based on EHR data.
The authors propose interpretable models, based on the analysis of admitting diagnoses and drugs served during the hospital stay.
The authors models can be used to make predictions concerning the risk of hospital-acquired infections, pressure ulcers, and inpatient mortality.
The authors use a distributed implementation to train models on millions of patient profiles.
The authors report on lessons learned with a large-scale experimental study with real data from US hospitals.

Did you find this useful? Give us your feedback

Figures (16)

TABLE II: Mortality risk predictions on the first day.

TABLE V: Predictive accuracy on the first day.

Fig. 4: PR curve for mortality prediction at t0 + 24h.

TABLE VI: Explanations of some ICD9-CM codes.

Fig. 5: Histogram illustrating the number of patients having at least one procedure or drug served on a given day.

Fig. 8: Ranges of AuROCs obtained with A, Dk and S(A,Dk).

TABLE I: Number of instances for each case study.

Fig. 1: Population distribution in terms of the number of drugs received on the first day (|D1|).

Fig. 2: Data excerpt for a 82 years old male patient who went out alive on the third day.

TABLE IX: Weights of meta-model S(A,D1,D2,D3).

TABLE VIII: Predictive accuracy with different historical windows (min and max values obtained with 5-fold cross validation).

Fig. 6: Similarities between drugs served on different days.

TABLE VII: Results for mortality prediction on day 4 (considered population: patients that stay for at least 5 days).

TABLE X: Predictive accuracy with stacking models.

Fig. 7: The different kinds of models considered.

Content maybe subject to copyright Report

HAL Id: hal-01877742

https://hal.inria.fr/hal-01877742

Submitted on 20 Sep 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Scalable and Interpretable Predictive Models for

Electronic Health Records

Amela Fejza, Pierre Genevès, Nabil Layaïda, Jean-Luc Bosson

To cite this version:

Amela Fejza, Pierre Genevès, Nabil Layaïda, Jean-Luc Bosson. Scalable and Interpretable Predictive

Models for Electronic Health Records. DSAA 2018 - 5th IEEE International Conference on Data

Science and Advanced Analytics, Oct 2018, Turin, Italy. pp.1-10. �hal-01877742�

Scalable and Interpretable Predictive Models for

Electronic Health Records

Amela Fejza

∗

, Pierre Genevès

∗

, Nabil Layaïda

∗

, Jean-Luc Bosson

†

∗

Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France

{amela.fejza, pierre.geneves, nabil.layaida}@inria.fr

†

Univ. Grenoble Alpes, CNRS, Public Health department CHU Grenoble Alpes,

Grenoble INP, TIMC-IMAG, 38000 Grenoble, France

JLBosson@chu-grenoble.fr

Abstract—Early identiﬁcation of patients at risk of developing

complications during their hospital stay is currently one of the

most challenging issues in healthcare. Complications include

hospital-acquired infections, admissions to intensive care units,

and in-hospital mortality. Being able to accurately predict the

patients’ outcomes is a crucial prerequisite for tailoring the care

that certain patients receive, if it is believed that they will do

poorly without additional intervention. We consider the problem

of complication risk prediction, such as inpatient mortality, from

the electronic health records of the patients. We study the

question of making predictions on the ﬁrst day at the hospital,

and of making updated mortality predictions day after day

during the patient’s stay. We develop distributed models that

are scalable and interpretable. Key insights include analysing

diagnoses known at admission and drugs served, which evolve

during the hospital stay. We leverage a distributed architecture

to learn interpretable models from training datasets of gigantic

size. We test our analyses with more than one million of patients

from hundreds of hospitals, and report on the lessons learned

from these experiments.

I. INTRODUCTION

One major expectation of data science in healthcare is

the ability to leverage on digitized health information and

computer systems to better apprehend and improve care. Over

the past few years the adoption of electronic health records

(EHRs) in hospitals has surged to an unprecedented level.

In the USA for example, more than 84% of hospitals have

adopted a basic EHR system, up from only 15% in 2010

[1], [12]. The availability of EHR data opens the way to the

development of quantitative models for patients that can be

used to predict health status, as well as to help prevent disease,

adverse effects, and ultimately death.

We consider the problem of predicting important clinical

outcomes such as inpatient mortality, based on EHR data. This

raises many challenges including dealing with the very high

number of potential predictor variables in EHRs. Traditional

approaches have overcome this complexity by extracting only

a very limited number of considered variables [5], [14]. These

approaches basically trade predictive accuracy for simplicity

and feasibility of model implementation. Other approaches

have dealt with this complexity by developing black box

This research was partially supported by the ANR project CLEAR (ANR-

16-CE25-0010).

machine learning models that retain predictor variables from a

large set of possible inputs, especially with deep learning [3],

[18], [20], [22]. These approaches often trade some model

interpretability for more predictive accuracy.

Predictive accuracy is crucial as wrong predictions might

have critical consequences. False positives might overwhelm

the hospital staff, and false negatives can miss to trigger

important alarms, exposing patients to poor clinical outcomes.

However, model interpretability is essential as it allows physi-

cians to get better insights on the factors that inﬂuence

the predictions, understand, edit and ﬁx predictive models

when needed [4]. The search for tradeoffs between predictive

accuracy and model interpretability is challenging.

We consider complication risk prediction and focus on two

aspects of this problem: (i) how to make accurate predictions

with interpretable models; and (ii) how to take into account

evolving clinical information during hospital stay. Our main

contributions are the following:

• we show that with interpretable models it is possible to

make accurate risk predictions, based on data concerning

admitting diagnoses and drugs served on the ﬁrst day.

• we further develop mortality risk prediction models to

make updated predictions when new clinical information

becomes available during hospitalization, in particular we

analyze the evolution of drugs served.

• we report on lessons learned through practical experi-

ments with real EHR data from more than one million of

patients admitted to US hospitals, which is, to the best

of our knowledge, one of the largest such experimental

study conducted so far.

Outline: The rest of the paper is organized as follows:

we ﬁrst present the data and methods used in § II. In § III we

present results obtained when making predictions of clinical

outcomes on the ﬁrst day at the hospital. In § IV we investigate

to which extent the predictive models can beneﬁt from the

availability of supplemental information becoming available

during the hospital stay to make updated predictions. We

ﬁnally review related works in § V before concluding in § VI.

II. METHODS

A. Data source

We used EHR data from the Premier healthcare database

which is one of the largest clinical databases in the United

States, gathering information from millions of patients over

a period of 12 months from 417 hospitals in the USA [19].

These hospitals are believed to be broadly representative of

the United States hospital experience. The database contains

hospital discharge ﬁles that are dated records of all billable

items (including therapeutic and diagnostic procedures, med-

ication, and laboratory usage) which are all linked to a given

admission [15]. We focus on hospital admissions of adults

hospitalized for at least 3 days, excluding elective admissions.

The snapshot of the database used in our work comprises the

EHR data of 1,271,733 hospital admissions.

B. Outcomes

For a given patient, we consider the problem of predicting

the occurence of several important clinical outcomes:

• death: in-hospital mortality, deﬁned as a discharge dispo-

sition of “expired” [9], [20];

• hospital-acquired infections (HAI) developed during the

stay [21];

• admissions to intensive care unit (ICU) on or after the

second day, excluding direct admissions on the ﬁrst day;

• pressure ulcers (PU) developed during the stay (not

present at admission).

Patients who experienced a given outcome are considered

positive cases for this outcome; those who did not are consid-

ered negative cases. Table I presents the distribution of patients

with respect to the considered outcomes.

TABLE I: Number of instances for each case study.

Problem studied Positive cases Negative cases Ratio

Mortality 28,236 857,005 3.29%

HAI 22,402 862,839 2.59%

ICU Admission 32,310 852,931 3.78%

Pressure Ulcers 23,742 861,499 2.75%

C. Preparing the data for supervised learning

Our methodology assumes no a priori clinical knowledge.

For a given patient, we ﬁrst extract a list E of elementary

features including the age, gender, and admission type. Our

models also use the list of admitting diagnoses known for

a given patient as available in the EHR data at admission

which we denote by A. Procedures can be performed during

the hospital stay. We denote the list of procedures performed

on the i

day of the stay (with i > 0) by P

. We also consider

the lists of drugs served, on a daily basis: D

denotes the list

of drug names (and their quantities) served on the i

day.

We use a list of unique identiﬁers encoded using The International

Classiﬁcation of Diseases, Ninth Revision, Clinical Modiﬁcation known as

ICD-9-CM.

We ﬁlter out unused procedures and drugs, and use a perfect

hash function to encode the features. The feature matrix is very

sparse so in the implementation we use a sparse representation

of feature vectors. Most patients are admitted at the hospital

with at least one admitting diagnosis (among 5,094 possible

diagnoses). A small proportion of patients receive procedures

during their stay (∼ 20% of patients receive procedures on the

ﬁrst day). The total number of possible procedures is 11,338.

Furthermore, during the stay, a total of 10,739 possible drugs

can be served. On the ﬁrst day of stay, a patient is served

8.6 drugs on average. Figure 1 shows the distribution of the

considered population in terms of the number of drugs received

on the ﬁrst day. Figure 2 shows an excerpt of the data for a

0 20 40 60 80 100

Number of drugs

20000

40000

60000

80000

100000

Patients

Fig. 1: Population distribution in terms of the number of drugs

received on the ﬁrst day (|D

|).

sample patient.

[(434456800,

(82, ’M’, 1,

[’A(0)’: [(’578.1’, ’A’, ’999’, 999)]],

[’D(1)’: [(’250258001120000’, ’2’),

(’460460947620000’, ’1’),

(’380381000310000’, ’2’),

(’300305857300000’, ’1’),

(’300305850250000’, ’1’),

...

(’250250043500000’, ’1’)]

[’D(2)’: [(’380381000310000’, ’1’),

(’320320721000000’, ’1’),

(’300301825750000’, ’1’),

(’250257025740000’, ’1.85’),

...

(’250250052970000’, ’1’)]

]))]

Fig. 2: Data excerpt for a 82 years old male patient who went

out alive on the third day.

D. Development of models

1) Interpretability: Following [4], we pay speciﬁc attention

to the interpretability of the predictive models we develop.

Model interpretability (or “intelligibility” as found in [4])

refers to the ability to understand, validate, edit, and trust

a learned model, which is particularly important in critical

applications in healthcare such as the one we consider here.

Accurate models such as deep neural nets and random forests

are usually not interpretable, but more interpretable models

such as logistic regression are usually less accurate. This

often imposes a tradeoff between accuracy and interpretability.

We choose to preserve interpretability and develop classiﬁers

based on logistic regression. Advantages of logistic regres-

sion include yielding insights on the factors that inﬂuence

the predictions, such as an interpretable vector of weights

associated to features, and predictions that can be interpreted

as probabilities.

2) Mathematical formulation: Logistic regression can be

formulated as the optimization problem min

w∈R

f(w) in

which the objective function is of the form:

f(w) = λR(w) +

i=1

L(w; x

, y

)

where n is the number of instances in the training set, and for

1 ≤ i ≤ n:

• w is the vector of weights we are looking for.

• the vectors x

∈ R

are the instances of the training

data set: each vector x

is composed of the d values

corresponding to features retained for a given admission.

• y

∈ {0, 1} are their corresponding labels, which we

want to predict (e.g. for the mortality case study, 0 means

the patient survived and 1 means the patient died at the

hospital)

• R(w) is the regularizer that controls the complexity of

the model. For the purpose of favoring simple models

and avoiding overﬁtting, in the reported experiments we

used R(w) =

||w||

• λ is the regularization parameter that deﬁnes the trade-

off between the two goals of minimizing the loss (i.e.,

training error) and minimizing model complexity (i.e., to

avoid overﬁtting). In the reported experiments we used

λ =

• θ

is the weight factor that we use to compensate for

class imbalance. The classes we consider are heavily

imbalanced (as shown in Table I): in-hospital death for

instance can be considered as a rare event. Notice that

we do not use downsampling (that would drastically

reduce the set of negative instances for the purpose

of rebalancing classes); instead we apply the weighting

technique [13] that allows our models to learn from

all instances of imbalanced training sets. θ

is thus in

charge of adjusting the impact of the error associated

to each instance proportionally to class imbalance: θ

τ ·y

+(1−τ )·(1−y

) where τ is the fraction of negative

instances in the training set.

• the loss function L measures the error of the model on

the training data set, we use the logistic loss:

L(w; x

, y

) = ln(1 + e

(1−2y

)

Given a new instance x of the test data set, the model makes

a prediction by applying the logistic function:

f(z) =

1 + e

−z

where z = w

x. The raw output f (z) has a probabilistic in-

terpretation: the probability that x is positive. In the sequel we

rely on this probability to build further models (in particular

using the stacking technique: see the meta-model built from

such probabilities in § IV). We also use the common threshold

t = 0.5 such that if f (z) > t, the outcome is predicted as

positive (and negative otherwise)

3) Scalability with distributed computations: a particularity

of our study is that we want our models to be able to

learn from very large amounts of training data (coming from

many hospitals). We typically consider models for which

both n and d are large: for instance n > 8 ∗ 10

and

d > 16 · 10

when we train models using features found in

EAD

. This is achieved by implementing a distributed version

of logistic regression, including a distributed version of the

L-BFGS optimization algorithm which we use to solve the

aforementioned optimization problem. L-BFGS is known for

often achieving faster convergence compared with other ﬁrst-

order optimization techniques [2]. We use a cluster composed

of one driver machine and a set of worker machines

. Each

worker machine receives a fraction of the training data set.

The driver machine then triggers several rounds of distributed

computations performed independently by worker machines,

until convergence is reached. The software was implemented

using the Python programming language and the Apache Spark

machine learning library [17].

E. Model evaluation and statistical analysis

Patients were randomly split into disjoint train and test

subsets. We perform k-fold cross validation with k = 5 unless

indicated otherwise (k = 10 when indicated). Model accuracy

is reported in terms of several metrics on the (naturally

imbalanced) test set, which is used exclusively for evaluation

purposes. We report on the receiver operator characteristic

(ROC) curves and especially on the area under the ROC

curve (AuROC). For the sake of completeness, we also include

the commonly used Accuracy metric [8]. Since we deal with

highly skewed datasets (as shown in Table I), we also report

on the precision-recall (PR) curves and on the area under the

PR curve (AuPR), in order to give a more complete picture

of the performance of the models [6].

F. Prediction timing

We consider making predictions at different times. First, we

consider making predictions on the ﬁrst day at the hospital.

We report on corresponding results, for all considered clinical

We make t vary in [0, 1] for computing ROC curves.

Reported experiments were conducted with 5 machines (1 driver and 4

workers), each equipped with two Intel Xeon CPU (1.90GHz-2.6Ghz), with

24 to 40 cores, 60-160 GB of RAM, and a 1GB ethernet network.

outcomes, in § III. We then report on how to make new mortal-

ity predictions, day after day, whenever new EHR information

becomes available, and present corresponding results in § IV.

III. RESULTS ON PREDICTIONS ON THE FIRST DAY

On the ﬁrst day, we consider predictive models built with

different sets of features (that we later combine). We name the

models we consider after the sets of features they rely on. For

example we consider the model EA for making predictions

at hospital admission time t

(i.e. at the moment when the

patient arrives at the hospital). This model uses the elementary

features E and the diagnoses A known at admission. We

also consider making predictions whenever the set of drugs

served on the ﬁrst day is known (typically at t

+ 24h). For

this purpose, we consider the model ED

of [9] that uses

elementary features and drugs served on the ﬁrst day. All the

considered models systematically use the elementary features

E, so we often omit E in model names in the sequel.

A. Mortality

For predicting in-hospital mortality, AuROC was 77.8% and

AuPR was 12.7% with the D

model, indicating signiﬁcant

predictive power of the drugs served on the ﬁrst day (as

already known from [9]). Over the total considered population

of 1,271,733 patients, 885,241 (∼ 70%) of them have non-

empty admitting diagnosis information at admission time

(A6= ∅). AuROC was 76.4% and AuPR was 10.9% with the

A model, which is aimed to leverage this information for

making predictions directly at admission time. This indicates

predictive power of the admitting diagnoses as well. It thus

makes sense to study how these models could be combined to

obtain more accurate predictions for the concerned population

of 885,241 patients. We study combinations of the predictions

made at admission with predictions made at t

+ 24h with the

knowledge of the set of drugs served on the ﬁrst day.

More generally, we consider different model combinations:

• we consider models obtained by the ﬂattening and con-

catenation of features found in several basic models. In

the sequel, we denote by C(B

, B

, ..., B

) (or equiv-

alently by B

...B

) the single model obtained from

the concatenation of the features used in the basic mod-

els B

, B

, ..., B

. For instance we consider the model

C(A,D

) in which all the features found in A and D

are concatenated.

• we also use ensemble techniques and in particular the

stacking technique [7] to create combined models. The

advantage of using logistic regressions as basic models

to be combined with the stacking technique is that we

can reuse not only their predictions, but also their raw

output probabilities (which are more precise, as pointed

out in § II-D2) as features for the meta-model. In the

sequel, we denote by S(B

, B

, ..., B

) the meta-model

obtained from the raw probabilities of the basic models

, B

, ..., B

with the stacking technique. For example,

we consider the model S(A,D

) built from the stacking

of the two models A and D

Table II gives an overview of the AuROC, Accuracy and

AuPR obtained with the basic and combined models con-

sidered, on the same population, having admitting diagnosis

information. Table II indicates the average, minimum and

maximum values of each metric obtained with a 5-fold cross-

validation process.

TABLE II: Mortality risk predictions on the ﬁrst day.

Model AuROC % Accuracy % AuPR %

A 76.4 (76.0-76.8) 65.4 (65.2-65.6) 10.9 (10.6-11.2)

[9] 77.4 (77.2-77.7) 74.5 (74.5-74.8) 12.3 (12.0-12.5)

S(A,D

) 80.1 (79.9-80.2) 69.2 (68.9-69.4) 14.0 (13.5-14.3)

C(A,D

) 80.4 (80.2-80.7) 75.3 (75.2-75.5) 14.2 (13.8-14.6)

We observe that the combined models yield signiﬁcantly

more accurate predictions than the basic ones, improving over

comparable earlier works. For predicting inpatient mortality,

with the AD

model AuROC was 80.4% and AuPR was

14.2%, compared to respectively 77.4% and 12.3% obtained

with the D

model of [9].

Figure 3 presents the ROC curve obtained for a run of the

C(A

) model on a given train and test set. The PR curve is

shown on Figure 4. Table III presents sizes of train and test

sets, and Table IV presents the confusion matrix and associated

metrics.

TABLE III: Number of instances for train and test sets.

Mortality case study Train set Test set

Total size 708,373 176,868

Positive instances 22,660 5,576

Negative instances 685,713 171,292

0% 20% 40% 60% 80% 100%

False Positive Rate

20%

40%

60%

80%

100%

True Positive Rate

AUROC of A(0) (80.7%)

Fig. 3: ROC curve for mortality prediction at t

+ 24h.

B. HAI, ICU admission, and pressure ulcers

Table V presents results obtained when predicting all the

other considered clinical outcomes using the C(A

) model.

To the best of our knowledge, our models outperform state-

of-the-art interpretable models found in the literature for

HTML Viewer

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Scalable and interpretable predictive models for electronic health records" ?

The authors consider the problem of complication risk prediction, such as inpatient mortality, from the electronic health records of the patients. The authors study the question of making predictions on the first day at the hospital, and of making updated mortality predictions day after day during the patient ’ s stay. The authors test their analyses with more than one million of patients from hundreds of hospitals, and report on the lessons learned from these experiments.

Q2. What are the future works in "Scalable and interpretable predictive models for electronic health records" ?

One perspective for further work would be to study to which extent the system generalizes for predicting other clinical outcomes such as long lengths of stay and hospital readmissions.

Q3. What is the advantage of using logistic regressions as basic models to combine with the stacking technique?

The advantage of using logistic regressions as basic models to be combined with the stacking technique is that the authors can reuse not only their predictions, but also their raw output probabilities (which are more precise, as pointed out in § II-D2) as features for the meta-model.

Q4. How is the driver machine able to perform distributed computations?

The driver machine then triggers several rounds of distributed computations performed independently by worker machines, until convergence is reached.

Q5. What is the function that controls the impact of the error associated to each instance?

θi is thus in charge of adjusting the impact of the error associated to each instance proportionally to class imbalance: θi = τ ·yi+(1−τ) ·(1−yi) where τ is the fraction of negative instances in the training set.

Q6. How many weights were retained for each training set?

For instance, the authors retained the 100 most important weights corresponding to A features (with the most significant absolute value) obtained for each random training set.

Q7. What is the weighting technique used to compensate for class imbalance?

Notice that the authors do not use downsampling (that would drastically reduce the set of negative instances for the purpose of rebalancing classes); instead the authors apply the weighting technique [13] that allows their models to learn from all instances of imbalanced training sets.

Q8. What is the main reason why DNNs are preferred over linear models?

This is one reason why simple linear models (such as logistic regression) might be preferred over DNNs even when their accuracy is significantly lower, as detailed in [4].

Q9. What is the threshold for the maximum acceptable ratio between the number of features and the number of training?

To avoid running into the curse of dimensionality, the authors define a threshold for the maximum acceptable ratio between the number of features and the number of training instances (that decreases with higher values of k as shown in Figure 5).

Q10. What is the ROC curve for the models proposed in [20]?

The models proposed in [20] typically achieve areas under the ROC curve within the 0.79-0.89 range for mortality prediction at admission on their dataset (they do not report on AuPR nor on accuracy though).

Q11. What is the function that controls the complexity of the model?

• yi ∈ {0, 1} are their corresponding labels, which the authors want to predict (e.g. for the mortality case study, 0 means the patient survived and 1 means the patient died at the hospital) • R(w) is the regularizer that controls the complexity of the model.

Q12. What is the common method of achieving the accuracy of the model?

Model accuracy is reported in terms of several metrics on the (naturally imbalanced) test set, which is used exclusively for evaluation purposes.

Q13. What is the main drawback of DNNs?

It is also worth noticing that the results reported in [20] are achieved with an important sacrifice: a major drawback of DNNs is their lack of interpretability, as notoriously known.

Scalable and Interpretable Predictive Models for Electronic Health Records

Summary (5 min read)

I. INTRODUCTION

A. Data source

B. Outcomes

C. Preparing the data for supervised learning

D. Development of models

E. Model evaluation and statistical analysis

F. Prediction timing

III. RESULTS ON PREDICTIONS ON THE FIRST DAY

A. Mortality

C. Benefits of interpretability and explainability of predictions

IV. RESULTS WITH EVOLVING DATA

A. Preliminary observations

B. Daily mortality predictions

C. Discussion

VI. CONCLUSION AND PERSPECTIVES

Figures (16)

Citations

Cites background or methods from "Scalable and Interpretable Predicti..."

References

"Scalable and Interpretable Predicti..." refers methods in this paper

"Scalable and Interpretable Predicti..." refers methods in this paper

Additional excerpts

"Scalable and Interpretable Predicti..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Scalable and interpretable predictive models for electronic health records" ?

Q2. What are the future works in "Scalable and interpretable predictive models for electronic health records" ?

Q3. What is the advantage of using logistic regressions as basic models to combine with the stacking technique?

Q4. How is the driver machine able to perform distributed computations?

Q5. What is the function that controls the impact of the error associated to each instance?

Q6. How many weights were retained for each training set?

Q7. What is the weighting technique used to compensate for class imbalance?

Q8. What is the main reason why DNNs are preferred over linear models?

Q9. What is the threshold for the maximum acceptable ratio between the number of features and the number of training?

Q10. What is the ROC curve for the models proposed in [20]?

Q11. What is the function that controls the complexity of the model?

Q12. What is the common method of achieving the accuracy of the model?

Q13. What is the main drawback of DNNs?