scispace - formally typeset
Open AccessPosted ContentDOI

Personalized survival probabilities for SARS-CoV-2 positive patients by explainable machine learning

Reads0
Chats0
TLDR
In this paper, a machine learning model was trained to predict mortality within 12 weeks of the first positive SARS-CoV-2 test, which can aid clinicians to implement precision medicine.
Abstract
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR) including demographics, diagnoses, medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling enabled us to predict personalized survival curves and explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of personalized survival probabilities in routine care.

read more

Content maybe subject to copyright    Report

1


Adrian G. Zucco
1
, Rudi Agius
2
, Rebecka Svanberg
2
, Kasper S. Moestrup
1
, Ramtin Z. Marandi
1
,
Cameron Ross MacPherson
1
, Jens Lundgren
1,4
, Sisse R. Ostrowski
3,4*
, Carsten U. Niemann
2,4*
1
PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark.
2
Department of Hematology, Rigshospitalet, Copenhagen, Denmark.
3
Department of Clinical Immunology, Rigshospitalet, Copenhagen, Denmark.
4
Department of Clinical Medicine, University of Copenhagen, Denmark.
*Co-senior authors.
Correspondence should be addressed to: A.G.Z (adrian.gabriel.zucco@regionh.dk), S.R.O
(Sisse.Rye.Ostrowski@regionh.dk) or C.U.N (Carsten.Utoft.Niemann@regionh.dk).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

2
ABSTRACT
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement
precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of
a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in
eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR)
including demographics, diagnoses, medications, laboratory test results and vital parameters. A
discrete-time framework for survival modelling enabled us to predict personalized survival curves and
explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall
area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous
hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable
survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk
factors. Upon further validation, this model may allow direct reporting of personalized survival
probabilities in routine care.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

3
INTRODUCTION
Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in
late 2019
1
. Infected individuals present a variety of symptoms, ranging from asymptomatic to life-
threatening diseases
2
. Although the majority of cases experience mild to moderate disease
approximately 15% of confirmed SARS-CoV-2 positive cases are estimated to develop severe
disease
3
. Progression to severe disease seems to occur within 1-2 weeks from symptom onset, and
is characterized by clinical signs of pneumonia with dyspnea, increased respiratory rate, and
decreased blood oxygen saturation requiring supplemental oxygen
37
. Development of critical illness
is driven by systemic inflammation, leading to acute respiratory distress syndrome (ARDS),
respiratory failure, septic shock, multi-organ failure, and/or disseminated coagulopathy
4,5,8
. The
majority of these patients require mechanical ventilation, and mortality for patients admitted to an
Intensive Care Unit (ICU) is reported to be 32-50%
3,810
. Despite the current vaccination program, both
people already vaccinated and patients not being vaccinated continue to develop critical COVID-19
disease
11
. Thus, the pandemic still poses a great burden on health care systems worldwide, locally
approaching the limit of capacity due to high patient burden and challenging clinical management.
Several factors associated with increased risk of severe disease course have been established
including old age, male gender, and lifestyle factors such as smoking and obesity
12,13
. Comorbidities
including hypertension, type 2 diabetes, renal disease, as well as pre-existing conditions of immune
dysfunction and cancer, are also associated with a higher risk of severe disease and COVID-19
related death
12,1416
. Among hospitalized patients, risk factors for severe disease or death include low
lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters
indicating organ dysfunction
6
. However, many of these factors likely reflect an ongoing progression of
COVID-19. Thus, identification of high-risk patients at or prior to hospital admission is warranted to
facilitate personalized interventions.
Multiple COVID-19 prognostic models have been built on reduced sets of predictive features from
demographics, patient history, physical examination, and laboratory results
17
processed by traditional
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

4
statistical frameworks or machine learning (ML) algorithms. A systematic review of 50 prognostic
models has concluded that overall such models have been poorly reported and are at a high risk of
bias
18
. While great efforts have been put into providing prognostic models based on data collected
from health systems, traditional modelling approaches solely based on domain knowledge may fail.
This represents a risk of missing novel markers and insights about the disease that could come from
data-driven models in a hypothesis-free manner
19
, which have been reported to outperform models
based on curated variables from domain experts
20
.
Furthermore, ML models facilitate clinical insights
21
when coupled with methods for model
explainability such as SHapley Additive exPlanations (SHAP) values
22
. Model explainability has been
developed mainly in the context of regression and binary classification, but in clinical research where
censored observations are common, explainable time-to-event modelling is required to avoid
selection bias
23,24
. Multiple ML algorithms have been developed for time-to-event modelling, either by
building on top of existing models such as Cox proportional hazards or by defining new loss functions
that model time as continuous
25
. Here we used an alternative approach that considered time in
discrete intervals and performed binary classification at such time intervals
26
. This allowed us to
implement gradient boosting decision trees for binary classification to predict personalized survival
probabilities
27
and allow explainability at the individual patient level using SHAP values
22
including
temporal dynamics of risk factors over the course of the disease. This approach not only allows to
predict personalized survival probabilities and risk factors for SARS-CoV-2 positive patients but also
provides a framework for precision medicine that can be applied to other diseases based on routine
electronic health records.
RESULTS
Patient cohort
Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, we
identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265
individuals who had a test performed between 17th of March 2020 and 2
nd
of March 2021 (Fig. 1). In
this cohort, 5,077 patients were hospitalized, of whom 502 were admitted to the ICU (Supplementary
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

5
Fig. 1). Overall, 1,803 (5.34%) deaths occurred among all individuals with a positive SARS-CoV-2
RT-PCR test, of whom 141 died later than 12 weeks from the first positive test (FPT) hence considered
as alive for this analysis. Right-censoring was only observed for patients tested after the 8
th
of
December 2020 with less than 12 weeks of follow-up available while deaths that occurred the same
day of FPT were not considered for training. For the initial model, demographics, laboratory test
results, hospitalizations, vital parameters, diagnoses, medicines (ordered and administered) and
summary features were included. Feature encoding resulted in 2,723 features (Supplementary Table
2) which after feature selection were reduced to 23 features. A summary of the cohort based on the
final feature set can be found in Table 1. This cohort represents an updated subset of individuals
residing in Denmark characterized in a previous publication
28
.
Survival modelling with machine learning achieves high discriminative performance
To predict the risk of death within 12 weeks from FPT, we trained gradient boosting decision trees
considering time as discrete in a time-to-event framework. Performance was measured on 20% of the
data (test set) unblinded only for performance assessment. The weighted concordance index (C-
index) for predicting risk of death for all 12 weeks with 95% confidence intervals (CI) was 0.946 (0.941-
0.950). Binary metrics were calculated for each predicted week by excluding censored individuals
(Fig. 2). At week 12, the precision-recall area under the curve (PR-AUC) and Mathew correlation
coefficient (MCC) with 95% CI were 0.686 (0.651-0.720) and 0.580 (0.562-0.597) respectively. The
sensitivity was 99.3% and the specificity was 86.4%. The performance for subgroups of patients
displayed some differences. In patients tested outside the hospital (Fig 2b), the C-index was 0.955
(0.950-0.960), the PR-AUC and MCC were 0.675 (0.632-0.719) and 0.585 (0.562-0.605) respectively.
98.9% sensitivity and 89.9% specificity were measured in this group. For patients previously admitted
to the hospital at the time of test (Fig. 2c), the C-Index was 0.809 (0.787-0.829), the PR-AUC and
MCC were 0.705 (0.640-0.760) and 0.357 (0.325-0.387) respectively. The sensitivity was 100% and
the specificity 31.0% indicating a higher number of false positives when using a 0.5 probability
threshold for this group (Supplementary Table 1).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

Patients with CLL have a lower risk of death from COVID-19 in the Omicron era

TL;DR: Patients with CLL with close hospital contactss and in particular those above 70 years of age with one or more comorbidities should be considered for closer monitoring and pre-emptive antiviral therapy upon a positive SARS-CoV-2 test.
References
More filters
Proceedings Article

LightGBM: a highly efficient gradient boosting decision tree

TL;DR: It is proved that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size, and is called LightGBM.
Journal ArticleDOI

Clinical and immunological features of severe and moderate coronavirus disease 2019.

TL;DR: The SARS-CoV-2 infection may affect primarily T lymphocytes particularly CD4+T and CD8+ T cells, resulting in decrease in numbers as well as IFN-γ production, which may be of importance due to their correlation with disease severity in COVID-19.
Journal ArticleDOI

Explanation in artificial intelligence: Insights from the social sciences

TL;DR: This paper argues that the field of explainable artificial intelligence should build on existing research, and reviews relevant papers from philosophy, cognitive psychology/science, and social psychology, which study these topics, and draws out some important findings.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Personalized survival probabilities for sars-cov-2 positive patients by explainable machine learning" ?

Is the author/funder, who has granted medRxiv a license to display the preprint in ( which was not certified by peer review ) preprint