scispace - formally typeset

Posted ContentDOI

Personalized survival probabilities for SARS-CoV-2 positive patients by explainable machine learning

29 Oct 2021-medRxiv (Cold Spring Harbor Laboratory Press)-

Abstract: Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR) including demographics, diagnoses, medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling enabled us to predict personalized survival curves and explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of personalized survival probabilities in routine care.
Topics: Risk assessment (52%)

Summary (2 min read)

Jump to: [INTRODUCTION][Patient cohort][DISCUSSION] and [CONCLUSION]

INTRODUCTION

  • Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in late 20191.
  • Both people already vaccinated and patients not being vaccinated continue to develop critical COVID-19 disease11.
  • Among hospitalized patients, risk factors for severe disease or death include low lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters indicating organ dysfunction6.
  • While great efforts have been put into providing prognostic models based on data collected from health systems, traditional modelling approaches solely based on domain knowledge may fail.
  • Furthermore, ML models facilitate clinical insights21 when coupled with methods for model explainability such as SHapley Additive exPlanations (SHAP) values22.

Patient cohort

  • Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, the authors identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265 individuals who had a test performed between 17th of March 2020 and 2nd of March 2021 (Fig. 1).
  • In patients tested outside the hospital (Fig 2b), the C-index was 0.955 (0.950-0.960), the PR-AUC and MCC were 0.675 (0.632-0.719) and 0.585 (0.562-0.605) respectively.
  • The median of the predicted cumulative death probabilities by survival status reflected the discriminative performance of the individual survival predictions (Fig. 3a).
  • From the original set of 2,723 features generated from routine EHR data (Supplementary Table 2), 22 features were selected.
  • As expected, patients with more hospitalizations and longer cumulative admission days prior to FPT exhibited a higher risk of death (Fig 5e-f).

DISCUSSION

  • The authors here developed an explainable Machine Learning model for predicting the risk of death within the first 12 weeks from a positive SARS-CoV-2 PCR test.
  • Additionally, instead of characterizing patients’ relevant history using a limited set of preselected variables, the set of 22 features in the final model were derived using a data-driven approach from an initial set of 2,723 features that encoded available demographics, laboratory test results, hospitalizations, vital parameters, diagnoses and medicines.
  • This has been the predominant modelling approach in COVID1918,34 related outcomes.
  • Multiple approaches have been proposed to open “black-box” models and allow explainability by, for example, removing features and measuring their impact on the model43.
  • This suggests that predicting late deaths requires a different set of risk factors and consideration of their interactions than predicting early death.

CONCLUSION

  • The authors developed a data-driven machine learning model to identify SARS-CoV-2 positive patients with a high risk of death within 12-week from the first positive test.
  • The discrete-time modelling approach implemented not only allowed us to train survival models with high performance but also enabled model explainability through SHAP values.
  • By learning temporal dynamics and interactions between clinical features, the model was able to identify personalized risk factors and high-risk patients for early interventions while improving the understanding of the disease.
  • At the same time, the authors demonstrate that leveraging electronic health records with explainable ML models provide a framework for the implementation of precision medicine in routine care which can be adapted to other diseases.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1


Adrian G. Zucco
1
, Rudi Agius
2
, Rebecka Svanberg
2
, Kasper S. Moestrup
1
, Ramtin Z. Marandi
1
,
Cameron Ross MacPherson
1
, Jens Lundgren
1,4
, Sisse R. Ostrowski
3,4*
, Carsten U. Niemann
2,4*
1
PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark.
2
Department of Hematology, Rigshospitalet, Copenhagen, Denmark.
3
Department of Clinical Immunology, Rigshospitalet, Copenhagen, Denmark.
4
Department of Clinical Medicine, University of Copenhagen, Denmark.
*Co-senior authors.
Correspondence should be addressed to: A.G.Z (adrian.gabriel.zucco@regionh.dk), S.R.O
(Sisse.Rye.Ostrowski@regionh.dk) or C.U.N (Carsten.Utoft.Niemann@regionh.dk).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

2
ABSTRACT
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement
precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of
a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in
eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR)
including demographics, diagnoses, medications, laboratory test results and vital parameters. A
discrete-time framework for survival modelling enabled us to predict personalized survival curves and
explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall
area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous
hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable
survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk
factors. Upon further validation, this model may allow direct reporting of personalized survival
probabilities in routine care.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

3
INTRODUCTION
Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in
late 2019
1
. Infected individuals present a variety of symptoms, ranging from asymptomatic to life-
threatening diseases
2
. Although the majority of cases experience mild to moderate disease
approximately 15% of confirmed SARS-CoV-2 positive cases are estimated to develop severe
disease
3
. Progression to severe disease seems to occur within 1-2 weeks from symptom onset, and
is characterized by clinical signs of pneumonia with dyspnea, increased respiratory rate, and
decreased blood oxygen saturation requiring supplemental oxygen
37
. Development of critical illness
is driven by systemic inflammation, leading to acute respiratory distress syndrome (ARDS),
respiratory failure, septic shock, multi-organ failure, and/or disseminated coagulopathy
4,5,8
. The
majority of these patients require mechanical ventilation, and mortality for patients admitted to an
Intensive Care Unit (ICU) is reported to be 32-50%
3,810
. Despite the current vaccination program, both
people already vaccinated and patients not being vaccinated continue to develop critical COVID-19
disease
11
. Thus, the pandemic still poses a great burden on health care systems worldwide, locally
approaching the limit of capacity due to high patient burden and challenging clinical management.
Several factors associated with increased risk of severe disease course have been established
including old age, male gender, and lifestyle factors such as smoking and obesity
12,13
. Comorbidities
including hypertension, type 2 diabetes, renal disease, as well as pre-existing conditions of immune
dysfunction and cancer, are also associated with a higher risk of severe disease and COVID-19
related death
12,1416
. Among hospitalized patients, risk factors for severe disease or death include low
lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters
indicating organ dysfunction
6
. However, many of these factors likely reflect an ongoing progression of
COVID-19. Thus, identification of high-risk patients at or prior to hospital admission is warranted to
facilitate personalized interventions.
Multiple COVID-19 prognostic models have been built on reduced sets of predictive features from
demographics, patient history, physical examination, and laboratory results
17
processed by traditional
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

4
statistical frameworks or machine learning (ML) algorithms. A systematic review of 50 prognostic
models has concluded that overall such models have been poorly reported and are at a high risk of
bias
18
. While great efforts have been put into providing prognostic models based on data collected
from health systems, traditional modelling approaches solely based on domain knowledge may fail.
This represents a risk of missing novel markers and insights about the disease that could come from
data-driven models in a hypothesis-free manner
19
, which have been reported to outperform models
based on curated variables from domain experts
20
.
Furthermore, ML models facilitate clinical insights
21
when coupled with methods for model
explainability such as SHapley Additive exPlanations (SHAP) values
22
. Model explainability has been
developed mainly in the context of regression and binary classification, but in clinical research where
censored observations are common, explainable time-to-event modelling is required to avoid
selection bias
23,24
. Multiple ML algorithms have been developed for time-to-event modelling, either by
building on top of existing models such as Cox proportional hazards or by defining new loss functions
that model time as continuous
25
. Here we used an alternative approach that considered time in
discrete intervals and performed binary classification at such time intervals
26
. This allowed us to
implement gradient boosting decision trees for binary classification to predict personalized survival
probabilities
27
and allow explainability at the individual patient level using SHAP values
22
including
temporal dynamics of risk factors over the course of the disease. This approach not only allows to
predict personalized survival probabilities and risk factors for SARS-CoV-2 positive patients but also
provides a framework for precision medicine that can be applied to other diseases based on routine
electronic health records.
RESULTS
Patient cohort
Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, we
identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265
individuals who had a test performed between 17th of March 2020 and 2
nd
of March 2021 (Fig. 1). In
this cohort, 5,077 patients were hospitalized, of whom 502 were admitted to the ICU (Supplementary
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

5
Fig. 1). Overall, 1,803 (5.34%) deaths occurred among all individuals with a positive SARS-CoV-2
RT-PCR test, of whom 141 died later than 12 weeks from the first positive test (FPT) hence considered
as alive for this analysis. Right-censoring was only observed for patients tested after the 8
th
of
December 2020 with less than 12 weeks of follow-up available while deaths that occurred the same
day of FPT were not considered for training. For the initial model, demographics, laboratory test
results, hospitalizations, vital parameters, diagnoses, medicines (ordered and administered) and
summary features were included. Feature encoding resulted in 2,723 features (Supplementary Table
2) which after feature selection were reduced to 23 features. A summary of the cohort based on the
final feature set can be found in Table 1. This cohort represents an updated subset of individuals
residing in Denmark characterized in a previous publication
28
.
Survival modelling with machine learning achieves high discriminative performance
To predict the risk of death within 12 weeks from FPT, we trained gradient boosting decision trees
considering time as discrete in a time-to-event framework. Performance was measured on 20% of the
data (test set) unblinded only for performance assessment. The weighted concordance index (C-
index) for predicting risk of death for all 12 weeks with 95% confidence intervals (CI) was 0.946 (0.941-
0.950). Binary metrics were calculated for each predicted week by excluding censored individuals
(Fig. 2). At week 12, the precision-recall area under the curve (PR-AUC) and Mathew correlation
coefficient (MCC) with 95% CI were 0.686 (0.651-0.720) and 0.580 (0.562-0.597) respectively. The
sensitivity was 99.3% and the specificity was 86.4%. The performance for subgroups of patients
displayed some differences. In patients tested outside the hospital (Fig 2b), the C-index was 0.955
(0.950-0.960), the PR-AUC and MCC were 0.675 (0.632-0.719) and 0.585 (0.562-0.605) respectively.
98.9% sensitivity and 89.9% specificity were measured in this group. For patients previously admitted
to the hospital at the time of test (Fig. 2c), the C-Index was 0.809 (0.787-0.829), the PR-AUC and
MCC were 0.705 (0.640-0.760) and 0.357 (0.325-0.387) respectively. The sensitivity was 100% and
the specificity 31.0% indicating a higher number of false positives when using a 0.5 probability
threshold for this group (Supplementary Table 1).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

Figures (6)
References
More filters

Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

33,540 citations


Book ChapterDOI
David Cox1Institutions (1)
Abstract: The analysis of censored failure times is considered. It is assumed that on each individual arc available values of one or more explanatory variables. The hazard function (age-specific failure rate) is taken to be a function of the explanatory variables and unknown regression coefficients multiplied by an arbitrary and unknown function of time. A conditional likelihood is obtained, leading to inferences about the unknown regression coefficients. Some generalizations are outlined.

28,225 citations


Journal ArticleDOI
Chaolin Huang1, Yeming Wang2, Xingwang Li3, Lili Ren4  +25 moreInstitutions (8)
24 Jan 2020-The Lancet
TL;DR: The epidemiological, clinical, laboratory, and radiological characteristics and treatment and clinical outcomes of patients with laboratory-confirmed 2019-nCoV infection in Wuhan, China, were reported.
Abstract: A recent cluster of pneumonia cases in Wuhan, China, was caused by a novel betacoronavirus, the 2019 novel coronavirus (2019-nCoV). We report the epidemiological, clinical, laboratory, and radiological characteristics and treatment and clinical outcomes of these patients. All patients with suspected 2019-nCoV were admitted to a designated hospital in Wuhan. We prospectively collected and analysed data on patients with laboratory-confirmed 2019-nCoV infection by real-time RT-PCR and next-generation sequencing. Data were obtained with standardised data collection forms shared by the International Severe Acute Respiratory and Emerging Infection Consortium from electronic medical records. Researchers also directly communicated with patients or their families to ascertain epidemiological and symptom data. Outcomes were also compared between patients who had been admitted to the intensive care unit (ICU) and those who had not.

26,390 citations


Journal ArticleDOI
Wei-jie Guan1, Zhengyi Ni1, Yu Hu1, Wenhua Liang1  +33 moreInstitutions (1)
TL;DR: During the first 2 months of the current outbreak, Covid-19 spread rapidly throughout China and caused varying degrees of illness, and patients often presented without fever, and many did not have abnormal radiologic findings.
Abstract: Background Since December 2019, when coronavirus disease 2019 (Covid-19) emerged in Wuhan city and rapidly spread throughout China, data have been needed on the clinical characteristics of...

16,855 citations


Journal ArticleDOI
Fei Zhou1, Ting Yu, Ronghui Du, Guohui Fan2  +16 moreInstitutions (5)
28 Mar 2020-The Lancet
Abstract: Summary Background Since December, 2019, Wuhan, China, has experienced an outbreak of coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Epidemiological and clinical characteristics of patients with COVID-19 have been reported but risk factors for mortality and a detailed clinical course of illness, including viral shedding, have not been well described. Methods In this retrospective, multicentre cohort study, we included all adult inpatients (≥18 years old) with laboratory-confirmed COVID-19 from Jinyintan Hospital and Wuhan Pulmonary Hospital (Wuhan, China) who had been discharged or had died by Jan 31, 2020. Demographic, clinical, treatment, and laboratory data, including serial samples for viral RNA detection, were extracted from electronic medical records and compared between survivors and non-survivors. We used univariable and multivariable logistic regression methods to explore the risk factors associated with in-hospital death. Findings 191 patients (135 from Jinyintan Hospital and 56 from Wuhan Pulmonary Hospital) were included in this study, of whom 137 were discharged and 54 died in hospital. 91 (48%) patients had a comorbidity, with hypertension being the most common (58 [30%] patients), followed by diabetes (36 [19%] patients) and coronary heart disease (15 [8%] patients). Multivariable regression showed increasing odds of in-hospital death associated with older age (odds ratio 1·10, 95% CI 1·03–1·17, per year increase; p=0·0043), higher Sequential Organ Failure Assessment (SOFA) score (5·65, 2·61–12·23; p Interpretation The potential risk factors of older age, high SOFA score, and d-dimer greater than 1 μg/mL could help clinicians to identify patients with poor prognosis at an early stage. Prolonged viral shedding provides the rationale for a strategy of isolation of infected patients and optimal antiviral interventions in the future. Funding Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences; National Science Grant for Distinguished Young Scholars; National Key Research and Development Program of China; The Beijing Science and Technology Project; and Major Projects of National Science and Technology on New Drug Creation and Development.

15,279 citations