Posted Content•DOI•

Personalized survival probabilities for SARS-CoV-2 positive patients by explainable machine learning

Adrian G. Zucco, Rudi Agius, Rebecka Svanberg, Kasper Sommerlund Moestrup, Ramtin Z. Marandi, Cameron Ross MacPherson, Jens D Lundgren¹, Sisse R. Ostrowski¹, Carsten Utoft Niemann¹ - Show less +5 more•Institutions (1)

University of Copenhagen¹

29 Oct 2021-medRxiv (Cold Spring Harbor Laboratory Press)-

TL;DR: In this paper, a machine learning model was trained to predict mortality within 12 weeks of the first positive SARS-CoV-2 test, which can aid clinicians to implement precision medicine.

read less

Abstract: Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR) including demographics, diagnoses, medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling enabled us to predict personalized survival curves and explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of personalized survival probabilities in routine care.

...read moreread less

Summary (2 min read)

Jump to: [INTRODUCTION] – [Patient cohort] – [DISCUSSION] and [CONCLUSION]

INTRODUCTION

Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in late 20191.
Both people already vaccinated and patients not being vaccinated continue to develop critical COVID-19 disease11.
Among hospitalized patients, risk factors for severe disease or death include low lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters indicating organ dysfunction6.
While great efforts have been put into providing prognostic models based on data collected from health systems, traditional modelling approaches solely based on domain knowledge may fail.
Furthermore, ML models facilitate clinical insights21 when coupled with methods for model explainability such as SHapley Additive exPlanations (SHAP) values22.

Patient cohort

Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, the authors identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265 individuals who had a test performed between 17th of March 2020 and 2nd of March 2021 (Fig. 1).
The median of the predicted cumulative death probabilities by survival status reflected the discriminative performance of the individual survival predictions (Fig. 3a).
From the original set of 2,723 features generated from routine EHR data (Supplementary Table 2), 22 features were selected.
As expected, patients with more hospitalizations and longer cumulative admission days prior to FPT exhibited a higher risk of death (Fig 5e-f).

DISCUSSION

The authors here developed an explainable Machine Learning model for predicting the risk of death within the first 12 weeks from a positive SARS-CoV-2 PCR test.
Additionally, instead of characterizing patients’ relevant history using a limited set of preselected variables, the set of 22 features in the final model were derived using a data-driven approach from an initial set of 2,723 features that encoded available demographics, laboratory test results, hospitalizations, vital parameters, diagnoses and medicines.
This has been the predominant modelling approach in COVID1918,34 related outcomes.
Multiple approaches have been proposed to open “black-box” models and allow explainability by, for example, removing features and measuring their impact on the model43.
This suggests that predicting late deaths requires a different set of risk factors and consideration of their interactions than predicting early death.

CONCLUSION

The authors developed a data-driven machine learning model to identify SARS-CoV-2 positive patients with a high risk of death within 12-week from the first positive test.
The discrete-time modelling approach implemented not only allowed us to train survival models with high performance but also enabled model explainability through SHAP values.
By learning temporal dynamics and interactions between clinical features, the model was able to identify personalized risk factors and high-risk patients for early interventions while improving the understanding of the disease.
At the same time, the authors demonstrate that leveraging electronic health records with explainable ML models provide a framework for the implementation of precision medicine in routine care which can be adapted to other diseases.

Did you find this useful? Give us your feedback

Figures (6)

Figure 4. Global and local explanations of feature contributions to the risk of death in SARS-CoV-2 positive patients.

Figure 5. Individual feature explanations by survival status. Partial dependence plots (PDP) of SHAP values versus age (a), body mass index (b), sex (c), Lymphocytes levels (d), cumulative days in hospital (e) and the number of admissions (f) in the last 3 years, admission status at the time of first positive test (g) and the number of ordered medicines (h). Each dot shows a patient-week value coloured by survival status indicating those patients who survived (green) or died (red). Total SHAP values are represented as explained contributions in terms of probability (y-axis) given all the features values for a patient whereas features (x-axis) are represented by their corresponding value. The top and left panels of each PDP plot depict letter-value plots of the distribution of the x and y axes by survival status. Top panels were substituted by bar plots for categorical variables. Additional PDPs for the remaining features can be found in Supplementary

Figure 1. Overview of the data sources, feature engineering and modelling approach for predicting 12-week mortality in SARS-CoV-2 positive patients. a, Electronic Health Records (EHR) of 33,938 patients from 17th of March 2020 to 2nd of March 2021 (incidence curve) in eastern Denmark (geographical region visualized in red) were used to predict 12- week mortality from the first positive SARS-CoV-2 test (FPT). b, Features were engineered as the last value observed prior to FPT within the last month for vitals and laboratory values. To encode hospital admissions, medications and diagnoses, the count of occurrences within three or one year(s) prior to FPT was used. c, Machine learning algorithms were trained for survival modelling using a discrete-time approach. Time-to-event data were transformed longitudinally into patient-weeks up to the loss of follow-up (0) or death (1). With the augmented data, binary classification was performed by gradient boosting decision trees to predict personalized survival distributions for each patient and provide explanations of individual risk factors using SHAP values.

Figure 2. Binary performance metrics for 12 weeks mortality prediction

Figure 6. Summary of relevant feature interactions in explaining early and late mortality in SARS-CoV-2 positive patients.

Figure 3. Predicted individual discrete and cumulative death probabilities.

Content maybe subject to copyright Report





Adrian G. Zucco

, Rudi Agius

, Rebecka Svanberg

, Kasper S. Moestrup

, Ramtin Z. Marandi

Cameron Ross MacPherson

, Jens Lundgren

1,4

, Sisse R. Ostrowski

3,4*

, Carsten U. Niemann

2,4*

PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark.

Department of Hematology, Rigshospitalet, Copenhagen, Denmark.

Department of Clinical Immunology, Rigshospitalet, Copenhagen, Denmark.

Department of Clinical Medicine, University of Copenhagen, Denmark.

*Co-senior authors.

Correspondence should be addressed to: A.G.Z (adrian.gabriel.zucco@regionh.dk), S.R.O

(Sisse.Rye.Ostrowski@regionh.dk) or C.U.N (Carsten.Utoft.Niemann@regionh.dk).

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

ABSTRACT

Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement

precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of

a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in

eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR)

including demographics, diagnoses, medications, laboratory test results and vital parameters. A

discrete-time framework for survival modelling enabled us to predict personalized survival curves and

explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall

area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous

hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable

survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk

factors. Upon further validation, this model may allow direct reporting of personalized survival

probabilities in routine care.

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

INTRODUCTION

Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome

coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in

late 2019

. Infected individuals present a variety of symptoms, ranging from asymptomatic to life-

threatening diseases

. Although the majority of cases experience mild to moderate disease

approximately 15% of confirmed SARS-CoV-2 positive cases are estimated to develop severe

disease

. Progression to severe disease seems to occur within 1-2 weeks from symptom onset, and

is characterized by clinical signs of pneumonia with dyspnea, increased respiratory rate, and

decreased blood oxygen saturation requiring supplemental oxygen

37

. Development of critical illness

is driven by systemic inflammation, leading to acute respiratory distress syndrome (ARDS),

respiratory failure, septic shock, multi-organ failure, and/or disseminated coagulopathy

4,5,8

. The

majority of these patients require mechanical ventilation, and mortality for patients admitted to an

Intensive Care Unit (ICU) is reported to be 32-50%

3,810

. Despite the current vaccination program, both

people already vaccinated and patients not being vaccinated continue to develop critical COVID-19

disease

. Thus, the pandemic still poses a great burden on health care systems worldwide, locally

approaching the limit of capacity due to high patient burden and challenging clinical management.

Several factors associated with increased risk of severe disease course have been established

including old age, male gender, and lifestyle factors such as smoking and obesity

12,13

. Comorbidities

including hypertension, type 2 diabetes, renal disease, as well as pre-existing conditions of immune

dysfunction and cancer, are also associated with a higher risk of severe disease and COVID-19

related death

12,1416

. Among hospitalized patients, risk factors for severe disease or death include low

lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters

indicating organ dysfunction

. However, many of these factors likely reflect an ongoing progression of

COVID-19. Thus, identification of high-risk patients at or prior to hospital admission is warranted to

facilitate personalized interventions.

Multiple COVID-19 prognostic models have been built on reduced sets of predictive features from

demographics, patient history, physical examination, and laboratory results

processed by traditional

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

statistical frameworks or machine learning (ML) algorithms. A systematic review of 50 prognostic

models has concluded that overall such models have been poorly reported and are at a high risk of

bias

. While great efforts have been put into providing prognostic models based on data collected

from health systems, traditional modelling approaches solely based on domain knowledge may fail.

This represents a risk of missing novel markers and insights about the disease that could come from

data-driven models in a hypothesis-free manner

, which have been reported to outperform models

based on curated variables from domain experts

Furthermore, ML models facilitate clinical insights

when coupled with methods for model

explainability such as SHapley Additive exPlanations (SHAP) values

. Model explainability has been

developed mainly in the context of regression and binary classification, but in clinical research where

censored observations are common, explainable time-to-event modelling is required to avoid

selection bias

23,24

. Multiple ML algorithms have been developed for time-to-event modelling, either by

building on top of existing models such as Cox proportional hazards or by defining new loss functions

that model time as continuous

. Here we used an alternative approach that considered time in

discrete intervals and performed binary classification at such time intervals

. This allowed us to

implement gradient boosting decision trees for binary classification to predict personalized survival

probabilities

and allow explainability at the individual patient level using SHAP values

including

temporal dynamics of risk factors over the course of the disease. This approach not only allows to

predict personalized survival probabilities and risk factors for SARS-CoV-2 positive patients but also

provides a framework for precision medicine that can be applied to other diseases based on routine

electronic health records.

RESULTS

Patient cohort

Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, we

identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265

individuals who had a test performed between 17th of March 2020 and 2

of March 2021 (Fig. 1). In

this cohort, 5,077 patients were hospitalized, of whom 502 were admitted to the ICU (Supplementary

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

Fig. 1). Overall, 1,803 (5.34%) deaths occurred among all individuals with a positive SARS-CoV-2

RT-PCR test, of whom 141 died later than 12 weeks from the first positive test (FPT) hence considered

as alive for this analysis. Right-censoring was only observed for patients tested after the 8

December 2020 with less than 12 weeks of follow-up available while deaths that occurred the same

day of FPT were not considered for training. For the initial model, demographics, laboratory test

results, hospitalizations, vital parameters, diagnoses, medicines (ordered and administered) and

summary features were included. Feature encoding resulted in 2,723 features (Supplementary Table

2) which after feature selection were reduced to 23 features. A summary of the cohort based on the

final feature set can be found in Table 1. This cohort represents an updated subset of individuals

residing in Denmark characterized in a previous publication

Survival modelling with machine learning achieves high discriminative performance

To predict the risk of death within 12 weeks from FPT, we trained gradient boosting decision trees

considering time as discrete in a time-to-event framework. Performance was measured on 20% of the

data (test set) unblinded only for performance assessment. The weighted concordance index (C-

index) for predicting risk of death for all 12 weeks with 95% confidence intervals (CI) was 0.946 (0.941-

0.950). Binary metrics were calculated for each predicted week by excluding censored individuals

(Fig. 2). At week 12, the precision-recall area under the curve (PR-AUC) and Mathew correlation

coefficient (MCC) with 95% CI were 0.686 (0.651-0.720) and 0.580 (0.562-0.597) respectively. The

sensitivity was 99.3% and the specificity was 86.4%. The performance for subgroups of patients

displayed some differences. In patients tested outside the hospital (Fig 2b), the C-index was 0.955

(0.950-0.960), the PR-AUC and MCC were 0.675 (0.632-0.719) and 0.585 (0.562-0.605) respectively.

98.9% sensitivity and 89.9% specificity were measured in this group. For patients previously admitted

to the hospital at the time of test (Fig. 2c), the C-Index was 0.809 (0.787-0.829), the PR-AUC and

MCC were 0.705 (0.640-0.760) and 0.357 (0.325-0.387) respectively. The sensitivity was 100% and

the specificity 31.0% indicating a higher number of false positives when using a 0.5 probability

threshold for this group (Supplementary Table 1).

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

HTML Viewer

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Patients with CLL have a lower risk of death from COVID-19 in the Omicron era

[...]

Carsten Utoft Niemann, Caspar da Cunha-Bang, Marie Helleberg, Sisse R. Ostrowski, Christian Brieghel - Show less +1 more

19 May 2022-Blood

TL;DR: Patients with CLL with close hospital contactss and in particular those above 70 years of age with one or more comorbidities should be considered for closer monitoring and pre-emptive antiviral therapy upon a positive SARS-CoV-2 test.

...read moreread less

32 citations

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Personalized survival probabilities for sars-cov-2 positive patients by explainable machine learning" ?

Is the author/funder, who has granted medRxiv a license to display the preprint in ( which was not certified by peer review ) preprint

Personalized survival probabilities for SARS-CoV-2 positive patients by explainable machine learning

Summary (2 min read)

INTRODUCTION

Patient cohort

DISCUSSION

CONCLUSION

Figures (6)

Citations

References

Related Papers (5)

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Personalized survival probabilities for sars-cov-2 positive patients by explainable machine learning" ?