scispace - formally typeset
Search or ask a question
Posted ContentDOI

Personalized survival probabilities for SARS-CoV-2 positive patients by explainable machine learning

TL;DR: In this paper, a machine learning model was trained to predict mortality within 12 weeks of the first positive SARS-CoV-2 test, which can aid clinicians to implement precision medicine.
Abstract: Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR) including demographics, diagnoses, medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling enabled us to predict personalized survival curves and explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of personalized survival probabilities in routine care.

Summary (2 min read)

Jump to: [INTRODUCTION][Patient cohort][DISCUSSION] and [CONCLUSION]

INTRODUCTION

  • Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in late 20191.
  • Both people already vaccinated and patients not being vaccinated continue to develop critical COVID-19 disease11.
  • Among hospitalized patients, risk factors for severe disease or death include low lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters indicating organ dysfunction6.
  • While great efforts have been put into providing prognostic models based on data collected from health systems, traditional modelling approaches solely based on domain knowledge may fail.
  • Furthermore, ML models facilitate clinical insights21 when coupled with methods for model explainability such as SHapley Additive exPlanations (SHAP) values22.

Patient cohort

  • Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, the authors identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265 individuals who had a test performed between 17th of March 2020 and 2nd of March 2021 (Fig. 1).
  • The median of the predicted cumulative death probabilities by survival status reflected the discriminative performance of the individual survival predictions (Fig. 3a).
  • From the original set of 2,723 features generated from routine EHR data (Supplementary Table 2), 22 features were selected.
  • As expected, patients with more hospitalizations and longer cumulative admission days prior to FPT exhibited a higher risk of death (Fig 5e-f).

DISCUSSION

  • The authors here developed an explainable Machine Learning model for predicting the risk of death within the first 12 weeks from a positive SARS-CoV-2 PCR test.
  • Additionally, instead of characterizing patients’ relevant history using a limited set of preselected variables, the set of 22 features in the final model were derived using a data-driven approach from an initial set of 2,723 features that encoded available demographics, laboratory test results, hospitalizations, vital parameters, diagnoses and medicines.
  • This has been the predominant modelling approach in COVID1918,34 related outcomes.
  • Multiple approaches have been proposed to open “black-box” models and allow explainability by, for example, removing features and measuring their impact on the model43.
  • This suggests that predicting late deaths requires a different set of risk factors and consideration of their interactions than predicting early death.

CONCLUSION

  • The authors developed a data-driven machine learning model to identify SARS-CoV-2 positive patients with a high risk of death within 12-week from the first positive test.
  • The discrete-time modelling approach implemented not only allowed us to train survival models with high performance but also enabled model explainability through SHAP values.
  • By learning temporal dynamics and interactions between clinical features, the model was able to identify personalized risk factors and high-risk patients for early interventions while improving the understanding of the disease.
  • At the same time, the authors demonstrate that leveraging electronic health records with explainable ML models provide a framework for the implementation of precision medicine in routine care which can be adapted to other diseases.

Did you find this useful? Give us your feedback

Figures (6)

Content maybe subject to copyright    Report

1


Adrian G. Zucco
1
, Rudi Agius
2
, Rebecka Svanberg
2
, Kasper S. Moestrup
1
, Ramtin Z. Marandi
1
,
Cameron Ross MacPherson
1
, Jens Lundgren
1,4
, Sisse R. Ostrowski
3,4*
, Carsten U. Niemann
2,4*
1
PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark.
2
Department of Hematology, Rigshospitalet, Copenhagen, Denmark.
3
Department of Clinical Immunology, Rigshospitalet, Copenhagen, Denmark.
4
Department of Clinical Medicine, University of Copenhagen, Denmark.
*Co-senior authors.
Correspondence should be addressed to: A.G.Z (adrian.gabriel.zucco@regionh.dk), S.R.O
(Sisse.Rye.Ostrowski@regionh.dk) or C.U.N (Carsten.Utoft.Niemann@regionh.dk).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

2
ABSTRACT
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement
precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of
a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in
eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR)
including demographics, diagnoses, medications, laboratory test results and vital parameters. A
discrete-time framework for survival modelling enabled us to predict personalized survival curves and
explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall
area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous
hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable
survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk
factors. Upon further validation, this model may allow direct reporting of personalized survival
probabilities in routine care.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

3
INTRODUCTION
Coronavirus disease 2019 (COVID-19) caused by infection with Severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2) has by October 2021 claimed almost 5 million lives since its outbreak in
late 2019
1
. Infected individuals present a variety of symptoms, ranging from asymptomatic to life-
threatening diseases
2
. Although the majority of cases experience mild to moderate disease
approximately 15% of confirmed SARS-CoV-2 positive cases are estimated to develop severe
disease
3
. Progression to severe disease seems to occur within 1-2 weeks from symptom onset, and
is characterized by clinical signs of pneumonia with dyspnea, increased respiratory rate, and
decreased blood oxygen saturation requiring supplemental oxygen
37
. Development of critical illness
is driven by systemic inflammation, leading to acute respiratory distress syndrome (ARDS),
respiratory failure, septic shock, multi-organ failure, and/or disseminated coagulopathy
4,5,8
. The
majority of these patients require mechanical ventilation, and mortality for patients admitted to an
Intensive Care Unit (ICU) is reported to be 32-50%
3,810
. Despite the current vaccination program, both
people already vaccinated and patients not being vaccinated continue to develop critical COVID-19
disease
11
. Thus, the pandemic still poses a great burden on health care systems worldwide, locally
approaching the limit of capacity due to high patient burden and challenging clinical management.
Several factors associated with increased risk of severe disease course have been established
including old age, male gender, and lifestyle factors such as smoking and obesity
12,13
. Comorbidities
including hypertension, type 2 diabetes, renal disease, as well as pre-existing conditions of immune
dysfunction and cancer, are also associated with a higher risk of severe disease and COVID-19
related death
12,1416
. Among hospitalized patients, risk factors for severe disease or death include low
lymphocyte counts, elevated inflammatory markers and elevated kidney and liver parameters
indicating organ dysfunction
6
. However, many of these factors likely reflect an ongoing progression of
COVID-19. Thus, identification of high-risk patients at or prior to hospital admission is warranted to
facilitate personalized interventions.
Multiple COVID-19 prognostic models have been built on reduced sets of predictive features from
demographics, patient history, physical examination, and laboratory results
17
processed by traditional
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

4
statistical frameworks or machine learning (ML) algorithms. A systematic review of 50 prognostic
models has concluded that overall such models have been poorly reported and are at a high risk of
bias
18
. While great efforts have been put into providing prognostic models based on data collected
from health systems, traditional modelling approaches solely based on domain knowledge may fail.
This represents a risk of missing novel markers and insights about the disease that could come from
data-driven models in a hypothesis-free manner
19
, which have been reported to outperform models
based on curated variables from domain experts
20
.
Furthermore, ML models facilitate clinical insights
21
when coupled with methods for model
explainability such as SHapley Additive exPlanations (SHAP) values
22
. Model explainability has been
developed mainly in the context of regression and binary classification, but in clinical research where
censored observations are common, explainable time-to-event modelling is required to avoid
selection bias
23,24
. Multiple ML algorithms have been developed for time-to-event modelling, either by
building on top of existing models such as Cox proportional hazards or by defining new loss functions
that model time as continuous
25
. Here we used an alternative approach that considered time in
discrete intervals and performed binary classification at such time intervals
26
. This allowed us to
implement gradient boosting decision trees for binary classification to predict personalized survival
probabilities
27
and allow explainability at the individual patient level using SHAP values
22
including
temporal dynamics of risk factors over the course of the disease. This approach not only allows to
predict personalized survival probabilities and risk factors for SARS-CoV-2 positive patients but also
provides a framework for precision medicine that can be applied to other diseases based on routine
electronic health records.
RESULTS
Patient cohort
Based on centralized EHR and SARS-CoV-2 test results from test centers in eastern Denmark, we
identified 33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265
individuals who had a test performed between 17th of March 2020 and 2
nd
of March 2021 (Fig. 1). In
this cohort, 5,077 patients were hospitalized, of whom 502 were admitted to the ICU (Supplementary
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

5
Fig. 1). Overall, 1,803 (5.34%) deaths occurred among all individuals with a positive SARS-CoV-2
RT-PCR test, of whom 141 died later than 12 weeks from the first positive test (FPT) hence considered
as alive for this analysis. Right-censoring was only observed for patients tested after the 8
th
of
December 2020 with less than 12 weeks of follow-up available while deaths that occurred the same
day of FPT were not considered for training. For the initial model, demographics, laboratory test
results, hospitalizations, vital parameters, diagnoses, medicines (ordered and administered) and
summary features were included. Feature encoding resulted in 2,723 features (Supplementary Table
2) which after feature selection were reduced to 23 features. A summary of the cohort based on the
final feature set can be found in Table 1. This cohort represents an updated subset of individuals
residing in Denmark characterized in a previous publication
28
.
Survival modelling with machine learning achieves high discriminative performance
To predict the risk of death within 12 weeks from FPT, we trained gradient boosting decision trees
considering time as discrete in a time-to-event framework. Performance was measured on 20% of the
data (test set) unblinded only for performance assessment. The weighted concordance index (C-
index) for predicting risk of death for all 12 weeks with 95% confidence intervals (CI) was 0.946 (0.941-
0.950). Binary metrics were calculated for each predicted week by excluding censored individuals
(Fig. 2). At week 12, the precision-recall area under the curve (PR-AUC) and Mathew correlation
coefficient (MCC) with 95% CI were 0.686 (0.651-0.720) and 0.580 (0.562-0.597) respectively. The
sensitivity was 99.3% and the specificity was 86.4%. The performance for subgroups of patients
displayed some differences. In patients tested outside the hospital (Fig 2b), the C-index was 0.955
(0.950-0.960), the PR-AUC and MCC were 0.675 (0.632-0.719) and 0.585 (0.562-0.605) respectively.
98.9% sensitivity and 89.9% specificity were measured in this group. For patients previously admitted
to the hospital at the time of test (Fig. 2c), the C-Index was 0.809 (0.787-0.829), the PR-AUC and
MCC were 0.705 (0.640-0.760) and 0.357 (0.325-0.387) respectively. The sensitivity was 100% and
the specificity 31.0% indicating a higher number of false positives when using a 0.5 probability
threshold for this group (Supplementary Table 1).
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 29, 2021. ; https://doi.org/10.1101/2021.10.28.21265598doi: medRxiv preprint

Citations
More filters
Journal ArticleDOI
19 May 2022-Blood
TL;DR: Patients with CLL with close hospital contactss and in particular those above 70 years of age with one or more comorbidities should be considered for closer monitoring and pre-emptive antiviral therapy upon a positive SARS-CoV-2 test.

32 citations

References
More filters
Journal ArticleDOI
TL;DR: Simulations data is compared to measured data of an experimental calf model and to physiological textbook data to create an object-oriented model library with components of the human cardiovascular system and physiological control mechanisms.
Abstract: Zusammenfassung Die “HumanLib” ist eine objektorientiert aufgebaute Modellbibliothek bestehend aus Komponenten des Herz-Kreislauf-Systems sowie körpereigenen Regelkreisen. Im Beitrag werden Aufbau und Modellierungsmethodik beschrieben. Anhand zweier Testszenarien werden Simulationsdaten mit Messdaten aus einem in vivo Versuch am Kalb sowie physiologischen Normalwerten verglichen. Abstract “HumanLib” is an object-oriented model library with components of the human cardiovascular system and physiological control mechanisms. In this paper, after specifying structure and modeling methods, simulated data is compared to measured data of an experimental calf model and to physiological textbook data.

718 citations

Journal ArticleDOI
01 Feb 2021-Allergy
TL;DR: In this review, the scientific evidence on the risk factors of severity of COVID‐19 are highlighted and socioeconomic status, diet, lifestyle, geographical differences, ethnicity, exposed viral load, day of initiation of treatment, and quality of health care have been reported to influence individual outcomes.
Abstract: The pandemic of coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has caused an unprecedented global social and economic impact, and high numbers of deaths. Many risk factors have been identified in the progression of COVID-19 into a severe and critical stage, including old age, male gender, underlying comorbidities such as hypertension, diabetes, obesity, chronic lung diseases, heart, liver and kidney diseases, tumors, clinically apparent immunodeficiencies, local immunodeficiencies, such as early type I interferon secretion capacity, and pregnancy. Possible complications include acute kidney injury, coagulation disorders, thoromboembolism. The development of lymphopenia and eosinopenia are laboratory indicators of COVID-19. Laboratory parameters to monitor disease progression include lactate dehydrogenase, procalcitonin, high-sensitivity C-reactive protein, proinflammatory cytokines such as interleukin (IL)-6, IL-1β, Krebs von den Lungen-6 (KL-6), and ferritin. The development of a cytokine storm and extensive chest computed tomography imaging patterns are indicators of a severe disease. In addition, socioeconomic status, diet, lifestyle, geographical differences, ethnicity, exposed viral load, day of initiation of treatment, and quality of health care have been reported to influence individual outcomes. In this review, we highlight the scientific evidence on the risk factors of severity of COVID-19.

703 citations

Journal ArticleDOI
TL;DR: Landscape compositions that can mitigate trade-offs under optimal land-use allocation but also show that intensive monocultures always lead to higher profits are identified, suggesting that targeted landscape planning is needed to increase land- use efficiency while ensuring socio-ecological sustainability.
Abstract: Land-use transitions can enhance the livelihoods of smallholder farmers but potential economic-ecological trade-offs remain poorly understood. Here, we present an interdisciplinary study of the environmental, social and economic consequences of land-use transitions in a tropical smallholder landscape on Sumatra, Indonesia. We find widespread biodiversity-profit trade-offs resulting from land-use transitions from forest and agroforestry systems to rubber and oil palm monocultures, for 26,894 aboveground and belowground species and whole-ecosystem multidiversity. Despite variation between ecosystem functions, profit gains come at the expense of ecosystem multifunctionality, indicating far-reaching ecosystem deterioration. We identify landscape compositions that can mitigate trade-offs under optimal land-use allocation but also show that intensive monocultures always lead to higher profits. These findings suggest that, to reduce losses in biodiversity and ecosystem functioning, changes in economic incentive structures through well-designed policies are urgently needed.

697 citations

Journal ArticleDOI
TL;DR: It is found that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases, which is a major weakness, given the urgency with which validated COVID-19 models are needed.
Abstract: Machine learning methods offer great promise for fast and accurate detection and prognostication of coronavirus disease 2019 (COVID-19) from standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we consider all published papers and preprints, for the period from 1 January 2020 to 3 October 2020, which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. All manuscripts uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE in this timeframe are considered. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 62 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher-quality model development and well-documented manuscripts. Many machine learning-based approaches have been developed for the prognosis and diagnosis of COVID-19 from medical images and this Analysis identifies over 2,200 relevant published papers and preprints in this area. After initial screening, 62 studies are analysed and the authors find they all have methodological flaws standing in the way of clinical utility. The authors have several recommendations to address these issues.

581 citations

Journal ArticleDOI
TL;DR: In this paper, the authors provide a survey of recent scientific works that incorporate machine learning and the way that explainable machine learning is used in combination with domain knowledge from the application areas.
Abstract: Machine learning methods have been remarkably successful for a wide range of application areas in the extraction of essential information from data. An exciting and relatively recent development is the uptake of machine learning in the natural sciences, where the major goal is to obtain novel scientific insights and discoveries from observational or simulated data. A prerequisite for obtaining a scientific outcome is domain knowledge, which is needed to gain explainability, but also to enhance scientific consistency. In this article, we review explainable machine learning in view of applications in the natural sciences and discuss three core elements that we identified as relevant in this context: transparency, interpretability, and explainability. With respect to these core elements, we provide a survey of recent scientific works that incorporate machine learning and the way that explainable machine learning is used in combination with domain knowledge from the application areas.

493 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Personalized survival probabilities for sars-cov-2 positive patients by explainable machine learning" ?

Is the author/funder, who has granted medRxiv a license to display the preprint in ( which was not certified by peer review ) preprint