scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation.

TL;DR: Externally and prospectively trained and validated machine learning models for mortality and critical events for patients with COVID-19 at different time horizons were developed and established model interpretability to identify and rank variables that drive model predictions.
Abstract: Background: COVID-19 has infected millions of people worldwide and is responsible for several hundred thousand fatalities. The COVID-19 pandemic has necessitated thoughtful resource allocation and early identification of high-risk patients. However, effective methods to meet these needs are lacking. Objective: The aims of this study were to analyze the electronic health records (EHRs) of patients who tested positive for COVID-19 and were admitted to hospitals in the Mount Sinai Health System in New York City; to develop machine learning models for making predictions about the hospital course of the patients over clinically meaningful time horizons based on patient characteristics at admission; and to assess the performance of these models at multiple hospitals and time points. Methods: We used Extreme Gradient Boosting (XGBoost) and baseline comparator models to predict in-hospital mortality and critical events at time windows of 3, 5, 7, and 10 days from admission. Our study population included harmonized EHR data from five hospitals in New York City for 4098 COVID-19–positive patients admitted from March 15 to May 22, 2020. The models were first trained on patients from a single hospital (n=1514) before or on May 1, externally validated on patients from four other hospitals (n=2201) before or on May 1, and prospectively validated on all patients after May 1 (n=383). Finally, we established model interpretability to identify and rank variables that drive model predictions. Results: Upon cross-validation, the XGBoost classifier outperformed baseline models, with an area under the receiver operating characteristic curve (AUC-ROC) for mortality of 0.89 at 3 days, 0.85 at 5 and 7 days, and 0.84 at 10 days. XGBoost also performed well for critical event prediction, with an AUC-ROC of 0.80 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. In external validation, XGBoost achieved an AUC-ROC of 0.88 at 3 days, 0.86 at 5 days, 0.86 at 7 days, and 0.84 at 10 days for mortality prediction. Similarly, the unimputed XGBoost model achieved an AUC-ROC of 0.78 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. Trends in performance on prospective validation sets were similar. At 7 days, acute kidney injury on admission, elevated LDH, tachypnea, and hyperglycemia were the strongest drivers of critical event prediction, while higher age, anion gap, and C-reactive protein were the strongest drivers of mortality prediction. Conclusions: We externally and prospectively trained and validated machine learning models for mortality and critical events for patients with COVID-19 at different time horizons. These models identified at-risk patients and uncovered underlying relationships that predicted outcomes.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this article, the authors proposed a machine learning (ML) method based on blood tests data to predict COVID-19 mortality risk using a powerful combination of five features: neutrophils, lymphocytes, lactate dehydrogenase (LDH), high-sensitivity C-reactive protein (hs-CRP), and age.
Abstract: The coronavirus disease 2019 (COVID-19), caused by the virus SARS-CoV-2, is an acute respiratory disease that has been classified as a pandemic by the World Health Organization (WHO). The sudden spike in the number of infections and high mortality rates have put immense pressure on the public healthcare systems. Hence, it is crucial to identify the key factors for mortality prediction to optimize patient treatment strategy. Different routine blood test results are widely available compared to other forms of data like X-rays, CT-scans, and ultrasounds for mortality prediction. This study proposes machine learning (ML) methods based on blood tests data to predict COVID-19 mortality risk. A powerful combination of five features: neutrophils, lymphocytes, lactate dehydrogenase (LDH), high-sensitivity C-reactive protein (hs-CRP), and age helps to predict mortality with 96% accuracy. Various ML models (neural networks, logistic regression, XGBoost, random forests, SVM, and decision trees) have been trained and performance compared to determine the model that achieves consistently high accuracy across the days that span the disease. The best performing method using XGBoost feature importance and neural network classification, predicts with an accuracy of 90% as early as 16 days before the outcome. Robust testing with three cases based on days to outcome confirms the strong predictive performance and practicality of the proposed model. A detailed analysis and identification of trends was performed using these key biomarkers to provide useful insights for intuitive application. This study provide solutions that would help accelerate the decision-making process in healthcare systems for focused medical treatments in an accurate, early, and reliable manner.

53 citations

Journal ArticleDOI
TL;DR: Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction.
Abstract: Between January and October of 2020, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus has infected more than 34 million persons in a worldwide pandemic leading to over one million deaths worldwide (data from the Johns Hopkins University). Since the virus begun to spread, emergency departments were busy with COVID-19 patients for whom a quick decision regarding in- or outpatient care was required. The virus can cause characteristic abnormalities in chest radiographs (CXR), but, due to the low sensitivity of CXR, additional variables and criteria are needed to accurately predict risk. Here, we describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a variable importance estimate not biased by the presence of surrogates. The most important variables are then selected to train a RF classifier, whose rules may be extracted, simplified, and pruned to finally build an associative tree, particularly appealing for its simplicity. Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction. The prediction performance of our approach was compared to that that of generalized linear models and shown to be effective and robust. The proposed machine learning-based computational system can be easily deployed and used in emergency departments for rapid and accurate risk prediction in COVID-19 patients.

45 citations

Journal ArticleDOI
TL;DR: In this article, the authors established daily confirmed infected cases prediction models for the time series data of America by applying both the long short-term memory (LSTM) and extreme gradient boosting (XGBoost) algorithms, and employed four performance parameters as MAE, MSE, RMSE, and MAPE to evaluate the effect of model fitting.
Abstract: In this paper, we establish daily confirmed infected cases prediction models for the time series data of America by applying both the long short-term memory (LSTM) and extreme gradient boosting (XGBoost) algorithms, and employ four performance parameters as MAE, MSE, RMSE, and MAPE to evaluate the effect of model fitting. LSTM is applied to reliably estimate accuracy due to the long-term attribute and diversity of COVID-19 epidemic data. Using XGBoost model, we conduct a sensitivity analysis to determine the robustness of predictive model to parameter features. Our results reveal that achieving a reduction in the contact rate between susceptible and infected individuals by isolated the uninfected individuals, can effectively reduce the number of daily confirmed cases. By combining the restrictive social distancing and contact tracing, the elimination of ongoing COVID-19 pandemic is possible. Our predictions are based on real time series data with reasonable assumptions, whereas the accurate course of epidemic heavily depends on how and when quarantine, isolation and precautionary measures are enforced.

44 citations

Journal ArticleDOI
TL;DR: In this paper, the authors used automated machine learning (autoML) to train various machine learning algorithms to predict patients' chances of surviving a SARS-CoV-2 infection.
Abstract: Background: During a pandemic, it is important for clinicians to stratify patients and decide who receives limited medical resources. Machine learning models have been proposed to accurately predict COVID-19 disease severity. Previous studies have typically tested only one machine learning algorithm and limited performance evaluation to area under the curve analysis. To obtain the best results possible, it may be important to test different machine learning algorithms to find the best prediction model. Objective: In this study, we aimed to use automated machine learning (autoML) to train various machine learning algorithms. We selected the model that best predicted patients’ chances of surviving a SARS-CoV-2 infection. In addition, we identified which variables (ie, vital signs, biomarkers, comorbidities, etc) were the most influential in generating an accurate model. Methods: Data were retrospectively collected from all patients who tested positive for COVID-19 at our institution between March 1 and July 3, 2020. We collected 48 variables from each patient within 36 hours before or after the index time (ie, real-time polymerase chain reaction positivity). Patients were followed for 30 days or until death. Patients’ data were used to build 20 machine learning models with various algorithms via autoML. The performance of machine learning models was measured by analyzing the area under the precision-recall curve (AUPCR). Subsequently, we established model interpretability via Shapley additive explanation and partial dependence plots to identify and rank variables that drove model predictions. Afterward, we conducted dimensionality reduction to extract the 10 most influential variables. AutoML models were retrained by only using these 10 variables, and the output models were evaluated against the model that used 48 variables. Results: Data from 4313 patients were used to develop the models. The best model that was generated by using autoML and 48 variables was the stacked ensemble model (AUPRC=0.807). The two best independent models were the gradient boost machine and extreme gradient boost models, which had an AUPRC of 0.803 and 0.793, respectively. The deep learning model (AUPRC=0.73) was substantially inferior to the other models. The 10 most influential variables for generating high-performing models were systolic and diastolic blood pressure, age, pulse oximetry level, blood urea nitrogen level, lactate dehydrogenase level, D-dimer level, troponin level, respiratory rate, and Charlson comorbidity score. After the autoML models were retrained with these 10 variables, the stacked ensemble model still had the best performance (AUPRC=0.791). Conclusions: We used autoML to develop high-performing models that predicted the survival of patients with COVID-19. In addition, we identified important variables that correlated with mortality. This is proof of concept that autoML is an efficient, effective, and informative method for generating machine learning–based clinical decision support tools.

43 citations

Journal ArticleDOI
TL;DR: In this paper, the authors conducted a systematic literature review on published and preprint reports of Artificial Intelligence models developed and validated for screening, diagnosis and prognosis of the coronavirus disease 2019.
Abstract: The worldwide health crisis caused by the SARS-Cov-2 virus has resulted in>3 million deaths so far. Improving early screening, diagnosis and prognosis of the disease are critical steps in assisting healthcare professionals to save lives during this pandemic. Since WHO declared the COVID-19 outbreak as a pandemic, several studies have been conducted using Artificial Intelligence techniques to optimize these steps on clinical settings in terms of quality, accuracy and most importantly time. The objective of this study is to conduct a systematic literature review on published and preprint reports of Artificial Intelligence models developed and validated for screening, diagnosis and prognosis of the coronavirus disease 2019. We included 101 studies, published from January 1st, 2020 to December 30th, 2020, that developed AI prediction models which can be applied in the clinical setting. We identified in total 14 models for screening, 38 diagnostic models for detecting COVID-19 and 50 prognostic models for predicting ICU need, ventilator need, mortality risk, severity assessment or hospital length stay. Moreover, 43 studies were based on medical imaging and 58 studies on the use of clinical parameters, laboratory results or demographic features. Several heterogeneous predictors derived from multimodal data were identified. Analysis of these multimodal data, captured from various sources, in terms of prominence for each category of the included studies, was performed. Finally, Risk of Bias (RoB) analysis was also conducted to examine the applicability of the included studies in the clinical setting and assist healthcare providers, guideline developers, and policymakers.

36 citations

References
More filters
Journal ArticleDOI
TL;DR: During the first 2 months of the current outbreak, Covid-19 spread rapidly throughout China and caused varying degrees of illness, and patients often presented without fever, and many did not have abnormal radiologic findings.
Abstract: Background Since December 2019, when coronavirus disease 2019 (Covid-19) emerged in Wuhan city and rapidly spread throughout China, data have been needed on the clinical characteristics of...

22,622 citations

Proceedings ArticleDOI
13 Aug 2016
TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.
Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

14,872 citations

Journal ArticleDOI
TL;DR: The findings reinforce the recommendation to strictly apply pharmacological thrombosis prophylaxis in all COVID-19 patients admitted to the ICU, and are strongly suggestive of increasing the prophYLaxis towards high-prophylactic doses, even in the absence of randomized evidence.

3,886 citations


"Machine Learning to Predict Mortali..." refers background in this paper

  • ...With growing evidence of COVID-19–induced hypercoagulable states in these patients [41,45,46], it is promising that our model recognized the feature importance of coagulability markers such as D-dimer (Figure 4)....

    [...]

Journal ArticleDOI
TL;DR: This study conducted a retrospective multicenter study of 68 death cases and 82 discharged cases with laboratory-confirmed infection of SARS-CoV-2 and confirmed that some patients died of fulminant myocarditis, which is characterized by a rapid progress and a severe state of illness.
Abstract: Dear Editor, The rapid emergence of COVID-19 in Wuhan city, Hubei Province, China, has resulted in thousands of deaths [1]. Many infected patients, however, presented mild flu-like symptoms and quickly recover [2]. To effectively prioritize resources for patients with the highest risk, we identified clinical predictors of mild and severe patient outcomes. Using the database of Jin Yin-tan Hospital and Tongji Hospital, we conducted a retrospective multicenter study of 68 death cases (68/150, 45%) and 82 discharged cases (82/150, 55%) with laboratory-confirmed infection of SARS-CoV-2. Patients met the discharge criteria if they had no fever for at least 3 days, significantly improved respiratory function, and had negative SARS-CoV-2 laboratory test results twice in succession. Case data included demographics, clinical characteristics, laboratory results, treatment options and outcomes. For statistical analysis, we represented continuous measurements as means (SDs) or as medians (IQRs) which compared with Student’s t test or the Mann–Whitney–Wilcoxon test. Categorical variables were expressed as numbers (%) and compared by the χ2 test or Fisher’s exact test. The distribution of the enrolled patients’ age is shown in Fig. 1a. There was a significant difference in age between the death group and the discharge group (p < 0.001) but no difference in the sex ratio (p = 0.43). A total of 63% (43/68) of patients in the death group and 41% (34/82) in the discharge group had underlying diseases (p = 0.0069). It should be noted that patients with cardiovascular diseases have a significantly increased risk of death when they are infected with SARS-CoV-2 (p < 0.001). A total of 16% (11/68) of the patients in the death group had secondary infections, and 1% (1/82) of the patients in the discharge group had secondary infections (p = 0.0018). Laboratory results showed that there were significant differences in white blood cell counts, absolute values of lymphocytes, platelets, albumin, total bilirubin, blood urea nitrogen, blood creatinine, myoglobin, cardiac troponin, C-reactive protein (CRP) and interleukin-6 (IL-6) between the two groups (Fig. 1b and Supplementary Table 1). The survival times of the enrolled patients in the death group were analyzed. The distribution of survival time from disease onset to death showed two peaks, with the first one at approximately 14 days (22 cases) and the second one at approximately 22 days (17 cases) (Fig. 1c). An analysis of the cause of death was performed. Among the 68 fatal cases, 36 patients (53%) died of respiratory failure, five patients (7%) with myocardial damage died of circulatory failure, 22 patients (33%) died of both, and five remaining died of an unknown cause (Fig. 1d). Based on the analysis of the clinical data, we confirmed that some patients died of fulminant myocarditis. In this study, we first reported that the infection of SARS-CoV-2 may cause fulminant myocarditis. Given that fulminant myocarditis is characterized by a rapid progress and a severe state of illness [3], our results should alert physicians to pay attention not only to the symptoms of respiratory dysfunction but also the symptoms of cardiac injury. *Correspondence: songsingsjx@sina.com 4 Department of Infectious Diseases, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, 1095 Jiefang Avenue, Wuhan 430030, Hubei, China Full author information is available at the end of the article

3,868 citations


"Machine Learning to Predict Mortali..." refers background in this paper

  • ...At 7 days, age was the most important feature for mortality prediction in COVID-19–positive patients, with a notably rapid and nonlinear increase of feature contribution with increasing age (Figure 4) [33,34]....

    [...]

Journal ArticleDOI
23 Mar 2020-JAMA
TL;DR: Since then, the number of cases identified in Italy has rapidly increased, mainly in northern Italy, but all regions of the country have reported having patients with COVID-19, and Italy now has the second largest number of CO VID-19 cases and also has a very high case-fatality rate.
Abstract: Only 3 cases of coronavirus disease 2019 (COVID-19) were identified in Italy in the first half of February 2020 and all involved people who had recently traveled to China. On February 20, 2020, a severe case of pneumonia due to SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) was diagnosed in northern Italy’s Lombardy region in a man in his 30s who had no history of possible exposure abroad. Within 14 days, many other cases of COVID-19 in the surrounding area were diagnosed, including a substantial number of critically ill patients.1 On the basis of the number of cases and of the advanced stage of the disease it was hypothesized that the virus had been circulating within the population since January. Another cluster of patients with COVID-19 was simultaneously identified in Veneto, which borders Lombardy. Since then, the number of cases identified in Italy has rapidly increased, mainly in northern Italy, but all regions of the country have reported having patients with COVID-19. After China, Italy now has the second largest number of COVID-19 cases2 and also has a very high case-fatality rate.3 This Viewpoint reviews the Italian experience with COVID-19 with an emphasis on fatalities.

3,438 citations


"Machine Learning to Predict Mortali..." refers background in this paper

  • ...Despite substantial, organized efforts to prevent disease spread, over 23 million people have tested positive for SARS-CoV-2 worldwide, and the World Health Organization has reported more than 800,000 deaths from the virus to date [1-4]....

    [...]

Related Papers (5)
Trending Questions (1)
How can machine learning be used to predict mortality in the intensive care unit?

Machine learning models can be trained on patient data to predict mortality in the intensive care unit based on patient characteristics at admission.