scispace - formally typeset
Search or ask a question

Showing papers by "Greg S. Corrado published in 2021"


Journal ArticleDOI
01 Jan 2021
TL;DR: In this article, two versions of a deep-learning system were used to predict the development of diabetic retinopathy in patients with diabetes who had had teleretinal DRS screening in a primary care setting.
Abstract: Summary Background Diabetic retinopathy screening is instrumental to preventing blindness, but scaling up screening is challenging because of the increasing number of patients with all forms of diabetes. We aimed to create a deep-learning system to predict the risk of patients with diabetes developing diabetic retinopathy within 2 years. Methods We created and validated two versions of a deep-learning system to predict the development of diabetic retinopathy in patients with diabetes who had had teleretinal diabetic retinopathy screening in a primary care setting. The input for the two versions was either a set of three-field or one-field colour fundus photographs. Of the 575 431 eyes in the development set 28 899 had known outcomes, with the remaining 546 532 eyes used to augment the training process via multitask learning. Validation was done on one eye (selected at random) per patient from two datasets: an internal validation (from EyePACS, a teleretinal screening service in the USA) set of 3678 eyes with known outcomes and an external validation (from Thailand) set of 2345 eyes with known outcomes. Findings The three-field deep-learning system had an area under the receiver operating characteristic curve (AUC) of 0·79 (95% CI 0·77–0·81) in the internal validation set. Assessment of the external validation set—which contained only one-field colour fundus photographs—with the one-field deep-learning system gave an AUC of 0·70 (0·67–0·74). In the internal validation set, the AUC of available risk factors was 0·72 (0·68–0·76), which improved to 0·81 (0·77–0·84) after combining the deep-learning system with these risk factors (p Interpretation The deep-learning systems predicted diabetic retinopathy development using colour fundus photographs, and the systems were independent of and more informative than available risk factors. Such a risk stratification tool might help to optimise screening intervals to reduce costs while improving vision-related outcomes. Funding Google.

88 citations


Journal ArticleDOI
19 Apr 2021
TL;DR: In this article, a deep learning system was developed for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides).
Abstract: Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66–0.73) and 0.69 (95% CI: 0.64–0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73–80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0–95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies.

67 citations


Journal ArticleDOI
14 Jul 2021
TL;DR: Interpretability analyses show known biomarker-histomorphology associations including associations of low-grade and lobular histology with ER/PR positivity, and increased inflammatory infiltrates with triple-negative staining.
Abstract: Breast cancer management depends on biomarkers including estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (ER/PR/HER2). Though existing scoring systems are widely used and well-validated, they can involve costly preparation and variable interpretation. Additionally, discordances between histology and expected biomarker findings can prompt repeat testing to address biological, interpretative, or technical reasons for unexpected results. We developed three independent deep learning systems (DLS) to directly predict ER/PR/HER2 status for both focal tissue regions (patches) and slides using hematoxylin-and-eosin-stained (H&E) images as input. Models were trained and evaluated using pathologist annotated slides from three data sources. Areas under the receiver operator characteristic curve (AUCs) were calculated for test sets at both a patch-level (>135 million patches, 181 slides) and slide-level (n = 3274 slides, 1249 cases, 37 sites). Interpretability analyses were performed using Testing with Concept Activation Vectors (TCAV), saliency analysis, and pathologist review of clustered patches. The patch-level AUCs are 0.939 (95%CI 0.936–0.941), 0.938 (0.936–0.940), and 0.808 (0.802–0.813) for ER/PR/HER2, respectively. At the slide level, AUCs are 0.86 (95%CI 0.84–0.87), 0.75 (0.73–0.77), and 0.60 (0.56–0.64) for ER/PR/HER2, respectively. Interpretability analyses show known biomarker-histomorphology associations including associations of low-grade and lobular histology with ER/PR positivity, and increased inflammatory infiltrates with triple-negative staining. This study presents rapid breast cancer biomarker estimation from routine H&E slides and builds on prior advances by prioritizing interpretability of computationally learned features in the context of existing pathological knowledge. Breast cancer diagnosis and characterization involves evaluation of marker proteins found inside or on the surface of tumor cells. Three of the most important markers are estrogen receptor (ER), progesterone receptor (PR) and a receptor called HER2. The levels of these markers can influence how a person with breast cancer is treated in the clinic. This study explored the ability of machine learning – whereby computer software is trained to recognise and classify particular image features - to determine the status of these markers in digitized images, without the need for tissue stains. Our results demonstrate that machine learning can automatically predict the status of ER, PR and HER2 in pathology images and further testing identifies specific image features which enable these predictions. This type of approach may decrease costs and timelines and enable improved quality control in marker detection. Gamble and Jaroensri et al. develop deep learning systems to predict breast cancer biomarker status using H&E images. Their models enable slide-level and patch-level predictions for ER, PR and HER2, with interpretability analyses highlighting specific histological features associated with these markers.

40 citations


Journal ArticleDOI
01 Apr 2021
TL;DR: In this article, an Artificial Intelligence-based assistive tool for interpreting clinical images and associated medical history was evaluated for diagnosis of 120 different skin conditions in a multiple-reader, multiple-case diagnostic study.
Abstract: Importance Most dermatologic cases are initially evaluated by nondermatologists such as primary care physicians (PCPs) or nurse practitioners (NPs). Objective To evaluate an artificial intelligence (AI)–based tool that assists with diagnoses of dermatologic conditions. Design, Setting, and Participants This multiple-reader, multiple-case diagnostic study developed an AI-based tool and evaluated its utility. Primary care physicians and NPs retrospectively reviewed an enriched set of cases representing 120 different skin conditions. Randomization was used to ensure each clinician reviewed each case either with or without AI assistance; each clinician alternated between batches of 50 cases in each modality. The reviews occurred from February 21 to April 28, 2020. Data were analyzed from May 26, 2020, to January 27, 2021. Exposures An AI-based assistive tool for interpreting clinical images and associated medical history. Main Outcomes and Measures The primary analysis evaluated agreement with reference diagnoses provided by a panel of 3 dermatologists for PCPs and NPs. Secondary analyses included diagnostic accuracy for biopsy-confirmed cases, biopsy and referral rates, review time, and diagnostic confidence. Results Forty board-certified clinicians, including 20 PCPs (14 women [70.0%]; mean experience, 11.3 [range, 2-32] years) and 20 NPs (18 women [90.0%]; mean experience, 13.1 [range, 2-34] years) reviewed 1048 retrospective cases (672 female [64.2%]; median age, 43 [interquartile range, 30-56] years; 41 920 total reviews) from a teledermatology practice serving 11 sites and provided 0 to 5 differential diagnoses per case (mean [SD], 1.6 [0.7]). The PCPs were located across 12 states, and the NPs practiced in primary care without physician supervision across 9 states. The NPs had a mean of 13.1 (range, 2-34) years of experience and practiced in primary care without physician supervision across 9 states. Artificial intelligence assistance was significantly associated with higher agreement with reference diagnoses. For PCPs, the increase in diagnostic agreement was 10% (95% CI, 8%-11%;P Conclusions and Relevance Artificial intelligence assistance was associated with improved diagnoses by PCPs and NPs for 1 in every 8 to 10 cases, indicating potential for improving the quality of dermatologic care.

37 citations


Journal ArticleDOI
01 Jun 2021-PLOS ONE
TL;DR: In this paper, the authors quantify the impact of COVID-19 social distancing policies across 27 European counties in spring 2020 on population mobility and the subsequent trajectory of disease, and find that mandatory stay-at-home orders and workplace closures had the largest impacts on population migration and subsequent cases at the onset of the pandemic.
Abstract: Background Social distancing have been widely used to mitigate community spread of SARS-CoV-2. We sought to quantify the impact of COVID-19 social distancing policies across 27 European counties in spring 2020 on population mobility and the subsequent trajectory of disease. Methods We obtained data on national social distancing policies from the Oxford COVID-19 Government Response Tracker and aggregated and anonymized mobility data from Google. We used a pre-post comparison and two linear mixed-effects models to first assess the relationship between implementation of national policies and observed changes in mobility, and then to assess the relationship between changes in mobility and rates of COVID-19 infections in subsequent weeks. Results Compared to a pre-COVID baseline, Spain saw the largest decrease in aggregate population mobility (~70%), as measured by the time spent away from residence, while Sweden saw the smallest decrease (~20%). The largest declines in mobility were associated with mandatory stay-at-home orders, followed by mandatory workplace closures, school closures, and non-mandatory workplace closures. While mandatory shelter-in-place orders were associated with 16.7% less mobility (95% CI: -23.7% to -9.7%), non-mandatory orders were only associated with an 8.4% decrease (95% CI: -14.9% to -1.8%). Large-gathering bans were associated with the smallest change in mobility compared with other policy types. Changes in mobility were in turn associated with changes in COVID-19 case growth. For example, a 10% decrease in time spent away from places of residence was associated with 11.8% (95% CI: 3.8%, 19.1%) fewer new COVID-19 cases. Discussion This comprehensive evaluation across Europe suggests that mandatory stay-at-home orders and workplace closures had the largest impacts on population mobility and subsequent COVID-19 cases at the onset of the pandemic. With a better understanding of policies’ relative performance, countries can more effectively invest in, and target, early nonpharmacological interventions.

30 citations


Journal ArticleDOI
30 Jun 2021
TL;DR: Wulczyn et al. as discussed by the authors developed a system to predict prostate cancer-specific mortality via A.I. grading and subsequently evaluated its ability to risk-stratify patients on an independent retrospective cohort of 2807 prostatectomy cases from a single European center with 5-25 years of follow-up (median: 13, interquartile range 9-17).
Abstract: Gleason grading of prostate cancer is an important prognostic factor, but suffers from poor reproducibility, particularly among non-subspecialist pathologists. Although artificial intelligence (A.I.) tools have demonstrated Gleason grading on-par with expert pathologists, it remains an open question whether and to what extent A.I. grading translates to better prognostication. In this study, we developed a system to predict prostate cancer-specific mortality via A.I.-based Gleason grading and subsequently evaluated its ability to risk-stratify patients on an independent retrospective cohort of 2807 prostatectomy cases from a single European center with 5–25 years of follow-up (median: 13, interquartile range 9–17). Here, we show that the A.I.’s risk scores produced a C-index of 0.84 (95% CI 0.80–0.87) for prostate cancer-specific mortality. Upon discretizing these risk scores into risk groups analogous to pathologist Grade Groups (GG), the A.I. has a C-index of 0.82 (95% CI 0.78–0.85). On the subset of cases with a GG provided in the original pathology report (n = 1517), the A.I.’s C-indices are 0.87 and 0.85 for continuous and discrete grading, respectively, compared to 0.79 (95% CI 0.71–0.86) for GG obtained from the reports. These represent improvements of 0.08 (95% CI 0.01–0.15) and 0.07 (95% CI 0.00–0.14), respectively. Our results suggest that A.I.-based Gleason grading can lead to effective risk stratification, and warrants further evaluation for improving disease management. Gleason grading is the process by which pathologists assess the morphology of prostate tumors. The assigned Grade Group tells us about the likely clinical course of people with prostate cancer and helps doctors to make decisions on treatment. The process is complex and subjective, with frequent disagreement amongst pathologists. In this study, we develop and evaluate an approach to Gleason grading based on artificial intelligence, rather than pathologists’ assessment, to predict risk of dying of prostate cancer. Looking back at tumors and data from 2,807 people diagnosed with prostate cancer, we find that our approach is better at predicting outcomes compared to grading by pathologists alone. These findings suggest that artificial intelligence might help doctors to accurately determine the probable clinical course of people with prostate cancer, which, in turn, will guide treatment. Wulczyn et al. utilise a deep learning-based Gleason grading model to predict prostate cancer-specific mortality in a retrospective cohort of radical prostatectomy patients. Their model enables improved risk stratification compared to pathologists’ grading and demonstrates the potential for computational pathology in the management of prostate cancer.

15 citations


Journal ArticleDOI
TL;DR: In this article, the authors developed and evaluated an AI system to classify chest radiography (CXR) as normal or abnormal, using a de-identified dataset of 248,445 patients from a multi-city hospital network in India.
Abstract: Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to detect every possible condition by building multiple separate systems, each of which detects one or more pre-specified conditions. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For training and tuning the system, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system trained using a large dataset containing a diverse array of CXR abnormalities generalizes to new patient populations and unseen diseases. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7-28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist. Lastly, to facilitate the continued development of AI models for CXR, we release our collected labels for the publicly available dataset.

14 citations


Journal ArticleDOI
TL;DR: In this paper, the DEEP2 system was trained on 3,611 hours of colonoscopy videos derived from two sources, and was validated on a set comprising 1,393 hours, from a third unrelated source.

12 citations


Posted Content
TL;DR: In this paper, a hierarchical outlier detection (HOD) loss is proposed to assign multiple abstention classes for each training outlier class and jointly perform a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes.
Abstract: We develop and rigorously evaluate a deep learning based system that can accurately classify skin conditions while detecting rare conditions for which there is not enough data available for training a confident classifier. We frame this task as an out-of-distribution (OOD) detection problem. Our novel approach, hierarchical outlier detection (HOD) assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate the effectiveness of the HOD loss in conjunction with modern representation learning approaches (BiT, SimCLR, MICLe) and explore different ensembling strategies for further improving the results. We perform an extensive subgroup analysis over conditions of varying risk levels and different skin types to investigate how the OOD detection performance changes over each subgroup and demonstrate the gains of our framework in comparison to baselines. Finally, we introduce a cost metric to approximate downstream clinical impact. We use this cost metric to compare the proposed method against a baseline system, thereby making a stronger case for the overall system effectiveness in a real-world deployment scenario.

9 citations


Posted ContentDOI
22 Jun 2021-medRxiv
TL;DR: Results validate the accuracy of the smartphone camera-based techniques to measure heart rate (HR) and RR across a range of pre-defined subgroups.
Abstract: Measuring vital signs plays a key role in both patient care and wellness, but can be challenging outside of medical settings due to the lack of specialized equipment. In this study, we prospectively evaluated smartphone camera-based techniques for measuring heart rate (HR) and respiratory rate (RR) for consumer wellness use. HR was measured by placing the finger over the rear-facing camera, while RR was measured via a video of the participants sitting still in front of the front-facing camera. In the HR study of 95 participants (with a protocol that included both measurements at rest and post exercise), the mean absolute percent error (MAPE) {+/-} standard deviation of the measurement was 1.6% {+/-} 4.3%, which was significantly lower than the pre-specified goal of 5%. No significant differences in the MAPE were present across colorimeter-measured skin-tone subgroups: 1.8% {+/-} 4.5% for very light to intermediate, 1.3% {+/-} 3.3% for tan and brown, and 1.8% {+/-} 4.9% for dark. In the RR study of 50 participants, the mean absolute error (MAE) was 0.78 {+/-} 0.61 breaths/min, which was significantly lower than the pre-specified goal of 3 breath/min. The MAE was low in both healthy participants (0.70 {+/-} 0.67 breaths/min), and participants with chronic respiratory conditions (0.80 {+/-} 0.60 breaths/min). Our results validate that smartphone camera-based techniques can accurately measure HR and RR across a range of pre-defined subgroups.

3 citations


Posted Content
TL;DR: In this article, a deep learning system was trained to detect active pulmonary TB using CXRs from 9 countries across Africa, Asia, and Europe, and utilized large-scale CXR pretraining, attention pooling, and noisy student semi-supervised learning.
Abstract: Tuberculosis (TB) is a top-10 cause of death worldwide. Though the WHO recommends chest radiographs (CXRs) for TB screening, the limited availability of CXR interpretation is a barrier. We trained a deep learning system (DLS) to detect active pulmonary TB using CXRs from 9 countries across Africa, Asia, and Europe, and utilized large-scale CXR pretraining, attention pooling, and noisy student semi-supervised learning. Evaluation was on (1) a combined test set spanning China, India, US, and Zambia, and (2) an independent mining population in South Africa. Given WHO targets of 90% sensitivity and 70% specificity, the DLS's operating point was prespecified to favor sensitivity over specificity. On the combined test set, the DLS's ROC curve was above all 9 India-based radiologists, with an AUC of 0.90 (95%CI 0.87-0.92). The DLS's sensitivity (88%) was higher than the India-based radiologists (75% mean sensitivity), p<0.001 for superiority; and its specificity (79%) was non-inferior to the radiologists (84% mean specificity), p=0.004. Similar trends were observed within HIV positive and sputum smear positive sub-groups, and in the South Africa test set. We found that 5 US-based radiologists (where TB isn't endemic) were more sensitive and less specific than the India-based radiologists (where TB is endemic). The DLS also remained non-inferior to the US-based radiologists. In simulations, using the DLS as a prioritization tool for confirmatory testing reduced the cost per positive case detected by 40-80% compared to using confirmatory testing alone. To conclude, our DLS generalized to 5 countries, and merits prospective evaluation to assist cost-effective screening efforts in radiologist-limited settings. Operating point flexibility may permit customization of the DLS to account for site-specific factors such as TB prevalence, demographics, clinical resources, and customary practice patterns.