scispace - formally typeset
Search or ask a question

Showing papers in "Radiology in 2022"


Journal ArticleDOI
TL;DR: Transformer-based language models tailored to radiology had improved performance of radiology NLP tasks compared with baseline transformer language models.
Abstract: Purpose To investigate if tailoring a transformer-based language model to radiology is beneficial for radiology natural language processing (NLP) applications. Materials and Methods This retrospective study presents a family of bidirectional encoder representations from transformers (BERT)-based language models adapted for radiology, named RadBERT. Transformers were pretrained with either 2.16 or 4.42 million radiology reports from U.S. Department of Veterans Affairs health care systems nationwide on top of four different initializations (BERT-base, Clinical-BERT, robustly optimized BERT pretraining approach [RoBERTa], and BioMed-RoBERTa) to create six variants of RadBERT. Each variant was fine-tuned for three representative NLP tasks in radiology: (a) abnormal sentence classification: models classified sentences in radiology reports as reporting abnormal or normal findings; (b) report coding: models assigned a diagnostic code to a given radiology report for five coding systems; and (c) report summarization: given the findings section of a radiology report, models selected key sentences that summarized the findings. Model performance was compared by bootstrap resampling with five intensively studied transformer language models as baselines: BERT-base, BioBERT, Clinical-BERT, BlueBERT, and BioMed-RoBERTa. Results For abnormal sentence classification, all models performed well (accuracies above 97.5 and F1 scores above 95.0). RadBERT variants achieved significantly higher scores than corresponding baselines when given only 10% or less of 12 458 annotated training sentences. For report coding, all variants outperformed baselines significantly for all five coding systems. The variant RadBERT-BioMed-RoBERTa performed the best among all models for report summarization, achieving a Recall-Oriented Understudy for Gisting Evaluation-1 score of 16.18 compared with 15.27 by the corresponding baseline (BioMed-RoBERTa, P < .004). Conclusion Transformer-based language models tailored to radiology had improved performance of radiology NLP tasks compared with baseline transformer language models.Keywords: Translation, Unsupervised Learning, Transfer Learning, Neural Networks, Informatics Supplemental material is available for this article. © RSNA, 2022See also commentary by Wiggins and Tejani in this issue.

23 citations


Journal ArticleDOI
TL;DR: This report focuses on four aspects of model development where bias may arise: data augmentation, model and loss function, optimizers, and transfer learning.
Abstract: There are increasing concerns about the bias and fairness of artificial intelligence (AI) models as they are put into clinical practice. Among the steps for implementing machine learning tools into clinical workflow, model development is an important stage where different types of biases can occur. This report focuses on four aspects of model development where such bias may arise: data augmentation, model and loss function, optimizers, and transfer learning. This report emphasizes appropriate considerations and practices that can mitigate biases in radiology AI studies. Keywords: Model, Bias, Machine Learning, Deep Learning, Radiology © RSNA, 2022.

18 citations


Journal ArticleDOI
TL;DR: Cadrin-Chênevert et al. as mentioned in this paper compared RadImageNet with ImageNet using the area under the receiver operating characteristic curve (AUC) for eight classification tasks and using Dice scores for two segmentation problems.
Abstract: To demonstrate the value of pretraining with millions of radiologic images compared with ImageNet photographic images on downstream medical applications when using transfer learning.This retrospective study included patients who underwent a radiologic study between 2005 and 2020 at an outpatient imaging facility. Key images and associated labels from the studies were retrospectively extracted from the original study interpretation. These images were used for RadImageNet model training with random weight initiation. The RadImageNet models were compared with ImageNet models using the area under the receiver operating characteristic curve (AUC) for eight classification tasks and using Dice scores for two segmentation problems.The RadImageNet database consists of 1.35 million annotated medical images in 131 872 patients who underwent CT, MRI, and US for musculoskeletal, neurologic, oncologic, gastrointestinal, endocrine, abdominal, and pulmonary pathologic conditions. For transfer learning tasks on small datasets-thyroid nodules (US), breast masses (US), anterior cruciate ligament injuries (MRI), and meniscal tears (MRI)-the RadImageNet models demonstrated a significant advantage (P < .001) to ImageNet models (9.4%, 4.0%, 4.8%, and 4.5% AUC improvements, respectively). For larger datasets-pneumonia (chest radiography), COVID-19 (CT), SARS-CoV-2 (CT), and intracranial hemorrhage (CT)-the RadImageNet models also illustrated improved AUC (P < .001) by 1.9%, 6.1%, 1.7%, and 0.9%, respectively. Additionally, lesion localizations of the RadImageNet models were improved by 64.6% and 16.4% on thyroid and breast US datasets, respectively.RadImageNet pretrained models demonstrated better interpretability compared with ImageNet models, especially for smaller radiologic datasets.Keywords: CT, MR Imaging, US, Head/Neck, Thorax, Brain/Brain Stem, Evidence-based Medicine, Computer Applications-General (Informatics) Supplemental material is available for this article. Published under a CC BY 4.0 license.See also the commentary by Cadrin-Chênevert in this issue.

17 citations


Journal ArticleDOI
TL;DR: Artificial intelligence-based software can achieve noninferior image quality for 3D brain MRI sequences with a 45% scan time reduction, potentially improving the patient experience and scanner efficiency without sacrificing diagnostic quality.
Abstract: Artificial intelligence (AI)-based image enhancement has the potential to reduce scan times while improving signal-to-noise ratio (SNR) and maintaining spatial resolution. This study prospectively evaluated AI-based image enhancement in 32 consecutive patients undergoing clinical brain MRI. Standard-of-care (SOC) three-dimensional (3D) T1 precontrast, 3D T2 fluid-attenuated inversion recovery, and 3D T1 postcontrast sequences were performed along with 45% faster versions of these sequences using half the number of phase-encoding steps. Images from the faster sequences were processed by a Food and Drug Administration-cleared AI-based image enhancement software for resolution enhancement. Four board-certified neuroradiologists scored the SOC and AI-enhanced image series independently on a five-point Likert scale for image SNR, anatomic conspicuity, overall image quality, imaging artifacts, and diagnostic confidence. While interrater κ was low to fair, the AI-enhanced scans were noninferior for all metrics and actually demonstrated a qualitative SNR improvement. Quantitative analyses showed that the AI software restored the high spatial resolution of small structures, such as the septum pellucidum. In conclusion, AI-based software can achieve noninferior image quality for 3D brain MRI sequences with a 45% scan time reduction, potentially improving the patient experience and scanner efficiency without sacrificing diagnostic quality. Keywords: MR Imaging, CNS, Brain/Brain Stem, Reconstruction Algorithms © RSNA, 2022.

15 citations


Journal ArticleDOI
TL;DR: W Wiggins et al. as mentioned in this paper proposed a robustly optimized BERT pretraining approach for radiology report classification, and used the BERT pre-training approach to understand radiology reports.
Abstract: HomeRadiology: Artificial IntelligenceVol. 4, No. 4 PreviousNext CommentaryOn the Opportunities and Risks of Foundation Models for Natural Language Processing in RadiologyWalter F. Wiggins , Ali S. TejaniWalter F. Wiggins , Ali S. TejaniAuthor AffiliationsFrom the Department of Radiology, Duke University Health System, 2301 Erwin Rd, Durham, NC 27710 (W.F.W.); Duke Center for Artificial Intelligence in Radiology, Duke University School of Medicine, Durham, NC (W.F.W.); and Department of Radiology, University of Texas Southwestern Medical Center, Dallas, Tex (A.S.T.).Address correspondence to W.F.W. (email: [email protected]).Walter F. Wiggins Ali S. TejaniPublished Online:Jul 20 2022https://doi.org/10.1148/ryai.220119MoreSectionsFull textPDF ToolsImage ViewerAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinked In References1. Linna N, Kahn CE Jr. Applications of natural language processing in radiology: A systematic review. Int J Med Inform 2022;163:104779. Crossref, Medline, Google Scholar2. Vaswani A, Shazeer NM, Parmar N, et al. Attention is all you need. arXiv:1706.03762 [preprint] https://arxiv.org/abs/1706.03762. Posted June 12, 2017. Accessed June 15, 2022. Google Scholar3. Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [preprint] https://arxiv.org/abs/1810.04805. Posted October 11, 2018. Accessed June 15, 2022. Google Scholar4. Huang K, Altosaar J, Ranganath R.. ClinicalBERT: Modeling Clinical notes and predicting hospital readmission. arXiv:1904.05342 [preprint] https://arxiv.org/abs/1904.05342. Posted April 10, 2019. Accessed June 15, 2022. Google Scholar5. Wiggins WF, Kitamura F, Santos I, Prevedello LM. Natural language processing of radiology text reports: interactive text classification. Radiol Artif Intell 2021;3(4):e210035. Link, Google Scholar6. Jaiswal A, Tang L, Ghosh M, Rousseau JF, Peng Y, Ding Y. RadBERT-CL: Factually-aware contrastive learning for radiology report classification. Proc Mach Learn Res 2021;158:196–208. Medline, Google Scholar7. Kuling G, Curpen B, Martel AL. BI-RADS BERT and using section segmentation to understand radiology reports. J Imaging 2022;8(5):131. Crossref, Medline, Google Scholar8. Yan A, McAuley J, Lu X, et al. RadBERT: Adapting transformer-based language models to radiology. Radiol Artif Intell 2022;4(4):e210258. Link, Google Scholar9. Liu Y, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [preprint] https://arxiv.org/abs/1907.11692. Posted July 26, 2019. Accessed June 15, 2022. Google Scholar10. Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv:2108.07258 [preprint] https://arxiv.org/abs/2108.07258. Posted August 16, 2021. Accessed June 15, 2022. Google ScholarArticle HistoryReceived: June 17 2022Revision requested: June 19 2022Revision received: June 23 2022Accepted: June 27 2022Published online: July 20 2022 FiguresReferencesRelatedDetailsCited ByMachine Learning for Precision Epilepsy SurgeryLaraJehi2023 | Epilepsy CurrentsThe role of artificial intelligence in the differential diagnosis of wheezing symptoms in childrenLanSong, ZhenchenZhu, GeHu, XinSui, WeiSong, ZhengyuJin2022 | Radiology Science, Vol. 1, No. 1Applying BERT for Early-Stage Recognition of Persistence in Chat-Based Social Engineering AttacksNikolaosTsinganos, PanagiotisFouliras, IoannisMavridis2022 | Applied Sciences, Vol. 12, No. 23Accompanying This ArticleRadBERT: Adapting Transformer-based Language Models to RadiologyJun 15 2022Radiology: Artificial IntelligenceEpisode 23: NLP/Transformer Models for RadiologyOct 7 2022Default Digital Object SeriesRecommended Articles Natural Language Processing of Radiology Text Reports: Interactive Text ClassificationRadiology: Artificial Intelligence2021Volume: 3Issue: 4Current Applications and Future Impact of Machine Learning in RadiologyRadiology2018Volume: 288Issue: 2pp. 318-328RadBERT: Adapting Transformer-based Language Models to RadiologyRadiology: Artificial Intelligence2022Volume: 4Issue: 4Moving from ImageNet to RadImageNet for Improved Transfer Learning and GeneralizabilityRadiology: Artificial Intelligence2022Volume: 4Issue: 5Preparing Medical Imaging Data for Machine LearningRadiology2020Volume: 295Issue: 1pp. 4-15See More RSNA Education Exhibits Seeing Through the Eyes (and Visual Cortex) of a Machine: Convolutional Neural Networks at the Forefront of Machine Intelligence in Medical ImagingDigital Posters2018A Cased-Based Health Equity Primer for Radiologists: Real Cases, Real Problems, Real SolutionsDigital Posters2020Anatomy of a Deep Learning Project for Breast Cancer Prognosis Prediction: From Collecting Data to Building a PipelineDigital Posters2019 RSNA Case Collection Invasive ductal carcinoma of the breastRSNA Case Collection2020Pancreatic Schwannoma RSNA Case Collection2021COVID-19 pneumoniaRSNA Case Collection2020 Vol. 4, No. 4 PodcastMetrics Downloaded 181 times Altmetric Score PDF download

14 citations


Journal ArticleDOI
TL;DR: AI-based tools have not yet reached full diagnostic potential for COVID-19 and underperform compared with radiologist prediction and the association of race and sex with AI model diagnostic accuracy was evaluated.
Abstract: Purpose To conduct a prospective observational study across 12 U.S. hospitals to evaluate real-time performance of an interpretable artificial intelligence (AI) model to detect COVID-19 on chest radiographs. Materials and Methods A total of 95 363 chest radiographs were included in model training, external validation, and real-time validation. The model was deployed as a clinical decision support system, and performance was prospectively evaluated. There were 5335 total real-time predictions and a COVID-19 prevalence of 4.8% (258 of 5335). Model performance was assessed with use of receiver operating characteristic analysis, precision-recall curves, and F1 score. Logistic regression was used to evaluate the association of race and sex with AI model diagnostic accuracy. To compare model accuracy with the performance of board-certified radiologists, a third dataset of 1638 images was read independently by two radiologists. Results Participants positive for COVID-19 had higher COVID-19 diagnostic scores than participants negative for COVID-19 (median, 0.1 [IQR, 0.0–0.8] vs 0.0 [IQR, 0.0–0.1], respectively; P < .001). Real-time model performance was unchanged over 19 weeks of implementation (area under the receiver operating characteristic curve, 0.70; 95% CI: 0.66, 0.73). Model sensitivity was higher in men than women (P = .01), whereas model specificity was higher in women (P = .001). Sensitivity was higher for Asian (P = .002) and Black (P = .046) participants compared with White participants. The COVID-19 AI diagnostic system had worse accuracy (63.5% correct) compared with radiologist predictions (radiologist 1 = 67.8% correct, radiologist 2 = 68.6% correct; McNemar P < .001 for both). Conclusion AI-based tools have not yet reached full diagnostic potential for COVID-19 and underperform compared with radiologist prediction. Keywords: Diagnosis, Classification, Application Domain, Infection, Lung Supplemental material is available for this article.. © RSNA, 2022

12 citations


Journal ArticleDOI
TL;DR: DETR demonstrated high specificity for detection, localization, and characterization of FLLs on abdominal US images and met or exceeded that of two experts and Faster R-CNN for these tasks.
Abstract: Purpose To train and assess the performance of a deep learning-based network designed to detect, localize, and characterize focal liver lesions (FLLs) in the liver parenchyma on abdominal US images. Materials and Methods In this retrospective, multicenter, institutional review board-approved study, two object detectors, Faster region-based convolutional neural network (Faster R-CNN) and Detection Transformer (DETR), were fine-tuned on a dataset of 1026 patients (n = 2551 B-mode abdominal US images obtained between 2014 and 2018). Performance of the networks was analyzed on a test set of 48 additional patients (n = 155 B-mode abdominal US images obtained in 2019) and compared with the performance of three caregivers (one nonexpert and two experts) blinded to the clinical history. The sign test was used to compare accuracy, specificity, sensitivity, and positive predictive value among all raters. Results DETR achieved a specificity of 90% (95% CI: 75, 100) and a sensitivity of 97% (95% CI: 97, 97) for the detection of FLLs. The performance of DETR met or exceeded that of the three caregivers for this task. DETR correctly localized 80% of the lesions, and it achieved a specificity of 81% (95% CI: 67, 91) and a sensitivity of 82% (95% CI: 62, 100) for FLL characterization (benign vs malignant) among lesions localized by all raters. The performance of DETR met or exceeded that of two experts and Faster R-CNN for these tasks. Conclusion DETR demonstrated high specificity for detection, localization, and characterization of FLLs on abdominal US images. Supplemental material is available for this article. RSNA, 2022Keywords: Computer-aided Diagnosis (CAD), Ultrasound, Abdomen/GI, Liver, Tissue Characterization, Supervised Learning, Transfer Learning, Convolutional Neural Network (CNN).

12 citations


Journal ArticleDOI
TL;DR: In this article , two object detectors, Faster R-CNN and Detection Transformer (DETR), were fine-tuned on a dataset of 1026 patients (n = 2551 B-mode abdominal US images obtained between 2014 and 2018).
Abstract: To train and assess the performance of a deep learning-based network designed to detect, localize, and characterize focal liver lesions (FLLs) in the liver parenchyma on abdominal US images.In this retrospective, multicenter, institutional review board-approved study, two object detectors, Faster region-based convolutional neural network (Faster R-CNN) and Detection Transformer (DETR), were fine-tuned on a dataset of 1026 patients (n = 2551 B-mode abdominal US images obtained between 2014 and 2018). Performance of the networks was analyzed on a test set of 48 additional patients (n = 155 B-mode abdominal US images obtained in 2019) and compared with the performance of three caregivers (one nonexpert and two experts) blinded to the clinical history. The sign test was used to compare accuracy, specificity, sensitivity, and positive predictive value among all raters.DETR achieved a specificity of 90% (95% CI: 75, 100) and a sensitivity of 97% (95% CI: 97, 97) for the detection of FLLs. The performance of DETR met or exceeded that of the three caregivers for this task. DETR correctly localized 80% of the lesions, and it achieved a specificity of 81% (95% CI: 67, 91) and a sensitivity of 82% (95% CI: 62, 100) for FLL characterization (benign vs malignant) among lesions localized by all raters. The performance of DETR met or exceeded that of two experts and Faster R-CNN for these tasks.DETR demonstrated high specificity for detection, localization, and characterization of FLLs on abdominal US images. Supplemental material is available for this article. RSNA, 2022Keywords: Computer-aided Diagnosis (CAD), Ultrasound, Abdomen/GI, Liver, Tissue Characterization, Supervised Learning, Transfer Learning, Convolutional Neural Network (CNN).

12 citations


Journal ArticleDOI
TL;DR: An artificial intelligence (AI)-based detection tool for intracranial hemorrhage (ICH) on noncontrast CT images into an emergent workflow was implemented, its diagnostic performance was evaluated, and clinical workflow metrics compared with pre-AI implementation were assessed.
Abstract: Authors implemented an artificial intelligence (AI)-based detection tool for intracranial hemorrhage (ICH) on noncontrast CT images into an emergent workflow, evaluated its diagnostic performance, and assessed clinical workflow metrics compared with pre-AI implementation. The finalized radiology report constituted the ground truth for the analysis, and CT examinations (n = 4450) before and after implementation were retrieved using various keywords for ICH. Diagnostic performance was assessed, and mean values with their respective 95% CIs were reported to compare workflow metrics (report turnaround time, communication time of a finding, consultation time of another specialty, and turnaround time in the emergency department). Although practicable diagnostic performance was observed for overall ICH detection with 93.0% diagnostic accuracy, 87.2% sensitivity, and 97.8% negative predictive value, the tool yielded lower detection rates for specific subtypes of ICH (eg, 69.2% [74 of 107] for subdural hemorrhage and 77.4% [24 of 31] for acute subarachnoid hemorrhage). Common false-positive findings included postoperative and postischemic defects (23.6%, 37 of 157), artifacts (19.7%, 31 of 157), and tumors (15.3%, 24 of 157). Although workflow metrics such as communicating a critical finding (70 minutes [95% CI: 54, 85] vs 63 minutes [95% CI: 55, 71]) were on average reduced after implementation, future efforts are necessary to streamline the workflow all along the workflow chain. It is crucial to define a clear framework and recognize limitations as AI tools are only as reliable as the environment in which they are deployed. Keywords: CT, CNS, Stroke, Diagnosis, Classification, Application Domain © RSNA, 2022.

12 citations


Journal ArticleDOI
TL;DR: This study demonstrates accurate and reliable fully automated multi-vertebral level quantification and characterization of muscle and adipose tissue on routine chest CT scans.
Abstract: Body composition on chest CT scans encompasses a set of important imaging biomarkers. This study developed and validated a fully automated analysis pipeline for multi-vertebral level assessment of muscle and adipose tissue on routine chest CT scans. This study retrospectively trained two convolutional neural networks on 629 chest CT scans from 629 patients (55% women; mean age, 67 years ± 10 [standard deviation]) obtained between 2014 and 2017 prior to lobectomy for primary lung cancer at three institutions. A slice-selection network was developed to identify an axial image at the level of the fifth, eighth, and 10th thoracic vertebral bodies. A segmentation network (U-Net) was trained to segment muscle and adipose tissue on an axial image. Radiologist-guided manual-level selection and segmentation generated ground truth. The authors then assessed the predictive performance of their approach for cross-sectional area (CSA) (in centimeters squared) and attenuation (in Hounsfield units) on an independent test set. For the pipeline, median absolute error and intraclass correlation coefficients for both tissues were 3.6% (interquartile range, 1.3%-7.0%) and 0.959-0.998 for the CSA and 1.0 HU (interquartile range, 0.0-2.0 HU) and 0.95-0.99 for median attenuation. This study demonstrates accurate and reliable fully automated multi-vertebral level quantification and characterization of muscle and adipose tissue on routine chest CT scans. Keywords: Skeletal Muscle, Adipose Tissue, CT, Chest, Body Composition Analysis, Convolutional Neural Network (CNN), Supervised Learning Supplemental material is available for this article. © RSNA, 2022.

11 citations


Journal ArticleDOI
TL;DR: Three methodological pitfalls cannot be captured using internal model evaluation, and the inaccurate predictions made by such models may lead to wrong conclusions and interpretations, therefore, understanding and avoiding these pitfalls is necessary for developing generalizable models.
Abstract: Purpose To investigate the impact of the following three methodological pitfalls on model generalizability: (a) violation of the independence assumption, (b) model evaluation with an inappropriate performance indicator or baseline for comparison, and (c) batch effect. Materials and Methods The authors used retrospective CT, histopathologic analysis, and radiography datasets to develop machine learning models with and without the three methodological pitfalls to quantitatively illustrate their effect on model performance and generalizability. F1 score was used to measure performance, and differences in performance between models developed with and without errors were assessed using the Wilcoxon rank sum test when applicable. Results Violation of the independence assumption by applying oversampling, feature selection, and data augmentation before splitting data into training, validation, and test sets seemingly improved model F1 scores by 71.2% for predicting local recurrence and 5.0% for predicting 3-year overall survival in head and neck cancer and by 46.0% for distinguishing histopathologic patterns in lung cancer. Randomly distributing data points for a patient across datasets superficially improved the F1 score by 21.8%. High model performance metrics did not indicate high-quality lung segmentation. In the presence of a batch effect, a model built for pneumonia detection had an F1 score of 98.7% but correctly classified only 3.86% of samples from a new dataset of healthy patients. Conclusion Machine learning models developed with these methodological pitfalls, which are undetectable during internal evaluation, produce inaccurate predictions; thus, understanding and avoiding these pitfalls is necessary for developing generalizable models. Keywords: Random Forest, Diagnosis, Prognosis, Convolutional Neural Network (CNN), Medical Image Analysis, Generalizability, Machine Learning, Deep Learning, Model Evaluation Supplemental material is available for this article. Published under a CC BY 4.0 license.

Journal ArticleDOI
TL;DR: Accuracy and reliability of a fully automated software for BD classification based on convolutional neural networks from mammograms obtained between 2017 and 2020 are demonstrated.
Abstract: Mammographic breast density (BD) is commonly visually assessed using the Breast Imaging Reporting and Data System (BI-RADS) four-category scale. To overcome inter- and intraobserver variability of visual assessment, the authors retrospectively developed and externally validated a software for BD classification based on convolutional neural networks from mammograms obtained between 2017 and 2020. The tool was trained using the majority BD category determined by seven board-certified radiologists who independently visually assessed 760 mediolateral oblique (MLO) images in 380 women (mean age, 57 years ± 6 [SD]) from center 1; this process mimicked training from a consensus of several human readers. External validation of the model was performed by the three radiologists whose BD assessment was closest to the majority (consensus) of the initial seven on a dataset of 384 MLO images in 197 women (mean age, 56 years ± 13) obtained from center 2. The model achieved an accuracy of 89.3% in distinguishing BI-RADS a or b (nondense breasts) versus c or d (dense breasts) categories, with an agreement of 90.4% (178 of 197 mammograms) and a reliability of 0.807 (Cohen κ) compared with the mode of the three readers. This study demonstrates accuracy and reliability of a fully automated software for BD classification. Keywords: Mammography, Breast, Convolutional Neural Network (CNN), Deep Learning Algorithms, Machine Learning Algorithms Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: In this paper , a 3D U-Net with dynamic contrast-enhanced MRI as input and with intensity normalized for each examination was used for 2D segmentation of breast cancer.
Abstract: To develop a deep network architecture that would achieve fully automated radiologist-level segmentation of cancers at breast MRI.In this retrospective study, 38 229 examinations (composed of 64 063 individual breast scans from 14 475 patients) were performed in female patients (age range, 12-94 years; mean age, 52 years ± 10 [standard deviation]) who presented between 2002 and 2014 at a single clinical site. A total of 2555 breast cancers were selected that had been segmented on two-dimensional (2D) images by radiologists, as well as 60 108 benign breasts that served as examples of noncancerous tissue; all these were used for model training. For testing, an additional 250 breast cancers were segmented independently on 2D images by four radiologists. Authors selected among several three-dimensional (3D) deep convolutional neural network architectures, input modalities, and harmonization methods. The outcome measure was the Dice score for 2D segmentation, which was compared between the network and radiologists by using the Wilcoxon signed rank test and the two one-sided test procedure.The highest-performing network on the training set was a 3D U-Net with dynamic contrast-enhanced MRI as input and with intensity normalized for each examination. In the test set, the median Dice score of this network was 0.77 (interquartile range, 0.26). The performance of the network was equivalent to that of the radiologists (two one-sided test procedures with radiologist performance of 0.69-0.84 as equivalence bounds, P < .001 for both; n = 250).When trained on a sufficiently large dataset, the developed 3D U-Net performed as well as fellowship-trained radiologists in detailed 2D segmentation of breast cancers at routine clinical MRI.Keywords: MRI, Breast, Segmentation, Supervised Learning, Convolutional Neural Network (CNN), Deep Learning Algorithms, Machine Learning AlgorithmsPublished under a CC BY 4.0 license. Supplemental material is available for this article.

Journal ArticleDOI
TL;DR: Thoracic aortic aneurysms were accurately predicted at CT by using deep learning and these predictions were compared with annotations made by two independent readers as well as radiology reports to evaluate system performance.
Abstract: Purpose To develop and validate a deep learning-based system that predicts the largest ascending and descending aortic diameters at chest CT through automatic thoracic aortic segmentation and identifies aneurysms in each segment. Materials and Methods In this retrospective study conducted from July 2019 to February 2021, a U-Net and a postprocessing algorithm for thoracic aortic segmentation and measurement were developed by using a dataset (dataset A) that included 315 CT studies split into training, hyperparameter-tuning, and testing sets. The U-Net and postprocessing algorithm were associated with a Digital Imaging and Communications in Medicine series filter and visualization interface and were further validated by using a dataset (dataset B) that included 1400 routine CT studies. In dataset B, system-predicted measurements were compared with annotations made by two independent readers as well as radiology reports to evaluate system performance. Results In dataset B, the mean absolute error between the automatic and reader-measured diameters was equal to or less than 0.27 cm for both the ascending aorta and the descending aorta. The intraclass correlation coefficients (ICCs) were greater than 0.80 for the ascending aorta and equal to or greater than 0.70 for the descending aorta, and the ICCs between readers were 0.91 (95% CI: 0.90, 0.92) and 0.82 (95% CI: 0.80, 0.84), respectively. Aneurysm detection accuracy was 88% (95% CI: 86, 90) and 81% (95% CI: 79, 83) compared with reader 1 and 90% (95% CI: 88, 91) and 82% (95% CI: 80, 84) compared with reader 2 for the ascending aorta and descending aorta, respectively. Conclusion Thoracic aortic aneurysms were accurately predicted at CT by using deep learning.Keywords: Aorta, Convolutional Neural Network, Machine Learning, CT, Thorax, AneurysmsSupplemental material is available for this article.© RSNA, 2022.

Journal ArticleDOI
TL;DR: The CNN model detected and delineated vestibular schwannomas accurately on contrast-enhanced T1- and T2-weighted MRI scans and distinguished the clinically relevant difference between intrameatal and extrameatal tumor parts.
Abstract: Purpose To develop automated vestibular schwannoma measurements on contrast-enhanced T1- and T2-weighted MRI scans. Materials and Methods MRI data from 214 patients in 37 different centers were retrospectively analyzed between 2020 and 2021. Patients with hearing loss (134 positive for vestibular schwannoma [mean age ± SD, 54 years ± 12;64 men] and 80 negative for vestibular schwannoma) were randomly assigned to a training and validation set and to an independent test set. A convolutional neural network (CNN) was trained using fivefold cross-validation for two models (T1 and T2). Quantitative analysis, including Dice index, Hausdorff distance, surface-to-surface distance (S2S), and relative volume error, was used to compare the computer and the human delineations. An observer study was performed in which two experienced physicians evaluated both delineations. Results The T1-weighted model showed state-of-the-art performance, with a mean S2S distance of less than 0.6 mm for the whole tumor and the intrameatal and extrameatal tumor parts. The whole tumor Dice index and Hausdorff distance were 0.92 and 2.1 mm in the independent test set, respectively. T2-weighted images had a mean S2S distance less than 0.6 mm for the whole tumor and the intrameatal and extrameatal tumor parts. The whole tumor Dice index and Hausdorff distance were 0.87 and 1.5 mm in the independent test set. The observer study indicated that the tool was similar to human delineations in 85%-92% of cases. Conclusion The CNN model detected and delineated vestibular schwannomas accurately on contrast-enhanced T1- and T2-weighted MRI scans and distinguished the clinically relevant difference between intrameatal and extrameatal tumor parts.Keywords: MRI, Ear, Nose, and Throat, Skull Base, Segmentation, Convolutional Neural Network (CNN), Deep Learning Algorithms, Machine Learning Algorithms Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: This review focuses on six major categories for artificial intelligence applications: study selection and protocoling, image acquisition, worklist prioritization, study reporting, business applications, and resident education.
Abstract: Artificial intelligence has become a ubiquitous term in radiology over the past several years, and much attention has been given to applications that aid radiologists in the detection of abnormalities and diagnosis of diseases. However, there are many potential applications related to radiologic image quality, safety, and workflow improvements that present equal, if not greater, value propositions to radiology practices, insurance companies, and hospital systems. This review focuses on six major categories for artificial intelligence applications: study selection and protocoling, image acquisition, worklist prioritization, study reporting, business applications, and resident education. All of these categories can substantially affect different aspects of radiology practices and workflows. Each of these categories has different value propositions in terms of whether they could be used to increase efficiency, improve patient safety, increase revenue, or save costs. Each application is covered in depth in the context of both current and future areas of work. Keywords: Use of AI in Education, Application Domain, Supervised Learning, Safety © RSNA, 2022.

Journal ArticleDOI
TL;DR: The developed deep NLP model reached the performance level of medical students but not radiologists in curating oncologic outcomes from radiology FTOR and to compare its performance with human readers and conventional NLP algorithms.
Abstract: Purpose To train a deep natural language processing (NLP) model, using data mined structured oncology reports (SOR), for rapid tumor response category (TRC) classification from free-text oncology reports (FTOR) and to compare its performance with human readers and conventional NLP algorithms. Materials and Methods In this retrospective study, databases of three independent radiology departments were queried for SOR and FTOR dated from March 2018 to August 2021. An automated data mining and curation pipeline was developed to extract Response Evaluation Criteria in Solid Tumors-related TRCs for SOR for ground truth definition. The deep NLP bidirectional encoder representations from transformers (BERT) model and three feature-rich algorithms were trained on SOR to predict TRCs in FTOR. Models' F1 scores were compared against scores of radiologists, medical students, and radiology technologist students. Lexical and semantic analyses were conducted to investigate human and model performance on FTOR. Results Oncologic findings and TRCs were accurately mined from 9653 of 12 833 (75.2%) queried SOR, yielding oncology reports from 10 455 patients (mean age, 60 years ± 14 [SD]; 5303 women) who met inclusion criteria. On 802 FTOR in the test set, BERT achieved better TRC classification results (F1, 0.70; 95% CI: 0.68, 0.73) than the best-performing reference linear support vector classifier (F1, 0.63; 95% CI: 0.61, 0.66) and technologist students (F1, 0.65; 95% CI: 0.63, 0.67), had similar performance to medical students (F1, 0.73; 95% CI: 0.72, 0.75), but was inferior to radiologists (F1, 0.79; 95% CI: 0.78, 0.81). Lexical complexity and semantic ambiguities in FTOR influenced human and model performance, revealing maximum F1 score drops of -0.17 and -0.19, respectively. Conclusion The developed deep NLP model reached the performance level of medical students but not radiologists in curating oncologic outcomes from radiology FTOR.Keywords: Neural Networks, Computer Applications-Detection/Diagnosis, Oncology, Research Design, Staging, Tumor Response, Comparative Studies, Decision Analysis, Experimental Investigations, Observer Performance, Outcomes Analysis Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: In this article , the authors discuss the challenges and obstacles of training a very large medical imaging transformer, including data needs, biases, training tasks, network architecture, privacy concerns, and computational requirements.
Abstract: Deep learning models are currently the cornerstone of artificial intelligence in medical imaging. While progress is still being made, the generic technological core of convolutional neural networks (CNNs) has had only modest innovations over the last several years, if at all. There is thus a need for improvement. More recently, transformer networks have emerged that replace convolutions with a complex attention mechanism, and they have already matched or exceeded the performance of CNNs in many tasks. Transformers need very large amounts of training data, even more than CNNs, but obtaining well-curated labeled data is expensive and difficult. A possible solution to this issue would be transfer learning with pretraining on a self-supervised task using very large amounts of unlabeled medical data. This pretrained network could then be fine-tuned on specific medical imaging tasks with relatively modest data requirements. The authors believe that the availability of a large-scale, three-dimension-capable, and extensively pretrained transformer model would be highly beneficial to the medical imaging and research community. In this article, authors discuss the challenges and obstacles of training a very large medical imaging transformer, including data needs, biases, training tasks, network architecture, privacy concerns, and computational requirements. The obstacles are substantial but not insurmountable for resourceful collaborative teams that may include academia and information technology industry partners. © RSNA, 2022 Keywords: Computer-aided Diagnosis (CAD), Informatics, Transfer Learning, Convolutional Neural Network (CNN).

Journal ArticleDOI
TL;DR: In this article , the authors present a balanced perspective on whether AI should be included in the medical school curriculum. And they provide a compromise on the balanced point-counterpoint arguments, providing a compromise between AI's potential to shape the future of medicine and medical education.
Abstract: Although artificial intelligence (AI) has immense potential to shape the future of medicine, its place in undergraduate medical education currently is unclear. Numerous arguments exist both for and against including AI in the medical school curriculum. AI likely will affect all medical specialties, perhaps radiology more so than any other. The purpose of this article is to present a balanced perspective on whether AI should be included officially in the medical school curriculum. After presenting the balanced point-counterpoint arguments, the authors provide a compromise. Keywords: Artificial Intelligence, Medical Education, Medical School Curriculum, Medical Students, Radiology, Use of AI in Education © RSNA, 2022.

Journal ArticleDOI
TL;DR: The CT-based DL model performed similarly to radiologists and LSVR and splenic volume were predictive of advanced fibrosis and cirrhosis.
Abstract: Purpose To evaluate the performance of a deep learning (DL) model that measures the liver segmental volume ratio (LSVR) (ie, the volumes of Couinaud segments I-III/IV-VIII) and spleen volumes from CT scans to predict cirrhosis and advanced fibrosis. Materials and Methods For this Health Insurance Portability and Accountability Act-compliant, retrospective study, two datasets were used. Dataset 1 consisted of patients with hepatitis C who underwent liver biopsy (METAVIR F0-F4, 2000-2016). Dataset 2 consisted of patients who had cirrhosis from other causes who underwent liver biopsy (Ishak 0-6, 2001-2021). Whole liver, LSVR, and spleen volumes were measured with contrast-enhanced CT by radiologists and the DL model. Areas under the receiver operating characteristic curve (AUCs) for diagnosing advanced fibrosis (≥METAVIR F2 or Ishak 3) and cirrhosis (≥METAVIR F4 or Ishak 5) were calculated. Multivariable models were built on dataset 1 and tested on datasets 1 (hold out) and 2. Results Datasets 1 and 2 consisted of 406 patients (median age, 50 years [IQR, 44-56 years]; 297 men) and 207 patients (median age, 50 years [IQR, 41-57 years]; 147 men), respectively. In dataset 1, the prediction of cirrhosis was similar between the manual versus automated measurements for spleen volume (AUC, 0.86 [95% CI: 0.82, 0.9] vs 0.85 [95% CI: 0.81, 0.89]; significantly noninferior, P < .001) and LSVR (AUC, 0.83 [95% CI: 0.78, 0.87] vs 0.79 [95% CI: 0.74, 0.84]; P < .001). The best performing multivariable model achieved AUCs of 0.94 (95% CI: 0.89, 0.99) and 0.79 (95% CI: 0.71, 0.87) for cirrhosis and 0.8 (95% CI: 0.69, 0.91) and 0.71 (95% CI: 0.64, 0.78) for advanced fibrosis in datasets 1 and 2, respectively. Conclusion The CT-based DL model performed similarly to radiologists. LSVR and splenic volume were predictive of advanced fibrosis and cirrhosis.Keywords: CT, Liver, Cirrhosis, Computer Applications-Detection/Diagnosis Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: In this paper , a convolutional neural network (CNN) was used for screening mammography images, and the results showed that the CNN was effective in detecting mammography abnormalities.
Abstract: Supplemental material is available for this article. Keywords: Mammography, Screening, Convolutional Neural Network (CNN) Published under a CC BY 4.0 license. See also the commentary by Cadrin-Chênevert in this issue.

Journal ArticleDOI
TL;DR: This report demonstrates how two recently proposed checklists, datasheets for datasets and model cards, can be adopted to increase the transparency of crucial stages of the ML lifecycle, using ChestX-ray8 and CheXNet as examples.
Abstract: Artificial intelligence applications for health care have come a long way. Despite the remarkable progress, there are several examples of unfulfilled promises and outright failures. There is still a struggle to translate successful research into successful real-world applications. Machine learning (ML) products diverge from traditional software products in fundamental ways. Particularly, the main component of an ML solution is not a specific piece of code that is written for a specific purpose; rather, it is a generic piece of code, a model, customized by a training process driven by hyperparameters and a dataset. Datasets are usually large, and models are opaque. Therefore, datasets and models cannot be inspected in the same, direct way as traditional software products. Other methods are needed to detect failures in ML products. This report investigates recent advancements that promote auditing, supported by transparency, as a mechanism to detect potential failures in ML products for health care applications. It reviews practices that apply to the early stages of the ML lifecycle, when datasets and models are created; these stages are unique to ML products. Concretely, this report demonstrates how two recently proposed checklists, datasheets for datasets and model cards, can be adopted to increase the transparency of crucial stages of the ML lifecycle, using ChestX-ray8 and CheXNet as examples. The adoption of checklists to document the strengths, limitations, and applications of datasets and models in a structured format leads to increased transparency, allowing early detection of potential problems and opportunities for improvement. Keywords: Artificial Intelligence, Machine Learning, Lifecycle, Auditing, Transparency, Failures, Datasheets, Datasets, Model Cards Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed an Artificial Intelligence-assisted contour editing (AIACE) method to assist clinicians in editing contours efficiently and effectively using deep learning models on CT images from three head-and-neck cancer datasets.
Abstract: To present a concept called artificial intelligence-assisted contour editing (AIACE) and demonstrate its feasibility.The conceptual workflow of AIACE is as follows: Given an initial contour that requires clinician editing, the clinician indicates where large editing is needed, and a trained deep learning model uses this input to update the contour. This process repeats until a clinically acceptable contour is achieved. In this retrospective, proof-of-concept study, the authors demonstrated the concept on two-dimensional (2D) axial CT images from three head-and-neck cancer datasets by simulating the interaction with the AIACE model to mimic the clinical environment. The input at each iteration was one mouse click on the desired location of the contour segment. Model performance is quantified with the Dice similarity coefficient (DSC) and 95th percentile of Hausdorff distance (HD95) based on three datasets with sample sizes of 10, 28, and 20 patients.The average DSCs and HD95 values of the automatically generated initial contours were 0.82 and 4.3 mm, 0.73 and 5.6 mm, and 0.67 and 11.4 mm for the three datasets, which were improved to 0.91 and 2.1 mm, 0.86 and 2.5 mm, and 0.86 and 3.3 mm, respectively, with three mouse clicks. Each deep learning-based contour update required about 20 msec.The authors proposed the newly developed AIACE concept, which uses deep learning models to assist clinicians in editing contours efficiently and effectively, and demonstrated its feasibility by using 2D axial CT images from three head-and-neck cancer datasets.Keywords: Segmentation, Convolutional Neural Network (CNN), CT, Deep Learning Algorithms Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: The developed pipeline was able to accurately localize and classify brands of hardware implants using a weakly supervised learning framework and achieved an intersection over union of 86.8% and an F1 score of 94.9%.
Abstract: Purpose To develop an end-to-end pipeline to localize and identify cervical spine hardware brands on routine cervical spine radiographs. Materials and Methods In this single-center retrospective study, patients who received cervical spine implants between 2014 and 2018 were identified. Information on the implant model was retrieved from the surgical notes. The dataset was filtered for implants present in at least three patients, which yielded five anterior and five posterior hardware models for classification. Images for training were manually annotated with bounding boxes for anterior and posterior hardware. An object detection model was trained and implemented to localize hardware on the remaining images. An image classification model was then trained to differentiate between five anterior and five posterior hardware models. Model performance was evaluated on a holdout test set with 1000 iterations of bootstrapping. Results A total of 984 patients (mean age, 62 years ± 12 [standard deviation]; 525 women) were included for model training, validation, and testing. The hardware localization model achieved an intersection over union of 86.8% and an F1 score of 94.9%. For brand classification, an F1 score, sensitivity, and specificity of 98.7% ± 0.5, 98.7% ± 0.5, and 99.2% ± 0.3, respectively, were attained for anterior hardware, with values of 93.5% ± 2.0, 92.6% ± 2.0, and 96.1% ± 2.0, respectively, attained for posterior hardware. Conclusion The developed pipeline was able to accurately localize and classify brands of hardware implants using a weakly supervised learning framework.Keywords: Spine, Convolutional Neural Network, Deep Learning Algorithms, Machine Learning Algorithms, Prostheses, Semisupervised Learning Supplemental material is available for this article. © RSNA, 2022See also commentary by Huisman and Lessmann in this issue.

Journal ArticleDOI
TL;DR: Artificial intelligence (AI) applications in stroke care are being used currently in clinical practice with multiple Food and Drug Administration–approved and Conformité Européenne mark–certified commercially available platforms and are no longer a mere academic exercise.
Abstract: I stroke is one of the leading causes of mortality and severe disability worldwide (1). “Code stroke” is a time-sensitive and high-stakes clinical scenario alert for acute stroke that requires a rapid team approach to facilitate hyperacute evaluation and management of patients. For each acute stroke case not adequately interpreted, categorized, and treated, there is a high risk of mortality and disability. Management of acute ischemic stroke is evolving rapidly due to highly efficacious endovascular therapy (2–8). Regardless of the imaging modality (CT vs MRI) or type of hospital (tertiary “hub” vs outlying “spoke”), acute stroke management has one unifying need: to treat as quickly as possible. Optimal stroke treatment is a highly timedependent process (9). Every small reduction in time to treatment helps bring lifetime benefits: every minute saved leads to 4 days of disability-free life (10,11). In this scenario, any technology is desirable that improves the diagnosis of stroke and rapidly informs treatment decisions. Artificial intelligence (AI) is rising as a leading component in stroke imaging and is expected to further change acute stroke care (12). There are three main use cases where AI offers potential benefits: 1. Detection of abnormality to improve the performance of the human reader and increase the accuracy, sensitivity, and specificity of imaging analysis. AI can identify morphologic, mathematical, and geometric characteristics that are hard to detect with the human eye. 2. Workflow efficiency to reduce the time to detect abnormalities and generate reports. AI can recognize the most severe conditions to expedite care and reduce treatment times. 3. Therapeutic decision support to make triage and treatment decisions based on clinical and imaging parameters. There are three primary imaging questions in acute ischemic stroke that need to be answered quickly: extent of acute ischemia, presence of large vessel occlusion (LVO), and extent of brain tissue at risk. The Alberta Stroke Program Early CT Score (ASPECTS) provides a prognostic approach for extent of ischemia, where noncontrast head CT is scored by dividing the middle cerebral artery territory into 10 regions of interest (13). LVO accounts for one-third of acute ischemic stroke and can be assessed quickly with CT arteriography or MR arteriography. Perfusion imaging is another powerful tool that can measure ischemic and at-risk brain tissue. The constellation of imaging findings helps physicians make time-sensitive triage and treatment decisions, such as intravenous thrombolysis, endovascular treatment, patient transfer, or a decision not to treat. There are several practical limitations of using imaging tools for acute stroke. First, analysis of imaging information requires time. Second, not all radiologists are trained to analyze and interpret these techniques. Third, there is significantly limited availability of radiologists in community-based and rural hospitals that receive patients with acute stroke. To address these limitations, AI solutions have been built that can automate ASPECTS (14), LVO detection (15), and perfusion analysis. These AI solutions are being used currently in clinical practice with multiple Food and Drug Administration–approved and Conformité Européenne mark–certified commercially available platforms and are no longer a mere academic exercise (16,17). The AI application in stroke care is now integrated into clinical care with level 1 evidence in the American Heart Association/ American Stroke Association guidelines (18). Additionally, in September 2020, the U.S. Centers for Medicare & Medicaid Services approved the first reimbursement for such AI-augmented medical care (19). AI-powered software platforms have streamlined stroke care workflows, reduced treatment times, and improved outcomes (12,17,20–22). Moreover, the effect of differing levels of imaging expertise is mitigated by AI models that offer the analysis automatically, which is even more critical for community-based and rural stroke centers that may not have the necessary 24/7 availability of expert radiology interpretation. Delays to treatment are particularly prevalent when patients require a transfer from hospitals that lack endovascular therapy capability onsite. The AI tools can help expedite this workflow and be cost-effective. A recent cost-effectiveness study of automated LVO detection–applied base case scenario (6% missed diagnosis, $40 per AI analysis, and 50% reduction of missed LVOs by AI) showed substantial cost savings and increased qualityadjusted life-years for patients with acute ischemic stroke over their projected lifetime (23). Artificial Intelligence in “Code Stroke”—A Paradigm Shift: Do Radiologists Need to Change Their Practice?

Journal ArticleDOI
TL;DR: A deep learning tool to automatically quantify femoral component subsidence between two serial anteroposterior (AP) hip radiographs was developed and evaluated and showed no evidence of significant differences.
Abstract: Femoral component subsidence following total hip arthroplasty (THA) is a worrisome radiographic finding. This study developed and evaluated a deep learning tool to automatically quantify femoral component subsidence between two serial anteroposterior (AP) hip radiographs. The authors' institutional arthroplasty registry was used to retrospectively identify patients who underwent primary THA from 2000 to 2020. A deep learning dynamic U-Net model was trained to automatically segment femur, implant, and magnification markers on a dataset of 500 randomly selected AP hip radiographs from 386 patients with polished tapered cemented femoral stems. An image processing algorithm was then developed to measure subsidence by automatically annotating reference points on the femur and implant, calibrating that with respect to magnification markers. Algorithm and manual subsidence measurements by two independent orthopedic surgeon reviewers in 135 randomly selected patients were compared. The mean, median, and SD of measurement discrepancy between the automatic and manual measurements were 0.6, 0.3, and 0.7 mm, respectively, and did not demonstrate a systematic tendency between human and machine. Automatic and manual measurements were strongly correlated and showed no evidence of significant differences. In contrast to the manual approach, the deep learning tool needs no user input to perform subsidence measurements. Keywords: Total Hip Arthroplasty, Femoral Component Subsidence, Artificial Intelligence, Deep Learning, Semantic Segmentation, Hip, Joints Supplemental material is available for this article. © RSNA, 2022.

Journal ArticleDOI
TL;DR: In this paper , a neural network model was trained to locate six vertebral landmarks, which are used to measure vertebral body height and to output spine angle measurements across multiple modalities.
Abstract: To construct and evaluate the efficacy of a deep learning system to rapidly and automatically locate six vertebral landmarks, which are used to measure vertebral body heights, and to output spine angle measurements (lumbar lordosis angles [LLAs]) across multiple modalities.In this retrospective study, MR (n = 1123), CT (n = 137), and radiographic (n = 484) images were used from a wide variety of patient populations, ages, disease stages, bone densities, and interventions (n = 1744 total patients, 64 years ± 8, 76.8% women; images acquired 2005-2020). Trained annotators assessed images and generated data necessary for deformity analysis and for model development. A neural network model was then trained to output vertebral body landmarks for vertebral height measurement. The network was trained and validated on 898 MR, 110 CT, and 387 radiographic images and was then evaluated or tested on the remaining images for measuring deformities and LLAs. The Pearson correlation coefficient was used in reporting LLA measurements.On the holdout testing dataset (225 MR, 27 CT, and 97 radiographic images), the network was able to measure vertebral heights (mean height percentage of error ± 1 standard deviation: MR images, 1.5% ± 0.3; CT scans, 1.9% ± 0.2; radiographs, 1.7% ± 0.4) and produce other measures such as the LLA (mean absolute error: MR images, 2.90°; CT scans, 2.26°; radiographs, 3.60°) in less than 1.7 seconds across MR, CT, and radiographic imaging studies.The developed network was able to rapidly measure morphometric quantities in vertebral bodies and output LLAs across multiple modalities.Keywords: Computer Aided Diagnosis (CAD), MRI, CT, Spine, Demineralization-Bone, Feature Detection Supplemental material is available for this article. © RSNA, 2021.

Journal ArticleDOI
TL;DR: The developed deep learning–based method for mouse lung segmentation performed well independently of disease state (healthy, fibrotic, emphysematous lungs) and CT resolution.
Abstract: Purpose To develop a model to accurately segment mouse lungs with varying levels of fibrosis and investigate its applicability to mouse images with different resolutions. Materials and Methods In this experimental retrospective study, a U-Net was trained to automatically segment lungs on mouse CT images. The model was trained (n = 1200), validated (n = 300), and tested (n = 154) on longitudinally acquired and semiautomatically segmented CT images, which included both healthy and irradiated mice (group A). A second independent group of 237 mice (group B) was used for external testing. The Dice score coefficient (DSC) and Hausdorff distance (HD) were used as metrics to quantify segmentation accuracy. Transfer learning was applied to adapt the model to high-spatial-resolution mouse micro-CT segmentation (n = 20; group C [n = 16 for training and n = 4 for testing]). Results The trained model yielded a high median DSC in both test datasets: 0.984 (interquartile range [IQR], 0.977–0.988) in group A and 0.966 (IQR, 0.955–0.972) in group B. The median HD in both test datasets was 0.47 mm (IQR, 0–0.51 mm [group A]) and 0.31 mm (IQR, 0.30–0.32 mm [group B]). Spatially resolved quantification of differences toward reference masks revealed two hot spots close to the air-tissue interfaces, which are particularly prone to deviation. Finally, for the higher-resolution mouse CT images, the median DSC was 0.905 (IQR, 0.902–0.929) and the median 95th percentile of the HD was 0.33 mm (IQR, 2.61–2.78 mm). Conclusion The developed deep learning–based method for mouse lung segmentation performed well independently of disease state (healthy, fibrotic, emphysematous lungs) and CT resolution. Keywords: Deep Learning, Lung Fibrosis, Radiation Therapy, Segmentation, Animal Studies, CT, Thorax, Lung Supplemental material is available for this article. Published under a CC BY 4.0 license.

Journal ArticleDOI
TL;DR: Kahn et al. as mentioned in this paper presented a collection of articles on detecting and mitigating bias in Radiology Machine Learning (ML) systems, including three main areas of bias: data handling, model development and performance evaluation.
Abstract: HomeRadiology: Artificial IntelligenceVol. 4, No. 5 PreviousNext EditorialFree AccessHitting the Mark: Reducing Bias in AI SystemsCharles E. Kahn, Jr Charles E. Kahn, Jr Author AffiliationsFrom the Department of Radiology, University of Pennsylvania, 3400 Spruce St, 1 Silverstein, Philadelphia, PA 19104.Address correspondence to C.E.K. (email: [email protected]).Charles E. Kahn, Jr Published Online:Aug 24 2022https://doi.org/10.1148/ryai.220171MoreSectionsPDF ToolsImage ViewerAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinked In Bias can creep in at many stages of the deep-learning process, and the standard practices in computer science aren’t designed to detect it (1). Imagine that you’ve taken up the sport of archery. With practice, you achieve consistency, but when you aim at the center of the target, the arrows land 5 cm to the right of center. What would you do?Naturally, you’d adjust your aim. By aiming 5 cm to the left, you overcome your systematic bias, and your arrows now find their mark. Skilled archers know to adjust their aim to account for factors such as their own bias, the distance to the target, and the presence of crosswind.The machine learning (ML) models that have begun to show success in medical imaging require even more careful attention to address the effects of bias. It’s well known that ML models are supremely adept at recognizing patterns. Often, though, the patterns that they learn can incorporate features that the systems’ authors never intended. We’ve laughed—and despaired—over the systems that diagnosed pneumonia based on a radiographic marker (2) or those that detected pneumothorax based on confounding image features, such as an inserted chest tube (3).Such failures are examples of “shortcut learning,” where an ML system learns a pattern that fits its training data, without learning to generalize properly to the complexities of the real world (4). Because ML systems are trained on historical data, they also can learn shortcuts that are not only undesirable, but that incorporate historical prejudice. A résumé-screening system to select a technology company’s top job candidates was found to discriminate against women (5). A widely used algorithm to assign health care resources was found to be racially biased (6). Given that artificial intelligence (AI) systems can discern a patient’s race and sex from a chest radiograph (7), we must be attentive to the possibility that unintended biases in radiology ML systems could interact adversely with other health care AI models.Several proposed approaches can help make ML models—and the data from which they’re derived—more transparent. “Model cards” can describe how a model performs across a variety of demographic or phenotypic groups; they help disclose the context in which the model is intended to be used (8). “Datasheets for datasets” can document the motivation, composition, collection process, and recommended uses of a given dataset; these datasheets have potential to increase transparency and accountability (9).To address the particular challenges of bias in medical imaging AI systems, Dr Bradley Erickson and colleagues have produced a new collection of articles on “Mitigating Bias in Radiology Machine Learning” (https://pubs.rsna.org/page/ai/mitigating_bias). These articles describe approaches to help identify and reduce bias in radiology ML systems. Their articles address three principal areas where bias can impact an ML model: data handling (10), model development (11), and performance evaluation (12).These articles will help us identify and mitigate potential biases in radiology ML systems, and thus better assure trustworthy results that are truly “on target” for the care of our patients.Disclosures of conflicts of interest: C.E.K. Salary support from RSNA, paid to employer, for editorial role (Editor of Radiology: Artificial Intelligence).Author declared no funding for this work.References1. Hao K. This is how AI bias really happens—and why it’s so hard to fix. MIT Technology Review. https://www.technologyreview.com/2019/02/04/137602/this-is-how-ai-bias-really-happensand-why-its-so-hard-to-fix/. Published February 4, 2019. Google Scholar2. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 2018;15:e1002683. Crossref, Medline, Google Scholar3. Rueckel J, Trappmann L, Schachtner B, et al. Impact of confounding thoracic tubes and pleural dehiscence extent on artificial intelligence pneumothorax detection in chest radiographs. Invest Radiol 2020;55:792–798. Crossref, Medline, Google Scholar4. Geirhos R, Jacobsen J-H, Michaelis C, et al. Shortcut learning in deep neural networks. Nat Mach Intell 2020;2:665–673. Crossref, Google Scholar5. Dastin J. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters 2018. https://www.reuters.com/article/idUSKCN1MK08G. Published October 10, 2018. Google Scholar6. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447–453. Crossref, Medline, Google Scholar7. Gichoya JW, Banerjee I, Bhimireddy AR, et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health 2022;4:e406–e414. Crossref, Medline, Google Scholar8. Mitchell M, Wu S, Zaldivar A, et al. Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3287560.3287596. Published January 29, 2019. Google Scholar9. Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, Crawford K. Datasheets for datasets. Commun ACM 2021;64:86–92. Crossref, Google Scholar10. Rouzrokh P, Khosravi B, Faghani S, et al. Mitigating bias in radiology machine learning: 1. Data handling. Radiol Artif Intell 2022;4(5):e210290. Google Scholar11. Zhang K, Khosravi B, Vahdati S, et al. Mitigating bias in radiology machine learning: 2. Model development. Radiol Artif Intell 2022;4(5):e220010. Google Scholar12. Faghani S, Khosravi B, Zhang K, et al. Mitigating bias in radiology machine learning: 3. Performance metrics. Radiol Artif Intell 2022;4(5):e220061. Google ScholarArticle HistoryReceived: Aug 15 2022Accepted: Aug 15 2022Published online: Aug 24 2022 FiguresReferencesRelatedDetailsRecommended Articles Deep Learning Systems for Pneumothorax Detection on Chest Radiographs: A Multicenter External Validation StudyRadiology: Artificial Intelligence2021Volume: 3Issue: 4Curation of the CANDID-PTX Dataset with Free-Text ReportsRadiology: Artificial Intelligence2021Volume: 3Issue: 6Autologous Blood Patch Injection versus Hydrogel Plug in CT-guided Lung Biopsy: A Prospective Randomized TrialRadiology2018Volume: 290Issue: 2pp. 547-554Case 287: Intrathoracic Migration of a Breast Implant after Video-assisted Thoracoscopic Surgery for Right Upper LobectomyRadiology2021Volume: 298Issue: 3pp. 713-716Diagnosis of Coronavirus Disease 2019 Pneumonia by Using Chest Radiography: Value of Artificial IntelligenceRadiology2020Volume: 298Issue: 2pp. E88-E97See More RSNA Education Exhibits Thoracic Abnormal Air Collections in Intensive-Care Unit: Radiograph Findings Correlated with CTDigital Posters2018Introduction to Artificial Intelligence and Big Data Research in Chest RadiologyDigital Posters2019 Imaging in Lung Transplantation: a Practical Guide for Radiologists  Digital Posters2018 RSNA Case Collection Pneumothorax Ex VacuoRSNA Case Collection2022Pulmonary metastasis causing spontaneous pneumothoraxRSNA Case Collection2020Traumatic Diaphragmatic RuptureRSNA Case Collection2021 Vol. 4, No. 5 Metrics Downloaded 560 times Altmetric Score PDF download

Journal ArticleDOI
TL;DR: In this article , a two-dimensional MultiResUNet and DenseNet121 networks based on T2-weighted images were used for spinal cord lesion segmentation and classification.
Abstract: Accurate differentiation of intramedullary spinal cord tumors and inflammatory demyelinating lesions and their subtypes are warranted because of their overlapping characteristics at MRI but with different treatments and prognosis. The authors aimed to develop a pipeline for spinal cord lesion segmentation and classification using two-dimensional MultiResUNet and DenseNet121 networks based on T2-weighted images. A retrospective cohort of 490 patients (118 patients with astrocytoma, 130 with ependymoma, 101 with multiple sclerosis [MS], and 141 with neuromyelitis optica spectrum disorders [NMOSD]) was used for model development, and a prospective cohort of 157 patients (34 patients with astrocytoma, 45 with ependymoma, 33 with MS, and 45 with NMOSD) was used for model testing. In the test cohort, the model achieved Dice scores of 0.77, 0.80, 0.50, and 0.58 for segmentation of astrocytoma, ependymoma, MS, and NMOSD, respectively, against manual labeling. Accuracies of 96% (area under the receiver operating characteristic curve [AUC], 0.99), 82% (AUC, 0.90), and 79% (AUC, 0.85) were achieved for the classifications of tumor versus demyelinating lesion, astrocytoma versus ependymoma, and MS versus NMOSD, respectively. In a subset of radiologically difficult cases, the classifier showed an accuracy of 79%-95% (AUC, 0.78-0.97). The established deep learning pipeline for segmentation and classification of spinal cord lesions can support an accurate radiologic diagnosis. Supplemental material is available for this article. © RSNA, 2022 Keywords: Spinal Cord MRI, Astrocytoma, Ependymoma, Multiple Sclerosis, Neuromyelitis Optica Spectrum Disorder, Deep Learning.