Journal ArticleDOI
Large Language Models Encode Clinical Knowledge
Karan Singhal,Shekoofeh Azizi,Tao Tu,S Mahdavi,Jason Loh Seong Wei,Hyung Won Chung,Nathan Scales,Ajay Kumar Tanwani,Heather Cole-Lewis,Stephen Pfohl,P. A. Payne,Martin G. Seneviratne,P. Gamble,Chris Kelly,Nathaneal Scharli,Aakanksha Chowdhery,Philip Andrew Mansfield,Blaise Aguera y Arcas,Dale R. Webster,Greg S. Corrado,Yossi Matias,K. Chou,Juraj Gottweis,Nenad Tomasev,Yun Liu,Alvin Rajkomar,Joëlle K. Barral,Christopher Semturs,Alan Karthikesalingam,Vivek T. Natarajan +29 more
TLDR
The authors proposed a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias, and showed that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine.Abstract:
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.read more
Citations
More filters
Journal ArticleDOI
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
TL;DR: This paper proposes BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature and evaluates it on six biomedical natural language processing tasks and demonstrates that the model outperforms previous models on most tasks.
Journal ArticleDOI
Capabilities of GPT-4 on Medical Challenge Problems
TL;DR: This article presented a comprehensive evaluation of GPT-4, a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks, on medical competency examinations and benchmark datasets.
Journal ArticleDOI
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Shayne Longpre,Le Hou,Tu Vu,Albert Webson,Hyung Won Chung,Yi Tay,Denny Zhou,Quoc V. Le,Barret Zoph,Jason Loh Seong Wei,Adam Roberts +10 more
TL;DR: Chung et al. as mentioned in this paper studied the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung, et al., 2022) through careful ablation studies on the Flan Collection of tasks and methods, revealing that task balancing and enrichment techniques are critical to effective instruction tuning.
Journal ArticleDOI
Foundation models for generalist medical artificial intelligence
Michael Moor,O. Banerjee,Zahra F.H. Abad,Harlan M. Krumholz,Jure Leskovec,Eric J. Topol,Pranav Rajpurkar +6 more
TL;DR: Generalist medical AI (GMAI) as mentioned in this paper is a new paradigm for medical AI, which is capable of carrying out a diverse set of tasks using very little or no task-specific labelled data.
Journal ArticleDOI
BloombergGPT: A Large Language Model for Finance
Shijie Wu,Ozan Irsoy,Steven Lu,Mark Dredze,Sebastian Gehrmann,Prabhanjan Kambadur,D Rosenberg,Gideon Mann +7 more
TL;DR: The authors presented a 50 billion parameter language model that is trained on a wide range of financial data, including a 363 billion token dataset, augmented with 345 billion tokens from general purpose datasets.
References
More filters
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
Bleu: a Method for Automatic Evaluation of Machine Translation
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Posted Content
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Posted Content
Decoupled Weight Decay Regularization
Ilya Loshchilov,Frank Hutter +1 more
TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.