scispace - formally typeset
Journal ArticleDOI

Large Language Models Encode Clinical Knowledge

TLDR
The authors proposed a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias, and showed that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine.
Abstract
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

TL;DR: This paper proposes BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature and evaluates it on six biomedical natural language processing tasks and demonstrates that the model outperforms previous models on most tasks.
Journal ArticleDOI

Capabilities of GPT-4 on Medical Challenge Problems

TL;DR: This article presented a comprehensive evaluation of GPT-4, a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks, on medical competency examinations and benchmark datasets.
Journal ArticleDOI

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

TL;DR: Chung et al. as mentioned in this paper studied the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung, et al., 2022) through careful ablation studies on the Flan Collection of tasks and methods, revealing that task balancing and enrichment techniques are critical to effective instruction tuning.
Journal ArticleDOI

Foundation models for generalist medical artificial intelligence

TL;DR: Generalist medical AI (GMAI) as mentioned in this paper is a new paradigm for medical AI, which is capable of carrying out a diverse set of tasks using very little or no task-specific labelled data.
Journal ArticleDOI

BloombergGPT: A Large Language Model for Finance

TL;DR: The authors presented a 50 billion parameter language model that is trained on a wide range of financial data, including a 363 billion token dataset, augmented with 345 billion tokens from general purpose datasets.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Posted Content

Decoupled Weight Decay Regularization

TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.
Trending Questions (3)
How to improve the accuracy of LLM model?

To enhance LLM accuracy, utilize instruction prompt tuning aligning models to new domains with few exemplars, as shown in the study.