MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

Open AccessProceedings Article

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

Chats0

TLDR

The MedMentions corpus as mentioned in this paper is a manually annotated resource for the recognition of biomedical concepts, which includes over 4,000 abstracts and over 350,000 linked mentions.

Abstract:

This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Jinhyuk Lee, +6 more

- 25 Jan 2019 -

Bioinformatics

TL;DR: This article proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.

...read moreread less

Proceedings ArticleDOI

Self-alignment pretraining for biomedical entity representations

Fangyu Liu, +4 more

TL;DR: SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets and being able to achieve SOTA even without task-specific supervision.

...read moreread less

Posted Content

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Zeljko Kraljevic, +17 more

- 02 Oct 2020 -

arXiv: Computation and Language

TL;DR: The open source Medical Concept Annotation Toolkit (MedCAT) is presented, providing a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT and a feature-rich annotation interface for customizing and training IE models.

...read moreread less

Posted Content

COMETA: A Corpus for Medical Entity Linking in the Social Media

Marco Basaldella, +3 more

- 07 Oct 2020 -

arXiv: Computation and Language

TL;DR: A new corpus called COMETA is introduced, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph, that satisfies a combination of desirable properties that to the best of the knowledge has not been met by any of the existing resources in the field.

...read moreread less

Posted Content

CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset

Ting-Hao Kenneth Huang, +4 more

- 17 Aug 2020 -

arXiv: Computation and Language

TL;DR: It is demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the CO VID-19 Open Research Dataset.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

Christopher D. Manning, +5 more

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

...read moreread less

Journal ArticleDOI

The Unified Medical Language System (UMLS): integrating biomedical terminology

Olivier Bodenreider

- 01 Jan 2004 -

Nucleic Acids Research

TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).

...read moreread less

Proceedings ArticleDOI

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

Erik Tjong Kim Sang, +1 more

TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.

...read moreread less

Journal ArticleDOI

GENIA corpus—a semantically annotated corpus for bio-textmining

Jin-Dong Kim, +3 more

- 03 Jul 2003 -

Bioinformatics

TL;DR: The GENIA corpus as mentioned in this paper is a large corpus of 2000 MEDLINE abstracts with more than 400 000 words and almost 100, 000 annotations for biological terms for bio-text mining.

...read moreread less

Proceedings ArticleDOI

Zero-shot Learning with Semantic Output Codes

Mark Palatucci, +3 more

TL;DR: A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.

...read moreread less