scispace - formally typeset
Open AccessProceedings Article

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

Reads0
Chats0
TLDR
The MedMentions corpus as mentioned in this paper is a manually annotated resource for the recognition of biomedical concepts, which includes over 4,000 abstracts and over 350,000 linked mentions.
Abstract
This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

TL;DR: This article proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.
Proceedings ArticleDOI

Self-alignment pretraining for biomedical entity representations

TL;DR: SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets and being able to achieve SOTA even without task-specific supervision.
Posted Content

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

TL;DR: The open source Medical Concept Annotation Toolkit (MedCAT) is presented, providing a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT and a feature-rich annotation interface for customizing and training IE models.
Posted Content

COMETA: A Corpus for Medical Entity Linking in the Social Media

TL;DR: A new corpus called COMETA is introduced, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph, that satisfies a combination of desirable properties that to the best of the knowledge has not been met by any of the existing resources in the field.
Posted Content

CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset

TL;DR: It is demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the CO VID-19 Open Research Dataset.
References
More filters
Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Journal ArticleDOI

The Unified Medical Language System (UMLS): integrating biomedical terminology

TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).
Proceedings ArticleDOI

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.
Journal ArticleDOI

GENIA corpus—a semantically annotated corpus for bio-textmining

TL;DR: The GENIA corpus as mentioned in this paper is a large corpus of 2000 MEDLINE abstracts with more than 400 000 words and almost 100, 000 annotations for biological terms for bio-text mining.
Proceedings ArticleDOI

Zero-shot Learning with Semantic Output Codes

TL;DR: A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.
Related Papers (5)