Open AccessProceedings Article
MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts
Sunil Mohan,Donghui Li +1 more
Reads0
Chats0
TLDR
The MedMentions corpus as mentioned in this paper is a manually annotated resource for the recognition of biomedical concepts, which includes over 4,000 abstracts and over 350,000 linked mentions.Abstract:
This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.read more
Citations
More filters
Journal ArticleDOI
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.
TL;DR: This article proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.
Proceedings ArticleDOI
Self-alignment pretraining for biomedical entity representations
TL;DR: SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets and being able to achieve SOTA even without task-specific supervision.
Posted Content
Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit
Zeljko Kraljevic,Thomas Searle,Anthony Shek,Lukasz Roguski,Kawsar Noor,Daniel Bean,Aurelie Mascio,Leilei Zhu,Amos Folarin,Angus Roberts,Rebecca Bendayan,Mark P. Richardson,Robert Stewart,Anoop D. Shah,Wai Keong Wong,Zina M. Ibrahim,James T. Teo,Richard Dobson +17 more
TL;DR: The open source Medical Concept Annotation Toolkit (MedCAT) is presented, providing a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT and a feature-rich annotation interface for customizing and training IE models.
Posted Content
COMETA: A Corpus for Medical Entity Linking in the Social Media
TL;DR: A new corpus called COMETA is introduced, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph, that satisfies a combination of desirable properties that to the best of the knowledge has not been met by any of the existing resources in the field.
Posted Content
CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset
TL;DR: It is demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the CO VID-19 Open Research Dataset.
References
More filters
Proceedings ArticleDOI
The Stanford CoreNLP Natural Language Processing Toolkit
Christopher D. Manning,Mihai Surdeanu,John Bauer,Jenny Rose Finkel,Steven Bethard,David McClosky +5 more
TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Journal ArticleDOI
The Unified Medical Language System (UMLS): integrating biomedical terminology
TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).
Proceedings ArticleDOI
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition
TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.
Journal ArticleDOI
GENIA corpus—a semantically annotated corpus for bio-textmining
TL;DR: The GENIA corpus as mentioned in this paper is a large corpus of 2000 MEDLINE abstracts with more than 400 000 words and almost 100, 000 annotations for biological terms for bio-text mining.
Proceedings ArticleDOI
Zero-shot Learning with Semantic Output Codes
TL;DR: A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.