scispace - formally typeset
Open AccessJournal ArticleDOI

A dictionary to identify small molecules and drugs in free text

TLDR
A dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus is developed.
Abstract
Motivation: From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers. Results: We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary. Availability: The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist. Contact: k.hettne@erasmusmc.nl Supplementary information:Supplementary data are available at Bioinformatics online.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Deep learning with word embeddings improves biomedical named entity recognition.

TL;DR: This work shows that a completely generic method based on deep learning and statistical word embeddings [called long short‐term memory network‐conditional random field (LSTM‐CRF)] outperforms state‐of‐the‐art entity‐specific NER tools, and often by a large margin.
Proceedings Article

A Survey on Recent Advances in Named Entity Recognition from Deep Learning models

TL;DR: This work presents a comprehensive survey of deep neural network architectures for NER, and contrast them with previous approaches to NER based on feature engineering and other supervised or semi-supervised learning algorithms.
Journal ArticleDOI

An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition

TL;DR: A neural network approach, i.e. attention‐based bidirectional Long Short‐Term Memory with a conditional random field layer (Att‐BiLSTM‐CRF), to document‐level chemical NER that achieves better performances with little feature engineering than other state‐of‐the‐art methods.
Journal ArticleDOI

tmChem: a high performance approach for chemical named entity recognition and normalization

TL;DR: For example, tmChem as mentioned in this paper is a state-of-the-art system for chemical named entity recognition that combines two independent machine learning models in an ensemble, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation).
References
More filters
Journal ArticleDOI

Database resources of the National Center for Biotechnology Information

TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.
Journal ArticleDOI

KEGG for linking genomes to life and the environment

TL;DR: KEGG PATHWAY is now supplemented with a new global map of metabolic pathways, which is essentially a combined map of about 120 existing pathway maps, and the KEGG resource is being expanded to suit the needs for practical applications.
Journal ArticleDOI

The Unified Medical Language System (UMLS): integrating biomedical terminology

TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).
Journal ArticleDOI

DrugBank: a knowledgebase for drugs, drug actions and drug targets

TL;DR: The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release and contains 60% more FDA-approved small molecule and biotech drugs including 10% more ‘experimental’ drugs.
Related Papers (5)