A dictionary to identify small molecules and drugs in free text
Kristina Hettne,Rob H. Stierum,Martijn J. Schuemie,Peter J. M. Hendriksen,Bob J. A. Schijvenaars,Erik M. van Mulligen,Jos C. S. Kleinjans,Jan A. Kors +7 more
TLDR
A dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus is developed.Abstract:
Motivation: From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers.
Results: We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary.
Availability: The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist.
Contact: k.hettne@erasmusmc.nl
Supplementary information:Supplementary data are available at Bioinformatics online.read more
Citations
More filters
Journal ArticleDOI
Deep learning with word embeddings improves biomedical named entity recognition.
TL;DR: This work shows that a completely generic method based on deep learning and statistical word embeddings [called long short‐term memory network‐conditional random field (LSTM‐CRF)] outperforms state‐of‐the‐art entity‐specific NER tools, and often by a large margin.
Journal ArticleDOI
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition
George Tsatsaronis,Georgios Balikas,Prodromos Malakasiotis,Ioannis Partalas,Matthias Zschunke,Michael R. Alvers,Dirk Weissenborn,Anastasia Krithara,Sergios Petridis,Dimitris Polychronopoulos,Yannis Almirantis,John Pavlopoulos,Nicolas Baskiotis,Patrick Gallinari,Thierry Artières,Axel-Cyrille Ngonga Ngomo,Norman Heino,Eric Gaussier,Liliana Barrio-Alvers,Michael Schroeder,Ion Androutsopoulos,Georgios Paliouras +21 more
TL;DR: Overall, BioASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.
Proceedings Article
A Survey on Recent Advances in Named Entity Recognition from Deep Learning models
Vikas Yadav,Steven Bethard +1 more
TL;DR: This work presents a comprehensive survey of deep neural network architectures for NER, and contrast them with previous approaches to NER based on feature engineering and other supervised or semi-supervised learning algorithms.
Journal ArticleDOI
An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition
TL;DR: A neural network approach, i.e. attention‐based bidirectional Long Short‐Term Memory with a conditional random field layer (Att‐BiLSTM‐CRF), to document‐level chemical NER that achieves better performances with little feature engineering than other state‐of‐the‐art methods.
Journal ArticleDOI
tmChem: a high performance approach for chemical named entity recognition and normalization
TL;DR: For example, tmChem as mentioned in this paper is a state-of-the-art system for chemical named entity recognition that combines two independent machine learning models in an ensemble, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation).
References
More filters
Journal ArticleDOI
Database resources of the National Center for Biotechnology Information
David L. Wheeler,Deanna M. Church,Ron Edgar,Scott Federhen,Wolfgang Helmberg,Thomas L. Madden,Joan Pontius,Gregory D. Schuler,Lynn M. Schriml,Edwin Sequeira,Tugba O. Suzek,Tatiana Tatusova,Lukas Wagner +12 more
TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.
Journal ArticleDOI
KEGG for linking genomes to life and the environment
Minoru Kanehisa,Michihiro Araki,Susumu Goto,Masahiro Hattori,Mika Hirakawa,Masumi Itoh,Toshiaki Katayama,Shuichi Kawashima,Shujiro Okuda,Toshiaki Tokimatsu,Yoshihiro Yamanishi +10 more
TL;DR: KEGG PATHWAY is now supplemented with a new global map of metabolic pathways, which is essentially a combined map of about 120 existing pathway maps, and the KEGG resource is being expanded to suit the needs for practical applications.
Journal ArticleDOI
The Unified Medical Language System (UMLS): integrating biomedical terminology
TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).
Journal ArticleDOI
DrugBank: a knowledgebase for drugs, drug actions and drug targets
David S. Wishart,Craig Knox,An Chi Guo,Dean Cheng,Savita Shrivastava,Dan Tzur,Bijaya Gautam,Murtaza Hassanali +7 more
TL;DR: The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release and contains 60% more FDA-approved small molecule and biotech drugs including 10% more ‘experimental’ drugs.
Journal ArticleDOI
HMDB: a knowledgebase for the human metabolome
David S. Wishart,Craig Knox,An Chi Guo,Roman Eisner,Nelson Young,Bijaya Gautam,David Hau,Nick Psychogios,Edison Dong,Souhaila Bouatra,Rupasri Mandal,Igor Sinelnikov,Jianguo Xia,Leslie Jia,Joseph A. Cruz,Emilia L. Lim,Constance A. Sobsey,Savita Shrivastava,Paul Huang,Philip Liu,Lydia Fang,Jun Peng,Ryan Fradette,Dean Cheng,Dan Tzur,Melisa Clements,Avalyn Lewis,Andrea De Souza,Azaret Zuniga,Margot Dawe,Yeping Xiong,Derrick L. J. Clive,Russell Greiner,Alsu Nazyrova,Rustem Shaykhutdinov,Liang Li,Hans J. Vogel,Ian J. Forsythe +37 more
TL;DR: The most recent release of HMDB has been significantly expanded and enhanced over the previous release, with the number of fully annotated metabolite entries growing from 2180 to more than 6800, a 300% increase.