scispace - formally typeset
Open AccessJournal ArticleDOI

BioInfer: a corpus for information extraction in the biomedical domain

Reads0
Chats0
TLDR
A corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers is introduced.
Abstract
Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer .

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

The Stanford Typed Dependencies Representation

TL;DR: This paper examines the Stanford typed dependencies representation, which was designed to provide a straightforward description of grammatical relations for any user who could benefit from automatic text understanding, and considers the underlying design principles of the Stanford scheme.
Proceedings ArticleDOI

Overview of BioNLP'09 Shared Task on Event Extraction

TL;DR: The design and implementation of the BioNLP'09 Shared Task is presented, indicating that state-of-the-art performance is approaching a practically applicable level and revealing some remaining challenges.
Journal ArticleDOI

Special Report: NCBI disease corpus: A resource for disease name recognition and concept normalization

TL;DR: The results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
Journal ArticleDOI

Deep learning with word embeddings improves biomedical named entity recognition.

TL;DR: This work shows that a completely generic method based on deep learning and statistical word embeddings [called long short‐term memory network‐conditional random field (LSTM‐CRF)] outperforms state‐of‐the‐art entity‐specific NER tools, and often by a large margin.
Journal ArticleDOI

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes

TL;DR: A corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts, which is also a good resource for the linguistic analysis of scientific and clinical texts.
References
More filters
Book

Nonparametric statistics for the behavioral sciences

Sidney Siegel
TL;DR: This is the revision of the classic text in the field, adding two new chapters and thoroughly updating all others as discussed by the authors, and the original structure is retained, and the book continues to serve as a combined text/reference.
Journal ArticleDOI

Gene Ontology: tool for the unification of biology

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Proceedings ArticleDOI

The Berkeley FrameNet Project

TL;DR: This report will present the project's goals and workflow, and information about the computational tools that have been adapted or created in-house for this work.
Journal ArticleDOI

The Database of Interacting Proteins: 2004 update

TL;DR: The Database of Interacting Proteins (DIP; http://dip.doe-mbi.ucla. edu) is a database that documents experimentally determined protein-protein interactions.
Related Papers (5)