Extraction of protein interaction information from unstructured text using a context-free grammar.

doi:10.1093/BIOINFORMATICS/BTG279

Open AccessJournal ArticleDOI

Extraction of protein interaction information from unstructured text using a context-free grammar.

Joshua Michael Temkin, +1 more

- 01 Nov 2003 -

Bioinformatics

- Vol. 19, Iss: 16, pp 2046-2053

Chats0

TLDR

This work describes a system for extracting PGSM interactions from unstructured text using a lexical analyzer and context free grammar, and demonstrates that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision.

Abstract:

Motivation: As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published. Results: In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships. Availability: The program and corpus are available by request from the authors. Contact: gilder@research.ge.com; jtemkin1@comcast.net

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Literature mining for the biologist: from information retrieval to biological discovery.

Lars Juhl Jensen, +2 more

- 01 Feb 2006 -

Nature Reviews Genetics

TL;DR: This work states that literature mining is also becoming useful for both hypothesis generation and biological discovery, however, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.

...read moreread less

Journal ArticleDOI

BioInfer: a corpus for information extraction in the biomedical domain

Sampo Pyysalo, +6 more

- 09 Feb 2007 -

BMC Bioinformatics

TL;DR: A corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers is introduced.

...read moreread less

Journal ArticleDOI

Corpus annotation for mining biomedical events from literature

Jin-Dong Kim, +3 more

- 08 Jan 2008 -

BMC Bioinformatics

TL;DR: A new type of semantic annotation, event annotation, is completed, which is an addition to the existing annotations in the GENIA corpus, and is expected to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.

...read moreread less

Journal ArticleDOI

Discovering patterns to extract protein--protein interactions from full texts

Minlie Huang, +5 more

- 12 Dec 2004 -

Bioinformatics

TL;DR: A robust and powerful methodology to mine protein-protein interactions from biomedical texts by using a dynamic programming algorithm to compute distinguishing patterns by aligning relevant sentences and key verbs that describe protein interactions.

...read moreread less

Journal ArticleDOI

VirusMINT: a viral protein interaction database

Andrew Chatr-aryamontri, +11 more

- 01 Jan 2009 -

Nucleic Acids Research

TL;DR: The VirusMINT database aims at collecting all protein interactions between viral and human proteins reported in the literature, and currently stores over 5000 interactions involving more than 490 unique viral proteins from more than 110 different viral strains.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Initial sequencing and analysis of the human genome.

Eric S. Lander, +248 more

- 15 Feb 2001 -

Nature

TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.

...read moreread less

Journal ArticleDOI

The sequence of the human genome.

J. Craig Venter, +272 more

- 16 Feb 2001 -

Science

TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.

...read moreread less

Book

Compilers: Principles, Techniques, and Tools

Alfred V. Aho, +2 more

TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.

...read moreread less

Journal ArticleDOI

The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999.

Amos Marc Bairoch, +1 more

- 01 Jan 1998 -

Nucleic Acids Research

TL;DR: The Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT, is described.

...read moreread less

Journal ArticleDOI

Three models for the description of language

Noam Chomsky

- 01 Sep 1956 -

IEEE Transactions on Information Theory

TL;DR: It is found that no finite-state Markov process that produces symbols with transition from state to state can serve as an English grammar, and the particular subclass of such processes that produce n -order statistical approximations to English do not come closer, with increasing n, to matching the output of anEnglish grammar.

...read moreread less