Extraction of protein interaction information from unstructured text using a context-free grammar.
Reads0
Chats0
TLDR
This work describes a system for extracting PGSM interactions from unstructured text using a lexical analyzer and context free grammar, and demonstrates that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision.Abstract:
Motivation: As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published. Results: In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships. Availability: The program and corpus are available by request from the authors. Contact: gilder@research.ge.com; jtemkin1@comcast.netread more
Citations
More filters
Journal ArticleDOI
Literature mining for the biologist: from information retrieval to biological discovery.
TL;DR: This work states that literature mining is also becoming useful for both hypothesis generation and biological discovery, however, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
Journal ArticleDOI
BioInfer: a corpus for information extraction in the biomedical domain
Sampo Pyysalo,Filip Ginter,Juho Heimonen,Jari Björne,Jorma Boberg,Jouni Järvinen,Tapio Salakoski +6 more
TL;DR: A corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers is introduced.
Journal ArticleDOI
Corpus annotation for mining biomedical events from literature
TL;DR: A new type of semantic annotation, event annotation, is completed, which is an addition to the existing annotations in the GENIA corpus, and is expected to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
Journal ArticleDOI
Discovering patterns to extract protein--protein interactions from full texts
TL;DR: A robust and powerful methodology to mine protein-protein interactions from biomedical texts by using a dynamic programming algorithm to compute distinguishing patterns by aligning relevant sentences and key verbs that describe protein interactions.
Journal ArticleDOI
VirusMINT: a viral protein interaction database
Andrew Chatr-aryamontri,Arnaud Ceol,Daniele Peluso,Aurelio Pio Nardozza,Simona Panni,Francesca Sacco,Michele Tinti,Alex Smolyar,Luisa Castagnoli,Marc Vidal,Michael E. Cusick,Gianni Cesareni +11 more
TL;DR: The VirusMINT database aims at collecting all protein interactions between viral and human proteins reported in the literature, and currently stores over 5000 interactions involving more than 490 unique viral proteins from more than 110 different viral strains.
References
More filters
Journal ArticleDOI
Initial sequencing and analysis of the human genome.
Eric S. Lander,Lauren Linton,Bruce W. Birren,Chad Nusbaum,Michael C. Zody,Jennifer Baldwin,Keri Devon,Ken Dewar,Michael Doyle,William Fitzhugh,Roel Funke,Diane Gage,Katrina Harris,Andrew Heaford,John Howland,Lisa Kann,Jessica A. Lehoczky,Rosie Levine,Paul A. McEwan,Kevin McKernan,James Meldrim,Jill P. Mesirov,Cher Miranda,William Morris,Jerome Naylor,Christina Raymond,Mark Rosetti,Ralph Santos,Andrew Sheridan,Carrie Sougnez,Nicole Stange-Thomann,Nikola Stojanovic,Aravind Subramanian,Dudley Wyman,Jane Rogers,John Sulston,R Ainscough,Stephan Beck,David Bentley,John Burton,C M Clee,Nigel P. Carter,Alan Coulson,Rebecca Deadman,Panos Deloukas,Andrew Dunham,Ian Dunham,Richard Durbin,Lisa French,Darren Grafham,Simon G. Gregory,Tim Hubbard,Sean Humphray,Adrienne Hunt,Matthew Jones,Christine Lloyd,Amanda McMurray,Lucy Matthews,Simon Mercer,Sarah Milne,James C. Mullikin,Andrew J. Mungall,Robert W. Plumb,Mark T. Ross,Ratna Shownkeen,Sarah Sims,Robert H. Waterston,Richard K. Wilson,LaDeana W. Hillier,John Douglas Mcpherson,Marco A. Marra,Elaine R. Mardis,Lucinda Fulton,Asif T. Chinwalla,Kymberlie H. Pepin,Warren Gish,Stephanie L. Chissoe,Michael C. Wendl,Kim D. Delehaunty,Tracie L. Miner,Andrew Delehaunty,Jason B. Kramer,Lisa Cook,Robert S. Fulton,Douglas L. Johnson,Patrick Minx,Sandra W. Clifton,Trevor Hawkins,Elbert Branscomb,Paul Predki,Paul G. Richardson,Sarah Wenning,Tom Slezak,Norman A. Doggett,Jan Fang Cheng,Anne S. Olsen,Susan Lucas,Christopher J. Elkin,Edward Uberbacher,Marvin Frazier,Richard A. Gibbs,Donna M. Muzny,Steven E. Scherer,John Bouck,Erica Sodergren,Kim C. Worley,Catherine M. Rives,James H. Gorrell,Michael L. Metzker,Susan L. Naylor,Raju Kucherlapati,David L. Nelson,George M. Weinstock,Yoshiyuki Sakaki,Asao Fujiyama,Masahira Hattori,Tetsushi Yada,Atsushi Toyoda,Takehiko Itoh,Chiharu Kawagoe,Hidemi Watanabe,Yasushi Totoki,Todd D. Taylor,Jean Weissenbach,Roland Heilig,William Saurin,François Artiguenave,Philippe Brottier,Thomas Brüls,Eric Pelletier,Catherine Robert,Patrick Wincker,André Rosenthal,Matthias Platzer,Gerald Nyakatura,Stefan Taudien,Andreas Rump,Douglas R. Smith,Lynn Doucette-Stamm,Marc Rubenfield,Keith Weinstock,Mei Lee Hong,Joann Dubois,Huanming Yang,Jun Yu,Jian Wang,Guyang Huang,Jun Gu,Leroy Hood,Lee Rowen,Anup Madan,Shizen Qin,Ronald W. Davis,Nancy A. Federspiel,A. Pia Abola,Michael Proctor,Bruce A. Roe,Feng Chen,Huaqin Pan,Juliane Ramser,Hans Lehrach,Richard Reinhardt,W. Richard McCombie,Melissa De La Bastide,Neilay Dedhia,H. Blöcker,K. Hornischer,Gabriele Nordsiek,Richa Agarwala,L. Aravind,Jeffrey A. Bailey,Alex Bateman,Serafim Batzoglou,Ewan Birney,Peer Bork,Daniel G. Brown,Christopher B. Burge,Lorenzo Cerutti,Hsiu Chuan Chen,Deanna M. Church,Michele Clamp,Richard R. Copley,Tobias Doerks,Sean R. Eddy,Evan E. Eichler,Terrence S. Furey,James E. Galagan,James G. R. Gilbert,Cyrus L. Harmon,Yoshihide Hayashizaki,David Haussler,Henning Hermjakob,Karsten Hokamp,Wonhee Jang,L. Steven Johnson,Thomas A. Jones,Simon Kasif,Arek Kaspryzk,Scot Kennedy,W. James Kent,Paul Kitts,Eugene V. Koonin,Ian F Korf,David Kulp,Doron Lancet,Todd M. Lowe,Aoife McLysaght,Tarjei S. Mikkelsen,John V. Moran,Nicola Mulder,Victor J. Pollara,Chris P. Ponting,Greg Schuler,Jörg Schultz,Guy Slater,Arian F.A. Smit,Elia Stupka,Joseph Szustakowki,Danielle Thierry-Mieg,Jean Thierry-Mieg,Lukas Wagner,John W. Wallis,Raymond Wheeler,Alan Williams,Yuri I. Wolf,Kenneth H. Wolfe,Shiaw Pyng Yang,Ru Fang Yeh,Francis S. Collins,Mark S. Guyer,Jane Peterson,Adam Felsenfeld,Kris A. Wetterstrand,Richard M. Myers,Jeremy Schmutz,Mark Dickson,Jane Grimwood,David R. Cox,Maynard V. Olson,Rajinder Kaul,Christopher K. Raymond,Nobuyoshi Shimizu,Kazuhiko Kawasaki,Shinsei Minoshima,Glen A. Evans,Maria Athanasiou,Roger A. Schultz,Aristides Patrinos,Michael J. Morgan +248 more
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Journal ArticleDOI
The sequence of the human genome.
J. Craig Venter,Mark Raymond Adams,Eugene W. Myers,Peter W. Li,Richard J. Mural,Granger G. Sutton,Hamilton O. Smith,Mark Yandell,Cheryl A. Evans,Robert A. Holt,Jeannine D. Gocayne,Peter Amanatides,Richard M. Ballew,Daniel H. Huson,Jennifer R. Wortman,Qing Zhang,Chinnappa D. Kodira,Xiangqun H. Zheng,Lin Chen,Marian P. Skupski,Gangadharan Subramanian,Paul Thomas,Jinghui Zhang,George L. Gabor Miklos,Catherine R. Nelson,Samuel Broder,Andrew G. Clark,J. H. Nadeau,Victor A. McKusick,Norton D. Zinder,Arnold J. Levine,Richard J. Roberts,M. I. Simon,Carolyn W. Slayman,Michael W. Hunkapiller,Randall Bolanos,Arthur L. Delcher,Ian M. Dew,Daniel Fasulo,Michael Flanigan,Liliana Florea,Aaron L. Halpern,Sridhar Hannenhalli,Saul A. Kravitz,Samuel Levy,Clark M. Mobarry,Knut Reinert,Karin A. Remington,Jane Abu-Threideh,Ellen M. Beasley,Kendra Biddick,Vivien Bonazzi,Rhonda Brandon,Michele Cargill,Ishwar Chandramouliswaran,Rosane Charlab,Kabir Chaturvedi,Zuoming Deng,Valentina Di Francesco,Patrick Dunn,Karen Eilbeck,Carlos Evangelista,Andrei Gabrielian,Weiniu Gan,Wangmao Ge,Fangcheng Gong,Zhiping Gu,Ping Guan,Thomas J. Heiman,Maureen E. Higgins,Rui-Ru Ji,Zhaoxi Ke,Karen A. Ketchum,Zhongwu Lai,Yiding Lei,Zhenya Li,Jiayin Li,Yong Liang,Xiaoying Lin,Fu Lu,Gennady V. Merkulov,Natalia Milshina,Helen M. Moore,Ashwinikumar K Naik,Vaibhav A. Narayan,Beena Neelam,Deborah Nusskern,Douglas B. Rusch,Steven L. Salzberg,Wei Shao,Bixiong Chris Shue,Jingtao Sun,Zhen Yuan Wang,Aihui Wang,Xin Wang,Jian Wang,Ming-Hui Wei,Ron Wides,Chunlin Xiao,Chunhua Yan,Alison Yao,Jane Ye,Ming Zhan,Weiqing Zhang,Hongyu Zhang,Qi Zhao,Liansheng Zheng,Fei Zhong,Wenyan Zhong,Shiaoping C. Zhu,Shaying Zhao,Dennis A. Gilbert,Suzanna Baumhueter,Gene Spier,Christine Carter,Anibal Cravchik,Trevor Woodage,Feroze Ali,Huijin An,Aderonke Awe,Danita Baldwin,Holly Baden,Mary Barnstead,Ian Barrow,Karen Beeson,Dana A. Busam,Amy Carver,Ming Lai Cheng,Liz Curry,Steve Danaher,Lionel Davenport,Raymond Desilets,Susanne Dietz,Kristina Dodson,Lisa Doup,Steven Ferriera,Neha Garg,Andres Gluecksmann,Brit J. Hart,Jason Haynes,Charles Haynes,Cheryl Heiner,Suzanne Hladun,Damon Hostin,Jarrett Houck,Timothy Howland,Chinyere Ibegwam,Jeffery Johnson,Francis Kalush,Lesley Kline,Shashi Koduru,Amy Love,Felecia Mann,David May,Steven McCawley,Tina C. McIntosh,Ivy McMullen,Mee Moy,Linda Moy,Brian Murphy,Keith Nelson,Cynthia Pfannkoch,Eric Pratts,Vinita Puri,Hina Qureshi,Matthew Reardon,Robert Rodriguez,Yu-Hui Rogers,Deanna Romblad,Bob Ruhfel,Richard T. Scott,Cynthia Sitter,Michelle Smallwood,Erin Stewart,Renee Strong,Ellen Suh,Reginald Thomas,Ni Ni Tint,Sukyee Tse,Claire Vech,Gary Wang,Jeremy Wetter,Sherita Williams,Monica Williams,Sandra Windsor,Emily Winn-Deen,Keriellen Wolfe,Jayshree Zaveri,Karena Zaveri,Josep F. Abril,Roderic Guigó,Michael J. Campbell,Kimmen Sjölander,Brian Karlak,Anish Kejariwal,Huaiyu Mi,Betty Lazareva,Thomas Hatton,Apurva Narechania,Karen Diemer,Anushya Muruganujan,Nan Guo,Shinji Sato,Vineet Bafna,Sorin Istrail,Ross Lippert,Russell Schwartz,Brian P. Walenz,Shibu Yooseph,David Allen,Anand Basu,James Baxendale,Louis Blick,Marcelo Caminha,John Carnes-Stine,Parris Caulk,Yen-Hui Chiang,My Coyne,Carl Dahlke,Anne Deslattes Mays,Maria Dombroski,Michael Donnelly,Dale Ely,Shiva Esparham,Carl Fosler,Harold Gire,Stephen Glanowski,Kenneth Glasser,Anna Glodek,Mark Gorokhov,Ken Graham,Barry Gropman,Michael Harris,Jeremy Heil,Scott Henderson,Jeffrey Hoover,Donald Jennings,Catherine Jordan,James Jordan,John Kasha,Leonid Kagan,Cheryl L. Kraft,Alexander Levitsky,Mark Lewis,Xiangjun Liu,John Lopez,Daniel Ma,William H. Majoros,Joe McDaniel,Sean C. Murphy,Matthew Newman,Trung Hieu Nguyen,Ngoc Nguyen,Marc Nodell,Sue Pan,Jim Peck,Marshall Peterson,William Rowe,Robert Sanders,John Scott,Michael Simpson,Thomas J. Smith,Arlan Sprague,Timothy B. Stockwell,Russell Turner,Eli Venter,Mei Wang,Meiyuan Wen,David Wu,Mitchell Wu,Ashley Xia,Ali Zandieh,Xiaohong Zhu +272 more
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Book
Compilers: Principles, Techniques, and Tools
TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
Journal ArticleDOI
The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999.
Amos Marc Bairoch,Rolf Apweiler +1 more
TL;DR: The Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT, is described.
Journal ArticleDOI
Three models for the description of language
TL;DR: It is found that no finite-state Markov process that produces symbols with transition from state to state can serve as an English grammar, and the particular subclass of such processes that produce n -order statistical approximations to English do not come closer, with increasing n, to matching the output of anEnglish grammar.