GENETAG: a tagged corpus for gene/protein named entity recognition.
Reads0
Chats0
TLDR
The annotation of GENETAG required intricate manual judgments by annotators which hindered tagging consistency, and the data were pre-segmented into words, to provide indices supporting comparison of system responses to the "gold standard", however, character- based indices would have been more robust than word-based indices.Abstract:
Named entity recognition (NER) is an important first step for text mining the biomedical literature. Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus. The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition. To ensure heterogeneity of the corpus, MEDLINE sentences were first scored for term similarity to documents with known gene names, and 10K high- and 10K low-scoring sentences were chosen at random. The original 20K sentences were run through a gene/protein name tagger, and the results were modified manually to reflect a wide definition of gene/protein names subject to a specificity constraint, a rule that required the tagged entities to refer to specific entities. Each sentence in GENETAG was annotated with acceptable alternatives to the gene/protein names it contained, allowing for partial matching with semantic constraints. Semantic constraints are rules requiring the tagged entity to contain its true meaning in the sentence context. Application of these constraints results in a more meaningful measure of the performance of an NER system than unrestricted partial matching. The annotation of GENETAG required intricate manual judgments by annotators which hindered tagging consistency. The data were pre-segmented into words, to provide indices supporting comparison of system responses to the "gold standard". However, character-based indices would have been more robust than word-based indices. GENETAG Train, Test and Round1 data and ancillary programs are freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz
. A newer version of GENETAG-05, will be released later this year.read more
Citations
More filters
Journal ArticleDOI
Literature mining for the biologist: from information retrieval to biological discovery.
TL;DR: This work states that literature mining is also becoming useful for both hypothesis generation and biological discovery, however, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
Proceedings ArticleDOI
Overview of BioNLP'09 Shared Task on Event Extraction
TL;DR: The design and implementation of the BioNLP'09 Shared Task is presented, indicating that state-of-the-art performance is approaching a practically applicable level and revealing some remaining challenges.
Journal ArticleDOI
Overview of BioCreAtIvE: critical assessment of information extraction for biology
TL;DR: The first BioCreAtIvE assessment provided state-of-the-art performance results for a basic task (gene name finding and normalization), where the best systems achieved a balanced 80% precision / recall or better, which potentially makes them suitable for real applications in biology.
Journal ArticleDOI
Special Report: NCBI disease corpus: A resource for disease name recognition and concept normalization
TL;DR: The results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
Journal ArticleDOI
Overview of BioCreative II gene mention recognition
Larry Smith,Lorraine K. Tanabe,Rie Johnson nee Ando,Cheng-Ju Kuo,I-Fang Chung,Chun-Nan Hsu,Yu-Shi Lin,Roman Klinger,Christoph M. Friedrich,Kuzman Ganchev,Manabu Torii,Hongfang Liu,Barry Haddow,Craig A. Struble,Richard J. Povinelli,Andreas Vlachos,William A. Baumgartner,Lawrence Hunter,Bob Carpenter,Richard Tzong-Han Tsai,Richard Tzong-Han Tsai,Hong-Jie Dai,Hong-Jie Dai,Feng Liu,Yifei Chen,Chengjie Sun,Sophia Katrenko,Pieter Adriaans,Christian Blaschke,Rafael Torres,Mariana Neves,Preslav Nakov,Preslav Nakov,Anna Divoli,Manuel Maña-López,Jacinto Mata,W. John Wilbur +36 more
TL;DR: It is demonstrated that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.
References
More filters
ReportDOI
Building a large annotated corpus of English: the penn treebank
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Journal ArticleDOI
GENIA corpus—a semantically annotated corpus for bio-textmining
TL;DR: The GENIA corpus as mentioned in this paper is a large corpus of 2000 MEDLINE abstracts with more than 400 000 words and almost 100, 000 annotations for biological terms for bio-text mining.
Book
Elements of Machine Learning
TL;DR: Elements of Machine Learning by Pat Langley examines the science of machine learning, methodology, and prospects for machine learning in the coming years.
Journal ArticleDOI
Tagging gene and protein names in biomedical text.
TL;DR: This work proposes to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation, and demonstrates that this method can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets.
Journal ArticleDOI
BioCreAtIvE Task 1A: gene mention finding evaluation
TL;DR: The 80% plus F-measure results are good, but still somewhat lag the best scores achieved in some other domains such as newswire, due in part to the complexity and length of gene names, compared to person or organization names in newswire.
Related Papers (5)
GENIA corpus—a semantically annotated corpus for bio-textmining
Gene Ontology: tool for the unification of biology
M Ashburner,Catherine A. Ball,Judith A. Blake,David Botstein,Heather Butler,J. M. Cherry,Allan Peter Davis,Kara Dolinski,Selina S. Dwight,J.T. Eppig,Midori A. Harris,David P. Hill,Laurie Issel-Tarver,Andrew Kasarskis,Suzanna E. Lewis,John C. Matese,Joel E. Richardson,M. Ringwald,Gerald M. Rubin,Gavin Sherlock +19 more