scispace - formally typeset
Search or ask a question
Author

Yanpeng Li

Bio: Yanpeng Li is an academic researcher from Dalian University of Technology. The author has contributed to research in topics: Feature (machine learning) & Biomedical text mining. The author has an hindex of 16, co-authored 46 publications receiving 2794 citations. Previous affiliations of Yanpeng Li include University of Massachusetts Medical School & Drexel University.

Papers
More filters
Journal ArticleDOI
TL;DR: Improvements and expansions to several branches of the Gene Ontology, as well as updates that have allowed us to more efficiently disseminate the GO and capture feedback from the research community are described.
Abstract: The Gene Ontology (GO; http://wwwgeneontologyorg) is a community-based bioinformatics resource that supplies information about gene product function using ontologies to represent biological knowledge Here we describe improvements and expansions to several branches of the ontology, as well as updates that have allowed us to more efficiently disseminate the GO and capture feedback from the research community The Gene Ontology Consortium (GOC) has expanded areas of the ontology such as cilia-related terms, cell-cycle terms and multicellular organism processes We have also implemented new tools for generating ontology terms based on a set of logical rules making use of templates, and we have made efforts to increase our use of logical definitions The GOC has a new and improved web site summarizing new developments and documentation, serving as a portal to GO data Users can perform GO enrichment analysis, and search the GO for terms, annotations to gene products, and associated metadata across multiple species using the all-new AmiGO 2 browser We encourage and welcome the input of the research community in all biological areas in our continued effort to improve the Gene Ontology

2,529 citations

Journal ArticleDOI
01 Jan 2014-Database
TL;DR: The state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation.
Abstract: Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

61 citations

Journal ArticleDOI
TL;DR: Experimental evaluations with other state-of-the-art PPI extraction systems tested on the DIP corpus indicate that BioPPISVMExtractor can substantially improve recall at the cost of a moderate decline in precision.

50 citations

Journal ArticleDOI
TL;DR: A general framework for gene named entity representation, called feature coupling generalization (FCG), which is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data, and its performance in a named entity classification task is examined.
Abstract: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner .

50 citations

Journal ArticleDOI
13 Jun 2013-PLOS ONE
TL;DR: This paper presents a Stacked Generalization-based approach for automatic DDI extraction that combines the feature-based, graph and tree kernels and reduces the risk of missing important features and achieves a better performance than each individual kernel.
Abstract: Drug-drug interaction (DDI) detection is particularly important for patient safety. However, the amount of biomedical literature regarding drug interactions is increasing rapidly. Therefore, there is a need to develop an effective approach for the automatic extraction of DDI information from the biomedical literature. In this paper, we present a Stacked Generalization-based approach for automatic DDI extraction. The approach combines the feature-based, graph and tree kernels and, therefore, reduces the risk of missing important features. In addition, it introduces some domain knowledge based features (the keyword, semantic type, and DrugBank features) into the feature-based kernel, which contribute to the performance improvement. More specifically, the approach applies Stacked generalization to automatically learn the weights from the training data and assign them to three individual kernels to achieve a much better performance than each individual kernel. The experimental results show that our approach can achieve a better performance of 69.24% in F-score compared with other systems in the DDI Extraction 2011 challenge task.

46 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The Perseus software platform was developed to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data and it is anticipated that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.
Abstract: A main bottleneck in proteomics is the downstream biological analysis of highly multivariate quantitative protein abundance data generated using mass-spectrometry-based analysis. We developed the Perseus software platform (http://www.perseus-framework.org) to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. A machine learning module supports the classification and validation of patient groups for diagnosis and prognosis, and it also detects predictive protein signatures. Central to Perseus is a user-friendly, interactive workflow environment that provides complete documentation of computational methods used in a publication. All activities in Perseus are realized as plugins, and users can extend the software by programming their own, which can be shared through a plugin store. We anticipate that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.

5,165 citations

Journal ArticleDOI
TL;DR: The Reactome Knowledgebase provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations—an extended version of a classic metabolic map, in a single consistent data model.
Abstract: The Reactome Knowledgebase (www.reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations-an extended version of a classic metabolic map, in a single consistent data model. Reactome functions both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data such as gene expression pattern surveys or somatic mutation catalogues from tumour cells. Over the last two years we redeveloped major components of the Reactome web interface to improve usability, responsiveness and data visualization. A new pathway diagram viewer provides a faster, clearer interface and smooth zooming from the entire reaction network to the details of individual reactions. Tool performance for analysis of user datasets has been substantially improved, now generating detailed results for genome-wide expression datasets within seconds. The analysis module can now be accessed through a RESTFul interface, facilitating its inclusion in third party applications. A new overview module allows the visualization of analysis results on a genome-wide Reactome pathway hierarchy using a single screen page. The search interface now provides auto-completion as well as a faceted search to narrow result lists efficiently.

5,065 citations

01 Aug 2001
TL;DR: The study of distributed systems which bring to life the vision of ubiquitous computing systems, also known as ambient intelligence, is concentrated on in this work.
Abstract: With digital equipment becoming increasingly networked, either on wired or wireless networks, for personal and professional use alike, distributed software systems have become a crucial element in information and communications technologies. The study of these systems forms the core of the ARLES' work, which is specifically concerned with defining new system software architectures, based on the use of emerging networking technologies. In this context, we concentrate on the study of distributed systems which bring to life the vision of ubiquitous computing systems, also known as ambient intelligence.

2,774 citations

Journal ArticleDOI
TL;DR: A historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations is made available to maintain consistency with other ontologies.
Abstract: The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.

1,988 citations