scispace - formally typeset
Search or ask a question

Showing papers by "Yi-Ping Phoebe Chen published in 2011"


Journal ArticleDOI
TL;DR: In silico strategies and modules applied at the level of hit identification and confer the different challenges with possible solutions in enhancing the success rate of the 'hit-to-lead' phase that could eventually help the progress of SBDD in the drug discovery arena are reviewed.

217 citations


Journal ArticleDOI
TL;DR: Experimental results illustrate that Apriori is the most useful association rule-mining algorithm to be used in the discovery of prevention factors.
Abstract: Cancer is increasing the total number of unexpected deaths around the world. Until now, cancer research could not significantly contribute to a proper solution for the cancer patient, and as a result, the high death rate is uncontrolled. The present research aim is to extract the significant prevention factors for particular types of cancer. To find out the prevention factors, we first constructed a prevention factor data set with an extensive literature review on bladder, breast, cervical, lung, prostate and skin cancer. We subsequently employed three association rule mining algorithms, Apriori, Predictive apriori and Tertius algorithms in order to discover most of the significant prevention factors against these specific types of cancer. Experimental results illustrate that Apriori is the most useful association rule-mining algorithm to be used in the discovery of prevention factors.

52 citations


Journal ArticleDOI
TL;DR: An extended version of the original DTW algorithm that allows us to determine the significance of time shift estimates in time series alignments, the DTW-Significance (DTW-S) algorithm, which can provide accurate and robust time shift Estimates for each time point on a gene-by-gene basis.
Abstract: Comparing biological time series data across different conditions, or different specimens, is a common but still challenging task. Algorithms aligning two time series represent a valuable tool for such comparisons. While many powerful computation tools for time series alignment have been developed, they do not provide significance estimates for time shift measurements. Here, we present an extended version of the original DTW algorithm that allows us to determine the significance of time shift estimates in time series alignments, the DTW-Significance (DTW-S) algorithm. The DTW-S combines important properties of the original algorithm and other published time series alignment tools: DTW-S calculates the optimal alignment for each time point of each gene, it uses interpolated time points for time shift estimation, and it does not require alignment of the time-series end points. As a new feature, we implement a simulation procedure based on parameters estimated from real time series data, on a series-by-series basis, allowing us to determine the false positive rate (FPR) and the significance of the estimated time shift values. We assess the performance of our method using simulation data and real expression time series from two published primate brain expression datasets. Our results show that this method can provide accurate and robust time shift estimates for each time point on a gene-by-gene basis. Using these estimates, we are able to uncover novel features of the biological processes underlying human brain development and maturation. The DTW-S provides a convenient tool for calculating accurate and robust time shift estimates at each time point for each gene, based on time series data. The estimates can be used to uncover novel biological features of the system being studied. The DTW-S is freely available as an R package TimeShift at http://www.picb.ac.cn/Comparative/data.html .

27 citations


Journal ArticleDOI
TL;DR: This paper focuses on re-sorting partially sorted data by taking advantage of the partial sorted nature of the data to speed up the run generation phase of the traditional external merge sort, and proposes a fast heuristic solution that can halve the external sorting time.
Abstract: The increasing popularity of flash memory means more database systems will run on flash memory in the future. One of the most important database operations is the external sort. Hence, this paper is focused on studying the problem of efficient external sorting on flash memory. In contrast to most previous work, we target the situation where previously sorted data have become progressively unsorted due to data updates. Accordingly, we call this ‘partially’ sorted data. We focus on re-sorting partially sorted data by taking advantage of the partial sorted nature of the data to speed up the run generation phase of the traditional external merge sort. We do this by finding ‘naturally occurring’ page runs in the partially sorted data. Our algorithm can perform up to a factor of 1024 less write IO compared with a traditional external merge sort during the run generation phase. We map the problem of finding naturally occurring runs into the shortest distance problem in a directed acyclic graph (DAG). Accordingly, we propose an optimal solution to the problem using the well-known DAG-Shortest-Paths algorithm. However, we found that the optimal solution was too slow for even moderate-sized data sets and accordingly propose a fast heuristic solution that—we experimentally show—finds a high percentage of page runs using a minimum of computational overhead. Experiments using both real and synthetic data sets show that our heuristic algorithm can halve the external sorting time when compared with three likely competing external sorting algorithms.

14 citations


Journal ArticleDOI
TL;DR: The results provide further evidence that the pathogenic changes in the H9N2 subtype are due mainly to re‐assortment with other highly pathogenic avian influenza viruses.
Abstract: Avian influenza virus H9N2 has become the dominant subtype of influenza which is endemic in poultry. The hemagglutinin, one of eight protein-coding genes, plays an important role during the early stage of infection. The adaptive evolution and the positively selected sites of the HA (the glycoprotein molecule) of H9N2 subtype viruses were investigated. Investigating 68 hemagglutinin H9N2 avian influenza virus isolates in China and phylogenetic analysis, it was necessary that these isolates were distributed geographically from 1994, and were all derived from the Eurasian lineage. H9N2 avian influenza virus isolates from domestic poultry in China were distinct phylogenetically from those isolated in Hong Kong, including viruses which had infected humans. Seven amino acid substitutions (2T, 3T, 14T, 165D, 197A, 233Q, 380R) were identified in the HA possibly due to positive selection pressure. Apart from the 380R site, the other positively selected sites detected were all located near the receptor-binding site of the HA1 strain. Based on epidemiological and phylogenetics analysis, the H9N2 epidemic in China was divided into three groups: the 1994-1997 group, the 1998-1999 group, and the 2000-2007 group. By investigating these three groups using the maximum likelihood estimation method, there were more positive selective sites in the 1994-1997 and 1998-1999 epidemic group than the 2000-2007 groups. This indicates that those detected selected sites are changed during different epidemic periods and the evolution of H9N2 is currently slow. The antigenic determinant or other key functional amino acid sites should be of concern because their adjacent sites have been under positive selection pressure. The results provide further evidence that the pathogenic changes in the H9N2 subtype are due mainly to re-assortment with other highly pathogenic avian influenza viruses.

11 citations


Journal ArticleDOI
TL;DR: A comparative study of the methods for Co-Sets computing in detail from four aspects: sensitivity, completeness and soundness, flexibility and scalability, by applying them to Escherichia coli core metabolic network.
Abstract: Correlated reaction sets (Co-Sets) are mathematically defined modules in biochemical reaction networks which facilitate the study of biological processes by decomposing complex reaction networks into conceptually simple units. According to the degree of association, Co-Sets can be classified into three types: perfect, partial and directional. Five approaches have been developed to calculate Co-Sets, including network-based pathway analysis, Monte Carlo sampling, linear optimization, enzyme subsets and hard-coupled reaction sets. However, differences in design and implementation of these methods lead to discrepancies in the resulted Co-Sets as well as in their use in biotechnology which need careful interpretation. In this paper, we provide a comparative study of the methods for Co-Sets computing in detail from four aspects: (i) sensitivity, (ii) completeness and soundness, (iii) flexibility and (iv) scalability. By applying them to Escherichia coli core metabolic network, the differences and relationships among these methods are clearly articulated which may be useful for potential users.

10 citations


Journal ArticleDOI
TL;DR: A novel interval-based distance (Hausdorff) measure for computing the similarity between characterized structures is developed and demonstrated by analyzing a data set of RNA secondary structures from the Rfam database.

8 citations


Journal ArticleDOI
TL;DR: This paper proposes a BN-based framework to discover the dependency correlations of kinase regulation by applying the Markov Chain Monte Carlo method to generate a sequence of samples from a probability distribution, by which to approximate the distribution.
Abstract: Abnormal kinase activity is a frequent cause of diseases, which makes kinases a promising pharmacological target. Thus, it is critical to identify the characteristics of protein kinases regulation by studying the activation and inhibition of kinase subunits in response to varied stimuli. Bayesian network (BN) is a formalism for probabilistic reasoning that has been widely used for learning dependency models. However, for high-dimensional discrete random vectors the set of plausible models becomes large and a full comparison of all the posterior probabilities related to the competing models becomes infeasible. A solution to this problem is based on the Markov Chain Monte Carlo (MCMC) method. This paper proposes a BN-based framework to discover the dependency correlations of kinase regulation. Our approach is to apply the MCMC method to generate a sequence of samples from a probability distribution, by which to approximate the distribution. The frequent connections (edges) are identified from the obtained sampling graphical models. Our results point to a number of novel candidate regulation patterns that are interesting in biology and include inferred associations that were unknown.

5 citations


Journal ArticleDOI
TL;DR: This article proposes a method called ‘mode pattern + mutual information’ to rank the inter-relationship between clusters, where the mode pattern is used to find outstanding objects from each cluster, and the mutual information criterion measures the close proximity of a pair of clusters.
Abstract: The evaluation of the relationships between clusters is important to identify vital unknown information in many real-life applications, such as in the fields of crime detection, evolution trees, metallurgical industry and biology engraftment. This article proposes a method called 'mode pattern + mutual information' to rank the inter-relationship between clusters. The idea of the mode pattern is used to find outstanding objects from each cluster, and the mutual information criterion measures the close proximity of a pair of clusters. Our approach is different from the conventional algorithms of classifying and clustering, because our focus is not to classify objects into different clusters, but instead, we aim to rank the inter-relationship between clusters when the clusters are given. We conducted experiments on a wide range of real-life datasets, including image data and cancer diagnosis data. The experimental results show that our algorithm is effective and promising.

4 citations


Journal ArticleDOI
TL;DR: A novel interval-based distance metric for structure-based RNA function assignment is introduced and not only offers sequence distance criteria to measure the similarity of secondary structures but also aids the functional classification of RNA structures with pesudoknots.
Abstract: Many raw biological sequence data have been generated by the human genome project and related efforts. The understanding of structural information encoded by biological sequences is important to acquire knowledge of their biochemical functions but remains a fundamental challenge. Recent interest in RNA regulation has resulted in a rapid growth of deposited RNA secondary structures in varied databases. However, a functional classification and characterization of the RNA structure have only been partially addressed. This article aims to introduce a novel interval-based distance metric for structure-based RNA function assignment. The characterization of RNA structures relies on distance vectors learned from a collection of predicted structures. The distance measure considers the intersected, disjoint, and inclusion between intervals. A set of RNA pseudoknotted structures with known function are applied and the function of the query structure is determined by measuring structure similarity. This not only offers sequence distance criteria to measure the similarity of secondary structures but also aids the functional classification of RNA structures with pesudoknots.

2 citations


Journal ArticleDOI
TL;DR: A wide range of topics in Bioinformatics were covered at the conference and were categorised as: • Phylogenetics and Evolution • Molecular binding and modelling • Genome Sequencing and Assembly • Network and system biology • Structural Bioinformics • Pathway analysis • Proteomics
Abstract: The Asia Pacific Bioinformatics Conference (APBC) is a leading conference in the Bioinformatics community and has grown rapidly since its inception in 2003. The goal of the annual conference series is to enable high quality interaction on bioinformatics research. The past APBC conferences were held in: 1. APBC2003 4-7 Feb 2003: Adelaide Australia 2. APBC2004 18-22 Jan 2004: Dunedin, New Zealand 3. APBC2005 17-21 Jan 2005: Singapore 4. APBC2006 13-16 Feb, 2006: Taipei Taiwan 5. APBC2007 15-17 Jan, 2007: Hong Kong 6. APBC2008 14-17 Jan, 2008: Kyoto Japan 7. APBC2009 13-16 Jan, 2009: Beijing China 8. APBC2010 18-21 Jan, 2010: Bangalore India The Ninth Asia Pacific Bioinformatics Conference (APBC2011) was held in Incheon, South Korea, the first time in this dynamic country. The conference spanning the dates of the 11 to the 14 of January brought together more than 300 researchers, professional, industry leaders and students from all over the globe. The participants came from institutions in the following 19 countries and regions (in alphabetical order): Australia, Bangladesh, Belgium, Brazil, Canada, China, Germany, India, Italy, Poland, Portugal, Saudi Arabia, Singapore, South Korea, Switzerland, Taiwan, Thailand, UK and USA. The conference program included 6 keynote speakers (Drs. Steven Jones, Luonan Chen, Peer Bork, Sang Yup Lee, Hong Gil Nam and Kenta Nakai), 55 selected talks, 8 tutorials and more than 118 posters. The titles of the keynote talks are: • Steven Jones, “Bioinformatics and Cancer Genomics” • Luonan Chen, “Modeling and Analyzing Nonlinear Biomolecular Networks” • Peer Bork, “Systemic analysis of the human gut: connecting chemicals, proteins, cells, communities and phenotypes” • Sang Yup Lee, “Systems metabolic engineering” • Hong Gil Nam, “Understanding and controlling plant growth and development: Genetic, Systems, and chemical genomic approaches” • Kenta Nakai, “Information for Transcriptional Regulation” The tutorial topics for APBC2011 included WebLab: a web-based bioinformatics platform (Jingchu Luo), Promenade through the web programming for biological research using Perl language (Kyung-Hoon Kwon), Bioinformatics analysis of genome & exome by next generation sequencing (Namshim Kim), Probabilistic Models for Multiple Motif Discovery (Jong Kyoung), Molecular Modeling: Analysis of Protein Structure and Function (Jinhyuk Lee), OASIS: Traditional Korean Medicine Information Portal (Sang-Jun Yea), Information Retrieval and Text Mining Opportunities in Bioinformatics (Jeyakumar Natarajan) and Bioworks: Bioinformation Analysis Pipeline (Seungyoon Nam and Youngmahn Hahn). The emphasis of APBC has been algorithmic development and innovation in Bioinformatics and this year that theme continued. Reflecting the ever-changing nature of Bioinformatics and is adaption to advances in technology, for the first time, a session was held on next-generation sequencing. A wide range of topics in Bioinformatics were covered at the conference and were categorised as: • Phylogenetics and Evolution • Molecular binding and modelling • Genome Sequencing and Assembly • Network and system biology • Structural Bioinformatics • Pathway analysis • Proteomics * Correspondence: Phoebe.Chen@latrobe.edu.au La Trobe University, Melbourne, Australia Full list of author information is available at the end of the article Chen and Cho BMC Bioinformatics 2011, 12(Suppl 1):I1 http://www.biomedcentral.com/1471-2105/12/S1/I1

Proceedings ArticleDOI
26 Oct 2011
TL;DR: In this paper, a scheduling method is proposed to reduce the waiting time for selective contents broadcasting considering fast-forwarding by separating parts for double playing speed and normalPlaying speed and scheduling the effective schedule, which effectively reduces the waited time.
Abstract: Due to the recent popularization of digital broadcasting systems, selective contents broadcasting depending on viewers' preference, have attracted attention. For example, in a quiz program, a user selects his answer and watches the video content for the answer. When the server deliver programs reflecting users' preferences, clients have to wait until their selected contents start playing. Therefore, we have proposed a scheduling method to reduce the waiting time. However, we have not considered the case that clients play contents with fast-forwarding. In this paper, we propose a method to reduce the waiting time for selective contents broadcasting considering fast-forwarding. In our proposed method, by separating parts for double playing speed and normal playing speed and scheduling the effective schedule, we effectively reduce the waiting time.

Proceedings ArticleDOI
05 Dec 2011
TL;DR: This paper proposes a scheduling method to reduce the waiting time for close-range broadcasting by dividing two types of data and producing an effective broadcast schedule considering the available bandwidth.
Abstract: Due to the recent popularization of digital webcast systems. close-range broadcasting using continuous media data, i.e. audio and video, has attracted great attention. For example, in a movie program, after a user watches interesting content such as a highlight scene, he/she will watch the main program continuously. In close-range broadcasting, the necessary bandwidth for continuously playing the two types of data increases. Conventional methods reduce the necessary bandwidth by producing an effective broadcast schedule for continuous media data. However, these methods do not consider the broadcast schedule for two types of continuous media data. When two types of continuous media data are scheduled, waiting time that occurs from finishing the highlight scene to starting the main scene may increase. In this paper, we propose a scheduling method to reduce the waiting time for close-range broadcasting. In our proposed method, by dividing two types of data and producing an effective broadcast schedule considering the available bandwidth, we can effectively reduce the waiting time.

Proceedings ArticleDOI
12 Nov 2011
TL;DR: The proposed method provides a powerful predicting method to recognize the real pre-miRNA with multiple stem-loops and achieves sensitivity and specificity of 95.16% on human test set.
Abstract: Those pre-miRNAs with multiple loops are usually excluded in the most existing prediction methods. But as more and more miRNA have been identified, amount of miRNA precursor with multiple loops have been found. Therefore, determining how to effectively predict real pre-miRNA with multiple loops from those large of pseudo pre-miRNAs with multiple loops is an imperative problem. Some features of main branch are extracted to describe pre-miRNA intrinsic features, and SVM classifier is implemented to recognize real pre-miRNA with multiple stem-loops. Training and testing on dataset from miRBase12.0, SVM classifier achieves sensitivity of 75.76% and specificity of 95.16% on human test set, and when being applied to pre-miRNAs of all other species, it correctly identifies 86.71% of them. The proposed method in this work provides a powerful predicting method to recognize the real pre-miRNA with multiple stem-loops.