scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2013"


Journal ArticleDOI
TL;DR: A comprehensive survey of computational methods for protein complex detection can be found in this article, where the authors review, classify and evaluate some of the key computational methods developed till date for the identification of protein complexes from PPI networks.
Abstract: Complexes of physically interacting proteins are one of the fundamental functional units responsible for driving key biological mechanisms within the cell. Their identification is therefore necessary to understand not only complex formation but also the higher level organization of the cell. With the advent of "high-throughput" techniques in molecular biology, significant amount of physical interaction data has been cataloged from organisms such as yeast, which has in turn fueled computational approaches to systematically mine complexes from the network of physical interactions among proteins (PPI network). In this survey, we review, classify and evaluate some of the key computational methods developed till date for the identification of protein complexes from PPI networks. We present two insightful taxonomies that reflect how these methods have evolved over the years toward improving automated complex prediction. We also discuss some open challenges facing accurate reconstruction of complexes, the crucial ones being the presence of high proportion of errors and noise in current high-throughput datasets and some key aspects overlooked by current complex detection methods. We hope this review will not only help to condense the history of computational complex detection for easy reference but also provide valuable insights to drive further research in this area.

93 citations


Journal ArticleDOI
TL;DR: This paper produces the first complete RNA inverse folding approach which allows for the specification of a wide range of design constraints, and introduces a Large Neighborhood Search approach which allowing us to tackle larger instances at the cost of losing completeness, while retaining the advantages of meeting design constraints.
Abstract: Synthetic biology is a rapidly emerging discipline with long-term ramifications that range from single-molecule detection within cells to the creation of synthetic genomes and novel life forms. Truly phenomenal results have been obtained by pioneering groups--for instance, the combinatorial synthesis of genetic networks, genome synthesis using BioBricks, and hybridization chain reaction (HCR), in which stable DNA monomers assemble only upon exposure to a target DNA fragment, biomolecular self-assembly pathways, etc. Such work strongly suggests that nanotechnology and synthetic biology together seem poised to constitute the most transformative development of the 21st century. In this paper, we present a Constraint Programming (CP) approach to solve the RNA inverse folding problem. Given a target RNA secondary structure, we determine an RNA sequence which folds into the target structure; i.e. whose minimum free energy structure is the target structure. Our approach represents a step forward in RNA design--we produce the first complete RNA inverse folding approach which allows for the specification of a wide range of design constraints. We also introduce a Large Neighborhood Search approach which allows us to tackle larger instances at the cost of losing completeness, while retaining the advantages of meeting design constraints (motif, GC-content, etc.). Results demonstrate that our software, RNAiFold, performs as well or better than all state-of-the-art approaches; nevertheless, our approach is unique in terms of completeness, flexibility, and the support of various design constraints. The algorithms presented in this paper are publicly available via the interactive webserver http://bioinformatics.bc.edu/clotelab/RNAiFold; additionally, the source code can be downloaded from that site.

84 citations


Journal ArticleDOI
TL;DR: DiChIPMunk as discussed by the authors constructs TFBS models as optimal dinucleotide PWMs, thus accounting for correlations between nucleotides neighboring in input sequences, and demonstrate that diPWMs constructed by diChIPmunk outperform traditional PWMs trained on ChIP-Seq data.
Abstract: Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) became a method of choice to locate DNA segments bound by different regulatory proteins. ChIP-Seq produces extremely valuable information to study transcriptional regulation. The wet-lab workflow is often supported by downstream computational analysis including construction of models of nucleotide sequences of transcription factor binding sites in DNA, which can be used to detect binding sites in ChIP-Seq data at a single base pair resolution. The most popular TFBS model is represented by positional weight matrix (PWM) with statistically independent positional weights of nucleotides in different columns; such PWMs are constructed from a gapless multiple local alignment of sequences containing experimentally identified TFBSs. Modern high-throughput techniques, including ChIP-Seq, provide enough data for careful training of advanced models containing more parameters than PWM. Yet, many suggested multiparametric models often provide only incremental improvement of TFBS recognition quality comparing to traditional PWMs trained on ChIP-Seq data. We present a novel computational tool, diChIPMunk, that constructs TFBS models as optimal dinucleotide PWMs, thus accounting for correlations between nucleotides neighboring in input sequences. diChIPMunk utilizes many advantages of ChIPMunk, its ancestor algorithm, accounting for ChIP-Seq base coverage profiles ("peak shape") and using the effective subsampling-based core procedure which allows processing of large datasets. We demonstrate that diPWMs constructed by diChIPMunk outperform traditional PWMs constructed by ChIPMunk from the same ChIP-Seq data. Software website: http://autosome.ru/dichipmunk/

66 citations


Journal ArticleDOI
TL;DR: A polynomial space and time algorithm is described to build a minimum reconciliation graph--a graph that summarizes the set of all most parsimonious reconciliations, and amongst numerous applications, it is shown how this graph allows counting the number of non-equivalent most Parsimonious reconcilations.
Abstract: Comparative genomic studies are often conducted by reconciliation analyses comparing gene and species trees. One of the issues with reconciliation approaches is that an exponential number of optimal scenarios is possible. The resulting complexity is masked by the fact that a majority of reconciliation software pick up a random optimal solution that is returned to the end-user. However, the alternative solutions should not be ignored since they tell different stories that parsimony considers as viable as the output solution. In this paper, we describe a polynomial space and time algorithm to build a minimum reconciliation graph -- a graph that summarizes the set of all most parsimonious reconciliations. Amongst numerous applications, it is shown how this graph allows counting the number of non-equivalent most parsimonious reconciliations.

48 citations


Journal ArticleDOI
TL;DR: A systematic summary, comparison and discussion of computational studies on host-pathogen interactions, including prediction and analysis of host- Pathogen protein-protein interactions; basic principles revealed from host-Pathogen interactions; and database and software tools for host- pathogen interaction data collection, integration and analysis are provided.
Abstract: Host–pathogen interactions are important for understanding infection mechanism and developing better treatment and prevention of infectious diseases. Many computational studies on host–pathogen interactions have been published. Here, we review recent progress and results in this field and provide a systematic summary, comparison and discussion of computational studies on host–pathogen interactions, including prediction and analysis of host–pathogen protein–protein interactions; basic principles revealed from host–pathogen interactions; and database and software tools for host–pathogen interaction data collection, integration and analysis.

44 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a next generation of biomedical similarity measures that efficiently and fully explore the semantics present in biomedical ontologies, such as disjointness and disjunctive axioms.
Abstract: There is a prominent trend to augment and improve the formality of biomedical ontologies. For example, this is shown by the current effort on adding description logic axioms, such as disjointness. One of the key ontology applications that can take advantage of this effort is the conceptual (functional) similarity measurement. The presence of description logic axioms in biomedical ontologies make the current structural or extensional approaches weaker and further away from providing sound semantics-based similarity measures. Although beneficial in small ontologies, the exploration of description logic axioms by semantics-based similarity measures is computational expensive. This limitation is critical for biomedical ontologies that normally contain thousands of concepts. Thus in the process of gaining their rightful place, biomedical functional similarity measures have to take the journey of finding how this rich and powerful knowledge can be fully explored while keeping feasible computational costs. This manuscript aims at promoting and guiding the development of compelling tools that deliver what the biomedical community will require in a near future: a next-generation of biomedical similarity measures that efficiently and fully explore the semantics present in biomedical ontologies.

41 citations


Journal ArticleDOI
Min Li1, Jianxin Wang1, Huan Wang1, Yi Pan1, Yi Pan2 
TL;DR: A new method for evaluating the confidence of each interaction based on the combination of logistic regression-based model and function similarity is proposed and the experimental results shows that the weighting method improved the performance of centrality measures considerably.
Abstract: Identifying essential proteins is very important for understanding the minimal requirements of cellular survival and development. Fast growth in the amount of available protein–protein interactions has produced unprecedented opportunities for detecting protein essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based on network topology. Unfortunately, the protein–protein interactions produced by high-throughput experiments generally have high false positives. Moreover, most of centrality measures based on network topology are sensitive to false positives. We therefore propose a new method for evaluating the confidence of each interaction based on the combination of logistic regression-based model and function similarity. Nine standard centrality measures in weighted network were redefined in this paper. The experimental results on a yeast protein interaction network shows that the weighting method improved the performance of centrality measures considerably. More essential proteins were discovered by the weighted centrality measures than by the original centrality measures used in the unweighted network. Even about 20% improvements were obtained from closeness centrality and subgraph centrality.

37 citations


Journal ArticleDOI
TL;DR: It is demonstrated that VBBMM offers significant improvements in inference and feature selection in this type of data compared to an Expectation-Maximization (EM) algorithm, at a significantly reduced computational cost.
Abstract: An increasing number of studies are using beadarrays to measure DNA methylation on a genome-wide basis. The purpose is to identify novel biomarkers in a wide range of complex genetic diseases including cancer. A common difficulty encountered in these studies is distinguishing true biomarkers from false positives. While statistical methods aimed at improving the feature selection step have been developed for gene expression, relatively few methods have been adapted to DNA methylation data, which is naturally beta-distributed. Here we explore and propose an innovative application of a recently developed variational Bayesian beta-mixture model (VBBMM) to the feature selection problem in the context of DNA methylation data generated from a highly popular beadarray technology. We demonstrate that VBBMM offers significant improvements in inference and feature selection in this type of data compared to an Expectation-Maximization (EM) algorithm, at a significantly reduced computational cost. We further demonstrate the added value of VBBMM as a feature selection and prioritization step in the context of identifying prognostic markers in breast cancer. A variational Bayesian approach to feature selection of DNA methylation profiles should thus be of value to any study undergoing large-scale DNA methylation profiling in search of novel biomarkers.

34 citations


Journal ArticleDOI
TL;DR: The development of just such an algorithm and MinimalMarker, its accompanying Perl-based computer program, are described and it is expected that this program will prove useful not only to genomics researchers but also to government agencies that use DNA markers to support a variety of food-inspection and -labeling regulations.
Abstract: DNA markers are frequently used to analyze crop varieties, with the coded marker data summarized in a computer-generated table. Such summary tables often provide extraneous data about individual crop genotypes, needlessly complicating and prolonging DNA-based differentiation between crop varieties. At present, it is difficult to identify minimal marker sets--the smallest sets that can distinguish between all crop varieties listed in a marker-summary table--due to the absence of algorithms capable of such characterization. Here, we describe the development of just such an algorithm and MinimalMarker, its accompanying Perl-based computer program. MinimalMarker has been validated in variety identification of fruit trees using published datasets and is available for use with both dominant and co-dominant markers, regardless of the number of alleles, including SSR markers with numeric notation. We expect that this program will prove useful not only to genomics researchers but also to government agencies that use DNA markers to support a variety of food-inspection and -labeling regulations.

30 citations


Journal ArticleDOI
TL;DR: It has been shown that an increase in modality of the functions describing the control of gene expression efficiency allows for a decrease in the dimensionality of these systems with retention of their chaotic dynamics.
Abstract: The methods for constructing "chaotic" nonlinear systems of differential equations modeling gene networks of arbitrary structure and dimensionality with various types of symmetry are considered. It has been shown that an increase in modality of the functions describing the control of gene expression efficiency allows for a decrease in the dimensionality of these systems with retention of their chaotic dynamics. Three-dimensional "chaotic" cyclic systems are considered. Symmetrical and asymmetrical attractors with "narrow" chaos having a Moebius-like structure have been detected in such systems. As has been demonstrated, a complete symmetry of the systems with respect to permutation of variables does not prevent the emergence of their chaotic dynamics.

29 citations


Journal ArticleDOI
TL;DR: Fast dynamic programming algorithms are described that solve the GTP problems and the RF supertree problem exactly, and it is demonstrated that these algorithms can solve instances with data sets consisting of as many as 22 taxa.
Abstract: Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.

Journal ArticleDOI
TL;DR: The CELLmicrocosmos PathwayIntegration (CmPI) was developed to support and visualize the subcellular localization prediction of protein-related data such as protein-interaction networks and the workflow was dramatically improved and simplified.
Abstract: The CELLmicrocosmos PathwayIntegration (CmPI) was developed to support and visualize the subcellular localization prediction of protein-related data such as protein-interaction networks. From the start it was possible to manually analyze the localizations by using an interactive table. It was, however, quite complicated to compare and analyze the different localization results derived from data integration as well as text-mining-based databases. The current software release provides a new interactive visual workflow, the Subcellular Localization Charts. As an application case, a MUPP1-related protein-protein interaction network is localized and semi-automatically analyzed. It will be shown that the workflow was dramatically improved and simplified. In addition, it is now possible to use custom protein-related data by using the SBML format and get a view of predicted protein localizations mapped onto a virtual cell model.

Journal ArticleDOI
TL;DR: This JBCB special issue contains 9 papers based on selected presentations at the BGRSnSB-2014 conference, focusing mainly on algorithmic problems in bioinformatics, and aims to diversify post-conference publications in di®erent journals covering wider bioinformics areas.
Abstract: The international conference series Bioinformatics of Genome Regulation and StructurenSystems Biology, known as \\BGRSnSB\" or just \\BGRS\", is a traditional biennial event that gets together biologists, computer scientists, mathematicians and biochemists working in an interdisciplinary ̄eld of systems biology biotechnology and genetics. Initiated by Prof. Nikolay A. Kolchanov in 1998, BGRSnSB has been held every two years in Novosibirsk, Russia, organized by the Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences. The ninth conference BGRSnSB-2014 held 23–28 June 2014 (http://conf.nsc.ru/BGRSSB2014) was associated with the multi-conference \\Bioinformatics and System Biology\" and the traditional Young Scientists' School \\Systems Biology and Bioinformatics\" (SBB2014). The BGRSnSB-2014 event hosted more than 400 participants from more than 20 countries, con ̄rming the conference as largest international system biology meeting series in Russia and in Northern Asia. Next year, the jubilee tenth BGRS2016 conference will be held in summer 2016, again in Novosibirsk Academgorodok. BGRS series has long traditions of special post-conference issue publications in the Journal of Bioinformatics and Computational Biology (JBCB), starting from 2006. Selected materials presented at the multi-conference were recommended for fulltext publications in several special journal issues, besides JBCB. BioMed Central journals publishing BGRS-2014 supplements include BMC Genomics, BMC Syst Biol, BMC Evol Biol, and BMC Genetics. A special issue on bioinformatics using BGRSnSB-2014 presentations was printed in 2014 in \\Vavilov Journal of Genetics and Breeding\" in Russian (published in English as \\Russian Journal of Genetics: Applied Research\"). Additionally, some separate papers related to the conference materials have been submitted for publications in JBSD (Journal of Biomolecular Structure and Dynamics) and JIB (Journal of Integrative Bioinformatics). Due to the multi-disciplinary nature of the conference, the organizing committees decided to diversify post-conference publications in di®erent journals covering wider bioinformatics areas. This JBCB special issue contains 9 papers based on selected presentations at the conference, focusing mainly on algorithmic problems in bioinformatics. The papers were chosen from as much as twice more submissions for this journal. We believe that Journal of Bioinformatics and Computational Biology Vol. 13, No. 1 (2015) 1502001 (3 pages) #.c Imperial College Press DOI: 10.1142/S0219720015020011

Journal ArticleDOI
TL;DR: Among several target candidates for cofactor engineering, glyceraldehyde-3-phosphate dehydrogenase (GAPD) is the most promising enzyme; its cofactor modification enhanced both the desired product and biomass yields significantly.
Abstract: Cofactors, such as NAD(H) and NADP(H), play important roles in energy transfer within the cells by providing the necessary redox carriers for a myriad of metabolic reactions, both anabolic and catabolic. Thus, it is crucial to establish the overall cellular redox balance for achieving the desired cellular physiology. Of several methods to manipulate the intracellular cofactor regeneration rates, altering the cofactor specificity of a particular enzyme is a promising one. However, the identification of relevant enzyme targets for such cofactor specificity engineering (CSE) is often very difficult and labor intensive. Therefore, it is necessary to develop more systematic approaches to find the cofactor engineering targets for strain improvement. Presented herein is a novel mathematical framework, cofactor modification analysis (CMA), developed based on the well-established constraints-based flux analysis, for the systematic identification of suitable CSE targets while exploring the global metabolic effects. The CMA algorithm was applied to E. coli using its genome-scale metabolic model, iJO1366, thereby identifying the growth-coupled cofactor engineering targets for overproducing four of its native products: acetate, formate, ethanol, and lactate, and three non-native products: 1-butanol, 1,4-butanediol, and 1,3-propanediol. Notably, among several target candidates for cofactor engineering, glyceraldehyde-3-phosphate dehydrogenase (GAPD) is the most promising enzyme; its cofactor modification enhanced both the desired product and biomass yields significantly. Finally, given the identified target, we further discussed potential mutational strategies for modifying cofactor specificity of GAPD in E. coli as suggested by in silico protein docking experiments.

Journal ArticleDOI
TL;DR: The basic concepts of cloud computing and MapReduce are introduced, and their applications in bioinformatics are highlighted, and some problems challenging the applications are highlighted.
Abstract: In the past decades, with the rapid development of high-throughput technologies, biology research has generated an unprecedented amount of data. In order to store and process such a great amount of data, cloud computing and MapReduce were applied to many fields of bioinformatics. In this paper, we first introduce the basic concepts of cloud computing and MapReduce, and their applications in bioinformatics. We then highlight some problems challenging the applications of cloud computing and MapReduce to bioinformatics. Finally, we give a brief guideline for using cloud computing in biology research.

Journal ArticleDOI
TL;DR: Interestingly, gene responsiveness is most intimately correlated with DNA structural features and promoter architecture, and a few of the variability measures of gene expression are linked to DNA structural properties, nucleosome occupancy, TATA-box presence, and bidirectionality of promoter regions.
Abstract: Gene expression is the most fundamental biological process, which is essential for phenotypic variation. It is regulated by various external (environment and evolution) and internal (genetic) factors. The level of gene expression depends on promoter architecture, along with other external factors. Presence of sequence motifs, such as transcription factor binding sites (TFBSs) and TATA-box, or DNA methylation in vertebrates has been implicated in the regulation of expression of some genes in eukaryotes, but a large number of genes lack these sequences. On the other hand, several experimental and computational studies have shown that promoter sequences possess some special structural properties, such as low stability, less bendability, low nucleosome occupancy, and more curvature, which are prevalent across all organisms. These structural features may play role in transcription initiation and regulation of gene expression. We have studied the relationship between the structural features of promoter DNA, promoter directionality and gene expression variability in S. cerevisiae. This relationship has been analyzed for seven different measures of gene expression variability, along with two different regulatory effect measures. We find that a few of the variability measures of gene expression are linked to DNA structural properties, nucleosome occupancy, TATA-box presence, and bidirectionality of promoter regions. Interestingly, gene responsiveness is most intimately correlated with DNA structural features and promoter architecture.

Journal ArticleDOI
TL;DR: An automated alignment algorithm was developed based on dynamic programming to align multiple-peak time-series data both globally and locally and yielded robust analysis of challenging SHAPE probing data.
Abstract: Alignment of peaks in electropherograms or chromatograms obtained from experimental techniques such capillary electrophoresis remains a significant challenge. Accurate alignment is critical for accurate interpretation of various classes of nucleic acid analysis technologies, including conventional DNA sequencing and new RNA structure probing technologies. An automated alignment algorithm was developed based on dynamic programming to align multiple-peak time-series data both globally and locally. This algorithm relies on a new peak similarity measure and other features such as time penalties, global constraints, and minimum-similarity scores and results in rapid, highly accurate comparisons of complex time-series datasets. As a demonstrative case study, the developed algorithm was applied to analysis of capillary electrophoresis data from a Selective 2'-Hydroxyl Acylation analyzed by Primer Extension (SHAPE) evaluation of RNA secondary structure. The algorithm yielded robust analysis of challenging SHAPE probing data. Experimental results show that the peak alignment algorithm corrects retention time variation efficiently due to the presence of fluorescent tags on fragments and differences in capillaries. The tools can be readily adapted for the analysis other biological datasets in which peak retention times vary.

Journal ArticleDOI
TL;DR: It is demonstrated that considering mRNA and near siRNA binding site features helps improve siRNA design accuracy, and the findings may also be helpful in understanding binding efficacy between microRNA and mRNA.
Abstract: Design of small interference RNA (siRNA) is one of the most important steps in effectively applying the RNA interference (RNAi) technology. The current siRNA design often produces inconsistent design results, which often fail to reliably select siRNA with clear silencing effects. We propose that when designing siRNA, one should consider mRNA global features and near siRNA-binding site local features. By a linear regression study, we discovered strong correlations between inhibitory efficacy and both mRNA global features and neighboring local features. This paper shows that, on average, less GC content, fewer stem secondary structures, and more loop secondary structures of mRNA at both global and local flanking regions of the siRNA binding sites lead to stronger inhibitory efficacy. Thus, the use of mRNA global features and near siRNA-binding site local features are essential to successful gene silencing and hence, a better siRNA design. We use a random forest model to predict siRNA efficacy using siRNA features, mRNA features, and near siRNA binding site features. Our prediction method achieved a correlation coefficient of 0.7 in 10-fold cross validation in contrast to 0.63 when using siRNA features only. Our study demonstrates that considering mRNA and near siRNA binding site features helps improve siRNA design accuracy. The findings may also be helpful in understanding binding efficacy between microRNA and mRNA.

Journal ArticleDOI
Lin He1, Xi Han1, Bin Ma1
TL;DR: An efficient de novo sequencing algorithm, DeNovoPTM, that includes a large number of post-translational modification (PTM) types, yet to limit the number of PTM occurrences in each peptide to increase the accuracy.
Abstract: De novo sequencing derives the peptide sequence from a tandem mass spectrum without the assistance of protein databases. This analysis has been indispensable for the identification of novel or modified peptides in a biological sample. Currently, the speed of de novo sequencing algorithms is not heavily affected by the number of post-translational modification (PTM) types in consideration. However, the accuracy of the algorithms can be degraded due to the increased search space. Most peptides in a proteomics research contain only a small number of PTMs per peptide, yet the types of PTMs can come from a large number of choices. Therefore, it is desirable to include a large number of PTM types in a de novo sequencing algorithm, yet to limit the number of PTM occurrences in each peptide to increase the accuracy. In this paper, we present an efficient de novo sequencing algorithm, DeNovoPTM, for such a purpose. The implemented software is downloadable from http://www.cs.uwaterloo.ca/~l22he/denovo_ptm.

Journal ArticleDOI
TL;DR: This research relaxation the pedigree structure to contain ungenotyped founders and present a cubic time whole genome haplotyping algorithm to minimize the number of zero-recombination haplotype blocks, which is implemented as a computer program iBDD.
Abstract: High-throughput single nucleotide polymorphism genotyping assays conveniently produce genotype data for genome-wide genetic linkage and association studies. For pedigree datasets, the unphased genotype data is used to infer the haplotypes for individuals, according to Mendelian inheritance rules. Linkage studies can then locate putative chromosomal regions based on the haplotype allele sharing among the pedigree members and their disease status. Most existing haplotyping programs require rather strict pedigree structures and return a single inferred solution for downstream analysis. In this research, we relax the pedigree structure to contain ungenotyped founders and present a cubic time whole genome haplotyping algorithm to minimize the number of zero-recombination haplotype blocks. With or without explicitly enumerating all the haplotyping solutions, the algorithm determines all distinct haplotype allele identity-by-descent (IBD) sharings among the pedigree members, in linear time in the total number of haplotyping solutions. Our algorithm is implemented as a computer program iBDD. Extensive simulation experiments using 2 sets of 16 pedigree structures from previous studies showed that, in general, there are trillions of haplotyping solutions, but only up to a few thousand distinct haplotype allele IBD sharings. iBDD is able to return all these sharings for downstream genome-wide linkage and association studies.

Journal ArticleDOI
TL;DR: This work describes here an approach to the automatic identification of discourse causality triggers in the biomedical domain using machine learning, and evaluates the impact of lexical, syntactic, and semantic features on each of the algorithms, showing that semantics improves the performance in all cases.
Abstract: Current domain-specific information extraction systems represent an important resource for biomedical researchers, who need to process vast amounts of knowledge in a short time. Automatic discourse causality recognition can further reduce their workload by suggesting possible causal connections and aiding in the curation of pathway models. We describe here an approach to the automatic identification of discourse causality triggers in the biomedical domain using machine learning. We create several baselines and experiment with and compare various parameter settings for three algorithms, i.e. Conditional Random Fields (CRF), Support Vector Machines (SVM) and Random Forests (RF). We also evaluate the impact of lexical, syntactic, and semantic features on each of the algorithms, showing that semantics improves the performance in all cases. We test our comprehensive feature set on two corpora containing gold standard annotations of causal relations, and demonstrate the need for more gold standard data. The best performance of 79.35% F-score is achieved by CRFs when using all three feature types.

Journal ArticleDOI
TL;DR: The developed method for analysis of the datasets on promoter activity assays having promoter sequences, namely, number and sequences of AuxREs, altogether provides possibility to investigate AuxRE structure-activity relationship and may be used as the basis for a novel approach for AuxRE recognition.
Abstract: Plant hormone auxin is a key regulator of growth and development. Auxin affects gene expression through ARF transcription factors, which bind specifically auxin responsive elements (AuxREs). Auxin responsive genes usually have more than one AuxRE, for example, a widely used auxin sensor DR5 contains seven AuxREs. Auxin responsive regions of several plant genes have been studied using sets of transgenic constructions in which the activity of one or several AuxREs were abolished. Here we present the method for analysis of the datasets on promoter activity assays having promoter sequences, namely, number and sequences of AuxREs, altogether with their measured auxin induction level. The method for a reverse problem solution considers two extreme models of AuxRE cooperation. Additive model describes auxin induction level of a gene as a sum of the individual AuxREs impacts. Multiplicative model considers pure cooperation between the AuxREs, where the combined effect is the multiplication of the individual AuxRE impacts. The reverse problem solution allows estimating the impact of an individual AuxRE into the induction level and the model for their cooperation. For promoters of three genes belonging to different plant species we showed that the multiplicative model fits better than additive. The reverse problem solution also suggests repressive state of auxin responsive promoters before auxin induction. The developed method provides possibility to investigate AuxRE structure-activity relationship and may be used as the basis for a novel approach for AuxRE recognition.

Journal ArticleDOI
TL;DR: A novel approach to biological classification based on molecular biology instead of traditional morphology is proposed, which achieves the best accuracies and robustness and comparability of experimental results of virus classification.
Abstract: In this paper, three genomic materials--DNA sequences, protein sequences, and regions (domains) are used to compare methods of virus classification. Virus classes (categories) are divided by various taxonomic level of virus into three datasets for 6 order, 42 family, and 33 genera. To increase the robustness and comparability of experimental results of virus classification, the classes are selected that contain at least 10 instances, and meanwhile each instance contains at least one region name. Experimental results show that the approach using region names achieved the best accuracies--reaching 99.9%, 97.3%, and 99.0% for 6 orders, 42 families, and 33 genera, respectively. This paper not only involves exhaustive experiments that compare virus classifications using different genomic materials, but also proposes a novel approach to biological classification based on molecular biology instead of traditional morphology.

Journal ArticleDOI
TL;DR: This work focuses on developing heuristics to provide an improved approximated solution to the transposition distance problem, and outperforms other algorithms on small sized permutations and keeps the good performance on longer permutations.
Abstract: Transpositions are large-scale mutational events that occur when a block of genes moves from a region of a chromosome to another region within the same chromosome. The transposition distance problem is the minimum number of transpositions required to transform one genome into another. Recently, Bulteau et al. [Bulteau L, Fertin G, Rusu U, Automata, Languages and Programming, Vol. 6755 of Lecture Notes in Computer Science, pp. 654–665, Springer Berlin, Heidelberg, 2011] proved that finding the transposition distance is a NP-Hard problem. Some approximation algorithm for this problem have been presented to date [Bafna V, Pevzner PA, SIAM J Discr Math11(2):224–240, 1998; Elias I, Hartman T, IEEE/ACM Trans Comput Biol Bioinform3(4):369–379, 2006; Mira CVG, Dias Z, Santos HP, Pinto GA, Walter ME, Proc 3rd Brazilian Symp Bioinformatics (BSB'2008), pp. 115–126, Santo Andre, Brazil, 2008; Walter MEMT, Dias Z, Meidanis J, Proc String Processing and Information Retrieval (SPIRE'2000), pp. 199–208, Coruna, Spain, 2000]. Here we focus on developing heuristics to provide an improved approximated solution. Our approach outperforms other algorithms on small sized permutations. We also show that our algorithm keeps the good performance on longer permutations.

Journal ArticleDOI
TL;DR: The primary focus is simulation, which includes fidelity to the clinical data in terms of clotting-factor concentrations and elapsed time; reproduction of known clotting pathologies; and fine-grained predictions which may be used to refine clinical understanding of blood clotting.
Abstract: The process of human blood clotting involves a complex interaction of continuous-time/continuous-state processes and discrete-event/discrete-state phenomena, where the former comprise the various chemical rate equations and the latter comprise both threshold-limited behaviors and binary states (presence/absence of a chemical). Whereas previous blood-clotting models used only continuous dynamics and perforce addressed only portions of the coagulation cascade, we capture both continuous and discrete aspects by modeling it as a hybrid dynamical system. The model was implemented as a hybrid Petri net, a graphical modeling language that extends ordinary Petri nets to cover continuous quantities and continuous-time flows. The primary focus is simulation: (1) fidelity to the clinical data in terms of clotting-factor concentrations and elapsed time; (2) reproduction of known clotting pathologies; and (3) fine-grained predictions which may be used to refine clinical understanding of blood clotting. Next we examine sensitivity to rate-constant perturbation. Finally, we propose a method for titrating between reliance on the model and on prior clinical knowledge. For simplicity, we confine these last two analyses to a critical purely-continuous subsystem of the model.

Journal ArticleDOI
TL;DR: This paper develops sketching techniques, akin to those created for web document clustering, to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all comparison.
Abstract: Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. In this paper, we present a parallel algorithm for taxonomic clustering of large metagenomic samples with support for overlapping clusters. We develop sketching techniques, akin to those created for web document clustering, to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all comparison. We formulate the metagenomic classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud ready implementation. We show that the resulting framework can produce high quality clustering of metagenomic samples consisting of millions of reads, in reasonable time limits, when executed on a modest size cluster.

Journal ArticleDOI
TL;DR: Simulation showed that despite the same auxin distribution pattern, provascular tissues in the root tip differ in dynamics of auxin transport, and predicted that shoot-derived auxin flow to protophloem is lower than one to protoxylem, and the efficiency of PIN-mediated Auxin transport in protophLOem is higher than in protoxylesm.
Abstract: Phytohormone auxin is the main regulator of plant growth and development. Nonuniform auxin distribution in plant tissue sets positional information, which determines morphogenesis. Auxin is transported in tissue by means of diffusion and active transport through the cell membrane. There is a number of auxin carriers performing its influx into a cell (AUX\LAX family) or efflux from a cell (PIN, PGP families). The paper presents mathematical models for auxin transport in vascular tissues of Arabidopsis thaliana L.root tip, namely protophloem and protoxylem. Tissue specificity of auxin active transport was considered in these models. There is PIN-mediated auxin efflux in both protoxylem and protophloem, but AUX1-mediated influx exists only in protophloem. Optimal values of parameters were adjusted for model solutions to fit the experimentally observed auxin distributions in the root tip. Based on simulation results we predicted that shoot-derived auxin flow to protophloem is lower than one to protoxylem, and the efficiency of PIN-mediated auxin transport in protophloem is higher than in protoxylem. In summary, our simulation showed that despite the same auxin distribution pattern, provascular tissues in the root tip differ in dynamics of auxin transport.

Journal ArticleDOI
TL;DR: The model that has been developed herein represents one of the first attempts to model AD from a systems approach to study physiologically relevant parameters that may prove useful to physicians in the future.
Abstract: Alzheimer's disease (AD) is the most common form of dementia. Even with its well-known symptoms of memory loss and well-characterized pathology of beta amyloid (Aβ) plaques and neurofibrillary tangles, the disease pathogenesis and initiating factors are still not well understood. To tackle this problem, a systems biology model has been developed and used to study the varying effects of variations in the ApoE allele present, as well as the effects of short term and periodic inflammation at low to moderate levels. Simulations showed a late onset peak of Aβ in the ApoE4 case that lead to localized neuron loss which could be ameliorated in part by application of short-term pro-inflammatory mediators. The model that has been developed herein represents one of the first attempts to model AD from a systems approach to study physiologically relevant parameters that may prove useful to physicians in the future.

Journal ArticleDOI
TL;DR: The spectral bipartitioning method is able to efficiently identify a biologically meaningful minimal set of proteins whose removal causes a massive disruption of protein complexes in an organism.
Abstract: Protein complexes are a cornerstone of many biological processes and, together, they form various types of molecular machinery that perform a vast array of biological functions. Different complexes perform different functions and, the same complex can perform very different functions that depend on a variety of factors. Thus disruption of protein complexes can be lethal to an organism. It is interesting to identify a minimal set of proteins whose removal would lead to a massive disruption of protein complexes and, to understand the biological properties of these proteins. A method is presented for identifying a minimum number of proteins from a given set of complexes so that a maximum number of these complexes are disrupted when these proteins are removed. The method is based on spectral bipartitioning. This method is applied to yeast protein complexes. The identified proteins participate in a large number of biological processes and functional modules. A large proportion of them are essential proteins. Moreover, removing these identified proteins causes a large number of the yeast protein complexes to break into two fragments of nearly equal size, which minimizes the chance of either fragment being functional. The method is also superior in these aspects to alternative methods based on proteins with high connection degree, proteins whose neighbors have high average degree, and proteins that connect to lots of proteins of high connection degree. Our spectral bipartitioning method is able to efficiently identify a biologically meaningful minimal set of proteins whose removal causes a massive disruption of protein complexes in an organism.

Journal ArticleDOI
TL;DR: A heuristic cluster-based EM algorithm, CEM, which refines the cluster subsets in EM method to explore the best local optimal solution and demonstrates significant improvements in identifying the motif instances and performs better than current widely used algorithms.
Abstract: The planted motif search problem arises from locating the transcription factor binding sites (TFBSs) which are crucial for understanding the gene regulatory relationship. Many attempts in using expectation maximization for TFBSs discovery are successful in past. However, identifying highly degenerate motifs and reducing the effect of local optima are still an arduous task. To alleviate the vulnerability of EM to local optima trapping, we present a heuristic cluster-based EM algorithm, CEM, which refines the cluster subsets in EM method to explore the best local optimal solution. Based on experiments using both synthetic and real datasets, our algorithm demonstrates significant improvements in identifying the motif instances and performs better than current widely used algorithms. CEM is a novel planted motif finding algorithm, which is able to solve the challenging instances and easy to parallel since the process of solving each cluster subset is independent.