Showing papers in "Journal of Bioinformatics and Computational Biology in 2016"

PDF

Open Access

Journal Article•DOI•

Evaluating feature-selection stability in next-generation proteomics.

[...]

Wilson Wen Bin Goh¹, Wilson Wen Bin Goh², Limsoon Wong², Limsoon Wong¹•Institutions (2)

National University of Singapore¹, Tianjin University²

03 Aug 2016-Journal of Bioinformatics and Computational Biology

TL;DR: It is demonstrated here for the first time the utility of a novel suite of network-based algorithms called ranked-based network algorithms (RBNAs) on proteomics, which are shown to be highly stable, reproducible and select relevant features when applied to proteomics data.

...read moreread less

Abstract: Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. [Formula: see text]-test and recursive feature elimination. Moreover, due to high test variability, selecting the top proteins based on [Formula: see text]-value ranks - even when restricted to high-abundance proteins - does not improve reproducibility. Statistical testing based on networks are believed to be more robust, but this does not always hold true: The commonly used hypergeometric enrichment that tests for enrichment of protein subnets performs abysmally due to its dependence on unstable protein pre-selection steps. We demonstrate here for the first time the utility of a novel suite of network-based algorithms called ranked-based network algorithms (RBNAs) on proteomics. These have originally been introduced and tested extensively on genomics data. We show here that they are highly stable, reproducible and select relevant features when applied to proteomics data. It is also evident from these results that use of statistical feature testing on protein expression data should be executed with due caution. Careless use of networks does not resolve poor-performance issues, and can even mislead. We recommend augmenting statistical feature-selection methods with concurrent analysis on stability and reproducibility to improve the quality of the selected features prior to experimental validation.

...read moreread less

54 citations

Journal Article•DOI•

Issues in performance evaluation for host-pathogen protein interaction prediction.

[...]

Wajid Arshad Abbasi¹, Fayyaz ul Amir Afsar Minhas¹•Institutions (1)

Pakistan Institute of Engineering and Applied Sciences¹

14 Jun 2016-Journal of Bioinformatics and Computational Biology

TL;DR: This paper questions the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions and proposes simpler and more directly interpretable metrics for this purpose.

...read moreread less

Abstract: The study of interactions between host and pathogen proteins is important for understanding the underlying mechanisms of infectious diseases and for developing novel therapeutic solutions. Wet-lab techniques for detecting protein–protein interactions (PPIs) can benefit from computational predictions. Machine learning is one of the computational approaches that can assist biologists by predicting promising PPIs. A number of machine learning based methods for predicting host–pathogen interactions (HPI) have been proposed in the literature. The techniques used for assessing the accuracy of such predictors are of critical importance in this domain. In this paper, we question the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions. K-fold cross-validation does not model this scenario, and we demonstrate a sizable difference between its performance and the performance of an alternative evaluation scheme called leave one pathogen protein out (LOPO) cross-validation. LOPO is more effective in modeling the real world use of HPI predictors, specifically for cases in which no information about the interacting partners of a pathogen protein is available during training. We also point out that currently used metrics such as areas under the precision-recall or receiver operating characteristic curves are not intuitive to biologists and propose simpler and more directly interpretable metrics for this purpose.

...read moreread less

31 citations

Journal Article•DOI•

Meta-analysis of transcriptome data identified TGTCNN motif variants associated with the response to plant hormone auxin in Arabidopsis thaliana L.

[...]

Elena V. Zemlyanskaya¹, Daniil S. Wiebe¹, Nadezhda A. Omelyanchuk¹, Victor G. Levitsky¹, Victoria V. Mironova¹ - Show less +1 more•Institutions (1)

Novosibirsk State University¹

27 Apr 2016-Journal of Bioinformatics and Computational Biology

TL;DR: Meta-analysis of microarray data is performed to reveal TGTCNN variants essential for auxin response and to characterize their functional features, providing an idea that variousTGTCNN motifs may play distinct roles in the auxin regulation of gene expression.

...read moreread less

Abstract: Auxin is the major regulator of plant growth and development. It regulates gene expression via a family of transcription factors (ARFs) that bind to auxin responsive elements (AuxREs) in the gene promoters. The canonical AuxREs found in regulatory regions of many auxin responsive genes contain the TGTCTC core motif, whereas ARF binding site is a degenerate TGTCNN with TGTCGG strongly preferred. Thereby two questions arise: which TGTCNN variants are functional AuxRE cores and whether different TGTCNN variants have distinct functional roles? In this study, we performed meta-analysis of microarray data to reveal TGTCNN variants essential for auxin response and to characterize their functional features. Our results indicate that four TGTCNN motifs (TGTCTC, TGTCCC, TGTCGG, and TGTCTG) are associated with auxin up-regulation and two (TGTCGG, TGTCAT) with auxin down-regulation, but to a lesser extent. The genes having some of these motifs in their regulatory regions showed time-specific auxin response. Functional annotation of auxin up- and down-regulated genes also revealed GO terms specific for the auxin-regulated genes with certain TGTCNN variants in their promoters. Our results provide an idea that various TGTCNN motifs may play distinct roles in the auxin regulation of gene expression.

...read moreread less

27 citations

Journal Article•DOI•

lncRNATargets: A platform for lncRNA target prediction based on nucleic acid thermodynamics.

[...]

Ruifeng Hu¹, Xiaobo Sun¹, Xiaobo Sun²•Institutions (2)

Peking Union Medical College¹, Chinese Ministry of Education²

25 Aug 2016-Journal of Bioinformatics and Computational Biology

TL;DR: Liu et al. as discussed by the authors used the nearest-neighbor (NN) model to calculate binging-free energy of long noncoding RNAs (lncRNAs) target dimers.

...read moreread less

Abstract: Many studies have supported that long noncoding RNAs (lncRNAs) perform various functions in various critical biological processes. Advanced experimental and computational technologies allow access to more information on lncRNAs. Determining the functions and action mechanisms of these RNAs on a large scale is urgently needed. We provided lncRNATargets, which is a web-based platform for lncRNA target prediction based on nucleic acid thermodynamics. The nearest-neighbor (NN) model was used to calculate binging-free energy. The main principle of NN model for nucleic acid assumes that identity and orientation of neighbor base pairs determine stability of a given base pair. lncRNATargets features the following options: setting of a specific temperature that allow use not only for human but also for other animals or plants; processing all lncRNAs in high throughput without RNA size limitation that is superior to any other existing tool; and web-based, user-friendly interface, and colored result displays that allow easy access for nonskilled computer operators and provide better understanding of results. This technique could provide accurate calculation on the binding-free energy of lncRNA-target dimers to predict if these structures are well targeted together. lncRNATargets provides high accuracy calculations, and this user-friendly program is available for free at http://www.herbbol.org:8001/lrt/ .

...read moreread less

25 citations

Journal Article•DOI•

Quality assurance of the gene ontology using abstraction networks

[...]

Christopher Ochs¹, Yehoshua Perl¹, Michael Halper¹, James Geller¹, Jane Lomax² - Show less +1 more•Institutions (2)

New Jersey Institute of Technology¹, Wellcome Trust Sanger Institute²

14 Jun 2016-Journal of Bioinformatics and Computational Biology

TL;DR: From the results of this QA effort, it is observed that different kinds of inconsistencies in the modeling of GO can be exposed with the use of the proposed heuristics, and time and effort will be saved during manual reviews of GO's content.

...read moreread less

Abstract: The gene ontology (GO) is used extensively in the field of genomics. Like other large and complex ontologies, quality assurance (QA) efforts for GO's content can be laborious and time consuming. Abstraction networks (AbNs) are summarization networks that reveal and highlight high-level structural and hierarchical aggregation patterns in an ontology. They have been shown to successfully support QA work in the context of various ontologies. Two kinds of AbNs, called the area taxonomy and the partial-area taxonomy, are developed for GO hierarchies and derived specifically for the biological process (BP) hierarchy. Within this framework, several QA heuristics, based on the identification of groups of anomalous terms which exhibit certain taxonomy-defined characteristics, are introduced. Such groups are expected to have higher error rates when compared to other terms. Thus, by focusing QA efforts on anomalous terms one would expect to find relatively more erroneous content. By automatically identifying these potential problem areas within an ontology, time and effort will be saved during manual reviews of GO's content. BP is used as a testbed, with samples of three kinds of anomalous BP terms chosen for a taxonomy-based QA review. Additional heuristics for QA are demonstrated. From the results of this QA effort, it is observed that different kinds of inconsistencies in the modeling of GO can be exposed with the use of the proposed heuristics. For comparison, the results of QA work on a sample of terms chosen from GO's general population are presented.

...read moreread less

24 citations

Journal Article•DOI•

Advances in high throughput DNA sequence data compression.

[...]

Muhammad Sardaraz¹, Muhammad Tahir¹, Ataul Aziz Ikram•Institutions (1)

University of Wah¹

14 Jun 2016-Journal of Bioinformatics and Computational Biology

TL;DR: This article presents a comprehensive review of compression methods for genome and reads compression, and highlights key challenges and research directions in DNA sequence data compression.

...read moreread less

Abstract: Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.

...read moreread less

23 citations

Journal Article•DOI•

GI-SVM: A sensitive method for predicting genomic islands based on unannotated sequence of a single genome.

[...]

Bingxin Lu¹, Hon Wai Leong¹•Institutions (1)

National University of Singapore¹

24 Feb 2016-Journal of Bioinformatics and Computational Biology

TL;DR: GI-SVM provides a more sensitive method for researchers interested in a first-pass detection of GI in newly sequenced genomes, based on one-class support vector machine (SVM), utilizing composition bias in terms of k-mer content.

...read moreread less

Abstract: Genomic islands (GIs) are clusters of functionally related genes acquired by lateral genetic transfer (LGT), and they are present in many bacterial genomes. GIs are extremely important for bacterial research, because they not only promote genome evolution but also contain genes that enhance adaption and enable antibiotic resistance. Many methods have been proposed to predict GI. But most of them rely on either annotations or comparisons with other closely related genomes. Hence these methods cannot be easily applied to new genomes. As the number of newly sequenced bacterial genomes rapidly increases, there is a need for methods to detect GI based solely on sequences of a single genome. In this paper, we propose a novel method, GI-SVM, to predict GIs given only the unannotated genome sequence. GI-SVM is based on one-class support vector machine (SVM), utilizing composition bias in terms of k-mer content. From our evaluations on three real genomes, GI-SVM can achieve higher recall compared with current methods, without much loss of precision. Besides, GI-SVM allows flexible parameter tuning to get optimal results for each genome. In short, GI-SVM provides a more sensitive method for researchers interested in a first-pass detection of GI in newly sequenced genomes.

...read moreread less

21 citations

Journal Article•DOI•

Accurate refinement of docked protein complexes using evolutionary information and deep learning.

[...]

Bahar Akbal-Delibas¹, Roshanak Farhoodi¹, Marc Pomplun¹, Nurit Haspel¹•Institutions (1)

University of Massachusetts Boston¹

14 Jun 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A properly trained deep learning network can accurately predict the root-mean-square-deviation (RMSD) of a docked complex with 1.40 Å error margin on average, by approximating the complex relationship between a wide set of scoring function terms and the RMSD of adocked structure.

...read moreread less

Abstract: One of the major challenges for protein docking methods is to accurately discriminate native-like structures from false positives. Docking methods are often inaccurate and the results have to be refined and re-ranked to obtain native-like complexes and remove outliers. In a previous work, we introduced AccuRefiner, a machine learning based tool for refining protein-protein complexes. Given a docked complex, the refinement tool produces a small set of refined versions of the input complex, with lower root-mean-square-deviation (RMSD) of atomic positions with respect to the native structure. The method employs a unique ranking tool that accurately predicts the RMSD of docked complexes with respect to the native structure. In this work, we use a deep learning network with a similar set of features and five layers. We show that a properly trained deep learning network can accurately predict the RMSD of a docked complex with 1.40 A error margin on average, by approximating the complex relationship between a wide set of scoring function terms and the RMSD of a docked structure. The network was trained on 35000 unbound docking complexes generated by RosettaDock. We tested our method on 25 different putative docked complexes produced also by RosettaDock for five proteins that were not included in the training data. The results demonstrate that the high accuracy of the ranking tool enables AccuRefiner to consistently choose the refinement candidates with lower RMSD values compared to the coarsely docked input structures.

...read moreread less

17 citations

Journal Article•DOI•

Reverse engineering of gene regulatory networks based on S-systems and Bat algorithm.

[...]

Sudip Mandal, Abhinandan Khan¹, Goutam Saha², Rajat Kumar Pal¹•Institutions (2)

University of Calcutta¹, North Eastern Hill University²

14 Jun 2016-Journal of Bioinformatics and Computational Biology

TL;DR: Bat algorithm, based on the echolocation of bats, has been used to optimize the S-system model parameters and significant improvements in the detection of a greater number of true regulations, and in the minimization of false detections compared to other existing methods are shown.

...read moreread less

Abstract: The correct inference of gene regulatory networks for the understanding of the intricacies of the complex biological regulations remains an intriguing task for researchers. With the availability of large dimensional microarray data, relationships among thousands of genes can be simultaneously extracted. Among the prevalent models of reverse engineering genetic networks, S-system is considered to be an efficient mathematical tool. In this paper, Bat algorithm, based on the echolocation of bats, has been used to optimize the S-system model parameters. A decoupled S-system has been implemented to reduce the complexity of the algorithm. Initially, the proposed method has been successfully tested on an artificial network with and without the presence of noise. Based on the fact that a real-life genetic network is sparsely connected, a novel Accumulative Cardinality based decoupled S-system has been proposed. The cardinality has been varied from zero up to a maximum value, and this model has been implemented for the reconstruction of the DNA SOS repair network of Escherichia coli. The obtained results have shown significant improvements in the detection of a greater number of true regulations, and in the minimization of false detections compared to other existing methods.

...read moreread less

16 citations

Journal Article•DOI•

A method to predict different mechanisms for blood-brain barrier permeability of CNS activity compounds in Chinese herbs using support vector machine.

[...]

Ludi Jiang¹, Jiahua Chen¹, Yusu He¹, Yanling Zhang¹, Gongyu Li¹ - Show less +1 more•Institutions (1)

Beijing University of Chinese Medicine¹

24 Feb 2016-Journal of Bioinformatics and Computational Biology

TL;DR: Four types of prediction models, referring to CNS activity, BBB permeability, passive diffusion and efflux transport, were obtained in the experiment process and discrimination models were utilized to study the BBB properties of the known CNS activity compounds in Chinese herbs and this may guide the CNS drug development.

...read moreread less

Abstract: The blood-brain barrier (BBB), a highly selective barrier between central nervous system (CNS) and the blood stream, restricts and regulates the penetration of compounds from the blood into the brain. Drugs that affect the CNS interact with the BBB prior to their target site, so the prediction research on BBB permeability is a fundamental and significant research direction in neuropharmacology. In this study, we combed through the available data and then with the help of support vector machine (SVM), we established an experiment process for discovering potential CNS compounds and investigating the mechanisms of BBB permeability of them to advance the research in this field four types of prediction models, referring to CNS activity, BBB permeability, passive diffusion and efflux transport, were obtained in the experiment process. The first two models were used to discover compounds which may have CNS activity and also cross the BBB at the same time; the latter two were used to elucidate the mechanism of BBB permeability of those compounds. Three optimization parameter methods, Grid Search, Genetic Algorithm (GA), and Particle Swarm Optimization (PSO), were used to optimize the SVM models. Then, four optimal models were selected with excellent evaluation indexes (the accuracy, sensitivity and specificity of each model were all above 85%). Furthermore, discrimination models were utilized to study the BBB properties of the known CNS activity compounds in Chinese herbs and this may guide the CNS drug development. With the relatively systematic and quick approach, the application rationality of traditional Chinese medicines for treating nervous system disease in the clinical practice will be improved.

...read moreread less

15 citations

Journal Article•DOI•

phraSED-ML: A paraphrased, human-readable adaptation of SED-ML.

[...]

Kiri Choi¹, Lucian P. Smith¹, J. Kyle Medley¹, Herbert M. Sauro¹•Institutions (1)

University of Washington¹

30 Sep 2016-Journal of Bioinformatics and Computational Biology

TL;DR: Model simulation exchange has been standardized with the Simulation Experiment Description Markup Language (SED-ML), but specialized software is needed to generate simulations in this format, and the language specification, libphrasedml library, and source code are available.

...read moreread less

Abstract: Motivation: Model simulation exchange has been standardized with the Simulation Experiment Description Markup Language (SED-ML), but specialized software is needed to generate simulations in this format. Text-based languages allow researchers to create and modify experimental protocols quickly and easily, and export them to a common machine-readable format. Results: phraSED-ML language allows modelers to use simple text commands to encode various elements of SED-ML (models, tasks, simulations, and results) in a format easy to read and modify. The library can translate this script to SED-ML for use in other softwares. Availability: phraSED-ML language specification, libphrasedml library, and source code are available under BSD license from http://phrasedml.sourceforge.net/.

...read moreread less

Journal Article•DOI•

Basis set dependence using DFT/B3LYP calculations to model the Raman spectrum of thymine

[...]

Jakub Bielecki¹, Ewelina Lipiec¹•Institutions (1)

Polish Academy of Sciences¹

24 Feb 2016-Journal of Bioinformatics and Computational Biology

TL;DR: This work provides a set of new scaling factors for Raman spectra calculation in the framework of DFT/B3LYP method and serves as a benchmark for further research on the interaction of ionizing radiation with DNA molecules by means of ab initio calculations and Raman Spectroscopy.

...read moreread less

Abstract: Raman spectroscopy (including surface enhanced Raman spectroscopy (SERS) and tip enhanced Raman spectroscopy (TERS)) is a highly promising experimental method for investigations of biomolecule damage induced by ionizing radiation. However, proper interpretation of changes in experimental spectra for complex systems is often difficult or impossible, thus Raman spectra calculations based on density functional theory (DFT) provide an invaluable tool as an additional layer of understanding of underlying processes. There are many works that address the problem of basis set dependence for energy and bond length consideration, nevertheless there is still lack of consistent research on basis set influence on Raman spectra intensities for biomolecules. This study fills this gap by investigating of the influence of basis set choice for the interpretation of Raman spectra of the thymine molecule calculated using the DFT/B3LYP framework and comparing these results with experimental spectra. Among 19 selected Pople's basis sets, the best agreement was achieved using 6-31[Formula: see text](d,p), 6-31[Formula: see text](d,p) and 6-11[Formula: see text]G(d,p) sets. Adding diffuse functions or polarized functions for small basis set or use of a medium or large basis set without diffuse or polarized functions is not sufficient to reproduce Raman intensities correctly. The introduction of the diffuse functions ([Formula: see text]) on hydrogen atoms is not necessary for gas phase calculations. This work serves as a benchmark for further research on the interaction of ionizing radiation with DNA molecules by means of ab initio calculations and Raman spectroscopy. Moreover, this work provides a set of new scaling factors for Raman spectra calculation in the framework of DFT/B3LYP method.

...read moreread less

Journal Article•DOI•

Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.

[...]

Dmitry A. Suplatov¹, Nina Popova¹, Sergey A. Zhumatiy¹, Vladimir V. Voevodin¹, Vytas K. Švedas¹ - Show less +1 more•Institutions (1)

Moscow State University¹

27 Apr 2016-Journal of Bioinformatics and Computational Biology

TL;DR: The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure and is used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer.

...read moreread less

Abstract: Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .

...read moreread less

Journal Article•DOI•

Is the average shortest path length of gene set a reflection of their biological relatedness

[...]

Varsha Embar¹, Adam Handen², Madhavi K. Ganapathiraju²•Institutions (2)

Carnegie Mellon University¹, University of Pittsburgh²

01 Dec 2016-Journal of Bioinformatics and Computational Biology

TL;DR: It is found that disease associated genes and their degree-matched random genes have comparable ASPL, which is a characteristic of the degree of the genes and the network topology, and not that of functional coherence.

...read moreread less

Abstract: When a set of genes are identified to be related to a disease, say through gene expression analysis, it is common to examine the average distance among their protein products in the human interactome as a measure of biological relatedness of these genes. The reasoning for this is that, genes associated with a disease would tend to be functionally related, and that functionally related genes would be closely connected to each other in the interactome. Typically, average shortest path length (ASPL) of disease genes (although referred to as genes in the context of disease-associations, the interactions are among protein-products of these genes) is compared to ASPL of randomly selected genes or to ASPL in a randomly permuted network. We examined whether the ASPL of a set of genes is indeed a good measure of biological relatedness or whether it is simply a characteristic of the degree distribution of those genes. We examined the ASPL of genes sets of some disease and pathway associations and compared them to ASPL of three types of randomly selected control sets: uniform selection, from entire proteome, degree-matched selection, and random permutation of the network. We found that disease associated genes and their degree-matched random genes have comparable ASPL. In other words, ASPL is a characteristic of the degree of the genes and the network topology, and not that of functional coherence.

...read moreread less

Journal Article•DOI•

Stochastic modeling for dynamics of HIV-1 infection using cellular automata: A review.

[...]

Monamorn Precharattana¹•Institutions (1)

Mahidol University¹

24 Feb 2016-Journal of Bioinformatics and Computational Biology

TL;DR: The studies of the dynamics of HIV-1 infection using CA in memory have been presented to recognize how CA have been developed for HIV- 1 dynamics, which issues have been studied already and which issues still are objectives in future studies.

...read moreread less

Abstract: Recently, the description of immune response by discrete models has emerged to play an important role to study the problems in the area of human immunodeficiency virus type 1 (HIV-1) infection, leading to AIDS. As infection of target immune cells by HIV-1 mainly takes place in the lymphoid tissue, cellular automata (CA) models thus represent a significant step in understanding when the infected population is dispersed. Motivated by these, the studies of the dynamics of HIV-1 infection using CA in memory have been presented to recognize how CA have been developed for HIV-1 dynamics, which issues have been studied already and which issues still are objectives in future studies.

...read moreread less

Journal Article•DOI•

Normalization of transposon-mutant library sequencing datasets to improve identification of conditionally essential genes

[...]

Michael A. DeJesus¹, Thomas R. Ioerger¹•Institutions (1)

Texas A&M University¹

14 Jan 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A novel method for normalization of TnSeq datasets that corrects for the skew of read-count distributions by fitting them to a Beta-Geometric distribution is proposed and shown to reduce the number of false positives when comparing replicate datasets grown under the same conditions.

...read moreread less

Abstract: Sequencing of transposon-mutant libraries using next-generation sequencing (TnSeq) has become a popular method for determining which genes and non-coding regions are essential for growth under various conditions in bacteria. For methods that rely on quantitative comparison of counts of reads at transposon insertion sites, proper normalization of TnSeq datasets is vitally important. Real TnSeq datasets are often noisy and exhibit a significant skew that can be dominated by high counts at a small number of sites (often for non-biological reasons). If two datasets that are not appropriately normalized are compared, it might cause the artifactual appearance of Differentially Essential (DE) genes in a statistical test, constituting type I errors (false positives). In this paper, we propose a novel method for normalization of TnSeq datasets that corrects for the skew of read-count distributions by fitting them to a Beta-Geometric distribution. We show that this read-count correction procedure reduces the number of false positives when comparing replicate datasets grown under the same conditions (for which no genuine differences in essentiality are expected). We compare these results to results obtained with other normalization procedures, and show that it results in greater reduction in the number of false positives. In addition we investigate the effects of normalization on the detection of DE genes.

...read moreread less

Journal Article•DOI•

Chromosome evolution in malaria mosquitoes inferred from physically mapped genome assemblies.

[...]

Igor V. Sharakhov¹, Igor V. Sharakhov², Gleb N. Artemov¹, Maria V. Sharakhova², Maria V. Sharakhova¹ - Show less +1 more•Institutions (2)

Tomsk State University¹, Virginia Tech²

27 Apr 2016-Journal of Bioinformatics and Computational Biology

TL;DR: The advances in understanding chromosome evolution in malaria vectors are highlighted and possible future directions in studying mechanisms and biological roles of genome rearrangements are discussed.

...read moreread less

Abstract: Polymorphic inversions in mosquitoes are distributed nonrandomly among chromosomes and are associated with ecological, behavioral, and physiological adaptations related to pathogen transmission. Despite their significance, the patterns and mechanism of genome rearrangements are not well understood. Recent sequencing and physical mapping of the genomes for 16 Anopheles mosquito species provided an opportunity to study chromosome evolution at the highest resolution. New studies revealed that fixed rearrangement accumulated [Formula: see text]3 times faster on the X chromosome than on autosomes. The highest densities of transposable elements (TEs) and satellites of different sizes have also been found on the X chromosome, suggesting a mechanism for the inversion generation. The high rate of X chromosome rearrangements is in sharp contrast with the paucity of polymorphic inversions on the X in the majority of anopheline species. This paper highlights the advances in understanding chromosome evolution in malaria vectors and discusses possible future directions in studying mechanisms and biological roles of genome rearrangements.

...read moreread less

Journal Article•DOI•

Consensus properties and their large-scale applications for the gene duplication problem.

[...]

Jucheol Moon¹, Harris T. Lin¹, Oliver Eulenstein¹•Institutions (1)

Iowa State University¹

06 Mar 2016-Journal of Bioinformatics and Computational Biology

TL;DR: It is shown that the gene duplication problem satisfies a weaker version of the Pareto property where the strict consensus is found in at least one solution (rather than all solutions).

...read moreread less

Abstract: Solving the gene duplication problem is a classical approach for species tree inference from gene trees that are confounded by gene duplications. This problem takes a collection of gene trees and seeks a species tree that implies the minimum number of gene duplications. Wilkinson et al. posed the conjecture that the gene duplication problem satisfies the desirable Pareto property for clusters. That is, for every instance of the problem, all clusters that are commonly present in the input gene trees of this instance, called strict consensus, will also be found in every solution to this instance. We prove that this conjecture does not generally hold. Despite this negative result we show that the gene duplication problem satisfies a weaker version of the Pareto property where the strict consensus is found in at least one solution (rather than all solutions). This weaker property contributes to our design of an efficient scalable algorithm for the gene duplication problem. We demonstrate the performance of our algorithm in analyzing large-scale empirical datasets. Finally, we utilize the algorithm to evaluate the accuracy of standard heuristics for the gene duplication problem using simulated datasets.

...read moreread less

Journal Article•DOI•

Identifying cancer type specific oncogenes and tumor suppressors using limited size data.

[...]

Ana B. Pavel¹, Cristian-Ioan Vasile²•Institutions (2)

Boston University¹, Massachusetts Institute of Technology²

31 Aug 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A method which classifies genes into oncogenes (ONGs) and tumor suppressors for breast cancer, lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and colon adenOCarcinomas (COAD), using data from the cancer genome atlas (TCGA).

...read moreread less

Abstract: Cancer is a complex and heterogeneous genetic disease. Different mutations and dysregulated molecular mechanisms alter the pathways that lead to cell proliferation. In this paper, we explore a method which classifies genes into oncogenes (ONGs) and tumor suppressors. We optimize this method to identify specific (ONGs) and tumor suppressors for breast cancer, lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and colon adenocarcinoma (COAD), using data from the cancer genome atlas (TCGA). A set of genes were previously classified as ONGs and tumor suppressors across multiple cancer types (Science 2013). Each gene was assigned an ONG score and a tumor suppressor score based on the frequency of its driver mutations across all variants from the catalogue of somatic mutations in cancer (COSMIC). We evaluate and optimize this approach within different cancer types from TCGA. We are able to determine known driver genes for each of the four cancer types. After establishing the baseline parameters for...

...read moreread less

Journal Article•DOI•

Global peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using point matching algorithms

[...]

Beichuan Deng¹, Seongho Kim¹, Hengguang Li¹, Elisabeth I. Heath¹, Xiang Zhang² - Show less +1 more•Institutions (2)

Wayne State University¹, University of Louisville²

09 Sep 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A global comparison-based peak alignment method using point matching algorithm (PMA-PA) for both homogeneous and heterogeneous data, which first extracts feature points (peaks) in the chromatography and then searches globally the matching peaks in the consecutive chromatography by adopting the projection of rigid and nonrigid transformation.

...read moreread less

Abstract: Comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC[Formula: see text][Formula: see text][Formula: see text]GC-MS) has been used to analyze multiple samples in a metabolomics study. However, due to some uncontrollable experimental conditions, such as the differences in temperature or pressure, matrix effects on samples and stationary phase degradation, there is always a shift of retention times in the two GC columns between samples. In order to correct the retention time shifts in GC[Formula: see text][Formula: see text][Formula: see text]GC-MS, the peak alignment is a crucial data analysis step to recognize the peaks generated by the same metabolite in different samples. Two approaches have been developed for GC[Formula: see text][Formula: see text][Formula: see text]GC-MS data alignment: profile alignment and peak matching alignment. However, these existing alignment methods are all based on a local alignment, resulting that a peak may not be correctly aligned in a dense chromatographic region where many peaks are present in a small region. False alignment will result in false discovery in the downstream statistical analysis. We, therefore, develop a global comparison-based peak alignment method using point matching algorithm (PMA-PA) for both homogeneous and heterogeneous data. The developed algorithm PMA-PA first extracts feature points (peaks) in the chromatography and then searches globally the matching peaks in the consecutive chromatography by adopting the projection of rigid and nonrigid transformation. PMA-PA is further applied to two real experimental data sets, showing that PMA-PA is a promising peak alignment algorithm for both homogenous and heterogeneous data in terms of [Formula: see text]1 score, although it uses only peak location information.

...read moreread less

Journal Article•DOI•

A hybrid model of cell cycle in mammals

[...]

Jonathan Behaegel¹, Jean-Paul Comet¹, Gilles Bernot¹, Emilien Cornillon¹, Franck Delaunay¹ - Show less +1 more•Institutions (1)

University of Nice Sophia Antipolis¹

24 Feb 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A new hybrid modeling framework is presented that extends René Thomas' discrete modeling, to associate with each qualitative state "celerities" allowing us to compute the time spent in each state.

...read moreread less

Abstract: Time plays an essential role in many biological systems, especially in cell cycle. Many models of biological systems rely on differential equations, but parameter identification is an obstacle to use differential frameworks. In this paper, we present a new hybrid modeling framework that extends Rene Thomas' discrete modeling. The core idea is to associate with each qualitative state "celerities" allowing us to compute the time spent in each state. This hybrid framework is illustrated by building a 5-variable model of the mammalian cell cycle. Its parameters are determined by applying formal methods on the underlying discrete model and by constraining parameters using timing observations on the cell cycle. This first hybrid model presents the most important known behaviors of the cell cycle, including quiescent phase and endoreplication.

...read moreread less

Journal Article•DOI•

Metabolic potential of microbial mats and microbialites: Autotrophic capabilities described by an in silico stoichiometric approach from shared genomic resources.

[...]

Daniel Cerqueda-García¹, Luisa I. Falcón¹•Institutions (1)

National Autonomous University of Mexico¹

25 Aug 2016-Journal of Bioinformatics and Computational Biology

TL;DR: It is suggested that microbial mats and microbialites are "metabolic pumps" for the incorporation of inorganic gases and formation of organic matter, and highlighting the relevance of reducing power source.

...read moreread less

Abstract: Microbialites and microbial mats are complex communities with high phylogenetic diversity. These communities are mostly composed of bacteria and archaea, which are the earliest living forms on Earth and relevant to biogeochemical evolution. In this study, we identified the shared metabolic pathways for uptake of inorganic C and N in microbial mats and microbialites based on metagenomic data sets. An in silico analysis for autotrophic pathways was used to trace the paths of C and N to the system, following an elementary flux modes (EFM) approach, resulting in a stoichiometric model. The fragility was analyzed by the minimal cut sets method. We found four relevant pathways for the incorporation of CO2 (Calvin cycle, reverse tricarboxylic acid cycle, reductive acetyl-CoA pathway, and dicarboxylate/4-hydroxybutyrate cycle), some of them present only in archaea, while nitrogen fixation was the most important source of N to the system. The metabolic potential to incorporate nitrate to biomass was also relevant. The fragility of the network was low, suggesting a high redundancy of the autotrophic pathways due to their broad metabolic diversity, and highlighting the relevance of reducing power source. This analysis suggests that microbial mats and microbialites are "metabolic pumps" for the incorporation of inorganic gases and formation of organic matter.

...read moreread less

Journal Article•DOI•

Gene-set activity toolbox (GAT): A platform for microarray-based cancer diagnosis using an integrative gene-set analysis approach.

[...]

Worrawat Engchuan¹, Asawin Meechai¹, Sissades Tongsima, Narumol Doungpan¹, Jonathan H. Chan¹ - Show less +1 more•Institutions (1)

King Mongkut's University of Technology Thonburi¹

25 Aug 2016-Journal of Bioinformatics and Computational Biology

TL;DR: The results show that GAT can be used to build a reasonable disease diagnostic model and the predicted markers have biological relevance.

...read moreread less

Abstract: Cancer is a complex disease that cannot be diagnosed reliably using only single gene expression analysis. Using gene-set analysis on high throughput gene expression profiling controlled by various environmental factors is a commonly adopted technique used by the cancer research community. This work develops a comprehensive gene expression analysis tool (gene-set activity toolbox: (GAT)) that is implemented with data retriever, traditional data pre-processing, several gene-set analysis methods, network visualization and data mining tools. The gene-set analysis methods are used to identify subsets of phenotype-relevant genes that will be used to build a classification model. To evaluate GAT performance, we performed a cross-dataset validation study on three common cancers namely colorectal, breast and lung cancers. The results show that GAT can be used to build a reasonable disease diagnostic model and the predicted markers have biological relevance. GAT can be accessed from http://gat.sit.kmutt.ac.th where GAT's java library for gene-set analysis, simple classification and a database with three cancer benchmark datasets can be downloaded.

...read moreread less

Journal Article•DOI•

Sequential construction of a model for modular gene expression control, applied to spatial patterning of the Drosophila gene hunchback

[...]

Alexander V. Spirov¹, Ekaterina Myasnikova², David M. Holloway³•Institutions (3)

Stony Brook University¹, Saint Petersburg State Polytechnic University², British Columbia Institute of Technology³

27 Apr 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A modeling framework is presented to address some of the gene regulatory dynamics implied by this biological complexity, including spatial patterning of the hunchback gene in Drosophila development and a differential equations model for transcription which takes into account the cis-regulatory architecture of the genes.

...read moreread less

Abstract: Gene network simulations are increasingly used to quantify mutual gene regulation in biological tissues. These are generally based on linear interactions between single-entity regulatory and target genes. Biological genes, by contrast, commonly have multiple, partially independent, cis-regulatory modules (CRMs) for regulator binding, and can produce variant transcription and translation products. We present a modeling framework to address some of the gene regulatory dynamics implied by this biological complexity. Spatial patterning of the hunchback (hb) gene in Drosophila development involves control by three CRMs producing two distinct mRNA transcripts. We use this example to develop a differential equations model for transcription which takes into account the cis-regulatory architecture of the gene. Potential regulatory interactions are screened by a genetic algorithms (GAs) approach and compared to biological expression data.

...read moreread less

Journal Article•DOI•

Analysis of multiple related phenotypes in genome-wide association studies.

[...]

Sohee Oh¹, Iksoo Huh¹, Seung Yeoun Lee², Taesung Park¹•Institutions (2)

Seoul National University¹, Sejong University²

03 Nov 2016-Journal of Bioinformatics and Computational Biology

TL;DR: It is demonstrated through computer simulation that the multivariate approach can improve power for detecting disease-predisposing genetic variants and pleiotropic variants that have simultaneous effects on multiple related phenotypes.

...read moreread less

Abstract: Most genome-wide association studies (GWAS) have been conducted by focusing on one phenotype of interest for identifying genetic variants associated with common complex phenotypes. However, despite many successful results from GWAS, only a small number of genetic variants tend to be identified and replicated given a very stringent genome-wide significance criterion, and explain only a small fraction of phenotype heritability. In order to improve power by using more information from data, we propose an alternative multivariate approach, which considers multiple related phenotypes simultaneously. We demonstrate through computer simulation that the multivariate approach can improve power for detecting disease-predisposing genetic variants and pleiotropic variants that have simultaneous effects on multiple related phenotypes. We apply the multivariate approach to a GWA dataset of 8,842 Korean individuals genotyped for 327,872 SNPs, and detect novel genetic variants associated with metabolic syndrome related phenotypes. Considering several related phenotype simultaneously, the multivariate approach provides not only more powerful results than the conventional univariate approach but also clue to identify pleiotropic genes that are important to the pathogenesis of many related complex phenotypes.

...read moreread less

Journal Article•DOI•

Computational analysis and enzyme assay of inhibitor response to disease single nucleotide polymorphisms (SNPs) in lipoprotein lipase

[...]

Deyong He¹, Ling Huang¹, Yaping Xu¹, Xiaoliang Pan¹, Lijun Liu¹ - Show less +1 more•Institutions (1)

Jinggangshan University¹

03 Nov 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A systematic profile of the lipase inhibitor response of three anti-obesity agents to clinical LPL missense mutations arising from disease single nucleotide polymorphisms (SNPs) was established by integrating complex structure modeling, virtual mutagenesis, molecular dynamics simulations, and binding energy analysis.

...read moreread less

Abstract: Lipoprotein lipase (LPL) is the rate-limiting enzyme for the hydrolysis of the triglyceride (TG) core of circulating TG-rich lipoproteins, chylomicrons, and very low-density lipoproteins. The enzyme has been established as an efficacious and safe therapeutic target for the management of obesity. Here, a systematic profile of the lipase inhibitor response of three anti-obesity agents (Orlistat, Lipstatin, and Cetilistat) to clinical LPL missense mutations arising from disease single nucleotide polymorphisms (SNPs) was established by integrating complex structure modeling, virtual mutagenesis, molecular dynamics (MD) simulations, binding energy analysis, and radiolabeled TG hydrolysis assays. The profile was then used to characterize the resistance and sensitivity of systematic mutation-inhibitor pairs. It is suggested that the Orlistat and Lipstatin have a similar response profile to the investigated mutations due to their homologous chemical structures, but exhibit a distinct profile to that of Cetilistat. Most mutations were predicted to have a modest or moderate effect on inhibitor binding; they are located far away from the enzyme active site and thus can only influence the binding limitedly. A number of mutations were found to sensitize or cause resistance for lipase inhibitors by directly interacting with the inhibitor ligands or by indirectly addressing allosteric effect on enzyme active site. Long-term MD simulations revealed a different noncovalent interaction network at the complex interfaces of Orlistat with wild-type LPL as well as its sensitized mutant H163R and resistant mutant I221T.

...read moreread less

Journal Article•DOI•

A Bayesian approach to analyzing phenotype microarray data enables estimation of microbial growth parameters.

[...]

Matthias Gerstgrasser¹, Sarah Michelle Nicholls², Michael Stout², Katherine A. Smart, Chris D. Powell², Theodore Kypraios², Dov J. Stekel² - Show less +3 more•Institutions (2)

University of Oxford¹, University of Nottingham²

14 Jun 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A novel, Bayesian approach to estimating parameters from Phenotype Microarray data, fitting growth models using Markov Chain Monte Carlo methods to enable high throughput estimation of important information, including length of lag phase, maximal "growth" rate and maximum output is presented.

...read moreread less

Abstract: Biolog phenotype microarrays enable simultaneous, high throughput analysis of cell cultures in different environments. The output is high-density time-course data showing redox curves (approximating growth) for each experimental condition. The software provided with the Omnilog incubator/reader summarizes each time-course as a single datum, so most of the information is not used. However, the time courses can be extremely varied and often contain detailed qualitative (shape of curve) and quantitative (values of parameters) information. We present a novel, Bayesian approach to estimating parameters from Phenotype Microarray data, fitting growth models using Markov Chain Monte Carlo methods to enable high throughput estimation of important information, including length of lag phase, maximal ``growth'' rate and maximum output. We find that the Baranyi model for microbial growth is useful for fitting Biolog data. Moreover, we introduce a new growth model that allows for diauxic growth with a lag phase, which is particularly useful where Phenotype Microarrays have been applied to cells grown in complex mixtures of substrates, for example in industrial or biotechnological applications, such as worts in brewing. Our approach provides more useful information from Biolog data than existing, competing methods, and allows for valuable comparisons between data series and across different models.

...read moreread less

Journal Article•DOI•

Reverse engineering of gene regulatory network using restricted gene expression programming

[...]

Bin Yang, Sanrong Liu, Wei Zhang

03 Nov 2016-Journal of Bioinformatics and Computational Biology

TL;DR: A novel representation of S-system model, named restricted gene expression programming (RGEP), is presented to infer gene regulatory network and a new hybrid evolutionary algorithm based on structure-based evolutionary algorithm and cuckoo search is proposed to optimize the architecture and corresponding parameters of model, respectively.

...read moreread less

Abstract: Inference of gene regulatory networks has been becoming a major area of interest in the field of systems biology over the past decade. In this paper, we present a novel representation of S-system model, named restricted gene expression programming (RGEP), to infer gene regulatory network. A new hybrid evolutionary algorithm based on structure-based evolutionary algorithm and cuckoo search (CS) is proposed to optimize the architecture and corresponding parameters of model, respectively. Two synthetic benchmark datasets and one real biological dataset from SOS DNA repair network in E. coli are used to test the validity of our method. Experimental results demonstrate that our proposed method performs better than previously proposed popular methods.

...read moreread less

Journal Article•DOI•

History of chromosome rearrangements reflects the spatial organization of yeast chromosomes.

[...]

Ekaterina Khrameeva¹, Geoffrey Fudenberg², Mikhail S. Gelfand¹, Mikhail S. Gelfand³, Leonid A. Mirny² - Show less +1 more•Institutions (3)

Russian Academy of Sciences¹, Massachusetts Institute of Technology², Moscow State University³

28 Jan 2016-Journal of Bioinformatics and Computational Biology

TL;DR: In yeast, chromosomes are organized into a Rabl-like configuration, with clustered centromeres and telomeres tethered to the nuclear periphery as discussed by the authors, and the authors show that a consequence of this Rabllike organization is that regions equally distant from Centromeres are more frequently in contact with each other, between arms of both the same and different chromosomes.

...read moreread less

Abstract: Three-dimensional (3D) organization of genomes affects critical cellular processes such as transcription, replication, and deoxyribo nucleic acid (DNA) repair. While previous studies have investigated the natural role, the 3D organization plays in limiting a possible set of genomic rearrangements following DNA repair, the influence of specific organizational principles on this process, particularly over longer evolutionary time scales, remains relatively unexplored. In budding yeast S.cerevisiae, chromosomes are organized into a Rabl-like configuration, with clustered centromeres and telomeres tethered to the nuclear periphery. Hi-C data for S.cerevisiae show that a consequence of this Rabl-like organization is that regions equally distant from centromeres are more frequently in contact with each other, between arms of both the same and different chromosomes. Here, we detect rearrangement events in Saccharomyces species using an automatic approach, and observe increased rearrangement frequency between regions with higher contact frequencies. Together, our results underscore how specific principles of 3D chromosomal organization can influence evolutionary events.

...read moreread less

Journal Article•DOI•

Systematic investigation of metabolic reprogramming in different cancers based on tissue-specific metabolic models.

[...]

Fangzhou Shen¹, Jian Li¹, Ying Zhu¹, Zhuo Wang¹•Institutions (1)

Shanghai Jiao Tong University¹

03 Nov 2016-Journal of Bioinformatics and Computational Biology

TL;DR: This study compared the well-reconstructed tissue-specific models of five cancers, including breast, liver, lung, renal, and urothelial cancer, and their corresponding normal cells to find the tissue- specific metabolic model is more suitable to investigate the cancer metabolism.

...read moreread less

Abstract: Cancer cells have different metabolism in contrast to normal cells. The advancement in omics measurement technology enables the genome-wide characterization of altered cellular processes in cancers, but the metabolic flux landscape of cancer is still far from understood. In this study, we compared the well-reconstructed tissue-specific models of five cancers, including breast, liver, lung, renal, and urothelial cancer, and their corresponding normal cells. There are similar patterns in majority of significantly regulated pathways and enriched pathways in correlated reaction sets. But the differences among cancers are also explicit. The renal cancer demonstrates more dramatic difference with other cancer models, including the smallest number of reactions, flux distribution patterns, and specifically correlated pathways. We also validated the predicted essential genes and revealed the Warburg effect by in silico simulation in renal cancer, which are consistent with the measurements for renal cancer. In conclusion, the tissue-specific metabolic model is more suitable to investigate the cancer metabolism. The similarity and heterogenicity of metabolic reprogramming in different cancers are crucial for understanding the aberrant mechanisms of cancer proliferation, which is fundamental for identifying drug targets and biomarkers.

...read moreread less