scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2018"


Journal ArticleDOI
TL;DR: This study proposed a support vector machine-based model to predict 2'-O-methylation sites in H. sapiens, and the RNA sequences were encoded with the optimal features obtained from feature selection.
Abstract: 2'-O-methylation plays an important biological role in gene expression. Owing to the explosive increase in genomic sequencing data, it is necessary to develop a method for quickly and efficiently identifying whether a sequence contains the 2'-O-methylation site. As an additional method to the experimental technique, a computational method may help to identify 2'-O-methylation sites. In this study, based on the experimental 2'-O-methylation data of Homo sapiens, we proposed a support vector machine-based model to predict 2'-O-methylation sites in H. sapiens. In this model, the RNA sequences were encoded with the optimal features obtained from feature selection. In the fivefold cross-validation test, the accuracy reached 97.95%.

124 citations


Journal ArticleDOI
TL;DR: An indexing scheme called split sequence bloom trees (SSBTs) is introduced to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments, and is an improvement over the sequence bloom tree (SBT) data structure for the same task.
Abstract: Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.

56 citations


Journal ArticleDOI
TL;DR: In this article, a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique was proposed to achieve both scalability and precision. But the scalability of the MinHash-based algorithm is limited due to its memory footprint and recall rate.
Abstract: Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each \(\ge 5\) kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and \(> 60,000\) genomes.

51 citations


Journal ArticleDOI
TL;DR: This study defines snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and gives an efficient algorithm for the detection of these more general structures, using the cactus graph.
Abstract: A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)].

48 citations


Journal ArticleDOI
TL;DR: Results show that the new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39%-85%, with a price of upto 3 × memory consumption during queries.
Abstract: The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.

44 citations


Journal ArticleDOI
TL;DR: In this paper, it has become possible to profile transcriptomes of several thousands of cells in a day, although such a large single-cell cohort may not be suitable for large-scale data collection.
Abstract: With the emergence of droplet-based technologies, it has now become possible to profile transcriptomes of several thousands of cells in a day. Although such a large single-cell cohort may ...

33 citations


Journal ArticleDOI
TL;DR: This work introduces the Copy-Number Tree Mixture Deconvolution (CNTMD) problem, which aims to find the phylogenetic tree with the fewest number of CNAs that explain the copy-number data from multiple samples of a tumor, and designs an algorithm for solving the CNTMD problem and applies the algorithm to both simulated and real data.
Abstract: Cancer is an evolutionary process driven by somatic mutations. This process can be represented as a phylogenetic tree. Constructing such a phylogenetic tree from genome sequencing data is a challenging task due to the many types of mutations in cancer and the fact that nearly all cancer sequencing is of a bulk tumor, measuring a superposition of somatic mutations present in different cells. We study the problem of reconstructing tumor phylogenies from copy-number aberrations (CNAs) measured in bulk-sequencing data. We introduce the Copy-Number Tree Mixture Deconvolution (CNTMD) problem, which aims to find the phylogenetic tree with the fewest number of CNAs that explain the copy-number data from multiple samples of a tumor. We design an algorithm for solving the CNTMD problem and apply the algorithm to both simulated and real data. On simulated data, we find that our algorithm outperforms existing approaches that either perform deconvolution/factorization of mixed tumor samples or build phylogene...

33 citations


Journal ArticleDOI
TL;DR: A novel viral quasispecies reconstruction algorithm, aBayesQR, that employs a maximum-likelihood framework to infer individual sequences in a mixture from high-throughput sequencing data and generally outperforms state-of-the-art methods.
Abstract: RNA viruses replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, studying the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations. However, accurate reconstruction of viral quasispecies remains difficult due to limited read-lengths and presence of sequencing errors. The problem is particularly challenging when the strains in a population are highly similar, i.e., the sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. In this paper, we present a novel viral quasispecies reconstruction algorithm, aBayesQR, that employs a maximum-likelihood framework to infer individual sequences in a mixture from high-throughput sequencing data. The search for the most likely quasispecies is conducted on long contigs that our method constructs from the set of short reads via agglomerative hierarchical clustering; operating on contigs rather than short reads enables identification of close strains in a population and provides computational tractability of the Bayesian method. Results on both simulated and real HIV-1 data demonstrate that the proposed algorithm generally outperforms state-of-the-art methods; aBayesQR particularly stands out when reconstructing a set of closely related viral strains (e.g., quasispecies characterized by low diversity).

29 citations


Journal ArticleDOI
TL;DR: This work develops a free interactive web software platform, MixProTool, for processing multigroup proteomics data sets, which provides integrated data analysis workflow, including quality control assessment, normalization, soft independent modeling of class analogy, statistics, gene ontology enrichment, and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis.
Abstract: Deciphering and visualizing proteomics data are a big challenge for high-throughput proteomics research. In this work, we develop a free interactive web software platform, MixProTool, for processing multigroup proteomics data sets. This tool provides integrated data analysis workflow, including quality control assessment, normalization, soft independent modeling of class analogy, statistics, gene ontology enrichment, and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis. This software is also highly compatible with the identification and quantification results of various frequently used search engines, such as MaxQuant, Proteome Discoverer, or Mascot. Moreover, all analyzed results can be visualized as vector graphs and tables for further analysis. MixProTool can be conveniently operated by users, even those without bioinformatics training, and it is extremely useful for mining the most relevant features among different samples. MixProTool is deployed at the public shinyapps io server.

25 citations


Journal ArticleDOI
TL;DR: BBK* is the first provable, ensemble-based CPD algorithm to run in time sublinear in the number of sequences and efficiently performs designs that are too large for previous methods.
Abstract: Computational protein design (CPD) algorithms that compute binding affinity, Ka, search for sequences with an energetically favorable free energy of binding. Recent work shows that three principles improve the biological accuracy of CPD: ensemble-based design, continuous flexibility of backbone and side-chain conformations, and provable guarantees of accuracy with respect to the input. However, previous methods that use all three design principles are single-sequence (SS) algorithms, which are very costly: linear in the number of sequences and thus exponential in the number of simultaneously mutable residues. To address this computational challenge, we introduce BBK*, a new CPD algorithm whose key innovation is the multisequence (MS) bound: BBK* efficiently computes a single provable upper bound to approximate Ka for a combinatorial number of sequences, and avoids SS computation for all provably suboptimal sequences. Thus, to our knowledge, BBK* is the first provable, ensemble-based CPD algorithm to run in time sublinear in the number of sequences. Computational experiments on 204 protein design problems show that BBK* finds the tightest binding sequences while approximating Ka for up to 105-fold fewer sequences than the previous state-of-the-art algorithms, which require exhaustive enumeration of sequences. Furthermore, for 51 protein-ligand design problems, BBK* provably approximates Ka up to 1982-fold faster than the previous state-of-the-art iMinDEE/[Formula: see text]/[Formula: see text] algorithm. Therefore, BBK* not only accelerates protein designs that are possible with previous provable algorithms, but also efficiently performs designs that are too large for previous methods.

23 citations


Journal ArticleDOI
TL;DR: This investigation has demonstrated that the selected SNPs were effective in the detection of obesity risk, and the ML-based method provides a feasible mean for conducting preliminary analyses of genetic characteristics of obesity.
Abstract: Obesity is a major risk factor for many metabolic diseases. To understand the genetic characteristics of obese individuals, single-nucleotide polymorphisms (SNPs) derived from next-generat...

Journal ArticleDOI
TL;DR: Execution of the BLAST algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors is presented.
Abstract: Basic Local Alignment Search Tool (BLAST) is an essential algorithm that researchers use for sequence alignment analysis. The National Center for Biotechnology Information (NCBI)-BLAST application is the most popular implementation of the BLAST algorithm. It can run on a single multithreading node. However, the volume of nucleotide and protein data is fast growing, making single node insufficient. It is more and more important to develop high-performance computing solutions, which could help researchers to analyze genetic data in a fast and scalable way. This article presents execution of the BLAST algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors. The Parallel Computing in Java (PCJ) library has been used to implement the optimal splitting up of the input queries, the work distribution, and search management. It is used with the nonmodified NCBI-BLAST package, which is an additional advantage for the users. The result application-PCJ-BLAST-is responsible for reading sequence for comparison, splitting it up and starting multiple NCBI-BLAST executables. Since I/O performance could limit sequence analysis performance, the article contains an investigation of this problem. The obtained results show that using Java and PCJ library it is possible to perform sequence analysis using hundreds of nodes in parallel. We have achieved excellent performance and efficiency and we have significantly reduced the time required for sequence analysis. Our work also proved that PCJ library could be used as an effective tool for fast development of the scalable applications.

Journal ArticleDOI
TL;DR: More work is needed to develop and validate appropriate in vitro systems for precise prediction of DILI risk and upcyte hepatocytes represent the most recent technical advancement combining some features of primary human hepatocytes such as physiological activity and donor variability with the ability of cell lines for extended proliferation.
Abstract: Drug induced liver injury (DILI) is still the leading single cause of drug failure during clinical phases and after market approval. Currently, many laboratories aim to develop appropriate in vitro systems to predict drug hepatotoxicity. Primary human hepatocytes are still the gold standard, but they have substantial disadvantages such as rapid dedifferentiation in vitro and lack of cell proliferation. In addition to primary human hepatocytes, liver cancer-derived cell lines such as HepG2, cytochrome P450 (CYP450) overexpressing HepG2 cell clones and HepaRG were studied intensively. In contrast to HepG2, HepaRG show promising characteristics of differentiated primary human hepatocytes, but they represent only one donor. There is some hope that this lack of donor variability can be solved by the use of iPS-derived hepatocytes. However, iPS technology still seems to need some improvement to produce physiologically relevant hepatocytes. Upcyte hepatocytes represent the most recent technical advancement combining some features of primary human hepatocytes such as physiological activity and donor variability with the ability of cell lines for extended proliferation. Altogether, more work is needed to develop and validate appropriate in vitro systems for precise prediction of DILI risk.

Journal ArticleDOI
TL;DR: RedO is a comprehensive application tool for identifying RNA editing events in plant organelles based on variant call format files from RNA-sequencing data and provides several functions such as detailed annotations, statistics, figures, and significantly differential proportion of RNA editing sites among different samples.
Abstract: RNA editing is a post-transcriptional or cotranscriptional process that changes the sequence of the precursor transcript by substitutions, insertions, or deletions. Almost all of the land plants undergo RNA editing in organelles (plastids and mitochondria). Although several software tools have been developed to identify RNA editing events, there has been a great challenge to distinguish true RNA editing events from genome variation, sequencing errors, and other factors. Here we introduce REDO, a comprehensive application tool for identifying RNA editing events in plant organelles based on variant call format files from RNA-sequencing data. REDO is a suite of Perl scripts that illustrate a bunch of attributes of RNA editing events in figures and tables. REDO can also detect RNA editing events in multiple samples simultaneously and identify the significant differential proportion of RNA editing loci. Comparing with similar tools, such as REDItools, REDO runs faster with higher accuracy, and more specificity at the cost of slightly lower sensitivity. Moreover, REDO annotates each RNA editing site in RNAs, whereas REDItools reports only possible RNA editing sites in genome, which need additional steps to obtain RNA editing profiles for RNAs. Overall, REDO can identify potential RNA editing sites easily and provide several functions such as detailed annotations, statistics, figures, and significantly differential proportion of RNA editing sites among different samples.

Journal ArticleDOI
TL;DR: This work describes an active learning strategy for selecting optimal interventions that reduces the detection error of validated edges as compared with an unguided choice of interventions and avoids redundant interventions, thereby increasing the effectiveness of the experiment.
Abstract: Machine learning methods for learning network structure are applied to quantitative proteomics experiments and reverse-engineer intracellular signal transduction networks. They provide insight into the rewiring of signaling within the context of a disease or a phenotype. To learn the causal patterns of influence between proteins in the network, the methods require experiments that include targeted interventions that fix the activity of specific proteins. However, the interventions are costly and add experimental complexity. We describe an active learning strategy for selecting optimal interventions. Our approach takes as inputs pathway databases and historic data sets, expresses them in form of prior probability distributions on network structures, and selects interventions that maximize their expected contribution to structure learning. Evaluations on simulated and real data show that the strategy reduces the detection error of validated edges as compared with an unguided choice of interventions and avoids redundant interventions, thereby increasing the effectiveness of the experiment.

Journal ArticleDOI
TL;DR: The performance evaluation results showed that the proposed method can outperform the previous methods and the pattern learned by the DNNs was visualized as position frequency matrices (PFMs), which were very similar to the consensus sequence.
Abstract: Accurate splice-site prediction is essential to delineate gene structures from sequence data. Several computational techniques have been applied to create a system to predict canonical splice sites. For classification tasks, deep neural networks (DNNs) have achieved record-breaking results and often outperformed other supervised learning techniques. In this study, a new method of splice-site prediction using DNNs was proposed. The proposed system receives an input sequence data and returns an answer as to whether it is splice site. The length of input is 140 nucleotides, with the consensus sequence (i.e., "GT" and "AG" for the donor and acceptor sites, respectively) in the middle. Each input sequence model is applied to the pretrained DNN model that determines the probability that an input is a splice site. The model consists of convolutional layers and bidirectional long short-term memory network layers. The pretraining and validation were conducted using the data set tested in previously reported methods. The performance evaluation results showed that the proposed method can outperform the previous methods. In addition, the pattern learned by the DNNs was visualized as position frequency matrices (PFMs). Some of PFMs were very similar to the consensus sequence. The trained DNN model and the brief source code for the prediction system are uploaded. Further improvement will be achieved following the further development of DNNs.

Journal ArticleDOI
TL;DR: FIESTA is based on parametric bootstrap sampling, and, therefore, avoids unjustified assumptions on the distribution of the heritability estimator, and uses stochastic approximation techniques, which accelerate the construction of CIs by several orders of magnitude, compared with previous approaches.
Abstract: Estimation of heritability is an important task in genetics. The use of linear mixed models (LMMs) to determine narrow-sense single-nucleotide polymorphism (SNP)-heritability and related quantities has received much recent attention, due of its ability to account for variants with small effect sizes. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. The common way to report the uncertainty in REML estimation uses standard errors (SEs), which rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals (CIs). In addition, for larger data sets (e.g., tens of thousands of individuals), the construction of SEs itself may require considerable time, as it requires expensive matrix inversions and multiplications. Here, we present FIESTA (Fast confidence IntErvals using STochastic Approximation), a method for constructing accurate CIs. FIESTA is based on parametric bootstrap sampling, and, therefore, avoids unjustified assumptions on the distribution of the heritability estimator. FIESTA uses stochastic approximation techniques, which accelerate the construction of CIs by several orders of magnitude, compared with previous approaches as well as to the analytical approximation used by SEs. FIESTA builds accurate CIs rapidly, for example, requiring only several seconds for data sets of tens of thousands of individuals, making FIESTA a very fast solution to the problem of building accurate CIs for heritability for all data set sizes.

Journal ArticleDOI
TL;DR: A new method called non-negative matrix factorization microbe-disease associations (NMFMDA) is proposed, which used Gaussian interaction profile kernel similarity measure, to calculate microbial similarity and disease similarity, and applied a logistic function to regulate disease similarity.
Abstract: More and more evidence shows that microbes play crucial roles in human health and disease. The exploration of the relationship between microbes and diseases will help people to better unde...

Journal ArticleDOI
TL;DR: The results show that although most TE classes are primarily associated with reduced gene expression, Alu elements are associated with upregulated gene expression and suggest a general model where clade-specific short interspersed elements (SINEs) may contribute more to gene regulation than ancient/ancestral TEs.
Abstract: Nearly half of the human genome is made up of transposable elements (TEs), and there is evidence that TEs are involved in gene regulation. In this study, we have integrated publicly available genomic, epigenetic, and transcriptomic data to investigate this in a genome-wide manner. A bootstrapping statistical method was applied to minimize confounder effects from different repeat types. Our results show that although most TE classes are primarily associated with reduced gene expression, Alu elements are associated with upregulated gene expression. Furthermore, Alu elements had the highest probability of any TE class contributing to regulatory regions of any type defined by chromatin state. This suggests a general model where clade-specific short interspersed elements (SINEs) may contribute more to gene regulation than ancient/ancestral TEs. Our exhaustive analysis has extended and updated our understanding of TEs in terms of their global impact on gene regulation and suggests that the most recently derived types of TEs, that is, clade- or species-specific SINES, have the greatest overall impact on gene regulation.


Journal ArticleDOI
TL;DR: A functional analysis of the pathways shows their close relationship at the level of gene regulation, which indicted that the identified signature genes play an important role in the pathogenesis of cancer and is very important for understanding the pathogenicity ofcancer and the early diagnosis.
Abstract: To identify signature genes for the pathogenesis of cancer, which provides a theoretical support for prevention and early diagnosis of cancer. The pattern recognition method was used to analyze the genome-wide gene expression data, which was collected from the The Cancer Genome Atlas (TCGA) database. For the transcription of invasive breast carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, colon adenocarcinoma, renal clear-cell carcinoma, thyroid carcinoma, and hepatocellular carcinoma of the seven cancers, the signature genes were selected by means of a combination of statistical methods, such as correlation, t-test, confidence interval, etc. Modeling by artificial neural network model, the accuracy can be as high as 98% for the TCGA data and as high as 92% for the Gene Expression Omnibus (GEO) independent data, the recognition accuracy of stage I is more than 95%, which is higher compared with the previous study. The common genes emerging in five cancers were obtained from the signature genes of seven cancers, PID1, and SPTBN2. At the same time, we obtain three common pathways of cancer by using Kyoto Encyclopedia of Genes and Genomes' pathway analysis. A functional analysis of the pathways shows their close relationship at the level of gene regulation, which indicted that the identified signature genes play an important role in the pathogenesis of cancer and is very important for understanding the pathogenesis of cancer and the early diagnosis.

Journal ArticleDOI
TL;DR: In this paper, U. tomentosa leaves aqueous extract was formed within 5min at room temperature and was characterized by Ultra-violet visible spectrophotometer (UV-vis), Fourier transform infrared spectroscopy (FT-IR), Transmission electron microscopy (TEM), and X-ray diffraction (XRD).
Abstract: The present study deals with rapid, green method and large-scale synthesis of magnetite nanoparticles (Fe3O4NPs) by Uncaria tomentosa leaves aqueous extract at ambient temperature. Crystal growth of magnetite nanoparticles using the processor; ferrous chloride (FeCl2), ferric chloride (FeCl3) and U. tomentosa leaves aqueous extract was formed within 5min at room temperature. Fe3O4NPs synthesized using prucessor FeCl2 have advantage due to U. tomentosa acts as partial oxidizing agent Fe(II) to Fe(III) and then reducing agent to form magnetite nanoparticles (FeO.Fe2O3) and at the same time gave smaller average particle size 20nm. Synthesized Fe3O4NPs were characterized by Ultra-violet visible spectrophotometer (UV-vis), Fourier transform infrared spectroscopy (FT-IR), Transmission electron microscopy (TEM), and X-ray diffraction (XRD). Magnetite nanoparticles showed better green peach aphid activity than reference.

Journal ArticleDOI
TL;DR: The computational results reveal that the 2, 3, 4, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT, the first time that the 1-base periodicity of introns is discovered through signal processing method.
Abstract: Ramanujan Fourier transform (RFT) nowadays is becoming a popular signal processing method. RFT is used to detect periodicities in exons and introns of eukaryotic genomes in this article. Genomic sequences of nine species were analyzed. The highest peak in the spectrum amplitude corresponding to each exon or intron is regarded as the significant signal. Accordingly, the periodicity corresponding to the significant signal can be also regarded as a valuable periodicity. Exons and introns have different periodic phenomena. The computational results reveal that the 2-, 3-, 4-, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT. It is the first time that the 2-base periodicity of introns is discovered through signal processing method. The frequencies of the 2-base periodicity and the 3-base periodicity occurrence are polar opposite between the exons and the introns. With regard to the cyclicality of the Ramanujan sums, which is the base function of the transformation, RFT is suggested for studying the periodic features of dinucleotides, trinucleotides, and q nucleotides.

Journal ArticleDOI
Weitao Sun1
TL;DR: An algorithm for community structure partition is proposed by integrating Miyazawa-Jernigan empirical potential energy as edge weight and a sensitivity parameter is defined to measure the effect of local residue interaction on low-frequency movement, showing that community structure is a more fundamental feature of residue contact networks.
Abstract: The global shape of a protein molecule is believed to be dominant in determining low-frequency deformational motions. However, how structure dynamics relies on residue interactions remains largely unknown. The global residue community structure and the local residue interactions are two important coexisting factors imposing significant effects on low-frequency normal modes. In this work, an algorithm for community structure partition is proposed by integrating Miyazawa-Jernigan empirical potential energy as edge weight. A sensitivity parameter is defined to measure the effect of local residue interaction on low-frequency movement. We show that community structure is a more fundamental feature of residue contact networks. Moreover, we surprisingly find that low-frequency normal mode eigenvectors are sensitive to some local critical residue interaction pairs (CRIPs). A fair amount of CRIPs act as bridges and hold distributed structure components into a unified tertiary structure by bonding nearby communities. Community structure analysis and CRIP detection of 116 catalytic proteins reveal that breaking up of a CRIP can cause low-frequency allosteric movement of a residue at the far side of protein structure. The results imply that community structure and CRIP may be the structural basis for low-frequency motions.

Journal ArticleDOI
TL;DR: The results of computational experiments using artificially generated networks and real-world biological networks suggest that the proposed algorithm is useful for identifying these three kinds of vertices for relatively large-scale networks, and that the fraction of critical and intermittent vertices is considerably small.
Abstract: Controlling complex networks through a small number of controller vertices is of great importance in wide-ranging research fields. Recently, a new approach based on the minimum feedback vertex set (MFVS) has been proposed to find such vertices in directed networks in which the target states are restricted to steady states. However, multiple MFVS configurations may exist and thus the selection of vertices may depend on algorithms and input data representations. Our attempts to address this ambiguity led us to adopt an existing approach that classifies vertices into three categories. This approach has been successfully applied to maximum matching-based and minimum dominating set-based controllability analysis frameworks. In this article, we present an algorithm as well as its implementation to compute and evaluate the critical, intermittent, and redundant vertices under the MFVS-based framework, where these three categories include vertices belonging to all MFVSs, some (but not all) MFVSs, and none of the MFVSs, respectively. The results of computational experiments using artificially generated networks and real-world biological networks suggest that the proposed algorithm is useful for identifying these three kinds of vertices for relatively large-scale networks, and that the fraction of critical and intermittent vertices is considerably small. Moreover, an analysis of the signal pathways indicates that critical and intermittent MFVSs tend to be enriched by essential genes.

Journal ArticleDOI
TL;DR: Two conflicting aspects of sampling capability are introduced and quantified via statistical (and graphical) analysis tools and this work believes that this work will advance the adoption of sample-based models as reliable tools for modeling slow protein structural rearrangements.
Abstract: Proteins often undergo slow structural rearrangements that involve several angstroms and surpass the nanosecond timescale. These spatiotemporal scales challenge physics-based simulations and open the way to sample-based models of structural dynamics. This article improves an understanding of current capabilities and limitations of sample-based models of dynamics. Borrowing from widely used concepts in evolutionary computation, this article introduces two conflicting aspects of sampling capability and quantifies them via statistical (and graphical) analysis tools. This allows not only conducting a principled comparison of different sample-based algorithms but also understanding which algorithmic ingredients to use as knobs via which to control sampling and, in turn, the accuracy and detail of modeled structural rearrangements. We demonstrate the latter by proposing two powerful variants of a recently published sample-based algorithm. We believe that this work will advance the adoption of sample-based models as reliable tools for modeling slow protein structural rearrangements.

Journal ArticleDOI
TL;DR: In this article, a dual partition function filtered by Hamming distance is presented, together with a Boltzmann sampler using novel dynamic programming routines for the loop-based energy model.
Abstract: Recently, a framework considering ribonucleic acid (RNA) sequences and their RNA secondary structures as pairs has led to new information theoretic perspectives on how the semantics encoded in RNA sequences can be inferred. In this context, the pairing arises naturally from the energy model of RNA secondary structures. Fixing the sequence in the pairing produces the RNA energy landscape, whose partition function was discovered by McCaskill. Dually, fixing the structure induces the energy landscape of sequences. The latter has been considered for designing more efficient inverse folding algorithms. In this work, we present the dual partition function filtered by Hamming distance, together with a Boltzmann sampler using novel dynamic programming routines for the loop-based energy model. The time complexity of the algorithm is [Formula: see text], where [Formula: see text] are Hamming distance and sequence length, respectively, reducing the time complexity of samplers, reported in the literature by [Formula: see text]. We then present two applications, the first in the context of the evolution of natural sequence-structure pairs of microRNAs and the second in constructing neutral paths. The former studies the inverse folding rate (IFR) of sequence-structure pairs, filtered by Hamming distance, observing that such pairs evolve toward higher levels of robustness, that is, increasing IFR. The latter is an algorithm that constructs neutral paths: given two sequences in a neutral network, we employ the sampler to construct short paths connecting them, consisting of sequences all contained in the neutral network.


Journal ArticleDOI
TL;DR: A computation pipeline whose only input is a Protein Data Bank file containing the 3D coordinates of the atoms of a biomolecule is motivate and presented, and a Mutation Sensitivity Map is developed, to permit identifying residues that are most sensitive to mutations.
Abstract: Understanding how an amino acid substitution affects a protein's structure can aid in the design of pharmaceutical drugs that aim at countering diseases caused by protein mutants. Unfortunately, performing even a few amino acid substitutions in vitro is both time and cost prohibitive, whereas an exhaustive analysis that involves systematically mutating all amino acids in the physical protein is infeasible. Computational methods have been developed to predict the effects of mutations, but even many of them are computationally intensive or are else dependent on homology or experimental data that may not be available for the protein being studied. In this work, we motivate and present a computation pipeline whose only input is a Protein Data Bank file containing the 3D coordinates of the atoms of a biomolecule. Our high-throughput approach uses our ProMuteHT algorithm to exhaustively generate in silico amino acid substitutions at each residue, and it also includes an energy minimization option. This is in contrast to our previous work, where we analyzed the effects of in silico mutations to Alanine, Serine, and Glycine only. We exploit the speed of a fast rigidity analysis approach to analyze our protein variants, and develop a Mutation Sensitivity (MuSe) Map, to permit identifying residues that are most sensitive to mutations. We present a case study to show the degree to which a MuSe Map and whisker plots are able to locate amino acids whose mutations most affect a protein's structure as inferred from a rigidity analysis approach.

Journal ArticleDOI
TL;DR: A new automated time structure learning model is proposed to automatically reveal the longitudinal genotype-phenotype interactions and exploits such learned structure to enhance the phenotypic predictions.
Abstract: With rapid progress in high-throughput genotyping and neuroimaging, imaging genetics has gained significant attention in the research of complex brain disorders, such as Alzheimer’s Disease (AD) The genotype-phenotype association study using imaging genetic data has the potential to reveal genetic basis and biological mechanism of brain structure and function AD is a progressive neurodegenerative disease, thus, it is crucial to look into the relations between SNPs and longitudinal variations of neuroimaging phenotypes Although some machine learning models were newly presented to capture the longitudinal patterns in genotype-phenotype association study, most of them required fixed longitudinal structures of prediction tasks and could not automatically learn the interrelations among longitudinal prediction tasks To address this challenge, we proposed a novel temporal structure auto-learning model to automatically uncover longitudinal genotype-phenotype interrelations and utilized such interrelated structures to enhance phenotype prediction in the meantime We conducted longitudinal phenotype prediction experiments on the ADNI cohort including 3,123 SNPs and two types of imaging markers, VBM and FreeSurfer Empirical results demonstrated advantages of our proposed model over the counterparts Moreover, available literature was identified for our top selected SNPs, which demonstrated the rationality of our prediction results An executable program is available online at https://githubcom/littleq1991/sparse_lowRank_regression