scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2022"


Journal ArticleDOI
TL;DR: SCOT is presented, an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets and performs on par with the current state-of-the-art un supervised alignment methods, is faster, and requires tuning of fewer hyperparameters.
Abstract: Recent advances in sequencing technologies have allowed us to capture various aspects of the genome at single-cell resolution. However, with the exception of a few of co-assaying technologies, it is not possible to simultaneously apply different sequencing assays on the same single cell. In this scenario, computational integration of multi-omic measurements is crucial to enable joint analyses. This integration task is particularly challenging due to the lack of sample-wise or feature-wise correspondences. We present single-cell alignment with optimal transport (SCOT), an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets. SCOT performs on par with the current state-of-the-art unsupervised alignment methods, is faster, and requires tuning of fewer hyperparameters. More importantly, SCOT uses a self-tuning heuristic to guide hyperparameter selection based on the Gromov-Wasserstein distance. Thus, in the fully unsupervised setting, SCOT aligns single-cell data sets better than the existing methods without requiring any orthogonal correspondence information.

37 citations


Journal ArticleDOI
TL;DR: A binary hybrid metaheuristic-based algorithm for selecting the optimal feature subset that efficiently reduces and selects the feature subset and at the same time results in higher classification accuracy than other methods in the literature.
Abstract: A large number of features lead to very high-dimensional data. The feature selection method reduces the dimension of data, increases the performance of prediction, and reduces the computation time. Feature selection is the process of selecting the optimal set of input features from a given data set in order to reduce the noise in data and keep the relevant features. The optimal feature subset contains all useful and relevant features and excludes any irrelevant feature that allows machine learning models to understand better and differentiate efficiently the patterns in data sets. In this article, we propose a binary hybrid metaheuristic-based algorithm for selecting the optimal feature subset. Concretely, the brain storm optimization algorithm is hybridized by the firefly algorithm and adopted as a wrapper method for feature selection problems on classification data sets. The proposed algorithm is evaluated on 21 data sets and compared with 11 metaheuristic algorithms. In addition, the proposed method is adopted for the coronavirus disease data set. The obtained experimental results substantiate the robustness of the proposed hybrid algorithm. It efficiently reduces and selects the feature subset and at the same time results in higher classification accuracy than other methods in the literature.

33 citations


Journal ArticleDOI
TL;DR: A novel algorithm is presented that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse, a major advance in the ability to perform MEM finding against very large collections of related references.
Abstract: Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.

24 citations


Journal ArticleDOI
TL;DR: This work proposes GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single-cell RNA-Sequencing (scRNA-Seq) data, which incorporates two intertwined models and designs an unrolled algorithm technique for this framework.
Abstract: We propose GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single-cell RNA-Sequencing (scRNA-Seq) data. Our framework incorporates two intertwined models. First, we leverage the expressive ability of neural networks to capture complex dependencies between transcription factors and the corresponding genes they regulate, by developing a multitask learning framework. Second, to capture sparsity of GRNs observed in the real world, we design an unrolled algorithm technique for our framework. Our deep architecture requires supervision for training, for which we repurpose existing synthetic data simulators that generate scRNA-Seq data guided by an underlying GRN. Experimental results demonstrate that GRNUlar outperforms state-of-the-art methods on both synthetic and real data sets. Our study also demonstrates the novel and successful use of expression data simulators for supervised learning of GRN inference.

11 citations


Journal ArticleDOI
TL;DR: In this article , the authors consider the simple model where a sequence S undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches.
Abstract: k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

10 citations


Journal ArticleDOI
TL;DR: An experimental comparison shows that the CS-ABC approach with the ICA algorithm performs a deeper search in the iterative process, which can avoid premature convergence and produce better results compared with the previously published feature selection algorithm for the NB classifier.
Abstract: The design of an optimal framework for the prediction of cancer from high-dimensional and imbalanced microarray data is a challenging job in the fields of bioinformatics and machine learning. There are so many techniques for dimensionality reduction, but it is unclear which of these techniques performs best with different classifiers and datasets. This article focused on the independent component analysis (ICA) features (genes) extraction method for Naïve Bayes (NB) classification of microarray data, because ICA perfectly takes out an independent component from the datasets that satisfy the classification criteria of the NB classifier. A novel hybrid method based on a nature-inspired metaheuristic algorithm is proposed in this article for resolving optimization problems of ICA extracted genes. The cuckoo search (CS) algorithm and artificial bee colony (ABC) for finding the best subset of features to increase the performance of ICA for the NB classifier is designed and executed. According to our investigation, the CS-ABC with ICA was implemented for the first time to resolve the dimensionality reduction problem in high-dimensional microarray biomedical datasets. The CS algorithm improved the local search process of the ABC algorithm, and then the hybrid algorithm CS-ABC provided better optimal gene sets that improved the classification accuracy of the NB classifier. The experimental comparison shows that the CS-ABC approach with the ICA algorithm performs a deeper search in the iterative process, which can avoid premature convergence and produce better results compared with the previously published feature selection algorithm for the NB classifier.

9 citations


Journal ArticleDOI
TL;DR: In this paper , a novel hybrid method based on a nature-inspired metaheuristic algorithm is proposed for resolving optimization problems of ICA extracted genes, where the cuckoo search (CS) algorithm and artificial bee colony (ABC) for finding the best subset of features to increase the performance of the NB classifier is designed and executed.
Abstract: The design of an optimal framework for the prediction of cancer from high-dimensional and imbalanced microarray data is a challenging job in the fields of bioinformatics and machine learning. There are so many techniques for dimensionality reduction, but it is unclear which of these techniques performs best with different classifiers and datasets. This article focused on the independent component analysis (ICA) features (genes) extraction method for Naïve Bayes (NB) classification of microarray data, because ICA perfectly takes out an independent component from the datasets that satisfy the classification criteria of the NB classifier. A novel hybrid method based on a nature-inspired metaheuristic algorithm is proposed in this article for resolving optimization problems of ICA extracted genes. The cuckoo search (CS) algorithm and artificial bee colony (ABC) for finding the best subset of features to increase the performance of ICA for the NB classifier is designed and executed. According to our investigation, the CS-ABC with ICA was implemented for the first time to resolve the dimensionality reduction problem in high-dimensional microarray biomedical datasets. The CS algorithm improved the local search process of the ABC algorithm, and then the hybrid algorithm CS-ABC provided better optimal gene sets that improved the classification accuracy of the NB classifier. The experimental comparison shows that the CS-ABC approach with the ICA algorithm performs a deeper search in the iterative process, which can avoid premature convergence and produce better results compared with the previously published feature selection algorithm for the NB classifier.

9 citations


Journal ArticleDOI
TL;DR: This new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: it uses a statistically principled technique to weight and rank the HMMs; second, it uses k>1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account.
Abstract: Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k>1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.

8 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the normalized meetagenomic Hi-C contact maps by HiCzin result in lower biases, higher capability to detect spurious contacts, and better performance in metagenomic contig clustering.
Abstract: High-throughput chromosome conformation capture (Hi-C) has recently been applied to natural microbial communities and revealed great potential to study multiple genomes simultaneously. Several extraneous factors may influence chromosomal contacts rendering the normalization of Hi-C contact maps essential for downstream analyses. However, the current paucity of metagenomic Hi-C normalization methods and the ignorance for spurious interspecies contacts weaken the interpretability of the data. Here, we report on two types of biases in metagenomic Hi-C experiments: explicit biases and implicit biases, and introduce HiCzin, a parametric model to correct both types of biases and remove spurious interspecies contacts. We demonstrate that the normalized metagenomic Hi-C contact maps by HiCzin result in lower biases, higher capability to detect spurious contacts, and better performance in metagenomic contig clustering.

8 citations


Journal ArticleDOI
TL;DR: This article introduces automatic brain tumor detection from a magnetic resonance image (MRI) and provides novel algorithms for extracting patches and segmentation trained with Convolutional Neural Network (CNN)'s to identify brain tumors.
Abstract: This article introduces automatic brain tumor detection from a magnetic resonance image (MRI). It provides novel algorithms for extracting patches and segmentation trained with Convolutional Neural Network (CNN)'s to identify brain tumors. Further, this study provides deep learning and image segmentation with CNN algorithms. This contribution proposed two similar segmentation algorithms: one for the Higher Grade Gliomas (HGG) and the other for the Lower Grade Gliomas (LGG) for the brain tumor patients. The proposed algorithms (Intensity normalization, Patch extraction, Selecting the best patch, segmentation of HGG, and Segmentation of LGG) identify the gliomas and detect the stage of the tumor as per taking the MRI as input and segmented tumor from the MRIs and elaborated the four algorithms to detect HGG, and segmentation to detect the LGG works with CNN. The segmentation algorithm is compared with different existing algorithms and performs the automatic identification reasonably with high accuracy as per epochs generated with accuracy and loss curves. This article also described how transfer learning has helped extract the image and resolution of the image and increase the segmentation accuracy in the case of LGG patients.

8 citations


Journal ArticleDOI
TL;DR: In this paper , a fractional mathematical model of the human immunodeficiency virus (HIV)/AIDS spread with fractional derivative of the Caputo type is presented, which includes five compartments corresponding to the variables describing the susceptible patients, HIV-infected patients, people with AIDS but not receiving antiretroviral treatment, patients being treated, and individuals who are immune to HIV infection by sexual contact.
Abstract: This article presents a fractional mathematical model of the human immunodeficiency virus (HIV)/AIDS spread with a fractional derivative of the Caputo type. The model includes five compartments corresponding to the variables describing the susceptible patients, HIV-infected patients, people with AIDS but not receiving antiretroviral treatment, patients being treated, and individuals who are immune to HIV infection by sexual contact. Moreover, it is assumed that the total population is constant. We construct an optimization technique supported by a class of basis functions, consisting of the generalized shifted Jacobi polynomials (GSJPs). The solution of the fractional HIV/AIDS epidemic model is approximated by means of GSJPs with coefficients and parameters in the matrix form. After calculating and combining the operational matrices with the Lagrange multipliers, we obtain the optimization method. The theorems on the existence, unique, and convergence results of the method are proved. Several illustrative examples show the performance of the proposed method. Mathematics Subject Classification: 97M60; 41A58; 92C42.

Journal ArticleDOI
TL;DR: This model presents a novel setup for predicting gene expression by integrating multimodal data sets in a graph convolutional framework and is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions contributing to a gene's predicted expression.
Abstract: Long-range regulatory interactions among genomic regions are critical for controlling gene expression, and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying genomic organization. To address these limitations, we present a Graph Convolutional Model for Epigenetic Regulation of Gene Expression (GC-MERGE). Using a graph-based framework, the model incorporates important information about long-range interactions via a natural encoding of genomic spatial interactions into the graph representation. It integrates measurements of both the global genomic organization and the local regulatory factors, specifically histone modifications, to not only predict the expression of a given gene of interest but also quantify the importance of its regulatory factors. We apply GC-MERGE to data sets for three cell lines-GM12878 (lymphoblastoid), K562 (myelogenous leukemia), and HUVEC (human umbilical vein endothelial)-and demonstrate its state-of-the-art predictive performance. Crucially, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions contributing to a gene's predicted expression. We provide model explanations for multiple exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal data sets in a graph convolutional framework. More importantly, it enables interpretation of the biological mechanisms driving the model's predictions.

Journal ArticleDOI
TL;DR: By using unbalanced Gromov-Wasserstein optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements, this method gives state-of-the-art alignment performance across five non-coassay data sets (simulated and real world).
Abstract: Multiomic single-cell data allow us to perform integrated analysis to understand genomic regulation of biological processes. However, most single-cell sequencing assays are performed on separately sampled cell populations, as applying them to the same single-cell is challenging. Existing unsupervised single-cell alignment algorithms have been primarily benchmarked on coassay experiments. Our investigation revealed that these methods do not perform well for noncoassay single-cell experiments when there is disproportionate cell-type representation across measurement domains. Therefore, we extend our previous work-Single Cell alignment using Optimal Transport (SCOT)-by using unbalanced Gromov-Wasserstein optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements. Our method, SCOTv2, gives state-of-the-art alignment performance across five non-coassay data sets (simulated and real world). It can also integrate multiple (M≥2) single-cell measurements while preserving the self-tuning capabilities and computational tractability of its original version.

Journal ArticleDOI
TL;DR: A method for computing a wide range of protein features, Pfeature, which allows to compute more than 200,000 features required for predicting the overall function of a protein, residue-level annotation of aprotein, and function of chemically modified peptides.
Abstract: In the last three decades, a wide range of protein features have been discovered to annotate a protein. Numerous attempts have been made to integrate these features in a software package/platform so that the user may compute a wide range of features from a single source. To complement the existing methods, we developed a method, Pfeature, for computing a wide range of protein features. Pfeature allows to compute more than 200,000 features required for predicting the overall function of a protein, residue-level annotation of a protein, and function of chemically modified peptides. It has six major modules, namely, composition, binary profiles, evolutionary information, structural features, patterns, and model building. Composition module facilitates to compute most of the existing compositional features, plus novel features. The binary profile of amino acid sequences allows to compute the fraction of each type of residue as well as its position. The evolutionary information module allows to compute evolutionary information of a protein in the form of a position-specific scoring matrix profile generated using Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST); fit for annotation of a protein and its residues. A structural module was developed for computing of structural features/descriptors from a tertiary structure of a protein. These features are suitable to predict the therapeutic potential of a protein containing non-natural or chemically modified residues. The model-building module allows to implement various machine learning techniques for developing classification and regression models as well as feature selection. Pfeature also allows the generation of overlapping patterns and features from a protein. A user-friendly Pfeature is available as a web server python library and stand-alone package.

Journal ArticleDOI
TL;DR: This article gives a high-level view of the main components of the data structure and shows how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and aSet of genomes.
Abstract: Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in O(r) space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in O(r) space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.

Journal ArticleDOI
TL;DR: A partial least squares-based method (spatial modular patterns [SpaMOD]) was adopted to simultaneously integrate the two data modalities, as well as the networks related to cells and spots, to identify the cell-spot comodules for deciphering the SpaMOD of tissues.
Abstract: Single-cell RNA sequencing (scRNA-seq) provides a powerful tool to analyze the expression level of tissues at a cellular resolution. However, it could not capture the spatial organization of cells in a tissue. The spatially resolved transcriptomics technologies (ST) have been developed to address this issue. However, the emerging STs are still inefficient at single-cell resolution and/or fail to capture the sufficient reads. To this end, we adopted a partial least squares-based method (spatial modular patterns [SpaMOD]) to simultaneously integrate the two data modalities, as well as the networks related to cells and spots, to identify the cell-spot comodules for deciphering the SpaMOD of tissues. We applied SpaMOD to three paired scRNA-seq and ST datasets, derived from the mouse brain, granuloma, and pancreatic ductal adenocarcinoma, respectively. The identified cell-spot comodules provide detailed biological insights into the spatial relationships between cell populations and their spatial locations in the tissue.

Journal ArticleDOI
TL;DR: In this article , the authors give the first local characterization of safe paths for flow decomposition in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe path.
Abstract: Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4−5×) and space performance (1.2−2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.

Journal ArticleDOI
TL;DR: In this article , the authors present an algorithm to solve the colinear chaining problem with anchor overlaps and gap costs in Õ(n) time, where n denotes the count of anchors.
Abstract: Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in Õ(n) time, where n denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.

Journal ArticleDOI
TL;DR: In this article , the authors investigate the transposition and indel distance and present a structure called labeled cycle graph, representing an instance of rearrangement distance problems for genomes with unequal gene content.
Abstract: In the comparative genomics field, one way to infer the evolutionary distance between two organisms of related species is by finding the minimum number of large-scale mutations, called genome rearrangements, that transform one genome into the other. This number is referred to as the rearrangement distance. Since problems in this area emerged in the mid-1990s, several genome rearrangements have been proposed. Rearrangements that do not alter the genome content are called conservative, and in this group we have the following: the reversal, which inverts a segment of the genome; the transposition, which exchanges two consecutive segments; and the double cut and join, which cuts two different pairs of adjacent blocks and joins them differently. Seminal works compared genomes sharing the same set of conserved blocks, but nowadays, researchers started looking at genomes with unequal gene content, by allowing the use of nonconservative rearrangements such as insertion and deletion (jointly called indel). The transposition distance and the transposition and indel distance are both NP-hard. We investigate the transposition and indel distance and present a structure called labeled cycle graph, representing an instance of rearrangement distance problems for genomes with unequal gene content. This structure is used to devise a lower bound and a 2-approximation algorithm for the transposition and indel distance.

Journal ArticleDOI
TL;DR: The authors proposed methods to calculate a confidence range of expression for each transcript, representing its possible abundance across equally optimal estimates for both quantification models, and applied these methods to the Human Body Map data, finding that 35% to 50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability.
Abstract: Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%–50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%–47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.

Journal ArticleDOI
TL;DR: MetaCoAG as mentioned in this paper uses assembly graphs with the composition and coverage information of contigs to estimate the number of initial bins using single-copy marker genes, assigns contigs into bins iteratively, and adjusts the size of bins dynamically throughout the binning process, which significantly outperforms state-of-the-art binning tools by producing similar or more high-quality bins than the second best binning tool on both simulated and real datasets.
Abstract: Metagenomics enables the recovery of various genetic materials from different species, thus providing valuable insights into microbial communities. Metagenomic binning group sequences belong to different organisms, which is an important step in the early stages of metagenomic analysis pipelines. The classic pipeline followed in metagenomic binning is to assemble short reads into longer contigs and then bin these resulting contigs into groups representing different taxonomic groups in the metagenomic sample. Most of the currently available binning tools are designed to bin metagenomic contigs, but they do not make use of the assembly graphs that produce such assemblies. In this study, we propose MetaCoAG, a metagenomic binning tool that uses assembly graphs with the composition and coverage information of contigs. MetaCoAG estimates the number of initial bins using single-copy marker genes, assigns contigs into bins iteratively, and adjusts the number of bins dynamically throughout the binning process. We show that MetaCoAG significantly outperforms state-of-the-art binning tools by producing similar or more high-quality bins than the second-best binning tool on both simulated and real datasets. To the best of our knowledge, MetaCoAG is the first stand-alone contig-binning tool that directly makes use of the assembly graph information along with other features of the contigs.

Journal ArticleDOI
Liang Chen, Hui Wan, Qiuyan He, Shun He, Min Deng 
TL;DR: This article provides a comprehensive review of emerging microbiome interaction network inference methods and points out several feasible directions of microbial network inference analysis and highlights that future research requires the joint promotion of statistical computation methods and experimental techniques.
Abstract: Microbes can be found almost everywhere in the world. They are not isolated, but rather interact with each other and establish connections with their living environments. Studying these interactions is essential to an understanding of the organization and complex interplay of microbial communities, as well as the structure and dynamics of various ecosystems. A widely used approach toward this objective involves the inference of microbiome interaction networks. However, owing to the compositional, high-dimensional, sparse, and heterogeneous nature of observed microbial data, applying network inference methods to estimate their associations is challenging. In addition, external environmental interference and biological concerns also make it more difficult to deal with the network inference. In this article, we provide a comprehensive review of emerging microbiome interaction network inference methods. According to various research targets, estimated networks are divided into four main categories: correlation networks, conditional correlation networks, mixture networks, and differential networks. Their assumptions, high-level ideas, advantages, as well as limitations, are presented in this review. Since real microbial interactions can be complex and dynamic, no unifying method has, to date, captured all the aspects of interest. In addition, we discuss the challenges now confronting current microbial interaction study and future prospects. Finally, we point out several feasible directions of microbial network inference analysis and highlight that future research requires the joint promotion of statistical computation methods and experimental techniques.

Journal ArticleDOI
TL;DR: The Set-Min sketch as mentioned in this paper is a sketching technique for representing associative maps inspired from Count-Min and applies it to the problem of representing k-mer count tables.
Abstract: k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.

Journal ArticleDOI
TL;DR: It is suggested that sharing rewards might simply be a byproduct of hunting, instead of a design strategy aimed at facilitating group coordination, and that current artificial intelligence modeling of human-like coordination in a group setting that assumes rewards sharing as a motivator might not be adequately capturing what is truly necessary for successful coordination.
Abstract: Coordinated hunting is widely observed in animals, and sharing rewards is often considered a major incentive for its success. While current theories about the role played by sharing in coordinated hunting are based on correlational evidence, we reveal the causal roles of sharing rewards through computational modeling with a state-of-the-art Multi-agent Reinforcement Learning (MARL) algorithm. We show that counterintuitively, while selfish agents reach robust coordination, sharing rewards undermines coordination. Hunting coordination modeled through sharing rewards (1) suffers from the free-rider problem, (2) plateaus at a small group size, and (3) is not a Nash equilibrium. Moreover, individually rewarded predators outperform predators that share rewards, especially when the hunting is difficult, the group size is large, and the action cost is high. Our results shed new light on the actual importance of prosocial motives for successful coordination in nonhuman animals and suggest that sharing rewards might simply be a byproduct of hunting, instead of a design strategy aimed at facilitating group coordination. This also highlights that current artificial intelligence modeling of human-like coordination in a group setting that assumes rewards sharing as a motivator (e.g., MARL) might not be adequately capturing what is truly necessary for successful coordination.

Journal ArticleDOI
TL;DR: In this paper , the authors formulates antiviral repositioning as a matrix completion problem, where the antiviral drugs are along the rows and the viruses are on the columns, and uses a novel optimization method called HyPALM (Hybrid Proximal Alternating Linearized Minimization).
Abstract: This study formulates antiviral repositioning as a matrix completion problem wherein the antiviral drugs are along the rows and the viruses are along the columns. The input matrix is partially filled, with ones in positions where the antiviral drug has been known to be effective against a virus. The curated metadata for antivirals (chemical structure and pathways) and viruses (genomic structure and symptoms) are encoded into our matrix completion framework as graph Laplacian regularization. We then frame the resulting multiple graph regularized matrix completion (GRMC) problem as deep matrix factorization. This is solved by using a novel optimization method called HyPALM (Hybrid Proximal Alternating Linearized Minimization). Results of our curated RNA drug–virus association data set show that the proposed approach excels over state-of-the-art GRMC techniques. When applied to in silico prediction of antivirals for COVID-19, our approach returns antivirals that are either used for treating patients or are under trials for the same.

Journal ArticleDOI
TL;DR: This article describes how ML can be used to aid in detecting and studying natural selection patterns using population genomic data with higher predictive accuracy and better resolution.
Abstract: Natural selection has been given a lot of attention because it relates to the adaptation of populations to their environments, both biotic and abiotic. An allele is selected when it is favored by natural selection. Consequently, the favored allele increases in frequency in the population and neighboring linked variation diminishes, causing so-called selective sweeps. A high-throughput genomic sequence allows one to disentangle the evolutionary forces at play in populations. With the development of high-throughput genome sequencing technologies, it has become easier to detect these selective sweeps/selection signatures. Various methods can be used to detect selective sweeps, from simple implementations using summary statistics to complex statistical approaches. One of the important problems of these statistical models is the potential to provide inaccurate results when their assumptions are violated. The use of machine learning (ML) in population genetics has been introduced as an alternative method of detecting selection by treating the problem of detecting selection signatures as a classification problem. Since the availability of population genomics data is increasing, researchers may incorporate ML into these statistical models to infer signatures of selection with higher predictive accuracy and better resolution. This article describes how ML can be used to aid in detecting and studying natural selection patterns using population genomic data.

Journal ArticleDOI
TL;DR: This article describes how to use the Resistor algorithm to predict escape mutations, which employs Pareto optimization on four resistance-conferring criteria-positive and negative design, mutational probability, and hotspot cardinality-to assign a Pare to rank to each prospective mutant.
Abstract: Computational, in silico prediction of resistance-conferring escape mutations could accelerate the design of therapeutics less prone to resistance. This article describes how to use the Resistor algorithm to predict escape mutations. Resistor employs Pareto optimization on four resistance-conferring criteria-positive and negative design, mutational probability, and hotspot cardinality-to assign a Pareto rank to each prospective mutant. It also predicts the mechanism of resistance, that is, whether a mutant ablates binding to a drug, strengthens binding to the endogenous ligand, or a combination of these two factors, and provides structural models of the mutants. Resistor is part of the free and open-source computational protein design software OSPREY.

Journal ArticleDOI
TL;DR: In this article , it was shown that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph, even when the substitutions were restricted to only four alphabets.
Abstract: The problem of aligning a sequence to a walk in a labeled graph is of fundamental importance to Computational Biology. For an arbitrary graph G=(V,E) and a pattern P of length m, a lower bound based on the Strong Exponential Time Hypothesis implies that an algorithm for finding a walk in G exactly matching P significantly faster than O(|E|m) time is unlikely. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is solvable in O(|E|m) time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets. Moreover, NP-completeness continues to hold even when edits are restricted to only substitutions. Despite the popularity of the de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on the de Bruijn graphs remained unknown. We investigate this problem and show that the properties that make the de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. We also demonstrate that an algorithm significantly faster than O(|E|m) is unlikely for the de Bruijn graphs in the case where substitutions are only allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, such as on the de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic Õ(nm) time, where n is the text's length.

Journal ArticleDOI
TL;DR: This work provides the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP), and extends the ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors.
Abstract: Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.

Journal ArticleDOI
TL;DR: SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario and requires tuning only two hyperparameters and is robust to the choice of one.
Abstract: Although the availability of various sequencing technologies allows us to capture different genome properties at single-cell resolution, with the exception of a few co-assaying technologies, applying different sequencing assays on the same single cell is impossible. Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data. First, it preserves the local geometry by constructing a k-nearest neighbor (k-NN) graph for each data set (or domain) to capture the intra-domain distances. SCOT then finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection, thus aligning them. SCOT requires tuning only two hyperparameters and is robust to the choice of one. Furthermore, the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available. Thus, SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario. We provide a tutorial for SCOT and make its source code publicly available on GitHub.