Showing papers in "BMC Bioinformatics in 2020"

PDF

Open Access

Journal Article•DOI•

PACVr: plastome assembly coverage visualization in R

[...]

Michael Gruenstaeudl¹, Nils Jenke¹•Institutions (1)

24 May 2020-BMC Bioinformatics

TL;DR: ’PACVr’ is introduced, an R package that visualizes the coverage depth of a plastid genome assembly in relation to the circular, quadripartite structure of the genome as well as the individual plastome genes and confirms sequence equality of and visualizes gene synteny between, the inverted repeat regions of the input genome.

...read moreread less

Abstract: Plastid genomes typically display a circular, quadripartite structure with two inverted repeat regions, which challenges automatic assembly procedures. The correct assembly of plastid genomes is a prerequisite for the validity of subsequent analyses on genome structure and evolution. The average coverage depth of a genome assembly is often used as an indicator of assembly quality. Visualizing coverage depth across a draft genome is a critical step, which allows users to inspect the quality of the assembly and, where applicable, identify regions of reduced assembly confidence. Despite the interplay between genome structure and assembly quality, no contemporary, user-friendly software tool can visualize the coverage depth of a plastid genome assembly while taking its quadripartite genome structure into account. A software tool is needed that fills this void. We introduce ’PACVr’, an R package that visualizes the coverage depth of a plastid genome assembly in relation to the circular, quadripartite structure of the genome as well as the individual plastome genes. By using a variable window approach, the tool allows visualizations on different calculation scales. It also confirms sequence equality of, as well as visualizes gene synteny between, the inverted repeat regions of the input genome. As a tool for plastid genomics, PACVr provides the functionality to identify regions of coverage depth above or below user-defined threshold values and helps to identify non-identical IR regions. To allow easy integration into bioinformatic workflows, PACVr can be invoked from a Unix shell, facilitating its use in automated quality control. We illustrate the application of PACVr on four empirical datasets and compare visualizations generated by PACVr with those of alternative software tools. PACVr provides a user-friendly tool to visualize (a) the coverage depth of a plastid genome assembly on a circular, quadripartite plastome map and in relation to individual plastome genes, and (b) gene synteny across the inverted repeat regions. It contributes to optimizing plastid genome assemblies and increasing the reliability of publicly available plastome sequences. The software, example datasets, technical documentation, and a tutorial are available with the package at https://cran.r-project.org/package=PACVr .

...read moreread less

161 citations

Journal Article•DOI•

So you think you can PLS-DA?

[...]

Daniel Ruiz-Perez¹, Haibin Guan¹, Purnima Madhivanan¹, Kalai Mathee², Giri Narasimhan¹ - Show less +1 more•Institutions (2)

Florida International University¹, FIU Herbert Wertheim College of Medicine²

09 Dec 2020-BMC Bioinformatics

TL;DR: It is demonstrated that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector.

...read moreread less

Abstract: Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.

...read moreread less

120 citations

Journal Article•DOI•

Reconstruction and analysis of a carbon-core metabolic network for Dunaliella salina

[...]

Melanie Fachet¹, Carina Witte¹, Robert J. Flassig, Liisa Rihko-Struckmann¹, Zaid McKie-Krisberg², Jürgen E.W. Polle², Kai Sundmacher³, Kai Sundmacher¹ - Show less +4 more•Institutions (3)

Max Planck Society¹, City University of New York², Otto-von-Guericke University Magdeburg³

02 Jan 2020-BMC Bioinformatics

TL;DR: The reconstructed metabolic network of D. salina CCAP 19/18 based on the recently published nuclear genome is able to predict the biological behavior under light and nutrient stress and will lead to an improved process understanding for the optimized production of high-value products in microalgae.

...read moreread less

Abstract: The green microalga Dunaliella salina accumulates a high proportion of β-carotene during abiotic stress conditions. To better understand the intracellular flux distribution leading to carotenoid accumulation, this work aimed at reconstructing a carbon core metabolic network for D. salina CCAP 19/18 based on the recently published nuclear genome and its validation with experimental observations and literature data. The reconstruction resulted in a network model with 221 reactions and 212 metabolites within three compartments: cytosol, chloroplast and mitochondrion. The network was implemented in the MATLAB toolbox CellNetAnalyzer and checked for feasibility. Furthermore, a flux balance analysis was carried out for different light and nutrient uptake rates. The comparison of the experimental knowledge with the model prediction revealed that the results of the stoichiometric network analysis are plausible and in good agreement with the observed behavior. Accordingly, our model provides an excellent tool for investigating the carbon core metabolism of D. salina. The reconstructed metabolic network of D. salina presented in this work is able to predict the biological behavior under light and nutrient stress and will lead to an improved process understanding for the optimized production of high-value products in microalgae.

...read moreread less

86 citations

Journal Article•DOI•

Prediction of heart disease and classifiers’ sensitivity analysis

[...]

Khaled Mohamad Almustafa¹•Institutions (1)

Prince Sultan University¹

02 Jul 2020-BMC Bioinformatics

TL;DR: The benefit of having a reliable feature selection method for HD disease prediction with using minimal number of attributes instead of having to consider all available ones is concluded.

...read moreread less

Abstract: Heart disease (HD) is one of the most common diseases nowadays, and an early diagnosis of such a disease is a crucial task for many health care providers to prevent their patients for such a disease and to save lives In this paper, a comparative analysis of different classifiers was performed for the classification of the Heart Disease dataset in order to correctly classify and or predict HD cases with minimal attributes The set contains 76 attributes including the class attribute, for 1025 patients collected from Cleveland, Hungary, Switzerland, and Long Beach, but in this paper, only a subset of 14 attributes are used, and each attribute has a given set value The algorithms used K- Nearest Neighbor (K-NN), Naive Bayes, Decision tree J48, JRip, SVM, Adaboost, Stochastic Gradient Decent (SGD) and Decision Table (DT) classifiers to show the performance of the selected classifications algorithms to best classify, and or predict, the HD cases It was shown that using different classification algorithms for the classification of the HD dataset gives very promising results in term of the classification accuracy for the K-NN (K = 1), Decision tree J48 and JRip classifiers with accuracy of classification of 997073, 980488 and 972683% respectively A feature extraction method was performed using Classifier Subset Evaluator on the HD dataset, and results show enhanced performance in term of the classification accuracy for K-NN (N = 1) and Decision Table classifiers to 100 and 938537% respectively after using the selected features by only applying a combination of up to 4 attributes instead of 13 attributes for the predication of the HD cases Different classifiers were used and compared to classify the HD dataset, and we concluded the benefit of having a reliable feature selection method for HD disease prediction with using minimal number of attributes instead of having to consider all available ones

...read moreread less

83 citations

Journal Article•DOI•

Graph-based prediction of Protein-protein interactions with attributed signed graph embedding

[...]

Fang Yang¹, Kunjie Fan², Dandan Song¹, Huakang Lin¹•Institutions (2)

Beijing Institute of Technology¹, Ohio State University²

21 Jul 2020-BMC Bioinformatics

TL;DR: Sign variational graph auto-encoder (S-VGAE), an improved graph representation learning method, is introduced to automatically learn to encode graph structure into low-dimensional embeddings for PPI prediction.

...read moreread less

Abstract: Protein-protein interactions (PPIs) are central to many biological processes. Considering that the experimental methods for identifying PPIs are time-consuming and expensive, it is important to develop automated computational methods to better predict PPIs. Various machine learning methods have been proposed, including a deep learning technique which is sequence-based that has achieved promising results. However, it only focuses on sequence information while ignoring the structural information of PPI networks. Structural information of PPI networks such as their degree, position, and neighboring nodes in a graph has been proved to be informative in PPI prediction. Facing the challenge of representing graph information, we introduce an improved graph representation learning method. Our model can study PPI prediction based on both sequence information and graph structure. Moreover, our study takes advantage of a representation learning model and employs a graph-based deep learning method for PPI prediction, which shows superiority over existing sequence-based methods. Statistically, Our method achieves state-of-the-art accuracy of 99.15% on Human protein reference database (HPRD) dataset and also obtains best results on Database of Interacting Protein (DIP) Human, Drosophila, Escherichia coli (E. coli), and Caenorhabditis elegans (C. elegan) datasets. Here, we introduce signed variational graph auto-encoder (S-VGAE), an improved graph representation learning method, to automatically learn to encode graph structure into low-dimensional embeddings. Experimental results demonstrate that our method outperforms other existing sequence-based methods on several datasets. We also prove the robustness of our model for very sparse networks and the generalization for a new dataset that consists of four datasets: HPRD, E.coli, C.elegan, and Drosophila.

...read moreread less

77 citations

Journal Article•DOI•

DPDDI: a deep predictor for drug-drug interactions

[...]

Yue-Hua Feng¹, Shao-Wu Zhang¹, Jian-Yu Shi¹•Institutions (1)

Northwestern Polytechnical University¹

24 Sep 2020-BMC Bioinformatics

TL;DR: This work proposed an effective and robust method DPDDI to predict the potential DDIs by utilizing the DDI network information without considering the drug properties, which should also be useful in other DDI-related scenarios, such as the detection of unexpected side effects, and the guidance of drug combination.

...read moreread less

Abstract: The treatment of complex diseases by taking multiple drugs becomes increasingly popular. However, drug-drug interactions (DDIs) may give rise to the risk of unanticipated adverse effects and even unknown toxicity. DDI detection in the wet lab is expensive and time-consuming. Thus, it is highly desired to develop the computational methods for predicting DDIs. Generally, most of the existing computational methods predict DDIs by extracting the chemical and biological features of drugs from diverse drug-related properties, however some drug properties are costly to obtain and not available in many cases. In this work, we presented a novel method (namely DPDDI) to predict DDIs by extracting the network structure features of drugs from DDI network with graph convolution network (GCN), and the deep neural network (DNN) model as a predictor. GCN learns the low-dimensional feature representations of drugs by capturing the topological relationship of drugs in DDI network. DNN predictor concatenates the latent feature vectors of any two drugs as the feature vector of the corresponding drug pairs to train a DNN for predicting the potential drug-drug interactions. Experiment results show that, the newly proposed DPDDI method outperforms four other state-of-the-art methods; the GCN-derived latent features include more DDI information than other features derived from chemical, biological or anatomical properties of drugs; and the concatenation feature aggregation operator is better than two other feature aggregation operators (i.e., inner product and summation). The results in case studies confirm that DPDDI achieves reasonable performance in predicting new DDIs. We proposed an effective and robust method DPDDI to predict the potential DDIs by utilizing the DDI network information without considering the drug properties (i.e., drug chemical and biological properties). The method should also be useful in other DDI-related scenarios, such as the detection of unexpected side effects, and the guidance of drug combination.

...read moreread less

71 citations

Journal Article•DOI•

ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data.

[...]

Silas Kieser¹, Silas Kieser², Joseph Brown³, Joseph Brown⁴, Evgeny M. Zdobnov², Evgeny M. Zdobnov¹, Mirko Trajkovski², Lee Ann McCue³ - Show less +4 more•Institutions (4)

Swiss Institute of Bioinformatics¹, University of Geneva², Pacific Northwest National Laboratory³, University of Utah⁴

22 Jun 2020-BMC Bioinformatics

TL;DR: ATLS provides a user-friendly, modular and customizable Snakemake workflow for metagenome data processing; it is easily installable with conda and maintained as open-source on GitHub at https://github.com/metagenome-atlas/atlas.

...read moreread less

Abstract: Metagenomics studies provide valuable insight into the composition and function of microbial populations from diverse environments; however, the data processing pipelines that rely on mapping reads to gene catalogs or genome databases for cultured strains yield results that underrepresent the genes and functional potential of uncultured microbes Recent improvements in sequence assembly methods have eased the reliance on genome databases, thereby allowing the recovery of genomes from uncultured microbes However, configuring these tools, linking them with advanced binning and annotation tools, and maintaining provenance of the processing continues to be challenging for researchers Here we present ATLAS, a software package for customizable data processing from raw sequence reads to functional and taxonomic annotations using state-of-the-art tools to assemble, annotate, quantify, and bin metagenome data Abundance estimates at genome resolution are provided for each sample in a dataset ATLAS is written in Python and the workflow implemented in Snakemake; it operates in a Linux environment, and is compatible with Python 35+ and Anaconda 3+ versions The source code for ATLAS is freely available, distributed under a BSD-3 license ATLAS provides a user-friendly, modular and customizable Snakemake workflow for metagenome data processing; it is easily installable with conda and maintained as open-source on GitHub at https://githubcom/metagenome-atlas/atlas

...read moreread less

70 citations

Journal Article•DOI•

MethylNet: an automated and modular deep learning approach for DNA methylation analysis

[...]

Joshua J. Levy¹, Alexander J. Titus, Curtis L. Petersen¹, Curtis L. Petersen², Youdinghuan Chen¹, Lucas A. Salas¹, Brock C. Christensen¹ - Show less +3 more•Institutions (2)

Dartmouth College¹, The Dartmouth Institute for Health Policy and Clinical Practice²

17 Mar 2020-BMC Bioinformatics

TL;DR: MethylNet is described, a DNAm deep learning method that can construct embeddings, make predictions, generate new data, and uncover unknown heterogeneity with minimal user supervision that can study cellular differences, grasp higher order information of cancer sub-types, and capture factors associated with smoking in concordance with known differences.

...read moreread less

Abstract: DNA methylation (DNAm) is an epigenetic regulator of gene expression programs that can be altered by environmental exposures, aging, and in pathogenesis. Traditional analyses that associate DNAm alterations with phenotypes suffer from multiple hypothesis testing and multi-collinearity due to the high-dimensional, continuous, interacting and non-linear nature of the data. Deep learning analyses have shown much promise to study disease heterogeneity. DNAm deep learning approaches have not yet been formalized into user-friendly frameworks for execution, training, and interpreting models. Here, we describe MethylNet, a DNAm deep learning method that can construct embeddings, make predictions, generate new data, and uncover unknown heterogeneity with minimal user supervision. The results of our experiments indicate that MethylNet can study cellular differences, grasp higher order information of cancer sub-types, estimate age and capture factors associated with smoking in concordance with known differences. The ability of MethylNet to capture nonlinear interactions presents an opportunity for further study of unknown disease, cellular heterogeneity and aging processes.

...read moreread less

70 citations

Journal Article•DOI•

Microscopy cell nuclei segmentation with enhanced U-Net.

[...]

Feixiao Long

08 Jan 2020-BMC Bioinformatics

TL;DR: The results preliminarily demonstrate the potential of proposed U-Net+ in correctly spotting microscopy cell nuclei with resources-constraint computing.

...read moreread less

Abstract: Cell nuclei segmentation is a fundamental task in microscopy image analysis, based on which multiple biological related analysis can be performed. Although deep learning (DL) based techniques have achieved state-of-the-art performances in image segmentation tasks, these methods are usually complex and require support of powerful computing resources. In addition, it is impractical to allocate advanced computing resources to each dark- or bright-field microscopy, which is widely employed in vast clinical institutions, considering the cost of medical exams. Thus, it is essential to develop accurate DL based segmentation algorithms working with resources-constraint computing. An enhanced, light-weighted U-Net (called U-Net+) with modified encoded branch is proposed to potentially work with low-resources computing. Through strictly controlled experiments, the average IOU and precision of U-Net+ predictions are confirmed to outperform other prevalent competing methods with 1.0% to 3.0% gain on the first stage test set of 2018 Kaggle Data Science Bowl cell nuclei segmentation contest with shorter inference time. Our results preliminarily demonstrate the potential of proposed U-Net+ in correctly spotting microscopy cell nuclei with resources-constraint computing.

...read moreread less

67 citations

Journal Article•DOI•

Conditional permutation importance revisited

[...]

Dries Debeer¹, Dries Debeer², Carolin Strobl¹•Institutions (2)

University of Zurich¹, Katholieke Universiteit Leuven²

14 Jul 2020-BMC Bioinformatics

TL;DR: It is argued and illustrated that the CPI corresponds to a more partial quantification of variable importance and several improvements in its methodology and implementation are suggested that enhance its practical value.

...read moreread less

Abstract: Random forest based variable importance measures have become popular tools for assessing the contributions of the predictor variables in a fitted random forest. In this article we reconsider a frequently used variable importance measure, the Conditional Permutation Importance (CPI). We argue and illustrate that the CPI corresponds to a more partial quantification of variable importance and suggest several improvements in its methodology and implementation that enhance its practical value. In addition, we introduce the threshold value in the CPI algorithm as a parameter that can make the CPI more partial or more marginal. By means of extensive simulations, where the original version of the CPI is used as the reference, we examine the impact of the proposed methodological improvements. The simulation results show how the improved CPI methodology increases the interpretability and stability of the computations. In addition, the newly proposed implementation decreases the computation times drastically and is more widely applicable. The improved CPI algorithm is made freely available as an add-on package to the open-source software R. The proposed methodology and implementation of the CPI is computationally faster and leads to more stable results. It has a beneficial impact on practical research by making random forest analyses more interpretable.

...read moreread less

64 citations

Journal Article•DOI•

Optimization and expansion of non-negative matrix factorization.

[...]

Xihui Lin¹, Paul C. Boutros¹, Paul C. Boutros²•Institutions (2)

Ontario Institute for Cancer Research¹, University of California, Los Angeles²

06 Jan 2020-BMC Bioinformatics

TL;DR: This work demonstrates that NMF can handle missing values naturally and this property leads to a novel method to determine the rank hyperparameter and argues that the suggested rank tuning method based on missing value imputation is theoretically superior to existing methods.

...read moreread less

Abstract: Non-negative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence or to matrices with missing entries. Besides, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. Finally, the lack of well-validated methodology for choosing the rank hyperparameters also raises concern on derived results. We adopt the idea of sequential coordinate-wise descent to NMF to increase the convergence rate. We demonstrate that NMF can handle missing values naturally and this property leads to a novel method to determine the rank hyperparameter. Further, we demonstrate some novel applications of NMF and show how to use masking to inject prior knowledge and desirable properties to achieve a more meaningful decomposition. We show through complexity analysis and experiments that our implementation converges faster than well-known methods. We also show that using NMF for tumour content deconvolution can achieve results similar to existing methods like ISOpure. Our proposed missing value imputation is more accurate than conventional methods like multiple imputation and comparable to missForest while achieving significantly better computational efficiency. Finally, we argue that the suggested rank tuning method based on missing value imputation is theoretically superior to existing methods. All algorithms are implemented in the R package NNLM, which is freely available on CRAN and Github.

...read moreread less

Journal Article•DOI•

Automatic construction of metabolic models with enzyme constraints

[...]

Pavlos Stephanos Bekiaris¹, Steffen Klamt¹•Institutions (1)

Max Planck Society¹

14 Jan 2020-BMC Bioinformatics

TL;DR: It is shown that the enzyme constraints improve flux predictions and demonstrate, for the first time, that these constraints can markedly change the spectrum of metabolic engineering strategies for different target products.

...read moreread less

Abstract: In order to improve the accuracy of constraint-based metabolic models, several approaches have been developed which intend to integrate additional biological information. Two of these methods, MOMENT and GECKO, incorporate enzymatic (kcat) parameters and enzyme mass constraints to further constrain the space of feasible metabolic flux distributions. While both methods have been proven to deliver useful extensions of metabolic models, they may considerably increase size and complexity of the models and there is currently no tool available to fully automate generation and calibration of such enzyme-constrained models from given stoichiometric models. In this work we present three major developments. We first conceived short MOMENT (sMOMENT), a simplified version of the MOMENT approach, which yields the same predictions as MOMENT but requires significantly fewer variables and enables direct inclusion of the relevant enzyme constraints in the standard representation of a constraint-based model. When measurements of enzyme concentrations are available, these can be included as well leading in the extreme case, where all enzyme concentrations are known, to a model representation that is analogous to the GECKO approach. Second, we developed the AutoPACMEN toolbox which allows an almost fully automated creation of sMOMENT-enhanced stoichiometric metabolic models. In particular, this includes the automatic read-out and processing of relevant enzymatic data from different databases and the reconfiguration of the stoichiometric model with embedded enzymatic constraints. Additionally, tools have been developed to adjust (kcat and enzyme pool) parameters of sMOMENT models based on given flux data. We finally applied the new sMOMENT approach and the AutoPACMEN toolbox to generate an enzyme-constrained version of the E. coli genome-scale model iJO1366 and analyze its key properties and differences with the standard model. In particular, we show that the enzyme constraints improve flux predictions (e.g., explaining overflow metabolism and other metabolic switches) and demonstrate, for the first time, that these constraints can markedly change the spectrum of metabolic engineering strategies for different target products. The methodological and tool developments presented herein pave the way for a simplified and routine construction and analysis of enzyme-constrained metabolic models.

...read moreread less

Journal Article•DOI•

ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles

[...]

Xudong Zhao¹, Qing Jiao¹, Hangyu Li¹, Yiming Wu¹, Hanxu Wang¹, Shan Huang², Guohua Wang¹ - Show less +3 more•Institutions (2)

Northeast Forestry University¹, Harbin Medical University²

05 Feb 2020-BMC Bioinformatics

TL;DR: A feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions and demonstrates the effectiveness of ECFS-DEA for differential expression analysis on expression profiles.

...read moreread less

Abstract: Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples. Based on the thought of using an ensemble classifier, we develop a feature selection tool for differential expression analysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature. Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea.

...read moreread less

Journal Article•DOI•

PyClone-VI: scalable inference of clonal population structures using whole genome data.

[...]

Sierra Gillis, Andrew Roth¹•Institutions (1)

University of British Columbia¹

10 Dec 2020-BMC Bioinformatics

TL;DR: The proposed PyClone-VI, a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers is described, which is 10–100× times faster than existing methods, while providing results which are as accurate.

...read moreread less

Abstract: At diagnosis tumours are typically composed of a mixture of genomically distinct malignant cell populations. Bulk sequencing of tumour samples coupled with computational deconvolution can be used to identify these populations and study cancer evolution. Existing computational methods for populations deconvolution are slow and/or potentially inaccurate when applied to large datasets generated by whole genome sequencing data. We describe PyClone-VI, a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers. We demonstrate the utility of the method by analyzing data from 1717 patients from PCAWG study and 100 patients from the TRACERx study. Our proposed method is 10–100× times faster than existing methods, while providing results which are as accurate. Software implementing our method is freely available https://github.com/Roth-Lab/pyclone-vi .

...read moreread less

Journal Article•DOI•

Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.

[...]

Yi Yue¹, Hao Huang¹, Zhao Qi¹, Dou Huimin¹, Xin-Yi Liu¹, Tian-Fei Han¹, Yue Chen¹, Xiangjun Song¹, You-Hua Zhang¹, Jian Tu¹ - Show less +6 more•Institutions (1)

Anhui Agricultural University¹

28 Jul 2020-BMC Bioinformatics

TL;DR: A set of currently available, state-of-the-art metagenomics hybrid binning tools are tested and provided a guide for selecting tools for metagenomic binning by comparing range of purity, completeness, adjusted rand index, and the number of high-quality reconstructed bins.

...read moreread less

Abstract: Shotgun metagenomics based on untargeted sequencing can explore the taxonomic profile and the function of unknown microorganisms in samples, and complement the shortage of amplicon sequencing. Binning assembled sequences into individual groups, which represent microbial genomes, is the key step and a major challenge in metagenomic research. Both supervised and unsupervised machine learning methods have been employed in binning. Genome binning belonging to unsupervised method clusters contigs into individual genome bins by machine learning methods without the assistance of any reference databases. So far a lot of genome binning tools have emerged. Evaluating these genome tools is of great significance to microbiological research. In this study, we evaluate 15 genome binning tools containing 12 original binning tools and 3 refining binning tools by comparing the performance of these tools on chicken gut metagenomic datasets and the first CAMI challenge datasets. For chicken gut metagenomic datasets, original genome binner MetaBat, Groopm2 and Autometa performed better than other original binner, and MetaWrap combined the binning results of them generated the most high-quality genome bins. For CAMI datasets, Groopm2 achieved the highest purity (> 0.9) with good completeness (> 0.8), and reconstructed the most high-quality genome bins among original genome binners. Compared with Groopm2, MetaBat2 had similar performance with higher completeness and lower purity. Genome refining binners DASTool predicated the most high-quality genome bins among all genomes binners. Most genome binner performed well for unique strains. Nonetheless, reconstructing common strains still is a substantial challenge for all genome binner. In conclusion, we tested a set of currently available, state-of-the-art metagenomics hybrid binning tools and provided a guide for selecting tools for metagenomic binning by comparing range of purity, completeness, adjusted rand index, and the number of high-quality reconstructed bins. Furthermore, available information for future binning strategy were concluded.

...read moreread less

Journal Article•DOI•

Broad-coverage biomedical relation extraction with SemRep.

[...]

Halil Kilicoglu¹, Halil Kilicoglu², Graciela Rosemblat¹, Marcelo Fiszman, Dongwook Shin¹ - Show less +1 more•Institutions (2)

National Institutes of Health¹, University of Illinois at Urbana–Champaign²

14 May 2020-BMC Bioinformatics

TL;DR: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text, which underpins SemMedDB, a literature-scale knowledge graph based on semantic relations.

...read moreread less

Abstract: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep’s performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.

...read moreread less

Journal Article•DOI•

A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis.

[...]

Eugene Lin¹, Eugene Lin², Sudipto Mukherjee², Sreeram Kannan²•Institutions (2)

China Medical University (Taiwan)¹, University of Washington²

21 Feb 2020-BMC Bioinformatics

TL;DR: The results indicate that DR-A significantly enhances clustering performance over state-of-the-art methods, and is well-suited for unsupervised learning tasks for the scRNA-seq data, where labels for cell types are costly and often impossible to acquire.

...read moreread less

Abstract: Single-cell RNA sequencing (scRNA-seq) is an emerging technology that can assess the function of an individual cell and cell-to-cell variability at the single cell level in an unbiased manner. Dimensionality reduction is an essential first step in downstream analysis of the scRNA-seq data. However, the scRNA-seq data are challenging for traditional methods due to their high dimensional measurements as well as an abundance of dropout events (that is, zero expression measurements). To overcome these difficulties, we propose DR-A (Dimensionality Reduction with Adversarial variational autoencoder), a data-driven approach to fulfill the task of dimensionality reduction. DR-A leverages a novel adversarial variational autoencoder-based framework, a variant of generative adversarial networks. DR-A is well-suited for unsupervised learning tasks for the scRNA-seq data, where labels for cell types are costly and often impossible to acquire. Compared with existing methods, DR-A is able to provide a more accurate low dimensional representation of the scRNA-seq data. We illustrate this by utilizing DR-A for clustering of scRNA-seq data. Our results indicate that DR-A significantly enhances clustering performance over state-of-the-art methods.

...read moreread less

Journal Article•DOI•

Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads

[...]

William S. Pearman¹, Nikki E. Freed¹, Olin K. Silander¹•Institutions (1)

Massey University¹

24 Apr 2020-BMC Bioinformatics

TL;DR: Insight is provided on the expected accuracy for metagenomic analyses for different taxonomic groups, and the point at which read length becomes more important than error rate for assigning the correct taxon is established.

...read moreread less

Abstract: The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities. Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences (PacBio) with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.

...read moreread less

Journal Article•DOI•

Drug-target interaction prediction with tree-ensemble learning and output space reconstruction.

[...]

Konstantinos Pliakos¹, Celine Vens¹•Institutions (1)

Katholieke Universiteit Leuven¹

07 Feb 2020-BMC Bioinformatics

TL;DR: A new DTI prediction method where bi-clustering trees are built on reconstructed networks by learning ensembles of multi-output bi- Clustering Trees (eBICT) on reconstructing networks, which can boost the predictive performance of tree-ensemble learning methods, yielding more accurate DTI predictions.

...read moreread less

Abstract: Computational prediction of drug-target interactions (DTI) is vital for drug discovery. The experimental identification of interactions between drugs and target proteins is very onerous. Modern technologies have mitigated the problem, leveraging the development of new drugs. However, drug development remains extremely expensive and time consuming. Therefore, in silico DTI predictions based on machine learning can alleviate the burdensome task of drug development. Many machine learning approaches have been proposed over the years for DTI prediction. Nevertheless, prediction accuracy and efficiency are persisting problems that still need to be tackled. Here, we propose a new learning method which addresses DTI prediction as a multi-output prediction task by learning ensembles of multi-output bi-clustering trees (eBICT) on reconstructed networks. In our setting, the nodes of a DTI network (drugs and proteins) are represented by features (background information). The interactions between the nodes of a DTI network are modeled as an interaction matrix and compose the output space in our problem. The proposed approach integrates background information from both drug and target protein spaces into the same global network framework. We performed an empirical evaluation, comparing the proposed approach to state of the art DTI prediction methods and demonstrated the effectiveness of the proposed approach in different prediction settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein networks. We show that output space reconstruction can boost the predictive performance of tree-ensemble learning methods, yielding more accurate DTI predictions. We proposed a new DTI prediction method where bi-clustering trees are built on reconstructed networks. Building tree-ensemble learning models with output space reconstruction leads to superior prediction results, while preserving the advantages of tree-ensembles, such as scalability, interpretability and inductive setting.

...read moreread less

Journal Article•DOI•

Amino acid encoding for deep learning applications

[...]

Hesham ElAbd¹, Yana Bromberg², Yana Bromberg³, Adrienne Hoarfrost³, Tobias L. Lenz⁴, Andre Franke¹, Mareike Wendorff¹ - Show less +3 more•Institutions (4)

University of Kiel¹, Technische Universität München², Rutgers University³, Max Planck Society⁴

09 Jun 2020-BMC Bioinformatics

TL;DR: It is shown that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities.

...read moreread less

Abstract: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction – a process called ‘end-to-end learning’ – has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.

...read moreread less

Journal Article•DOI•

A learning-based method for drug-target interaction prediction based on feature representation learning and deep neural network.

[...]

Jiajie Peng, Jingyi Li, Xuequn Shang

17 Sep 2020-BMC Bioinformatics

TL;DR: A learning-based method based on feature representation learning and deep neural network named DTI-CNN to predict the drug-target interactions, which obtains better performance than the other three existing state-of-the-art methods.

...read moreread less

Abstract: Drug-target interaction prediction is of great significance for narrowing down the scope of candidate medications, and thus is a vital step in drug discovery. Because of the particularity of biochemical experiments, the development of new drugs is not only costly, but also time-consuming. Therefore, the computational prediction of drug target interactions has become an essential way in the process of drug discovery, aiming to greatly reducing the experimental cost and time. We propose a learning-based method based on feature representation learning and deep neural network named DTI-CNN to predict the drug-target interactions. We first extract the relevant features of drugs and proteins from heterogeneous networks by using the Jaccard similarity coefficient and restart random walk model. Then, we adopt a denoising autoencoder model to reduce the dimension and identify the essential features. Third, based on the features obtained from last step, we constructed a convolutional neural network model to predict the interaction between drugs and proteins. The evaluation results show that the average AUROC score and AUPR score of DTI-CNN were 0.9416 and 0.9499, which obtains better performance than the other three existing state-of-the-art methods. All the experimental results show that the performance of DTI-CNN is better than that of the three existing methods and the proposed method is appropriately designed.

...read moreread less

Journal Article•DOI•

GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis

[...]

Hasindu Gamaarachchi¹, Hasindu Gamaarachchi², Chun Wai Lam², Gihan Jayatilaka³, Hiruna Samarakoon³, Jared T. Simpson⁴, Jared T. Simpson⁵, Martin A. Smith, Sri Parameswaran² - Show less +5 more•Institutions (5)

Garvan Institute of Medical Research¹, University of New South Wales², University of Peradeniya³, Ontario Institute for Cancer Research⁴, University of Toronto⁵

05 Aug 2020-BMC Bioinformatics

TL;DR: This work parallelise and optimise an implementation of the ABEA algorithm (termed f5c) to efficiently run on heterogeneous CPU-GPU architectures and demonstrates that complex genomics analyses can be performed on lightweight computing systems, but also benefits High-Performance Computing (HPC).

...read moreread less

Abstract: Nanopore sequencing enables portable, real-time sequencing applications, including point-of-care diagnostics and in-the-field genotyping. Achieving these outcomes requires efficient bioinformatic algorithms for the analysis of raw nanopore signal data. However, comparing raw nanopore signals to a biological reference sequence is a computationally complex task. The dynamic programming algorithm called Adaptive Banded Event Alignment (ABEA) is a crucial step in polishing sequencing data and identifying non-standard nucleotides, such as measuring DNA methylation. Here, we parallelise and optimise an implementation of the ABEA algorithm (termed f5c) to efficiently run on heterogeneous CPU-GPU architectures. By optimising memory, computations and load balancing between CPU and GPU, we demonstrate how f5c can perform ∼3-5 × faster than an optimised version of the original CPU-only implementation of ABEA in the Nanopolish software package. We also show that f5c enables DNA methylation detection on-the-fly using an embedded System on Chip (SoC) equipped with GPUs. Our work not only demonstrates that complex genomics analyses can be performed on lightweight computing systems, but also benefits High-Performance Computing (HPC). The associated source code for f5c along with GPU optimised ABEA is available at https://github.com/hasindu2008/f5c .

...read moreread less

Journal Article•DOI•

multiGSEA: a GSEA-based pathway enrichment analysis for multi-omics data.

[...]

Sebastian Canzler¹, Jörg Hackermüller¹•Institutions (1)

Helmholtz Centre for Environmental Research - UFZ¹

07 Dec 2020-BMC Bioinformatics

TL;DR: The multiGSEA package is presented, a highly versatile tool for multi-omics pathway integration that minimizes previous restrictions in terms of omics layer selection, pathway database availability, organism selection and the mapping of omic feature identifiers.

...read moreread less

Abstract: Gaining biological insights into molecular responses to treatments or diseases from omics data can be accomplished by gene set or pathway enrichment methods. A plethora of different tools and algorithms have been developed so far. Among those, the gene set enrichment analysis (GSEA) proved to control both type I and II errors well. In recent years the call for a combined analysis of multiple omics layers became prominent, giving rise to a few multi-omics enrichment tools. Each of these has its own drawbacks and restrictions regarding its universal application. Here, we present the multiGSEA package aiding to calculate a combined GSEA-based pathway enrichment on multiple omics layers. The package queries 8 different pathway databases and relies on the robust GSEA algorithm for a single-omics enrichment analysis. In a final step, those scores will be combined to create a robust composite multi-omics pathway enrichment measure. multiGSEA supports 11 different organisms and includes a comprehensive mapping of transcripts, proteins, and metabolite IDs. With multiGSEA we introduce a highly versatile tool for multi-omics pathway integration that minimizes previous restrictions in terms of omics layer selection, pathway database availability, organism selection and the mapping of omics feature identifiers. multiGSEA is publicly available under the GPL-3 license at https://github.com/yigbt/multiGSEA and at bioconductor: https://bioconductor.org/packages/multiGSEA .

...read moreread less

Journal Article•DOI•

Keras R-CNN: library for cell detection in biological images using deep neural networks

[...]

Jane Hung¹, Jane Hung², Allen Goodman¹, Deepali Ravel³, Stefanie C. P. Lopes⁴, Gabriel W. Rangel³, Odailton Amaral Nery⁵, Benoit Malleret⁶, Benoit Malleret⁷, François Nosten⁸, Marcus V. G. Lacerda⁴, Marcelo U. Ferreira⁵, Laurent Rénia⁷, Manoj T. Duraisingh³, Fabio T. M. Costa⁹, Matthias Marti³, Matthias Marti¹⁰, Anne E. Carpenter¹ - Show less +14 more•Institutions (10)

Broad Institute¹, Massachusetts Institute of Technology², Harvard University³, Oswaldo Cruz Foundation⁴, University of São Paulo⁵, National University of Singapore⁶, Singapore Immunology Network⁷, Mahidol University⁸, State University of Campinas⁹, University of Glasgow¹⁰

11 Jul 2020-BMC Bioinformatics

TL;DR: Keras R-CNN is a Python package that performs automated cell identification for both brightfield and fluorescence images and can process large image sets and is demonstrated on two important biological problems, nucleus detection and malaria stage classification.

...read moreread less

Abstract: A common yet still manual task in basic biology research, high-throughput drug screening and digital pathology is identifying the number, location, and type of individual cells in images. Object detection methods can be useful for identifying individual cells as well as their phenotype in one step. State-of-the-art deep learning for object detection is poised to improve the accuracy and efficiency of biological image analysis. We created Keras R-CNN to bring leading computational research to the everyday practice of bioimage analysts. Keras R-CNN implements deep learning object detection techniques using Keras and Tensorflow ( https://github.com/broadinstitute/keras-rcnn ). We demonstrate the command line tool’s simplified Application Programming Interface on two important biological problems, nucleus detection and malaria stage classification, and show its potential for identifying and classifying a large number of cells. For malaria stage classification, we compare results with expert human annotators and find comparable performance. Keras R-CNN is a Python package that performs automated cell identification for both brightfield and fluorescence images and can process large image sets. Both the package and image datasets are freely available on GitHub and the Broad Bioimage Benchmark Collection.

...read moreread less

Journal Article•DOI•

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

[...]

Aaron M. Smith, Jonathan R. Walsh, John Long¹, Craig Davis¹, Peter V. Henstock¹, Martin R. Hodge¹, Mateusz Maciejewski¹, Xinmeng Jasmine Mu¹, Stephen Ra¹, Shanrong Zhao¹, Daniel Ziemek¹, Charles K. Fisher - Show less +8 more•Institutions (1)

Pfizer¹

20 Mar 2020-BMC Bioinformatics

TL;DR: A comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks, suggesting that using l 2 -regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall.

...read moreread less

Abstract: The ability to confidently predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. Yet, the goal of developing actionable, robust, and reproducible predictive signatures of phenotypes such as clinical outcome has not been attained in almost any disease area. Here, we report a comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks. We systematically investigate the influence of gene subsets, normalization methods and prediction algorithms. Crucially, we also explore the novel use of deep representation learning methods on large transcriptomics compendia, such as GTEx and TCGA, to boost the performance of state-of-the-art methods. The resources and findings in this work should serve as both an up-to-date reference on attainable performance, and as a benchmarking resource for further research. Approaches that combine large numbers of genes outperformed single gene methods consistently and with a significant margin, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l2-regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall. Transcriptomics-based phenotype prediction benefits from proper normalization techniques and state-of-the-art regularized regression approaches. In our view, breakthrough performance is likely contingent on factors which are independent of normalization and general modeling techniques; these factors might include reduction of systematic errors in sequencing data, incorporation of other data types such as single-cell sequencing and proteomics, and improved use of prior knowledge.

...read moreread less

Journal Article•DOI•

DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction

[...]

Niraj Thapa¹, Meenal Chaudhari¹, Sean McManus¹, Kaushik Roy¹, Robert H. Newman¹, Hiroto Saigo², Dukka B. Kc³ - Show less +3 more•Institutions (3)

North Carolina Agricultural and Technical State University¹, Kyushu University², Wichita State University³

23 Apr 2020-BMC Bioinformatics

TL;DR: DeepSuccinylSite, a novel prediction tool that uses deep learning methodology along with embedding to identify succinylation sites in proteins based on their primary structure, is developed and results suggest that the method represents a robust and complementary technique for advanced exploration of protein succinylisation.

...read moreread less

Abstract: Protein succinylation has recently emerged as an important and common post-translation modification (PTM) that occurs on lysine residues. Succinylation is notable both in its size (e.g., at 100 Da, it is one of the larger chemical PTMs) and in its ability to modify the net charge of the modified lysine residue from + 1 to − 1 at physiological pH. The gross local changes that occur in proteins upon succinylation have been shown to correspond with changes in gene activity and to be perturbed by defects in the citric acid cycle. These observations, together with the fact that succinate is generated as a metabolic intermediate during cellular respiration, have led to suggestions that protein succinylation may play a role in the interaction between cellular metabolism and important cellular functions. For instance, succinylation likely represents an important aspect of genomic regulation and repair and may have important consequences in the etiology of a number of disease states. In this study, we developed DeepSuccinylSite, a novel prediction tool that uses deep learning methodology along with embedding to identify succinylation sites in proteins based on their primary structure. Using an independent test set of experimentally identified succinylation sites, our method achieved efficiency scores of 79%, 68.7% and 0.48 for sensitivity, specificity and MCC respectively, with an area under the receiver operator characteristic (ROC) curve of 0.8. In side-by-side comparisons with previously described succinylation predictors, DeepSuccinylSite represents a significant improvement in overall accuracy for prediction of succinylation sites. Together, these results suggest that our method represents a robust and complementary technique for advanced exploration of protein succinylation.

...read moreread less

Journal Article•DOI•

A random forest based computational model for predicting novel lncRNA-disease associations.

[...]

Dengju Yao¹, Xiaojuan Zhan, Xiaorong Zhan², Chee Keong Kwoh³, Peng Li¹, Jinke Wang¹ - Show less +2 more•Institutions (3)

Harbin University of Science and Technology¹, Harbin Medical University², Nanyang Technological University³

27 Mar 2020-BMC Bioinformatics

TL;DR: Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs and is better than several state-of-the-art LDA prediction models.

...read moreread less

Abstract: Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources. To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models. Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.

...read moreread less

Journal Article•DOI•

Deconvolution of bulk blood eQTL effects into immune cell subpopulations

[...]

Raul Aguirre-Gamboa¹, Niek de Klein¹, Jennifer di Tommaso¹, Annique Claringbould¹, Monique G. P. van der Wijst¹, Dylan H. de Vries¹, Harm Brugge¹, Roy Oelen¹, Urmo Võsa¹, Urmo Võsa², Maria M. Zorro¹, Xiaojin Chu¹, Xiaojin Chu³, Olivier B. Bakker¹, Zuzanna Borek¹, Isis Ricaño-Ponce¹, Patrick Deelen¹, Cheng-Jian Xu⁴, Cheng-Jian Xu³, Morris A. Swertz¹, Iris Jonkers¹, Sebo Withoff¹, Irma Joosten⁴, Serena Sanna¹, Vinod Kumar⁴, Vinod Kumar¹, Hans J. P. M. Koenen⁴, Leo A. B. Joosten⁴, Mihai G. Netea⁴, Mihai G. Netea⁵, Cisca Wijmenga¹, Lude Franke¹, Yang Li³, Yang Li¹, Yang Li⁴ - Show less +31 more•Institutions (5)

University Medical Center Groningen¹, University of Tartu², Hannover Medical School³, Radboud University Nijmegen⁴, University of Bonn⁵

12 Jun 2020-BMC Bioinformatics

TL;DR: Decon2 provides a method to detect cell type interaction effects from bulk blood eQTLs that is useful for pinpointing the most relevant cell type for a given complex disease.

...read moreread less

Abstract: Expression quantitative trait loci (eQTL) studies are used to interpret the function of disease-associated genetic risk factors. To date, most eQTL analyses have been conducted in bulk tissues, such as whole blood and tissue biopsies, which are likely to mask the cell type-context of the eQTL regulatory effects. Although this context can be investigated by generating transcriptional profiles from purified cell subpopulations, current methods to do this are labor-intensive and expensive. We introduce a new method, Decon2, as a framework for estimating cell proportions using expression profiles from bulk blood samples (Decon-cell) followed by deconvolution of cell type eQTLs (Decon-eQTL). The estimated cell proportions from Decon-cell agree with experimental measurements across cohorts (R ≥ 0.77). Using Decon-cell, we could predict the proportions of 34 circulating cell types for 3194 samples from a population-based cohort. Next, we identified 16,362 whole-blood eQTLs and deconvoluted cell type interaction (CTi) eQTLs using the predicted cell proportions from Decon-cell. CTi eQTLs show excellent allelic directional concordance with eQTL (≥ 96–100%) and chromatin mark QTL (≥87–92%) studies that used either purified cell subpopulations or single-cell RNA-seq, outperforming the conventional interaction effect. Decon2 provides a method to detect cell type interaction effects from bulk blood eQTLs that is useful for pinpointing the most relevant cell type for a given complex disease. Decon2 is available as an R package and Java application (https://github.com/molgenis/systemsgenetics/tree/master/Decon2) and as a web tool (www.molgenis.org/deconvolution).

...read moreread less

Journal Article•DOI•

Deep learning improves the ability of sgRNA off-target propensity prediction

[...]

Qiaoyue Liu¹, Xiang Cheng¹, Gan Liu¹, Bohao Li¹, Xiuqin Liu¹ - Show less +1 more•Institutions (1)

University of Science and Technology Beijing¹

10 Feb 2020-BMC Bioinformatics

TL;DR: CnnCrispr automatically trains the sequence features of sgRNA-DNA pairs with GloVe model, and embeds the trained word vector matrix into the deep learning model including biLSTM and CNN with five hidden layers, and has better classification and regression performance than the existing states-of-art models.

...read moreread less

Abstract: CRISPR/Cas9 system, as the third-generation genome editing technology, has been widely applied in target gene repair and gene expression regulation. Selection of appropriate sgRNA can improve the on-target knockout efficacy of CRISPR/Cas9 system with high sensitivity and specificity. However, when CRISPR/Cas9 system is operating, unexpected cleavage may occur at some sites, known as off-target. Presently, a number of prediction methods have been developed to predict the off-target propensity of sgRNA at specific DNA fragments. Most of them use artificial feature extraction operations and machine learning techniques to obtain off-target scores. With the rapid expansion of off-target data and the rapid development of deep learning theory, the existing prediction methods can no longer satisfy the prediction accuracy at the clinical level. Here, we propose a prediction method named CnnCrispr to predict the off-target propensity of sgRNA at specific DNA fragments. CnnCrispr automatically trains the sequence features of sgRNA-DNA pairs with GloVe model, and embeds the trained word vector matrix into the deep learning model including biLSTM and CNN with five hidden layers. We conducted performance verification on the data set provided by DeepCrispr, and found that the auROC and auPRC in the “leave-one-sgRNA-out” cross validation could reach 0.957 and 0.429 respectively (the Pearson value and spearman value could reach 0.495 and 0.151 respectively under the same settings). Our results show that CnnCrispr has better classification and regression performance than the existing states-of-art models. The code for CnnCrispr can be freely downloaded from https://github.com/LQYoLH/CnnCrispr.

...read moreread less

Journal Article•DOI•

NASQAR: a web-based platform for high-throughput sequencing data analysis and visualization

[...]

Ayman Yousif¹, Nizar Drou¹, Jillian Rowe¹, Mohammed Khalfan, Kristin C. Gunsalus¹ - Show less +1 more•Institutions (1)

New York University Abu Dhabi¹

29 Jun 2020-BMC Bioinformatics

TL;DR: NASQAR (Nucleic acid SeQuence Analysis Resource) as discussed by the authors is a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization.

...read moreread less

Abstract: As high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization Often, effective use of these tools requires computational skills beyond those of many researchers To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource) NASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization The platform is publicly accessible at http://nasqarabudhabinyuedu/ Open-source code is on GitHub at https://githubcom/nasqar/NASQAR , and the system is also available as a Docker image at https://hubdockercom/r/aymanm/nasqarall NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology NASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment

...read moreread less

Collapse