scispace - formally typeset
Search or ask a question

Showing papers in "BMC Bioinformatics in 2015"


Journal ArticleDOI
TL;DR: InteractiVenn is a more flexible tool for interacting with Venn diagrams including up to six sets that offers a clean interface for Venn diagram construction and enables analysis of set unions while preserving the shape of the diagram.
Abstract: Set comparisons permeate a large number of data analysis workflows, in particular workflows in biological sciences. Venn diagrams are frequently employed for such analysis but current tools are limited. We have developed InteractiVenn, a more flexible tool for interacting with Venn diagrams including up to six sets. It offers a clean interface for Venn diagram construction and enables analysis of set unions while preserving the shape of the diagram. Set unions are useful to reveal differences and similarities among sets and may be guided in our tool by a tree or by a list of set unions. The tool also allows obtaining subsets’ elements, saving and loading sets for further analyses, and exporting the diagram in vector and image formats. InteractiVenn has been used to analyze two biological datasets, but it may serve set analysis in a broad range of domains. InteractiVenn allows set unions in Venn diagrams to be explored thoroughly, by consequence extending the ability to analyze combinations of sets with additional observations, yielded by novel interactions between joined sets. InteractiVenn is freely available online at: www.interactivenn.net .

1,398 citations


Journal ArticleDOI
TL;DR: Overall, BioASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.
Abstract: This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. The 2013 BioASQ competition comprised two tasks, Task 1a and Task 1b. In Task 1a participants were asked to automatically annotate new PubMed documents with MeSH headings. Twelve teams participated in Task 1a, with a total of 46 system runs submitted, and one of the teams performing consistently better than the MTI indexer used by NLM to suggest MeSH headings to curators. Task 1b used benchmark datasets containing 29 development and 282 test English questions, along with gold standard (reference) answers, prepared by a team of biomedical experts from around Europe and participants had to automatically produce answers. Three teams participated in Task 1b, with 11 system runs. The BioASQ infrastructure, including benchmark datasets, evaluation mechanisms, and the results of the participants and baseline methods, is publicly available. A publicly available evaluation infrastructure for biomedical semantic indexing and QA has been developed, which includes benchmark datasets, and can be used to evaluate systems that: assign MeSH headings to published articles or to English questions; retrieve relevant RDF triples from ontologies, relevant articles and snippets from PubMed Central; produce “exact” and paragraph-sized “ideal” answers (summaries). The results of the systems that participated in the 2013 BioASQ competition are promising. In Task 1a one of the systems performed consistently better from the NLM’s MTI indexer. In Task 1b the systems received high scores in the manual evaluation of the “ideal” answers; hence, they produced high quality summaries as answers. Overall, BioASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.

471 citations


Journal ArticleDOI
TL;DR: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment, since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins.
Abstract: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.

245 citations


Journal ArticleDOI
TL;DR: The software package IPO (‘Isotopologue Parameter Optimization’) was successfully applied to data derived from liquid chromatography coupled to high resolution mass spectrometry from three studies with different sample types and different chromatographic methods and devices and the potential of IPO to increase the reliability of metabolomics data was shown.
Abstract: Background Untargeted metabolomics generates a huge amount of data. Software packages for automated data processing are crucial to successfully process these data. A variety of such software packages exist, but the outcome of data processing strongly depends on algorithm parameter settings. If they are not carefully chosen, suboptimal parameter settings can easily lead to biased results. Therefore, parameter settings also require optimization. Several parameter optimization approaches have already been proposed, but a software package for parameter optimization which is free of intricate experimental labeling steps, fast and widely applicable is still missing.

240 citations


Journal ArticleDOI
TL;DR: A heuristic approach based on analysis of variation of statistical perplexity during topic modelling is proposed to estimate the most appropriate number of topics, and the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector.
Abstract: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach. Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.

230 citations


Journal ArticleDOI
TL;DR: The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels and is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels.
Abstract: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction. In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels. The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.

226 citations


Journal ArticleDOI
TL;DR: QoRTs generates an unmatched variety of quality control metrics, and can provide cross-comparisons of replicates contrasted by batch, biological sample, or experimental condition, revealing any outliers and/or systematic issues that could drive false associations or otherwise compromise downstream analyses.
Abstract: Background High-throughput next-generation RNA sequencing has matured into a viable and powerful method for detecting variations in transcript expression and regulation. Proactive quality control is of critical importance as unanticipated biases, artifacts, or errors can potentially drive false associations and lead to flawed results.

223 citations


Journal ArticleDOI
TL;DR: stcR is a new R package, representing a platform for the advanced analysis of T cell receptor repertoires after primary TR sequences extraction from raw sequencing reads, which includes diversity measures, shared T cell receptors sequences identification, gene usage statistics computation and other widely used methods.
Abstract: The Immunoglobulins (IG) and the T cell receptors (TR) play the key role in antigen recognition during the adaptive immune response. Recent progress in next-generation sequencing technologies has provided an opportunity for the deep T cell receptor repertoire profiling. However, a specialised software is required for the rational analysis of massive data generated by next-generation sequencing. Here we introduce tcR, a new R package, representing a platform for the advanced analysis of T cell receptor repertoires, which includes diversity measures, shared T cell receptor sequences identification, gene usage statistics computation and other widely used methods. The tool has proven its utility in recent research studies. tcR is an R package for the advanced analysis of T cell receptor repertoires after primary TR sequences extraction from raw sequencing reads. The stable version can be directly installed from The Comprehensive R Archive Network ( http://cran.r-project.org/mirrors.html ). The source code and development version are available at tcR GitHub ( http://imminfo.github.io/tcr/ ) along with the full documentation and typical usage examples.

211 citations


Journal ArticleDOI
TL;DR: The predictive power of MAESTRO for single point mutations and stabilizing disulfide bonds is comparable to similar methods, and it is shown that this tool is a versatile tool in the field of stability change prediction upon point mutations.
Abstract: Point mutations can have a strong impact on protein stability. A change in stability may subsequently lead to dysfunction and finally cause diseases. Moreover, protein engineering approaches aim to deliberately modify protein properties, where stability is a major constraint. In order to support basic research and protein design tasks, several computational tools for predicting the change in stability upon mutations have been developed. Comparative studies have shown the usefulness but also limitations of such programs. We aim to contribute a novel method for predicting changes in stability upon point mutation in proteins called MAESTRO. MAESTRO is structure based and distinguishes itself from similar approaches in the following points: (i) MAESTRO implements a multi-agent machine learning system. (ii) It also provides predicted free energy change (Δ ΔG) values and a corresponding prediction confidence estimation. (iii) It provides high throughput scanning for multi-point mutations where sites and types of mutation can be comprehensively controlled. (iv) Finally, the software provides a specific mode for the prediction of stabilizing disulfide bonds. The predictive power of MAESTRO for single point mutations and stabilizing disulfide bonds is comparable to similar methods. MAESTRO is a versatile tool in the field of stability change prediction upon point mutations. Executables for the Linux and Windows operating systems are freely available to non-commercial users from http://biwww.che.sbg.ac.at/MAESTRO .

200 citations


Journal ArticleDOI
TL;DR: GelJ is an easy to use tool for analyzing DNA fingerprint gel images that combines the best characteristics of both free and commercial tools; it is light and simple to use, but it also includes the necessary features to obtain precise results.
Abstract: DNA fingerprinting is a technique for comparing DNA patterns that has applications in a wide variety of contexts. Several commercial and freely-available tools can be used to analyze DNA fingerprint gel images; however, commercial tools are expensive and usually difficult to use; and, free tools support the basic functionality for DNA fingerprint analysis, but lack some instrumental features to obtain accurate results. In this paper, we present GelJ, a feather-weight, user-friendly, platform-independent, open-source and free tool for analyzing DNA fingerprint gel images. Some of the outstanding features of GelJ are mechanisms for accurate lane- and band-detection, several options for computing migration models, a number of band- and curve-based similarity methods, different techniques for generating dendrograms, comparison of banding patterns from different experiments, and database support. GelJ is an easy to use tool for analyzing DNA fingerprint gel images. It combines the best characteristics of both free and commercial tools: GelJ is light and simple to use (as free programs), but it also includes the necessary features to obtain precise results (as commercial programs). In addition, GelJ incorporates new functionality that is not supported by any other tool.

194 citations


Journal ArticleDOI
TL;DR: It is proposed that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.
Abstract: Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information

Journal ArticleDOI
TL;DR: DiffHic as discussed by the authors is a software package for the detection of differential interactions from high-throughput sequencing (Hi-C) data, which uses the statistical framework of the edgeR package to model biological variability and test for significant differences between conditions.
Abstract: Chromatin conformation capture with high-throughput sequencing (Hi-C) is a technique that measures the in vivo intensity of interactions between all pairs of loci in the genome. Most conventional analyses of Hi-C data focus on the detection of statistically significant interactions. However, an alternative strategy involves identifying significant changes in the interaction intensity (i.e., differential interactions) between two or more biological conditions. This is more statistically rigorous and may provide more biologically relevant results. Here, we present the diffHic software package for the detection of differential interactions from Hi-C data. diffHic provides methods for read pair alignment and processing, counting into bin pairs, filtering out low-abundance events and normalization of trended or CNV-driven biases. It uses the statistical framework of the edgeR package to model biological variability and to test for significant differences between conditions. Several options for the visualization of results are also included. The use of diffHic is demonstrated with real Hi-C data sets. Performance against existing methods is also evaluated with simulated data. On real data, diffHic is able to successfully detect interactions with significant differences in intensity between biological conditions. It also compares favourably to existing software tools on simulated data sets. These results suggest that diffHic is a viable approach for differential analyses of Hi-C data.

Journal ArticleDOI
TL;DR: Spearman correlation analysis revealed that RC, UQ, Med, TMM, DESeq, and Q did not noticeably improve gene expression normalization, regardless of read length, and suggested ignoring poly-A tail during differential gene expression analysis.
Abstract: Recently, rapid improvements in technology and decrease in sequencing costs have made RNA-Seq a widely used technique to quantify gene expression levels. Various normalization approaches have been proposed, owing to the importance of normalization in the analysis of RNA-Seq data. A comparison of recently proposed normalization methods is required to generate suitable guidelines for the selection of the most appropriate approach for future experiments. In this paper, we compared eight non-abundance (RC, UQ, Med, TMM, DESeq, Q, RPKM, and ERPKM) and two abundance estimation normalization methods (RSEM and Sailfish). The experiments were based on real Illumina high-throughput RNA-Seq of 35- and 76-nucleotide sequences produced in the MAQC project and simulation reads. Reads were mapped with human genome obtained from UCSC Genome Browser Database. For precise evaluation, we investigated Spearman correlation between the normalization results from RNA-Seq and MAQC qRT-PCR values for 996 genes. Based on this work, we showed that out of the eight non-abundance estimation normalization methods, RC, UQ, Med, TMM, DESeq, and Q gave similar normalization results for all data sets. For RNA-Seq of a 35-nucleotide sequence, RPKM showed the highest correlation results, but for RNA-Seq of a 76-nucleotide sequence, least correlation was observed than the other methods. ERPKM did not improve results than RPKM. Between two abundance estimation normalization methods, for RNA-Seq of a 35-nucleotide sequence, higher correlation was obtained with Sailfish than that with RSEM, which was better than without using abundance estimation methods. However, for RNA-Seq of a 76-nucleotide sequence, the results achieved by RSEM were similar to without applying abundance estimation methods, and were much better than with Sailfish. Furthermore, we found that adding a poly-A tail increased alignment numbers, but did not improve normalization results. Spearman correlation analysis revealed that RC, UQ, Med, TMM, DESeq, and Q did not noticeably improve gene expression normalization, regardless of read length. Other normalization methods were more efficient when alignment accuracy was low; Sailfish with RPKM gave the best normalization results. When alignment accuracy was high, RC was sufficient for gene expression calculation. And we suggest ignoring poly-A tail during differential gene expression analysis.

Journal ArticleDOI
TL;DR: The accuracy of shotgun metagenomics classification methods varies widely, and no one program clearly outperformed others in all evaluation scenarios; rather, the results illustrate the strengths of different methods for different purposes.
Abstract: The field of metagenomics (study of genetic material recovered directly from an environment) has grown rapidly, with many bioinformatics analysis methods being developed. To ensure appropriate use of such methods, robust comparative evaluation of their accuracy and features is needed. For taxonomic classification of sequence reads, such evaluation should include use of clade exclusion, which better evaluates a method’s accuracy when identical sequences are not present in any reference database, as is common in metagenomic analysis. To date, relatively small evaluations have been performed, with evaluation approaches like clade exclusion limited to assessment of new methods by the authors of the given method. What is needed is a rigorous, independent comparison between multiple major methods, using the same in silico and in vitro test datasets, with and without approaches like clade exclusion, to better characterize accuracy under different conditions. An overview of the features of 38 bioinformatics methods is provided, evaluating accuracy with a focus on 11 programs that have reference databases that can be modified and therefore most robustly evaluated with clade exclusion. Taxonomic classification of sequence reads was evaluated using both in silico and in vitro mock bacterial communities. Clade exclusion was used at taxonomic levels from species to class—identifying how well methods perform in progressively more difficult scenarios. A wide range of variability was found in the sensitivity, precision, overall accuracy, and computational demand for the programs evaluated. In experiments where distilled water was spiked with only 11 bacterial species, frequently dozens to hundreds of species were falsely predicted by the most popular programs. The different features of each method (forces predictions or not, etc.) are summarized, and additional analysis considerations discussed. The accuracy of shotgun metagenomics classification methods varies widely. No one program clearly outperformed others in all evaluation scenarios; rather, the results illustrate the strengths of different methods for different purposes. Researchers must appreciate method differences, choosing the program best suited for their particular analysis to avoid very misleading results. Use of standardized datasets for method comparisons is encouraged, as is use of mock microbial community controls suitable for a particular metagenomic analysis.

Journal ArticleDOI
TL;DR: An R package for the study of prokaryotic pan-genomes is presented and parts of the package are demonstrated on a data set for the gram positive bacterium Enterococcus faecalis.
Abstract: A pan-genome is defined as the set of all unique gene families found in one or more strains of a prokaryotic species. Due to the extensive within-species diversity in the microbial world, the pan-genome is often many times larger than a single genome. Studies of pan-genomes have become popular due to the easy access to whole-genome sequence data for prokaryotes. A pan-genome study reveals species diversity and gene families that may be of special interest, e.g because of their role in bacterial survival or their ability to discriminate strains. We present an R package for the study of prokaryotic pan-genomes. The R computing environment harbors endless possibilities with respect to statistical analyses and graphics. External free software is used for the heavy computations involved, and the R package provides functions for building a computational pipeline. We demonstrate parts of the package on a data set for the gram positive bacterium Enterococcus faecalis. The package is free to download and install from The Comprehensive R Archive Network.

Journal ArticleDOI
TL;DR: A novel method ‘feature selection-based multi-label k-nearest neighbor method’ (FS-MLKNN), which can simultaneously determine critical feature dimensions and construct high-accuracy multi- label prediction models, which are promising tools for the side effect prediction.
Abstract: Predicting drug side effects is an important topic in the drug discovery. Although several machine learning methods have been proposed to predict side effects, there is still space for improvements. Firstly, the side effect prediction is a multi-label learning task, and we can adopt the multi-label learning techniques for it. Secondly, drug-related features are associated with side effects, and feature dimensions have specific biological meanings. Recognizing critical dimensions and reducing irrelevant dimensions may help to reveal the causes of side effects. In this paper, we propose a novel method ‘feature selection-based multi-label k-nearest neighbor method’ (FS-MLKNN), which can simultaneously determine critical feature dimensions and construct high-accuracy multi-label prediction models. Computational experiments demonstrate that FS-MLKNN leads to good performances as well as explainable results. To achieve better performances, we further develop the ensemble learning model by integrating individual feature-based FS-MLKNN models. When compared with other state-of-the-art methods, the ensemble method produces better performances on benchmark datasets. In conclusion, FS-MLKNN and the ensemble method are promising tools for the side effect prediction. The source code and datasets are available in the Additional file 1.

Journal ArticleDOI
TL;DR: Red is the first repeat-detection tool capable of labeling its training data and training itself automatically on an entire genome, and it is sensitive to both transposon and simple repeats; in contrast, available tools such as RepeatScout and ReCon are sensitive to transposons, and WindowMasker to simple repeats.
Abstract: With rapid advancements in technology, the sequences of thousands of species’ genomes are becoming available. Within the sequences are repeats that comprise significant portions of genomes. Successful annotations thus require accurate discovery of repeats. As species-specific elements, repeats in newly sequenced genomes are likely to be unknown. Therefore, annotating newly sequenced genomes requires tools to discover repeats de-novo. However, the currently available de-novo tools have limitations concerning the size of the input sequence, ease of use, sensitivities to major types of repeats, consistency of performance, speed, and false positive rate. To address these limitations, I designed and developed Red, applying Machine Learning. Red is the first repeat-detection tool capable of labeling its training data and training itself automatically on an entire genome. Red is easy to install and use. It is sensitive to both transposons and simple repeats; in contrast, available tools such as RepeatScout and ReCon are sensitive to transposons, and WindowMasker to simple repeats. Red performed consistently well on seven genomes; the other tools performed well only on some genomes. Red is much faster than RepeatScout and ReCon and has a much lower false positive rate than WindowMasker. On human genes with five or more copies, Red was more specific than RepeatScout by a wide margin. When tested on genomes of unusual nucleotide compositions, Red located repeats with high sensitivities and maintained moderate false positive rates. Red outperformed the related tools on a bacterial genome. Red identified 46,405 novel repetitive segments in the human genome. Finally, Red is capable of processing assembled and unassembled genomes. Red’s innovative methodology and its excellent performance on seven different genomes represent a valuable advancement in the field of repeats discovery.

Journal ArticleDOI
TL;DR: Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes, to have broad utility for finishing genomes across the tree of life.
Abstract: Background While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment “gaps” – uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes.

Journal ArticleDOI
TL;DR: In biological systems with low noise, i.e. where periodic signals with interesting shapes are more likely to occur, SW1PerS can be used as a powerful tool in exploratory analyses, and is shown to be the most shape-agnostic of the evaluated methods.
Abstract: Identifying periodically expressed genes across different processes (e.g. the cell and metabolic cycles, circadian rhythms, etc) is a central problem in computational biology. Biological time series may contain (multiple) unknown signal shapes of systemic relevance, imperfections like noise, damping, and trending, or limited sampling density. While there exist methods for detecting periodicity, their design biases (e.g. toward a specific signal shape) can limit their applicability in one or more of these situations. We present in this paper a novel method, SW1PerS, for quantifying periodicity in time series in a shape-agnostic manner and with resistance to damping. The measurement is performed directly, without presupposing a particular pattern, by evaluating the circularity of a high-dimensional representation of the signal. SW1PerS is compared to other algorithms using synthetic data and performance is quantified under varying noise models, noise levels, sampling densities, and signal shapes. Results on biological data are also analyzed and compared. On the task of periodic/not-periodic classification, using synthetic data, SW1PerS outperforms all other algorithms in the low-noise regime. SW1PerS is shown to be the most shape-agnostic of the evaluated methods, and the only one to consistently classify damped signals as highly periodic. On biological data, and for several experiments, the lists of top 10% genes ranked with SW1PerS recover up to 67% of those generated with other popular algorithms. Moreover, the list of genes from data on the Yeast metabolic cycle which are highly-ranked only by SW1PerS, contains evidently non-cosine patterns (e.g. ECM33, CDC9, SAM1,2 and MSH6) with highly periodic expression profiles. In data from the Yeast cell cycle SW1PerS identifies genes not preferred by other algorithms, hence not previously reported as periodic, but found in other experiments such as the universal growth rate response of Slavov. These genes are BOP3, CDC10, YIL108W, YER034W, MLP1, PAC2 and RTT101. In biological systems with low noise, i.e. where periodic signals with interesting shapes are more likely to occur, SW1PerS can be used as a powerful tool in exploratory analyses. Indeed, by having an initial set of periodic genes with a rich variety of signal types, pattern/shape information can be included in the study of systems and the generation of hypotheses regarding the structure of gene regulatory networks.

Journal ArticleDOI
TL;DR: A software package, dms_tools, to infer the impacts of mutations from deep mutational scanning data using a likelihood-based treatment of the mutation counts, which yields more accurate inferences on simulated data than simply calculating ratios of counts pre- and post-selection.
Abstract: Deep mutational scanning is a technique to estimate the impacts of mutations on a gene by using deep sequencing to count mutations in a library of variants before and after imposing a functional selection. The impacts of mutations must be inferred from changes in their counts after selection. I describe a software package, dms_tools, to infer the impacts of mutations from deep mutational scanning data using a likelihood-based treatment of the mutation counts. I show that dms_tools yields more accurate inferences on simulated data than simply calculating ratios of counts pre- and post-selection. Using dms_tools, one can infer the preference of each site for each amino acid given a single selection pressure, or assess the extent to which these preferences change under different selection pressures. The preferences and their changes can be intuitively visualized with sequence-logo-style plots created using an extension to weblogo. dms_tools implements a statistically principled approach for the analysis and subsequent visualization of deep mutational scanning data.

Journal ArticleDOI
TL;DR: SPINGO is an accurate, flexible and fast method for low-level taxonomic assignment that compares favourably to other methods in terms of classification accuracy, and is as fast or faster than those that have higher error rates.
Abstract: Taxonomic classification is a corner stone for the characterisation and comparison of microbial communities. Currently, most existing methods are either slow, restricted to specific communities, highly sensitive to taxonomic inconsistencies, or limited to genus level classification. As crucial microbiota information is hinging on high-level resolution it is imperative to increase taxonomic resolution to species level wherever possible. In response to this need we developed SPINGO, a flexible and stand-alone software dedicated to high-resolution assignment of sequences to species level using partial 16S rRNA gene sequences from any environment. SPINGO compares favourably to other methods in terms of classification accuracy, and is as fast or faster than those that have higher error rates. As a demonstration of its flexibility for other types of target genes we successfully applied SPINGO also on cpn60 amplicon sequences. SPINGO is an accurate, flexible and fast method for low-level taxonomic assignment. This combination is becoming increasingly important for rapid and accurate processing of amplicon data generated by newer next generation sequencing technologies.

Journal ArticleDOI
TL;DR: This work addresses three major limitations of biomedical KB construction by using a versatile and scalable approach to automatic KB construction, and is the first method that uses consistency checking for biomedical relations.
Abstract: Background Biomedical knowledge bases (KB’s) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects.

Journal ArticleDOI
TL;DR: The new coMET software allows visualisation of EWAS results with functional genomic annotations and with estimation of co-methylation patterns, and can be a valuable resource to interpret results in the fast growing field of epigenetics.
Abstract: Epigenome-wide association scans (EWAS) are an increasingly powerful and widely-used approach to assess the role of epigenetic variation in human complex traits. However, this rapidly emerging field lacks dedicated visualisation tools that can display features specific to epigenetic datasets. We developed coMET, an R package and online tool for visualisation of EWAS results in a genomic region of interest. coMET generates a regional plot of epigenetic-phenotype association results and the estimated DNA methylation correlation between CpG sites (co-methylation), with further options to visualise genomic annotations based on ENCODE data, gene tracks, reference CpG-sites, and user-defined features. The tool can be used to display phenotype association signals and correlation patterns of microarray or sequencing-based DNA methylation data, such as Illumina Infinium 450k, WGBS, or MeDIP-seq, as well as other types of genomic data, such as gene expression profiles. The software is available as a user-friendly online tool from http://epigen.kcl.ac.uk/comet and as an R Bioconductor package. Source code, examples, and full documentation are also available from GitHub. Our new software allows visualisation of EWAS results with functional genomic annotations and with estimation of co-methylation patterns. coMET is available to a wide audience as an online tool and R package, and can be a valuable resource to interpret results in the fast growing field of epigenetics. The software is designed for epigenetic data, but can also be applied to genomic and functional genomic datasets in any species.

Journal ArticleDOI
TL;DR: In this article, a combination of boosting and stability selection is proposed to detect influential predictors in high-dimensional settings while controlling the given error bound in various simulation scenarios, and the results were applied to investigate phenotype measurements in patients with autism spectrum disorders using a log-linear interaction model which was fitted by boosting.
Abstract: Modern biotechnologies often result in high-dimensional data sets with many more variables than observations (n≪p). These data sets pose new challenges to statistical analysis: Variable selection becomes one of the most important tasks in this setting. Similar challenges arise if in modern data sets from observational studies, e.g., in ecology, where flexible, non-linear models are fitted to high-dimensional data. We assess the recently proposed flexible framework for variable selection called stability selection. By the use of resampling procedures, stability selection adds a finite sample error control to high-dimensional variable selection procedures such as Lasso or boosting. We consider the combination of boosting and stability selection and present results from a detailed simulation study that provide insights into the usefulness of this combination. The interpretation of the used error bounds is elaborated and insights for practical data analysis are given. Stability selection with boosting was able to detect influential predictors in high-dimensional settings while controlling the given error bound in various simulation scenarios. The dependence on various parameters such as the sample size, the number of truly influential variables or tuning parameters of the algorithm was investigated. The results were applied to investigate phenotype measurements in patients with autism spectrum disorders using a log-linear interaction model which was fitted by boosting. Stability selection identified five differentially expressed amino acid pathways. Stability selection is implemented in the freely available R package stabs ( http://CRAN.R-project.org/package=stabs ). It proved to work well in high-dimensional settings with more predictors than observations for both, linear and additive models. The original version of stability selection, which controls the per-family error rate, is quite conservative, though, this is much less the case for its improvement, complementary pairs stability selection. Nevertheless, care should be taken to appropriately specify the error bound.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a Genotyping By Sequencing (GBS) approach for variant discovery and genotyping using restriction enzyme based reduced representation libraries. But this technique requires a large number of experimental design and initial processing specific tools.
Abstract: Background Massive parallel sequencing is a powerful tool for variant discovery and genotyping. To reduce costs, sequencing of restriction enzyme based reduced representation libraries can be utilized. This technology is generally referred to as Genotyping By Sequencing (GBS). To deal with GBS experimental design and initial processing specific bioinformatic tools are needed.

Journal ArticleDOI
TL;DR: A novel reference-free method meant to compress data issued from high throughput sequencing technologies, based on a reference probabilistic de Bruijn Graph, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.
Abstract: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. Leon was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. Leon is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/ .

Journal ArticleDOI
TL;DR: This work demonstrates that reference-free algorithms for cell-mixture adjustment can produce biologically valid results, separating cell-mediated epigenetic effects from those that are not cell mediated, and that in general the interpretation of associations evident from DNA methylation should be carefully considered.
Abstract: The impact of cell-composition effects in analysis of DNA methylation data is now widely appreciated. With the availability of a reference data set consisting of DNA methylation measurements on isolated cell types, it is possible to impute cell proportions and adjust for them, but there is increasing interest in methods that adjust for cell composition effects when reference sets are incomplete or unavailable. In this article we present a theoretical basis for one such method, showing that the total effect of a phenotype on DNA methylation can be decomposed into orthogonal components, one representing the effect of phenotype on proportions of major cell types, the other representing either subtle effects in composition or global effects at focused loci, and that it is possible to separate these two types of effects in a finite data set. We demonstrate this principle empirically on nine DNA methylation data sets, showing that the first few principal components generally contain a majority of the information on cell-type present in the data, but that later principal components nevertheless contain information about a small number of loci that may represent more focused associations. We also present a new method for determining the number of linear terms to interpret as cell-mixture effects and demonstrate robustness to the choice of this parameter. Taken together, our work demonstrates that reference-free algorithms for cell-mixture adjustment can produce biologically valid results, separating cell-mediated epigenetic effects (i.e. apparent effects arising from differences in cell composition) from those that are not cell mediated, and that in general the interpretation of associations evident from DNA methylation should be carefully considered.

Journal ArticleDOI
TL;DR: Mass-Up brings knowledge discovery within reach of MALDI-TOF-MS researchers by allowing data preprocessing, as well as subsequent analysis including biomarker discovery, clustering, biclustering and three-dimensional PCA visualization.
Abstract: Mass spectrometry is one of the most important techniques in the field of proteomics. MALDI-TOF mass spectrometry has become popular during the last decade due to its high speed and sensitivity for detecting proteins and peptides. MALDI-TOF-MS can be also used in combination with Machine Learning techniques and statistical methods for knowledge discovery. Although there are many software libraries and tools that can be combined for these kind of analysis, there is still a need for all-in-one solutions with graphical user-friendly interfaces and avoiding the need of programming skills. Mass-Up, an open software multiplatform application for MALDI-TOF-MS knowledge discovery is herein presented. Mass-Up software allows data preprocessing, as well as subsequent analysis including (i) biomarker discovery, (ii) clustering, (iii) biclustering, (iv) three-dimensional PCA visualization and (v) classification of large sets of spectra data. Mass-Up brings knowledge discovery within reach of MALDI-TOF-MS researchers. Mass-Up is distributed under license GPLv3 and it is open and free to all users at http://sing.ei.uvigo.es/mass-up .

Journal ArticleDOI
TL;DR: A broad analysis of the impact of five well-known missing value imputations methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets, using a statistical framework to assess whether different imputation methods improve the performance of the clustering/classification methods.
Abstract: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

Journal ArticleDOI
TL;DR: ProteinVolume is a highly accurate and fast application to interrogate geometric volumes of proteins and uses an innovative flood-fill algorithm to calculate the individual components of the molecular surface volume, van der Waals and intramolecular void volumes.
Abstract: Voids and cavities in the native protein structure determine the pressure unfolding of proteins. In addition, the volume changes due to the interaction of newly exposed atoms with solvent upon protein unfolding also contribute to the pressure unfolding of proteins. Quantitative understanding of these effects is important for predicting and designing proteins with predefined response to changes in hydrostatic pressure using computational approaches. The molecular surface volume is a useful metric that describes contribution of geometrical volume, which includes van der Waals volume and volume of the voids, to the total volume of a protein in solution, thus isolating the effects of hydration for separate calculations. We developed ProteinVolume, a highly robust and easy-to-use tool to compute geometric volumes of proteins. ProteinVolume generates the molecular surface of a protein and uses an innovative flood-fill algorithm to calculate the individual components of the molecular surface volume, van der Waals and intramolecular void volumes. ProteinVolume is user friendly and is available as a web-server or a platform-independent command-line version. ProteinVolume is a highly accurate and fast application to interrogate geometric volumes of proteins. ProteinVolume is a free web server available on http://gmlab.bio.rpi.edu . Free-standing platform-independent Java-based ProteinVolume executable is also freely available at this web site.