scispace - formally typeset
Search or ask a question

Showing papers in "IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2013"


Journal ArticleDOI
TL;DR: The GenomeTools, a convenient and efficient software library and associated software tools for developing bioinformatics software intended to create, process or convert annotation graphs, strictly follow the annotation graph approach, offering a unified graph-based representation.
Abstract: Genome annotations are often published as plain text files describing genomic features and their subcomponents by an implicit annotation graph. In this paper, we present the GenomeTools, a convenient and efficient software library and associated software tools for developing bioinformatics software intended to create, process or convert annotation graphs. The GenomeTools strictly follow the annotation graph approach, offering a unified graph-based representation. This gives the developer intuitive and immediate access to genomic features and tools for their manipulation. To process large annotation sets with low memory overhead, we have designed and implemented an efficient pull-based approach for sequential processing of annotations. This allows to handle even the largest annotation sets, such as a complete catalogue of human variations. Our object-oriented C-based software library enables a developer to conveniently implement their own functionality on annotation graphs and to integrate it into larger workflows, simultaneously accessing compressed sequence data if required. The careful C implementation of the GenomeTools does not only ensure a light-weight memory footprint while allowing full sequential as well as random access to the annotation graph, but also facilitates the creation of bindings to a variety of script programming languages (like Python and Ruby) sharing the same interface.

330 citations


Journal ArticleDOI
TL;DR: The proposed NFV-AAA approach does perform well in the field of similarity analysis of sequence and is compared with those from other related works.
Abstract: Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a 400 by ($(L-1)$) matrix $({\schmi M})$. In addition, we further derive a normalized 400-tuple mathematical descriptors $({\schmi D})$, which is extracted from the primary protein sequences via singular values decomposition (SVD) of the matrix. The obtained 400-D normalized feature vectors (NFVs) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity for different sequences upon two data sets: 1) ND5 sequences from nine species and 2) transferrin sequences of 24 vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our proposed NFV-AAA approach does perform well in the field of similarity analysis of sequence.

138 citations


Journal ArticleDOI
TL;DR: An efficient method is proposed to select initial prototypes of different gene clusters, which enables the proposed c-means algorithm to converge to an optimum or near optimum solutions and helps to discover coexpressed gene clusters.
Abstract: Gene expression data clustering is one of the important tasks of functional genomics as it provides a powerful tool for studying functional relationships of genes in a biological process. Identifying coexpressed groups of genes represents the basic challenge in gene clustering problem. In this regard, a gene clustering algorithm, termed as robust rough-fuzzy $(c)$-means, is proposed judiciously integrating the merits of rough sets and fuzzy sets. While the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in cluster definition, the integration of probabilistic and possibilistic memberships of fuzzy sets enables efficient handling of overlapping partitions in noisy environment. The concept of possibilistic lower bound and probabilistic boundary of a cluster, introduced in robust rough-fuzzy $(c)$-means, enables efficient selection of gene clusters. An efficient method is proposed to select initial prototypes of different gene clusters, which enables the proposed $(c)$-means algorithm to converge to an optimum or near optimum solutions and helps to discover coexpressed gene clusters. The effectiveness of the algorithm, along with a comparison with other algorithms, is demonstrated both qualitatively and quantitatively on 14 yeast microarray data sets.

95 citations


Journal ArticleDOI
TL;DR: A new ligand-specific template-free predictor called TargetS for targeting protein-ligand binding sites from primary sequences that achieves high performances and outperforms many existing predictors.
Abstract: Accurately identifying the protein-ligand binding sites or pockets is of significant importance for both protein function analysis and drug design. Although much progress has been made, challenges remain, especially when the 3D structures of target proteins are not available or no homology templates can be found in the library, where the template-based methods are hard to be applied. In this paper, we report a new ligand-specific template-free predictor called TargetS for targeting protein-ligand binding sites from primary sequences. TargetS first predicts the binding residues along the sequence with ligand-specific strategy and then further identifies the binding sites from the predicted binding residues through a recursive spatial clustering algorithm. Protein evolutionary information, predicted protein secondary structure, and ligand-specific binding propensities of residues are combined to construct discriminative features; an improved AdaBoost classifier ensemble scheme based on random undersampling is proposed to deal with the serious imbalance problem between positive (binding) and negative (nonbinding) samples. Experimental results demonstrate that TargetS achieves high performances and outperforms many existing predictors. TargetS web server and data sets are freely available at: http://www.csbio.sjtu.edu.cn/bioinf/TargetS/ for academic use.

92 citations


Journal ArticleDOI
TL;DR: An extensive comparison of various colon cancer detection categories, and of multiple techniques within each category is provided, and most of the techniques have been evaluated on similar data set to provide a fair performance comparison.
Abstract: Colon cancer causes deaths of about half a million people every year. Common method of its detection is histopathological tissue analysis, which, though leads to vital diagnosis, is significantly correlated to the tiredness, experience, and workload of the pathologist. Researchers have been working since decades to get rid of manual inspection, and to develop trustworthy systems for detecting colon cancer. Several techniques, based on spectral/spatial analysis of colon biopsy images, and serum and gene analysis of colon samples, have been proposed in this regard. Due to rapid evolution of colon cancer detection techniques, a latest review of recent research in this field is highly desirable. The aim of this paper is to discuss various colon cancer detection techniques. In this survey, we categorize the techniques on the basis of the adopted methodology and underlying data set, and provide detailed description of techniques in each category. Additionally, this study provides an extensive comparison of various colon cancer detection categories, and of multiple techniques within each category. Further, most of the techniques have been evaluated on similar data set to provide a fair performance comparison. Analysis reveals that neither of the techniques is perfect; however, research community is progressively inching toward the finest possible solution.

88 citations


Journal ArticleDOI
TL;DR: A new approach is proposed in the form of controlled search space to stabilize the randomness of swarm intelligence techniques especially for the EEG signal, which is found to be more accurate and powerful.
Abstract: This paper explores the migration of adaptive filtering with swarm intelligence/evolutionary techniques employed in the field of electroencephalogram/event-related potential noise cancellation or extraction. A new approach is proposed in the form of controlled search space to stabilize the randomness of swarm intelligence techniques especially for the EEG signal. Swarm-based algorithms such as Particles Swarm Optimization, Artificial Bee Colony, and Cuckoo Optimization Algorithm with their variants are implemented to design optimized adaptive noise canceler. The proposed controlled search space technique is tested on each of the swarm intelligence techniques and is found to be more accurate and powerful. Adaptive noise canceler with traditional algorithms such as least-mean-square, normalized least-mean-square, and recursive least-mean-square algorithms are also implemented to compare the results. ERP signals such as simulated visual evoked potential, real visual evoked potential, and real sensorimotor evoked potential are used, due to their physiological importance in various EEG studies. Average computational time and shape measures of evolutionary techniques are observed 8.21E-01 sec and 1.73E-01, respectively. Though, traditional algorithms take negligible time consumption, but are unable to offer good shape preservation of ERP, noticed as average computational time and shape measure difference, 1.41E-02 sec and 2.60E+00, respectively.

70 citations


Journal ArticleDOI
TL;DR: A general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO), and a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression).
Abstract: In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition, we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance, 4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.

69 citations


Journal ArticleDOI
TL;DR: A reference-point-based nondominated sorting composite differential evolution (RP-NSCDE) is developed to tackle the multiobjective identification of controlling areas in the neuronal network of a cat's brain by considering two measures of controllability simultaneously.
Abstract: In this paper, we investigate the multiobjective identification of controlling areas in the neuronal network of a cat's brain by considering two measures of controllability simultaneously. By utilizing nondominated sorting mechanisms and composite differential evolution (CoDE), a reference-point-based nondominated sorting composite differential evolution (RP-NSCDE) is developed to tackle the multiobjective identification of controlling areas in the neuronal network. The proposed RP-NSCDE shows its promising performance in terms of accuracy and convergence speed, in comparison to nondominated sorting genetic algorithms II. The proposed method is also compared with other representative statistical methods in the complex network theory, single objective, and constraint optimization methods to illustrate its effectiveness and reliability. It is shown that there exists a tradeoff between minimizing two objectives, and therefore pareto fronts (PFs) can be plotted. The developed approaches and findings can also be applied to coordination control of various kinds of real-world complex networks including biological networks and social networks, and so on.

63 citations


Journal ArticleDOI
TL;DR: The introduction of SuperQ, a new method for constructing phylogenetic supernetworks, which has the advantage of producing a planar network, and an analysis of some published data sets as an illustration of its applicability.
Abstract: Supertrees are a commonly used tool in phylogenetics to summarize collections of partial phylogenetic trees. As a generalization of supertrees, phylogenetic supernetworks allow, in addition, the visual representation of conflict between the trees that is not possible to observe with a single tree. Here, we introduce SuperQ, a new method for constructing such supernetworks (SuperQ is freely available at >www.uea.ac.uk/computing/superq.). It works by first breaking the input trees into quartet trees, and then stitching these together to form a special kind of phylogenetic network, called a split network. This stitching process is performed using an adaptation of the QNet method for split network reconstruction employing a novel approach to use the branch lengths from the input trees to estimate the branch lengths in the resulting network. Compared with previous supernetwork methods, SuperQ has the advantage of producing a planar network. We compare the performance of SuperQ to the Z-closure and Q-imputation supernetwork methods, and also present an analysis of some published data sets as an illustration of its applicability.

61 citations


Journal ArticleDOI
TL;DR: A novel feature extraction model that incorporates physicochemical and evolutionary-based information simultaneously simultaneously is proposed and enhancement of the protein structural class prediction accuracy for four popular benchmarks is shown.
Abstract: Better understanding of structural class of a given protein reveals important information about its overall folding type and its domain. It can also be directly used to provide critical information on general tertiary structure of a protein which has a profound impact on protein function determination and drug design. Despite tremendous enhancements made by pattern recognition-based approaches to solve this problem, it still remains as an unsolved issue for bioinformatics that demands more attention and exploration. In this study, we propose a novel feature extraction model that incorporates physicochemical and evolutionary-based information simultaneously. We also propose overlapped segmented distribution and autocorrelation-based feature extraction methods to provide more local and global discriminatory information. The proposed feature extraction methods are explored for 15 most promising attributes that are selected from a wide range of physicochemical-based attributes. Finally, by applying an ensemble of different classifiers namely, Adaboost.M1, LogitBoost, naive Bayes, multilayer perceptron (MLP), and support vector machine (SVM) we show enhancement of the protein structural class prediction accuracy for four popular benchmarks.

60 citations


Journal ArticleDOI
TL;DR: The proposed hybrid fuzzy cluster ensemble frameworks work well on real data sets, especially biomolecular data, and are able to provide more robust, stable, and accurate results when compared with the state-of-the-art single clustering algorithms and traditional cluster ensemble approaches.
Abstract: Cancer class discovery using biomolecular data is one of the most important tasks for cancer diagnosis and treatment. Tumor clustering from gene expression data provides a new way to perform cancer class discovery. Most of the existing research works adopt single-clustering algorithms to perform tumor clustering is from biomolecular data that lack robustness, stability, and accuracy. To further improve the performance of tumor clustering from biomolecular data, we introduce the fuzzy theory into the cluster ensemble framework for tumor clustering from biomolecular data, and propose four kinds of hybrid fuzzy cluster ensemble frameworks (HFCEF), named as HFCEF-I, HFCEF-II, HFCEF-III, and HFCEF-IV, respectively, to identify samples that belong to different types of cancers. The difference between HFCEF-I and HFCEF-II is that they adopt different ensemble generator approaches to generate a set of fuzzy matrices in the ensemble. Specifically, HFCEF-I applies the affinity propagation algorithm (AP) to perform clustering on the sample dimension and generates a set of fuzzy matrices in the ensemble based on the fuzzy membership function and base samples selected by AP. HFCEF-II adopts AP to perform clustering on the attribute dimension, generates a set of subspaces, and obtains a set of fuzzy matrices in the ensemble by performing fuzzy c-means on subspaces. Compared with HFCEF-I and HFCEF-II, HFCEF-III and HFCEF-IV consider the characteristics of HFCEF-I and HFCEF-II. HFCEF-III combines HFCEF-I and HFCEF-II in a serial way, while HFCEF-IV integrates HFCEF-I and HFCEF-II in a concurrent way. HFCEFs adopt suitable consensus functions, such as the fuzzy c-means algorithm or the normalized cut algorithm (Ncut), to summarize generated fuzzy matrices, and obtain the final results. The experiments on real data sets from UCI machine learning repository and cancer gene expression profiles illustrate that 1) the proposed hybrid fuzzy cluster ensemble frameworks work well on real data sets, especially biomolecular data, and 2) the proposed approaches are able to provide more robust, stable, and accurate results when compared with the state-of-the-art single clustering algorithms and traditional cluster ensemble approaches.

Journal ArticleDOI
TL;DR: A transductive multilabel classifier (TMC) is developed to predict multiple functions of proteins using several unlabeled proteins and a method for integrating the different data sources using an ensemble approach is proposed.
Abstract: High-throughput experimental techniques produce several kinds of heterogeneous proteomic and genomic data sets. To computationally annotate proteins, it is necessary and promising to integrate these heterogeneous data sources. Some methods transform these data sources into different kernels or feature representations. Next, these kernels are linearly (or nonlinearly) combined into a composite kernel. The composite kernel is utilized to develop a predictive model to infer the function of proteins. A protein can have multiple roles and functions (or labels). Therefore, multilabel learning methods are also adapted for protein function prediction. We develop a transductive multilabel classifier (TMC) to predict multiple functions of proteins using several unlabeled proteins. We also propose a method called transductive multilabel ensemble classifier (TMEC) for integrating the different data sources using an ensemble approach. The TMEC trains a graph-based multilabel classifier on each single data source, and then combines the predictions of the individual classifiers. We use a directed birelational graph to capture the relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the effectiveness of the TMC and TMEC to predict the functions of proteins on three benchmarks. We show that our approaches perform better than recently proposed protein function prediction methods on composite and multiple kernels. The code, data sets used in this paper and supplemental material are available at https://sites.google.com/site/guoxian85/tmec.

Journal ArticleDOI
TL;DR: The results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance.
Abstract: Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.

Journal ArticleDOI
TL;DR: The study systematically evaluates the joint effect of 23 SNP combinations of six steroid hormone metabolisms, and signaling-related genes involved in breast carcinogenesis pathways were systematically evaluated, with IGA successfully detecting significant ratio differences between breast cancer cases and noncancer cases.
Abstract: Genetic association is a challenging task for the identification and characterization of genes that increase the susceptibility to common complex multifactorial diseases. To fully execute genetic studies of complex diseases, modern geneticists face the challenge of detecting interactions between loci. A genetic algorithm (GA) is developed to detect the association of genotype frequencies of cancer cases and noncancer cases based on statistical analysis. An improved genetic algorithm (IGA) is proposed to improve the reliability of the GA method for high-dimensional SNP-SNP interactions. The strategy offers the top five results to the random population process, in which they guide the GA toward a significant search course. The IGA increases the likelihood of quickly detecting the maximum ratio difference between cancer cases and noncancer cases. The study systematically evaluates the joint effect of 23 SNP combinations of six steroid hormone metabolisms, and signaling-related genes involved in breast carcinogenesis pathways were systematically evaluated, with IGA successfully detecting significant ratio differences between breast cancer cases and noncancer cases. The possible breast cancer risks were subsequently analyzed by odds-ratio (OR) and risk-ratio analysis. The estimated OR of the best SNP barcode is significantly higher than 1 (between 1.15 and 7.01) for specific combinations of two to 13 SNPs. Analysis results support that the IGA provides higher ratio difference values than the GA between breast cancer cases and noncancer cases over 3-SNP to 13-SNP interactions. A more specific SNP-SNP interaction profile for the risk of breast cancer is also provided.

Journal ArticleDOI
TL;DR: With preliminary energy parameters, it is found that the overwhelming majority of putative quadruplex-forming sequences in the human genome are likely to fold into canonical secondary structures instead.
Abstract: G-quadruplexes are abundant locally stable structural elements in nucleic acids. The combinatorial theory of RNA structures and the dynamic programming algorithms for RNA secondary structure prediction are extended here to incorporate G-quadruplexes using a simple but plausible energy model. With preliminary energy parameters, we find that the overwhelming majority of putative quadruplex-forming sequences in the human genome are likely to fold into canonical secondary structures instead. Stable G-quadruplexes are strongly enriched, however, in the 5E1UTR of protein coding mRNAs.

Journal ArticleDOI
TL;DR: A relative comparison among different techniques, in predicting 12 known RNA secondary structures, is presented, as an example and future challenging issues are then mentioned.
Abstract: Prediction of RNA structure is invaluable in creating new drugs and understanding genetic diseases. Several deterministic algorithms and soft computing-based techniques have been developed for more than a decade to determine the structure from a known RNA sequence. Soft computing gained importance with the need to get approximate solutions for RNA sequences by considering the issues related with kinetic effects, cotranscriptional folding, and estimation of certain energy parameters. A brief description of some of the soft computing-based techniques, developed for RNA secondary structure prediction, is presented along with their relevance. The basic concepts of RNA and its different structural elements like helix, bulge, hairpin loop, internal loop, and multiloop are described. These are followed by different methodologies, employing genetic algorithms, artificial neural networks, and fuzzy logic. The role of various metaheuristics, like simulated annealing, particle swarm optimization, ant colony optimization, and tabu search is also discussed. A relative comparison among different techniques, in predicting 12 known RNA secondary structures, is presented, as an example. Future challenging issues are then mentioned.

Journal ArticleDOI
TL;DR: Experimental results seem to indicate that attributes of the proteins in the same complex do have some association with each other and, therefore, that protein complexes can be more accurately identified when protein attributes are taken into consideration.
Abstract: Many computational approaches developed to identify protein complexes in protein-protein interaction (PPI) networks perform their tasks based only on network topologies. The attributes of the proteins in the networks are usually ignored. As protein attributes within a complex may also be related to each other, we have developed a PCIA algorithm to take into consideration both such information and network topology in the identification process of protein complexes. Given a PPI network, PCIA first finds information about the attributes of the proteins in a PPI network in the Gene Ontology databases and uses such information for the identification of protein complexes. PCIA then computes a Degree of Association measure for each pair of interacting proteins to quantitatively determine how much their attribute values associate with each other. Based on this association measure, PCIA is able to discover dense graph clusters consisting of proteins whose attribute values are significantly closer associated with each other. PCIA has been tested with real data and experimental results seem to indicate that attributes of the proteins in the same complex do have some association with each other and, therefore, that protein complexes can be more accurately identified when protein attributes are taken into consideration.

Journal ArticleDOI
TL;DR: This work introduces the consistency condition that is sufficient for a general function to satisfy the plateau property, and introduces general linear-time solutions that identify optimal rootings and all rooting costs.
Abstract: Tree comparison functions are widely used in phylogenetics for comparing evolutionary trees Unrooted trees can be compared with rooted trees by identifying all rootings of the unrooted tree that minimize some provided comparison function between two rooted trees The plateau property is satisfied by the provided function, if all optimal rootings form a subtree, or plateau, in the unrooted tree, from which the rootings along every path toward a leaf have monotonically increasing costs This property is sufficient for the linear-time identification of all optimal rootings and rooting costs However, the plateau property has only been proven for a few rooted comparison functions, requiring individual proofs for each function without benefitting from inherent structural features of such functions Here, we introduce the consistency condition that is sufficient for a general function to satisfy the plateau property For consistent functions, we introduce general linear-time solutions that identify optimal rootings and all rooting costs Further, we identify novel relationships between consistent functions in terms of plateaus, especially the plateau of the well-studied duplication-loss function is part of a plateau of every other consistent function We introduce a novel approach for identifying consistent cost functions by defining a formal language of Boolean costs Formulas in this language can be interpreted as cost functions Finally, we demonstrate the performance of our general linear-time solutions in practice using empirical and simulation studies

Journal ArticleDOI
TL;DR: A novel signal processing-based algorithm is developed by enabling the window length adaptation in STFT of DNA sequences for improving the identification of three-base periodicity.
Abstract: Signal processing-based algorithms for identification of coding sequences (CDS) in eukaryotes are non-data driven and exploit the presence of three-base periodicity in these regions for their detection. Three-base periodicity is commonly detected using short time Fourier transform (STFT) that uses a window function of fixed length. As the length of the protein coding and noncoding regions varies widely, the identification accuracy of STFT-based algorithms is poor. In this paper, a novel signal processing-based algorithm is developed by enabling the window length adaptation in STFT of DNA sequences for improving the identification of three-base periodicity. The length of the window function has been made adaptive in coding regions to maximize the magnitude of period-3 measure, whereas in the noncoding regions, the window length is tailored to minimize this measure. Simulation results on bench mark data sets demonstrate the advantage of this algorithm when compared with other non-data-driven methods for CDS prediction.

Journal ArticleDOI
TL;DR: A survey of howWavelet analysis has been applied to cancer bioinformatics questions is provided and several approaches of representing the biological sequence data numerically and methods of using wavelet analysis on the numerical sequences are discussed.
Abstract: With the rapid development of next generation sequencing technology, the amount of biological sequence data of the cancer genome increases exponentially, which calls for efficient and effective algorithms that may identify patterns hidden underneath the raw data that may distinguish cancer Achilles' heels. From a signal processing point of view, biological units of information, including DNA and protein sequences, have been viewed as one-dimensional signals. Therefore, researchers have been applying signal processing techniques to mine the potentially significant patterns within these sequences. More specifically, in recent years, wavelet transforms have become an important mathematical analysis tool, with a wide and ever increasing range of applications. The versatility of wavelet analytic techniques has forged new interdisciplinary bounds by offering common solutions to apparently diverse problems and providing a new unifying perspective on problems of cancer genome research. In this paper, we provide a survey of how wavelet analysis has been applied to cancer bioinformatics questions. Specifically, we discuss several approaches of representing the biological sequence data numerically and methods of using wavelet analysis on the numerical sequences.

Journal ArticleDOI
TL;DR: A novel family of nonnegative least-squares classifiers for high-dimensional microarray gene expression and comparative genomic hybridization data is proposed based on combining the advantages of using local learning, transductive learning, and ensemble learning, for better prediction performance.
Abstract: Microarray data can be used to detect diseases and predict responses to therapies through classification models. However, the high dimensionality and low sample size of such data result in many computational problems such as reduced prediction accuracy and slow classification speed. In this paper, we propose a novel family of nonnegative least-squares classifiers for high-dimensional microarray gene expression and comparative genomic hybridization data. Our approaches are based on combining the advantages of using local learning, transductive learning, and ensemble learning, for better prediction performance. To study the performances of our methods, we performed computational experiments on 17 well-known data sets with diverse characteristics. We have also performed statistical comparisons with many classification techniques including the well-performing SVM approach and two related but recent methods proposed in literature. Experimental results show that our approaches are faster and achieve generally a better prediction performance over compared methods.

Journal ArticleDOI
TL;DR: Testing demonstrates that a soft energy bias steers sampling toward a diverse decoy ensemble less prone to exploiting energetic artifacts and thus more likely to facilitate retainment of near-native conformations by selection techniques.
Abstract: Adequate sampling of the conformational space is a central challenge in ab initio protein structure prediction. In the absence of a template structure, a conformational search procedure guided by an energy function explores the conformational space, gathering an ensemble of low-energy decoy conformations. If the sampling is inadequate, the native structure may be missed altogether. Even if reproduced, a subsequent stage that selects a subset of decoys for further structural detail and energetic refinement may discard near-native decoys if they are high energy or insufficiently represented in the ensemble. Sampling should produce a decoy ensemble that facilitates the subsequent selection of near-native decoys. In this paper, we investigate a robotics-inspired framework that allows directly measuring the role of energy in guiding sampling. Testing demonstrates that a soft energy bias steers sampling toward a diverse decoy ensemble less prone to exploiting energetic artifacts and thus more likely to facilitate retainment of near-native conformations by selection techniques. We employ two different energy functions, the associative memory Hamiltonian with water and Rosetta. Results show that enhanced sampling provides a rigorous testing of energy functions and exposes different deficiencies in them, thus promising to guide development of more accurate representations and energy functions.

Journal ArticleDOI
TL;DR: This paper constructed ontology attributed PPI networks with PPI data and GO resource and proposed a novel approach called CSO (clustering based on network structure and ontology attribute similarity), which showed that CSO was valuable in predicting protein complexes and achieved state-of-the-art performance.
Abstract: Protein complexes are important for unraveling the secrets of cellular organization and function Many computational approaches have been developed to predict protein complexes in protein-protein interaction (PPI) networks However, most existing approaches focus mainly on the topological structure of PPI networks, and largely ignore the gene ontology (GO) annotation information In this paper, we constructed ontology attributed PPI networks with PPI data and GO resource After constructing ontology attributed networks, we proposed a novel approach called CSO (clustering based on network structure and ontology attribute similarity) Structural information and GO attribute information are complementary in ontology attributed networks CSO can effectively take advantage of the correlation between frequent GO annotation sets and the dense subgraph for protein complex prediction Our proposed CSO approach was applied to four different yeast PPI data sets and predicted many well-known protein complexes The experimental results showed that CSO was valuable in predicting protein complexes and achieved state-of-the-art performance

Journal ArticleDOI
TL;DR: Current algorithms for assembling transcripts and genes from next generation sequencing reads aligned to a reference genome are reviewed, and areas for future improvements are laid out.
Abstract: Next generation sequencing technologies provide unprecedented power to explore the repertoire of genes and their alternative splice variants, collectively defining the transcriptome of a species in great detail. However, assembling the short reads into full-length gene and transcript models presents significant computational challenges. We review current algorithms for assembling transcripts and genes from next generation sequencing reads aligned to a reference genome, and lay out areas for future improvements.

Journal ArticleDOI
TL;DR: This paper applies HCPN to model a tissue comprising multiple cells hexagonally packed in a honeycomb formation in order to describe the phenomenon of Planar Cell Polarity (PCP) signaling in Drosophila wing and constructed a family of related models, permitting different hypotheses to be explored regarding the mechanisms underlying PCP.
Abstract: Modeling across multiple scales is a current challenge in Systems Biology, especially when applied to multicellular organisms. In this paper, we present an approach to model at different spatial scales, using the new concept of Hierarchically Colored Petri Nets (HCPN). We apply HCPN to model a tissue comprising multiple cells hexagonally packed in a honeycomb formation in order to describe the phenomenon of Planar Cell Polarity (PCP) signaling in Drosophila wing. We have constructed a family of related models, permitting different hypotheses to be explored regarding the mechanisms underlying PCP. In addition our models include the effect of well-studied genetic mutations. We have applied a set of analytical techniques including clustering and model checking over time series of primary and secondary data. Our models support the interpretation of biological observations reported in the literature.

Journal ArticleDOI
TL;DR: A general framework based on bipartite network projections by which homogeneous pharmacological networks can be constructed and integrated from heterogeneous and complementary sources of chemical, biomolecular and clinical information is designed.
Abstract: Drug repositioning is a challenging computational problem involving the integration of heterogeneous sources of biomolecular data and the design of label ranking algorithms able to exploit the overall topology of the underlying pharmacological network. In this context, we propose a novel semisupervised drug ranking problem: prioritizing drugs in integrated biochemical networks according to specific DrugBank therapeutic categories. Algorithms for drug repositioning usually perform the inference step into an inhomogeneous similarity space induced by the relationships existing between drugs and a second type of entity (e.g., disease, target, ligand set), thus making unfeasible a drug ranking within a homogeneous pharmacological space. To deal with this problem, we designed a general framework based on bipartite network projections by which homogeneous pharmacological networks can be constructed and integrated from heterogeneous and complementary sources of chemical, biomolecular and clinical information. Moreover, we present a novel algorithmic scheme based on kernelized score functions that adopts both local and global learning strategies to effectively rank drugs in the integrated pharmacological space using different network combination methods. Detailed experiments with more than 80 DrugBank therapeutic categories involving about 1,300 FDA-approved drugs show the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: The homogeneous ribosome flow model (HRFM) is analyzed when n goes to infinity and a simple expression for the steady-state protein synthesis rate is derived and bounds are derived that show that the behavior of the HRFM for finite, and relatively small, values of n is already in good agreement with the closed-form result in the infinite-dimensional case.
Abstract: Gene translation is a central stage in the intracellular process of protein synthesis. Gene translation proceeds in three major stages: initiation, elongation, and termination. During the elongation step, ribosomes (intracellular macromolecules) link amino acids together in the order specified by messenger RNA (mRNA) molecules. The homogeneous ribosome flow model (HRFM) is a mathematical model of translation-elongation under the assumption of constant elongation rate along the mRNA sequence. The HRFM includes $(n)$ first-order nonlinear ordinary differential equations, where $(n)$ represents the length of the mRNA sequence, and two positive parameters: ribosomal initiation rate and the (constant) elongation rate. Here, we analyze the HRFM when $(n)$ goes to infinity and derive a simple expression for the steady-state protein synthesis rate. We also derive bounds that show that the behavior of the HRFM for finite, and relatively small, values of $(n)$ is already in good agreement with the closed-form result in the infinite-dimensional case. For example, for $(n=15)$, the relative error is already less than 4 percent. Our results can, thus, be used in practice for analyzing the behavior of finite-dimensional HRFMs that model translation. To demonstrate this, we apply our approach to estimate the mean initiation rate in M. musculus, finding it to be around 0.17 codons per second.

Journal ArticleDOI
TL;DR: A novel method named random label selection (RALS) (multilabel learning via RALS), which extends the simple binary relevance (BR) method, is proposed to learn from multilocation proteins in an effective and efficient way.
Abstract: Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multilocation proteins to multiple proteins with single location, which does not take correlations among different subcellular locations into account. In this paper, a novel method named random label selection (RALS) (multilabel learning via RALS), which extends the simple binary relevance (BR) method, is proposed to learn from multilocation proteins in an effective and efficient way. RALS does not explicitly find the correlations among labels, but rather implicitly attempts to learn the label correlations from data by augmenting original feature space with randomly selected labels as its additional input features. Through the fivefold cross-validation test on a benchmark data set, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark data sets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multilocations of proteins. The prediction web server is available at >http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.

Journal ArticleDOI
Mark Howison1
TL;DR: A new storage model called SeqDB is presented, which offers high-throughput compression of sequence data with minimal sacrifice in compression ratio by combining the existing multithreaded Blosc compressor with a new data-parallel byte-packing scheme, called SequPack, which interleaves sequence data and quality scores.
Abstract: Compression has become a critical step in storing next-generation sequencing (NGS) data sets because of both the increasing size and decreasing costs of such data. Recent research into efficiently compressing sequence data has focused largely on improving compression ratios. Yet, the throughputs of current methods now lag far behind the I/O bandwidths of modern storage systems. As biologists move their analyses to high-performance systems with greater I/O bandwidth, low-throughput compression becomes a limiting factor. To address this gap, we present a new storage model called SeqDB, which offers high-throughput compression of sequence data with minimal sacrifice in compression ratio. It achieves this by combining the existing multithreaded Blosc compressor with a new data-parallel byte-packing scheme, called SeqPack, which interleaves sequence data and quality scores.

Journal ArticleDOI
TL;DR: A novel model is proposed for recognizing miRNA-binding residues in proteins from sequences using a cost-sensitive extension of Laplacian support vector machines (CS-LapSVM) with a hybrid feature that consists of evolutionary information of the amino acid sequence, the conservation information about three biochemical properties and mutual interaction propensities in protein-miRNA complex structures.
Abstract: The recognition of microRNA (miRNA)-binding residues in proteins is helpful to understand how miRNAs silence their target genes. It is difficult to use existing computational method to predict miRNA-binding residues in proteins due to the lack of training examples. To address this issue, unlabeled data may be exploited to help construct a computational model. Semisupervised learning deals with methods for exploiting unlabeled data in addition to labeled data automatically to improve learning performance, where no human intervention is assumed. In addition, miRNA-binding proteins almost always contain a much smaller number of binding than nonbinding residues, and cost-sensitive learning has been deemed as a good solution to the class imbalance problem. In this work, a novel model is proposed for recognizing miRNA-binding residues in proteins from sequences using a cost-sensitive extension of Laplacian support vector machines (CS-LapSVM) with a hybrid feature. The hybrid feature consists of evolutionary information of the amino acid sequence (position-specific scoring matrices), the conservation information about three biochemical properties (HKM) and mutual interaction propensities in protein-miRNA complex structures. The CS-LapSVM receives good performance with an F1 score of $(26.23 \pm 2.55\%)$ and an AUC value of $(0.805 \pm 0.020)$ superior to existing approaches for the recognition of RNA-binding residues. A web server called SARS is built and freely available for academic usage.