scispace - formally typeset
Search or ask a question

Showing papers in "IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2012"


Journal ArticleDOI
TL;DR: This survey focuses on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery, and presents them in a unified framework.
Abstract: A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities.

500 citations


Journal ArticleDOI
TL;DR: The experimental results on the three different networks show that the number of essential proteins discovered by NC universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC, and IC.
Abstract: Identification of essential proteins is key to understanding the minimal requirements for cellular life and important for drug design. The rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based on network topology. However, most of them tended to focus only on the location of single protein, but ignored the relevance between interactions and protein essentiality. In this paper, a new centrality measure for identifying essential proteins based on edge clustering coefficient, named as NC, is proposed. Different from previous centrality measures, NC considers both the centrality of a node and the relationship between it and its neighbors. For each interaction in the network, we calculate its edge clustering coefficient. A node's essentiality is determined by the sum of the edge clustering coefficients of interactions connecting it and its neighbors. The new centrality measure NC takes into account the modular nature of protein essentiality. NC is applied to three different types of yeast protein-protein interaction networks, which are obtained from the DIP database, the MIPS database and the BioGRID database, respectively. The experimental results on the three different networks show that the number of essential proteins discovered by NC universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC, and IC. Moreover, the essential proteins discovered by NC show significant cluster effect.

266 citations


Journal ArticleDOI
TL;DR: The proposed algorithm first divides genes into subsets, the sizes of which are relatively small, then selects informative smaller subsets of genes from a subset and merges the chosen genes with another gene subset to update the gene subset.
Abstract: Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r <; h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the relevance of the selected genes in terms of their biological functions.

186 citations


Journal ArticleDOI
TL;DR: This paper gives a comprehensive review of the application of metaheuristics to optimization problems in systems biology, mainly focusing on the parameter estimation problem (also called the inverse problem or model calibration).
Abstract: This paper gives a comprehensive review of the application of metaheuristics to optimization problems in systems biology, mainly focusing on the parameter estimation problem (also called the inverse problem or model calibration). It is intended for either the system biologist who wishes to learn more about the various optimization techniques available and/or the metaheuristic optimizer who is interested in applying such techniques to problems in systems biology. First, the parameter estimation problems emerging from different areas of systems biology are described from the point of view of machine learning. Brief descriptions of various metaheuristics developed for these problems follow, along with outlines of their advantages and disadvantages. Several important issues in applying metaheuristics to the systems biology modeling problem are addressed, including the reliability and identifiability of model parameters, optimal design of experiments, and so on. Finally, we highlight some possible future research directions in this field.

163 citations


Journal ArticleDOI
TL;DR: This work has studied several feature extraction approaches for representing proteins and proposes a novel bacterial virulent protein prediction method, based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence and from the evolutionary information of a given protein.
Abstract: The availability of a reliable prediction method for prediction of bacterial virulent proteins has several important applications in research efforts targeted aimed at finding novel drug targets, vaccine candidates, and understanding virulence mechanisms in pathogens. In this work, we have studied several feature extraction approaches for representing proteins and propose a novel bacterial virulent protein prediction method, based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence and from the evolutionary information of a given protein. We have evaluated and compared several ensembles obtained by combining six feature extraction methods and several classification approaches based on two general purpose classifiers (i.e., Support Vector Machine and a variant of input decimated ensemble) and their random subspace version. An extensive evaluation was performed according to a blind testing protocol, where the parameters of the system are optimized using the training set and the system is validated in three different independent data sets, allowing selection of the most performing system and demonstrating the validity of the proposed method. Based on the results obtained using the blind test protocol, it is interesting to note that even if in each independent data set the most performing stand-alone method is not always the same, the fusion of different methods enhances prediction efficiency in all the tested independent data sets.

153 citations


Journal ArticleDOI
TL;DR: Many biological networks, which have been previously classified as disassortative, are shown to be assortative with respect to these new measures.
Abstract: We analyze assortative mixing patterns of biological networks which are typically directed. We develop a theoretical background for analyzing mixing patterns in directed networks before applying them to specific biological networks. Two new quantities are introduced, namely the in-assortativity and the out-assortativity, which are shown to be useful in quantifying assortative mixing in directed networks. We also introduce the local (node level) assortativity quantities for in- and out-assortativity. Local assortativity profiles are the distributions of these local quantities over node degrees and can be used to analyze both canonical and real-world directed biological networks. Many biological networks, which have been previously classified as disassortative, are shown to be assortative with respect to these new measures. Finally, we demonstrate the use of local assortativity profiles in analyzing the functionalities of particular nodes and groups of nodes in real-world biological networks.

116 citations


Journal ArticleDOI
TL;DR: A recently developed SPSO algorithm is used to cope with the constrained optimization problem by converting it into an unconstrained optimization one through adding a penalty term to the objective function.
Abstract: In this paper, a hybrid extended Kalman filter (EKF) and switching particle swarm optimization (SPSO) algorithm is proposed for jointly estimating both the parameters and states of the lateral flow immunoassay model through available short time-series measurement. Our proposed method generalizes the well-known EKF algorithm by imposing physical constraints on the system states. Note that the state constraints are encountered very often in practice that give rise to considerable difficulties in system analysis and design. The main purpose of this paper is to handle the dynamic modeling problem with state constraints by combining the extended Kalman filtering and constrained optimization algorithms via the maximization probability method. More specifically, a recently developed SPSO algorithm is used to cope with the constrained optimization problem by converting it into an unconstrained optimization one through adding a penalty term to the objective function. The proposed algorithm is then employed to simultaneously identify the parameters and states of a lateral flow immunoassay model. It is shown that the proposed algorithm gives much improved performance over the traditional EKF method.

103 citations


Journal ArticleDOI
TL;DR: This paper represents the first attempt to include two measures of controllability into one unified framework and finds that controlling σ is becoming more important in controlling a cortical network with increasing I, and unveils the dependence of controlling regions on the number of driver nodes I and the constraint r.
Abstract: Controlling regions in cortical networks, which serve as key nodes to control the dynamics of networks to a desired state, can be detected by minimizing the eigenratio R and the maximum imaginary part \sigma of an extended connection matrix. Until now, optimal selection of the set of controlling regions is still an open problem and this paper represents the first attempt to include two measures of controllability into one unified framework. The detection problem of controlling regions in cortical networks is converted into a constrained optimization problem (COP), where the objective function R is minimized and \sigma is regarded as a constraint. Then, the detection of controlling regions of a weighted and directed complex network (e.g., a cortical network of a cat), is thoroughly investigated. The controlling regions of cortical networks are successfully detected by means of an improved dynamic hybrid framework (IDyHF). Our experiments verify that the proposed IDyHF outperforms two recently developed evolutionary computation methods in constrained optimization field and some traditional methods in control theory as well as graph theory. Based on the IDyHF, the controlling regions are detected in a microscopic and macroscopic way. Our results unveil the dependence of controlling regions on the number of driver nodes l and the constraint r. The controlling regions are largely selected from the regions with a large in-degree and a small out-degree. When r=+ \infty, there exists a concave shape of the mean degrees of the driver nodes, i.e., the regions with a large degree are of great importance to the control of the networks when l is small and the regions with a small degree are helpful to control the networks when l increases. When r=0, the mean degrees of the driver nodes increase as a function of l. We find that controlling \sigma is becoming more important in controlling a cortical network with increasing l. The methods and results of detecting controlling regions in this paper would promote the coordination and information consensus of various kinds of real-world complex networks including transportation networks, genetic regulatory networks, and social networks, etc.

101 citations


Journal ArticleDOI
TL;DR: An algorithm called ClassAMP has been developed to predict the propensity of a protein sequence to have antibacterial, antifungal, or antiviral activity.
Abstract: Antimicrobial peptides (AMPs) are gaining popularity as anti-infective agents. Information on sequence features that contribute to target specificity of AMPs will aid in accelerating drug discovery programs involving them. In this study, an algorithm called ClassAMP using Random Forests (RFs) and Support Vector Machines (SVMs) has been developed to predict the propensity of a protein sequence to have antibacterial, antifungal, or antiviral activity. ClassAMP is available at http://www.bicnirrh.res.in/classamp/.

97 citations


Journal ArticleDOI
TL;DR: A general framework of sample weighting is proposed to improve the stability of feature selection methods under sample variations and leads to more stable gene signatures than the state-of-the-art ensemble method, particularly for small signature sizes.
Abstract: Feature selection from gene expression microarray data is a widely used technique for selecting candidate genes in various cancer studies. Besides predictive ability of the selected genes, an important aspect in evaluating a selection method is the stability of the selected genes. Experts instinctively have high confidence in the result of a selection method that selects similar sets of genes under some variations to the samples. However, a common problem of existing feature selection methods for gene expression data is that the selected genes by the same method often vary significantly with sample variations. In this work, we propose a general framework of sample weighting to improve the stability of feature selection methods under sample variations. The framework first weights each sample in a given training set according to its influence to the estimation of feature relevance, and then provides the weighted training set to a feature selection method. We also develop an efficient margin-based sample weighting algorithm under this framework. Experiments on a set of microarray data sets show that the proposed algorithm significantly improves the stability of representative feature selection algorithms such as SVM-RFE and ReliefF, without sacrificing their classification performance. Moreover, the proposed algorithm also leads to more stable gene signatures than the state-of-the-art ensemble method, particularly for small signature sizes.

96 citations


Journal ArticleDOI
TL;DR: The dynamical behavior of the ribosome flow model is relatively simple, there exists a unique equilibrium point e and every trajectory converges to e, and convergence is monotone in the sense that the distance to e can never increase.
Abstract: Gene translation is a central process in all living organisms. Developing a better understanding of this complex process may have ramifications to almost every biomedical discipline. Recently, Reuveni et al. proposed a new computational model of this process called the ribosome flow model (RFM). In this study, we show that the dynamical behavior of the RFM is relatively simple. There exists a unique equilibrium point e and every trajectory converges to e. Furthermore, convergence is monotone in the sense that the distance to e can never increase. This qualitative behavior is maintained for any feasible set of parameter values, suggesting that the RFM is highly robust. Our analysis is based on a contraction principle and the theory of monotone dynamical systems. These analysis tools may prove useful in studying other properties of the RFM as well as additional intracellular biological processes.

Journal ArticleDOI
TL;DR: The adaptation, Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences, and allows for random access to individual sequences and subsequences without decompressing the whole data set.
Abstract: Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random access to individual sequences and subsequences without decompressing the whole data set. Comrad has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base.

Journal ArticleDOI
TL;DR: A new method of defining distances between unrooted binary phylogenetic trees that is especially useful for relatively large phylogenetics trees, and is investigated in detail the properties of one example of these metrics, called the Matching Split distance.
Abstract: The reconstruction of evolutionary trees is one of the primary objectives in phylogenetics. Such a tree represents the historical evolutionary relationship between different species or organisms. Tree comparisons are used for multiple purposes, from unveiling the history of species to deciphering evolutionary associations among organisms and geographical areas. In this paper, we propose a new method of defining distances between unrooted binary phylogenetic trees that is especially useful for relatively large phylogenetic trees. Next, we investigate in detail the properties of one example of these metrics, called the Matching Split distance, and describe how the general method can be extended to nonbinary trees.

Journal ArticleDOI
TL;DR: In this article, two support vector machine algorithms under the Multi-Instance Multi-Label (MIML) framework were proposed for the automated annotation of Drosophila embryo images.
Abstract: In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time. Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions, interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which term corresponds to which region of which image in the group. In this paper, we address this problem using a new machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant performance improvement over state-of-the-art approaches.

Journal ArticleDOI
TL;DR: A range of novel SFs employing different machine-learning approaches in conjunction with a variety of physicochemical and geometrical features characterizing protein-ligand complexes are explored, finding that ML SFs benefit more than their conventional counterparts from increases in the number of features and the size of training dataset.
Abstract: Accurately predicting the binding affinities of large sets of protein-ligand complexes efficiently is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank, and identify drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a protein's binding site has a significant bearing on the accuracy of virtual screening. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited ranking accuracy has been a major roadblock toward cost-effective drug discovery. Therefore, in this work, we explore a range of novel SFs employing different machine-learning (ML) approaches in conjunction with a variety of physicochemical and geometrical features characterizing protein-ligand complexes. We assess the ranking accuracies of these new ML-based SFs as well as those of conventional SFs in the context of the 2007 and 2010 PDBbind benchmark data sets on both diverse and protein-family-specific test sets. We also investigate the influence of the size of the training data set and the type and number of features used on ranking accuracy. Within clusters of protein-ligand complexes with different ligands bound to the same target protein, we find that the best ML-based SF is able to rank the ligands correctly based on their experimentally determined binding affinities 62.5 percent of the time and identify the top binding ligand 78.1 percent of the time. For this SF, the Spearman correlation coefficient between ranks of ligands ordered by predicted and experimentally determined binding affinities is 0.771. Given the challenging nature of the ranking problem and that SFs are used to screen millions of ligands, this represents a significant improvement over the best conventional SF we studied, for which the corresponding ranking performance values are 57.8 percent, 73.4 percent, and 0.677.

Journal ArticleDOI
TL;DR: This paper develops a novel method based on the random forests to identify a set of prognostic genes that incorporates multivariate correlations in microarray data for survival outcomes and shows the advantages of the approach over single-gene-based approaches.
Abstract: Although many feature selection methods for classification have been developed, there is a need to identify genes in high dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis. Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.

Journal ArticleDOI
TL;DR: Two new binarization approaches are introduced which determine thresholds based on limited numbers of samples and additionally provide a measure of threshold validity, which reduces the complexity of network inference.
Abstract: Network inference algorithms can assist life scientists in unraveling gene-regulatory systems on a molecular level. In recent years, great attention has been drawn to the reconstruction of Boolean networks from time series. These need to be binarized, as such networks model genes as binary variables (either "expressed” or "not expressed”). Common binarization methods often cluster measurements or separate them according to statistical or information theoretic characteristics and may require many data points to determine a robust threshold. Yet, time series measurements frequently comprise only a small number of samples. To overcome this limitation, we propose a binarization that incorporates measurements at multiple resolutions. We introduce two such binarization approaches which determine thresholds based on limited numbers of samples and additionally provide a measure of threshold validity. Thus, network reconstruction and further analysis can be restricted to genes with meaningful thresholds. This reduces the complexity of network inference. The performance of our binarization algorithms was evaluated in network reconstruction experiments using artificial data as well as real-world yeast expression time series. The new approaches yield considerably improved correct network identification rates compared to other binarization techniques by effectively reducing the amount of candidate networks.

Journal ArticleDOI
TL;DR: Results demonstrate the relative advantage of utilizing problem-specific knowledge regarding biologically plausible structural properties of gene networks over conducting a problem-agnostic search in the vast space of network architectures.
Abstract: In this paper, we investigate the problem of reverse engineering the topology of gene regulatory networks from temporal gene expression data. We adopt a computational intelligence approach comprising swarm intelligence techniques, namely particle swarm optimization (PSO) and ant colony optimization (ACO). In addition, the recurrent neural network (RNN) formalism is employed for modeling the dynamical behavior of gene regulatory systems. More specifically, ACO is used for searching the discrete space of network architectures and PSO for searching the corresponding continuous space of RNN model parameters. We propose a novel solution construction process in the context of ACO for generating biologically plausible candidate architectures. The objective is to concentrate the search effort into areas of the structure space that contain architectures which are feasible in terms of their topological resemblance to real-world networks. The proposed framework is initially applied to the reconstruction of a small artificial network that has previously been studied in the context of gene network reverse engineering. Subsequently, we consider an artificial data set with added noise for reconstructing a subnetwork of the genetic interaction network of S. cerevisiae (yeast). Finally, the framework is applied to a real-world data set for reverse engineering the SOS response system of the bacterium Escherichia coli. Results demonstrate the relative advantage of utilizing problem-specific knowledge regarding biologically plausible structural properties of gene networks over conducting a problem-agnostic search in the vast space of network architectures.

Journal ArticleDOI
TL;DR: Experiments illustrate that the proposed approaches not only achieve good performance on gene expression profiles, but also outperforms most of the existing approaches in the process of class discovery from these profiles.
Abstract: In order to perform successful diagnosis and treatment of cancer, discovering, and classifying cancer types correctly is essential. One of the challenging properties of class discovery from cancer data sets is that cancer gene expression profiles not only include a large number of genes, but also contains a lot of noisy genes. In order to reduce the effect of noisy genes in cancer gene expression profiles, we propose two new consensus clustering frameworks, named as triple spectral clustering-based consensus clustering (SC^{3}) and double spectral clustering-based consensus clustering (SC^{2}Ncut) in this paper, for cancer discovery from gene expression profiles. SC^{3} integrates the spectral clustering (SC) algorithm multiple times into the ensemble framework to process gene expression profiles. Specifically, spectral clustering is applied to perform clustering on the gene dimension and the cancer sample dimension, and also used as the consensus function to partition the consensus matrix constructed from multiple clustering solutions. Compared with SC^{3}, SC^{2}Ncut adopts the normalized cut algorithm, instead of spectral clustering, as the consensus function. Experiments on both synthetic data sets and real cancer gene expression profiles illustrate that the proposed approaches not only achieve good performance on gene expression profiles, but also outperforms most of the existing approaches in the process of class discovery from these profiles.

Journal ArticleDOI
Xin Ma1, Jing Guo1, Hongde Liu1, Jianming Xie1, Xiao Sun1 
TL;DR: Comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein, and two types of novel sequence features contribute most to the improvement in predictive ability.
Abstract: The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity-charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew's correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a 68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.

Journal ArticleDOI
TL;DR: A novel tumor classification method based on correlation filters to identify the overall pattern of tumor subtype hidden in differentially expressed genes, which can achieve better performance when balanced training sets are exploited to synthesize the templates.
Abstract: Tumor classification based on Gene Expression Profiles (GEPs), which is of great benefit to the accurate diagnosis and personalized treatment for different types of tumor, has drawn a great attention in recent years. This paper proposes a novel tumor classification method based on correlation filters to identify the overall pattern of tumor subtype hidden in differentially expressed genes. Concretely, two correlation filters, i.e., Minimum Average Correlation Energy (MACE) and Optimal Tradeoff Synthetic Discriminant Function (OTSDF), are introduced to determine whether a test sample matches the templates synthesized for each subclass. The experiments on six publicly available data sets indicate that the proposed method is robust to noise, and can more effectively avoid the effects of dimensionality curse. Compared with many model-based methods, the correlation filter-based method can achieve better performance when balanced training sets are exploited to synthesize the templates. Particularly, the proposed method can detect the similarity of overall pattern while ignoring small mismatches between test sample and the synthesized template. And it performs well even if only a few training samples are available. More importantly, the experimental results can be visually represented, which is helpful for the further analysis of results.

Journal ArticleDOI
TL;DR: This paper proposes a new biclustering algorithm based on evolutionary learning that demonstrates a significant improvement in discovering additive biclusters and is able to discover bicluster seeds within a limited computing time.
Abstract: The analysis of gene expression data obtained from microarray experiments is important for discovering the biological process of genes. Biclustering algorithms have been proven to be able to group the genes with similar expression patterns under a number of experimental conditions. In this paper, we propose a new biclustering algorithm based on evolutionary learning. By converting the biclustering problem into a common clustering problem, the algorithm can be applied in a search space constructed by the conditions. To further reduce the size of the search space, we randomly separate the full conditions into a number of condition subsets (subspaces), each of which has a smaller number of conditions. The algorithm is applied to each subspace and is able to discover bicluster seeds within a limited computing time. Finally, an expanding and merging procedure is employed to combine the bicluster seeds into larger biclusters according to a homogeneity criterion. We test the performance of the proposed algorithm using synthetic and real microarray data sets. Compared with several previously developed biclustering algorithms, our algorithm demonstrates a significant improvement in discovering additive biclusters.

Journal ArticleDOI
TL;DR: This work proposes a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality and does not take any input parameters, a feature missing in many existing algorithms.
Abstract: Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.

Journal ArticleDOI
TL;DR: This paper presents a novel algorithm for parameter synthesis based on parallel model checking that is conceptually universal with respect to the modeling approach employed and its applicability on several biological models is examined.
Abstract: An important problem in current computational systems biology is to analyze models of biological systems dynamics under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model checking. The algorithm is conceptually universal with respect to the modeling approach employed. We introduce the algorithm, show its scalability, and examine its applicability on several biological models.

Journal ArticleDOI
TL;DR: In this paper, a very fast Markov clustering algorithm using CUDA (CUDA-MCL) was introduced to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalization.
Abstract: Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However, with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data.

Journal ArticleDOI
TL;DR: This paper presents a fast fixed-parameter algorithm for constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees, and uses the algorithm together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimaltype-II reticulated network of the input trees.
Abstract: A reticulate network N of multiple phylogenetic trees may have nodes with two or more parents (called reticulation nodes). There are two ways to define the reticulation number of N. One way is to define it as the number of reticulation nodes in N in this case, a reticulate network with the smallest reticulation number is called an optimal type-I reticulate network of the trees. The better way is to define it as the total number of parents of reticulation nodes in N minus the number of reticulation nodes in N ; in this case, a reticulate network with the smallest reticulation number is called an optimal type-II reticulate network of the trees. In this paper, we first present a fast fixed-parameter algorithm for constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees. We then use the algorithm together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimal type-II reticulate network of the input trees. To our knowledge, these are the first fixed-parameter algorithms for the problems. We have implemented the algorithms in ANSI C, obtaining programs CMPT and MaafB. Our experimental data show that CMPT can construct optimal type-I reticulate networks rapidly and MaafB can compute better lower bounds for optimal type-II reticulate networks within shorter time than the previously best program PIRN designed by Wu.

Journal ArticleDOI
TL;DR: The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously.
Abstract: Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L_1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously.

Journal ArticleDOI
TL;DR: This work introduces a novel strategy based on building a contact-window based motif library from the protein structural data, discovery and extension of common alignment seeds from this library, and optimal superimposition of multiple structures according to these alignment seeds by an enhanced partial order curve comparison method.
Abstract: Availability of an effective tool for protein multiple structural alignment (MSTA) is essential for discovery and analysis of biologically significant structural motifs that can help solve functional annotation and drug design problems. Existing MSTA methods collect residue correspondences mostly through pairwise comparison of consecutive fragments, which can lead to suboptimal alignments, especially when the similarity among the proteins is low. We introduce a novel strategy based on: building a contact-window based motif library from the protein structural data, discovery and extension of common alignment seeds from this library, and optimal superimposition of multiple structures according to these alignment seeds by an enhanced partial order curve comparison method. The ability of our strategy to detect multiple correspondences simultaneously, to catch alignments globally, and to support flexible alignments, endorse a sensitive and robust automated algorithm that can expose similarities among protein structures even under low similarity conditions. Our method yields better alignment results compared to other popular MSTA methods, on several protein structure data sets that span various structural folds and represent different protein similarity levels. A web-based alignment tool, a downloadable executable, and detailed alignment results for the data sets used here are available at http://sacan.biomed. drexel.edu/Smolign and http://bio.cse.ohio-state.edu/Smolign.

Journal ArticleDOI
TL;DR: An improved version of Vina-dubbed QVina-achieved a maximum acceleration of about 25 times with the average speed-up of 8.34 times compared to the original Vina when tested on a set of 231 protein-ligand complexes while maintaining the optimal scores mostly identical.
Abstract: Predicting binding between macromolecule and small molecule is a crucial phase in the field of rational drug design. AutoDock Vina, one of the most widely used docking software released in 2009, uses an empirical scoring function to evaluate the binding affinity between the molecules and employs the iterated local search global optimizer for global optimization, achieving a significantly improved speed and better accuracy of the binding mode prediction compared its predecessor, AutoDock 4. In this paper, we propose further improvement in the local search algorithm of Vina by heuristically preventing some intermediate points from undergoing local search. Our improved version of Vina-dubbed QVina-achieved a maximum acceleration of about 25 times with the average speed-up of 8.34 times compared to the original Vina when tested on a set of 231 protein-ligand complexes while maintaining the optimal scores mostly identical. Using our heuristics, larger number of different ligands can be quickly screened against a given receptor within the same time frame.

Journal ArticleDOI
TL;DR: This work proposes a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network.
Abstract: Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL.