scispace - formally typeset
Search or ask a question

Showing papers in "IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2016"


Journal ArticleDOI
TL;DR: The basic taxonomy of feature selection is presented, and the state-of-the-art gene selection methods are reviewed by grouping the literatures into three categories: supervised, unsupervised, and semi-supervised.
Abstract: Recently, feature selection and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as gene expression microarray data. Gene expression microarray data comprises up to hundreds of thousands of features with relatively small sample size. Because learning algorithms usually do not work well with this kind of data, a challenge to reduce the data dimensionality arises. A huge number of gene selection are applied to select a subset of relevant features for model construction and to seek for better cancer classification performance. This paper presents the basic taxonomy of feature selection, and also reviews the state-of-the-art gene selection methods by grouping the literatures into three categories: supervised, unsupervised, and semi-supervised. The comparison of experimental results on top 5 representative gene expression datasets indicates that the classification accuracy of unsupervised and semi-supervised feature selection is competitive with supervised feature selection.

402 citations


Journal ArticleDOI
TL;DR: Using functions of several bioinformatics tools, 143 differentially expressed genes (DEGs) associated with the cervical cancer are selected and show that these DEGs play important roles in the development of cervical cancer.
Abstract: Cervical cancer is the third most common malignancy in women worldwide. It remains a leading cause of cancer-related death for women in developing countries. In order to contribute to the treatment of the cervical cancer, in our work, we try to find a few key genes resulting in the cervical cancer. Employing functions of several bioinformatics tools, we selected 143 differentially expressed genes (DEGs) associated with the cervical cancer. The results of bioinformatics analysis show that these DEGs play important roles in the development of cervical cancer. Through comparing two differential co-expression networks (DCNs) at two different states, we found a common sub-network and two differential sub-networks as well as some hub genes in three sub-networks. Moreover, some of the hub genes have been reported to be related to the cervical cancer. Those hub genes were analyzed from Gene Ontology function enrichment, pathway enrichment and protein binding three aspects. The results can help us understand the development of the cervical cancer and guide further experiments about the cervical cancer.

116 citations


Journal ArticleDOI
TL;DR: The principles of the CAVER 3.0 algorithms for the identification and analysis of properties of transport pathways both in static and dynamic structures are described and the improved clustering solution for finding tunnels in macromolecules is introduced.
Abstract: The biological function of a macromolecule often requires that a small molecule or ion is transported through its structure. The transport pathway often leads through void spaces in the structure. The properties of transport pathways change significantly in time; therefore, the analysis of a trajectory from molecular dynamics rather than of a single static structure is needed for understanding the function of pathways. The identification and analysis of transport pathways are challenging because of the high complexity and diversity of macromolecular shapes, the thermal motion of their atoms, and the large amount of conformations needed to properly describe conformational space of protein structure. In this paper, we describe the principles of the CAVER 3.0 algorithms for the identification and analysis of properties of transport pathways both in static and dynamic structures. Moreover, we introduce the improved clustering solution for finding tunnels in macromolecules, which is included in the latest CAVER 3.02 version. Voronoi diagrams are used to identify potential pathways in each snapshot of a molecular dynamics trajectory and clustering is then used to find the correspondence between tunnels from different snapshots. Furthermore, the geometrical properties of pathways and their evolution in time are computed and visualized.

109 citations


Journal ArticleDOI
TL;DR: This paper depicts the general architecture of an MCPS consisting of four layers: data acquisition, data aggregation, cloud processing, and action, and surveys conventional and emerging encryption schemes based on their ability to provide secure storage, data sharing, and secure computation.
Abstract: The following decade will witness a surge in remote health-monitoring systems that are based on body-worn monitoring devices. These Medical Cyber Physical Systems (MCPS) will be capable of transmitting the acquired data to a private or public cloud for storage and processing. Machine learning algorithms running in the cloud and processing this data can provide decision support to healthcare professionals. There is no doubt that the security and privacy of the medical data is one of the most important concerns in designing an MCPS. In this paper, we depict the general architecture of an MCPS consisting of four layers: data acquisition, data aggregation, cloud processing, and action. Due to the differences in hardware and communication capabilities of each layer, different encryption schemes must be used to guarantee data privacy within that layer. We survey conventional and emerging encryption schemes based on their ability to provide secure storage, data sharing, and secure computation. Our detailed experimental evaluation of each scheme shows that while the emerging encryption schemes enable exciting new features such as secure sharing and secure computation, they introduce several orders-of-magnitude computational and storage overhead. We conclude our paper by outlining future research directions to improve the usability of the emerging encryption schemes in an MCPS.

107 citations


Journal ArticleDOI
TL;DR: This survey surveys algorithms that perform global alignment of networks or graphs highlighting various proposed approaches, and classify them based on their methodology.
Abstract: In this paper, we survey algorithms that perform global alignment of networks or graphs. Global network alignment aligns two or more given networks to find the best mapping from nodes in one network to nodes in other networks. Since graphs are a common method of data representation, graph alignment has become important with many significant applications. Protein-protein interactions can be modeled as networks and aligning these networks of protein interactions has many applications in biological research. In this survey, we review algorithms for global pairwise alignment highlighting various proposed approaches, and classify them based on their methodology. Evaluation metrics that are used to measure the quality of the resulting alignments are also surveyed. We discuss and present a comparison between selected aligners on the same datasets and evaluate using the same evaluation metrics. Finally, a quick overview of the most popular databases of protein interaction networks is presented focusing on datasets that have been used recently.

93 citations


Journal ArticleDOI
TL;DR: The Brouwer's fixed point theorem is employed to obtain sufficient conditions such that the kind of GRNs under consideration here has at least one nonnegative equilibrium point which is globally asymptotically stable.
Abstract: This paper deals with the problem of globally asymptotic stability for nonnegative equilibrium points of genetic regulatory networks (GRNs) with mixed delays (i.e., time-varying discrete delays and constant distributed delays). Up to now, all existing stability criteria for equilibrium points of the kind of considered GRNs are in the form of the linear matrix inequalities (LMIs). In this paper, the Brouwer’s fixed point theorem is employed to obtain sufficient conditions such that the kind of GRNs under consideration here has at least one nonnegative equilibrium point. Then, by using the nonsingular M-matrix theory and the functional differential equation theory, M-matrix-based sufficient conditions are proposed to guarantee that the kind of GRNs under consideration here has a unique nonnegative equilibrium point which is globally asymptotically stable. The M-matrix-based sufficient conditions derived here are to check whether a constant matrix is a nonsingular M-matrix, which can be easily verified, as there are many equivalent statements on the nonsingular M-matrices. So, in terms of computational complexity, the M-matrix-based stability criteria established in this paper are superior to the LMI-based ones in literature. To illustrate the effectiveness of the approach proposed in this paper, several numerical examples and their simulations are given.

78 citations


Journal ArticleDOI
TL;DR: An effective method called BiCliques Merging (BCM) is developed to predict MRMs based on bicliques merging and it is shown that the modules identified by this method are more densely connected and functionally enriched.
Abstract: MicroRNAs (miRNAs) are post-transcriptional regulators that repress the expression of their targets. They are known to work cooperatively with genes and play important roles in numerous cellular processes. Identification of miRNA regulatory modules (MRMs) would aid deciphering the combinatorial effects derived from the many-to-many regulatory relationships in complex cellular systems. Here, we develop an effective method called BiCliques Merging (BCM) to predict MRMs based on bicliques merging. By integrating the miRNA/mRNA expression profiles from The Cancer Genome Atlas (TCGA) with the computational target predictions, we construct a weighted miRNA regulatory network for module discovery. The maximal bicliques detected in the network are statistically evaluated and filtered accordingly. We then employed a greedy-based strategy to iteratively merge the remaining bicliques according to their overlaps together with edge weights and the gene-gene interactions. Comparing with existing methods on two cancer datasets from TCGA, we showed that the modules identified by our method are more densely connected and functionally enriched. Moreover, our predicted modules are more enriched for miRNA families and the miRNA-mRNA pairs within the modules are more negatively correlated. Finally, several potential prognostic modules are revealed by Kaplan-Meier survival analysis and breast cancer subtype analysis. Availability: BCM is implemented in Java and available for download in the supplementary materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TCBB.2015.2462370 .

75 citations


Journal ArticleDOI
TL;DR: The goal is to establish an integrated model which could predict GBM prognosis with high accuracy by taking advantage of the minimum redundancy feature selection method (mRMR) and Multiple Kernel Machine (MKL) learning method.
Abstract: Glioblastoma multiforme (GBM) is a highly aggressive type of brain cancer with very low median survival. In order to predict the patient's prognosis, researchers have proposed rules to classify different glioma cancer cell subtypes. However, survival time of different subtypes of GBM is often various due to different individual basis. Recent development in gene testing has evolved classic subtype rules to more specific classification rules based on single biomolecular features. These classification methods are proven to perform better than traditional simple rules in GBM prognosis prediction. However, the real power behind the massive data is still under covered. We believe a combined prediction model based on more than one data type could perform better, which will contribute further to clinical treatment of GBM. The Cancer Genome Atlas (TCGA) database provides huge dataset with various data types of many cancers that enables us to inspect this aggressive cancer in a new way. In this research, we have improved GBM prognosis prediction accuracy further by taking advantage of the minimum redundancy feature selection method (mRMR) and Multiple Kernel Machine (MKL) learning method. Our goal is to establish an integrated model which could predict GBM prognosis with high accuracy.

62 citations


Journal ArticleDOI
TL;DR: An efficient first-order method based on extensions of coordinate descent method to learn the optimal solution of ChIP-PIT, which makes it particularly suitable for the analysis of massive scale Chip-seq data.
Abstract: In recent years, thanks to the efforts of individual scientists and research consortiums, a huge amount of chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) experimental data have been accumulated. Instead of investigating them independently, several recent studies have convincingly demonstrated that a wealth of scientific insights can be gained by integrative analysis of these ChIP-seq data. However, when used for the purpose of integrative analysis, a serious drawback of current ChIP-seq technique is that it is still expensive and time-consuming to generate ChIP-seq datasets of high standard. Most researchers are therefore unable to obtain complete ChIP-seq data for several TFs in a wide variety of cell lines, which considerably limits the understanding of transcriptional regulation pattern. In this paper, we propose a novel method called ChIP-PIT to overcome the aforementioned limitation. In ChIP-PIT, ChIP-seq data corresponding to a diverse collection of cell types, TFs and genes are fused together using the three-mode pair-wise interaction tensor (PIT) model, and the prediction of unperformed ChIP-seq experimental results is formulated as a tensor completion problem. Computationally, we propose efficient first-order method based on extensions of coordinate descent method to learn the optimal solution of ChIP-PIT, which makes it particularly suitable for the analysis of massive scale ChIP-seq data. Experimental evaluation the ENCODE data illustrate the usefulness of the proposed model.

61 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper presented miRTDL, a new miRNA target prediction algorithm based on convolutional neural network (CNN), which automatically extracts essential information from the input data rather than completely relying on the input dataset generated artificially when the precise MIRNA target mechanisms are poorly known.
Abstract: MicroRNAs (miRNAs) regulate genes that are associated with various diseases. To better understand miRNAs, the miRNA regulatory mechanism needs to be investigated and the real targets identified. Here, we present miRTDL, a new miRNA target prediction algorithm based on convolutional neural network (CNN). The CNN automatically extracts essential information from the input data rather than completely relying on the input dataset generated artificially when the precise miRNA target mechanisms are poorly known. In this work, the constraint relaxing method is first used to construct a balanced training dataset to avoid inaccurate predictions caused by the existing unbalanced dataset. The miRTDL is then applied to 1,606 experimentally validated miRNA target pairs. Finally, the results show that our miRTDL outperforms the existing target prediction algorithms and achieves significantly higher sensitivity, specificity and accuracy of 88.43, 96.44, and 89.98 percent, respectively. We also investigate the miRNA target mechanism, and the results show that the complementation features are more important than the others.

60 citations


Journal ArticleDOI
TL;DR: This paper identifies result-manipulation attacks on a DMFB that maliciously alter the assay outcomes and identifies denial-of-service attacks, where the attacker can disrupt the assay operation by tampering either with the droplet-routing algorithm or with the actuation sequence.
Abstract: A digital microfluidic biochip (DMFB) is an emerging technology that enables miniaturized analysis systems for point-of-care clinical diagnostics, DNA sequencing, and environmental monitoring. A DMFB reduces the rate of sample and reagent consumption, and automates the analysis of assays. In this paper, we provide the first assessment of the security vulnerabilities of DMFBs. We identify result-manipulation attacks on a DMFB that maliciously alter the assay outcomes. Two practical result-manipulation attacks are shown on a DMFB platform performing enzymatic glucose assay on serum. In the first attack, the attacker adjusts the concentration of the glucose sample and thereby modifies the final result. In the second attack, the attacker tampers with the calibration curve of the assay operation. We then identify denial-of-service attacks, where the attacker can disrupt the assay operation by tampering either with the droplet-routing algorithm or with the actuation sequence. We demonstrate these attacks using a digital microfluidic synthesis simulator. The results show that the attacks are easy to implement and hard to detect. Therefore, this work highlights the need for effective protections against malicious modifications in DMFBs.

Journal ArticleDOI
TL;DR: A hybrid framework composed of two stages for gene selection and classification of DNA microarray data is proposed, and it is observed that the proposed approach works better than other methods reported in the literature.
Abstract: A hybrid framework composed of two stages for gene selection and classification of DNA microarray data is proposed At the first stage, five traditional statistical methods are combined for preliminary gene selection (Multiple Fusion Filter) Then, different relevant gene subsets are selected by using an embedded Genetic Algorithm (GA), Tabu Search (TS), and Support Vector Machine (SVM) A gene subset, consisting of the most relevant genes, is obtained from this process, by analyzing the frequency of each gene in the different gene subsets Finally, the most frequent genes are evaluated by the embedded approach to obtain a final relevant small gene subset with high performance The proposed method is tested in four DNA microarray datasets From simulation study, it is observed that the proposed approach works better than other methods reported in the literature

Journal ArticleDOI
TL;DR: This paper proposes a method, called SimNet, to Semantically integrate multiple functional association Networks derived from heterogenous data sources, and shows that, SimNet not only achieves better (or comparable) results than other related competitive approaches, but also takes much less time.
Abstract: Determining the biological functions of proteins is one of the key challenges in the post-genomic era. The rapidly accumulated large volumes of proteomic and genomic data drives to develop computational models for automatically predicting protein function in large scale. Recent approaches focus on integrating multiple heterogeneous data sources and they often get better results than methods that use single data source alone. In this paper, we investigate how to integrate multiple biological data sources with the biological knowledge, i.e., Gene Ontology (GO), for protein function prediction. We propose a method, called SimNet , to S emantically i ntegrate m ultiple functional association Net works derived from heterogenous data sources. SimNet firstly utilizes GO annotations of proteins to capture the semantic similarity between proteins and introduces a semantic kernel based on the similarity. Next, SimNet constructs a composite network, obtained as a weighted summation of individual networks, and aligns the network with the kernel to get the weights assigned to individual networks. Then, it applies a network-based classifier on the composite network to predict protein function. Experiment results on heterogenous proteomic data sources of Yeast, Human, Mouse, and Fly show that, SimNet not only achieves better (or comparable) results than other related competitive approaches, but also takes much less time. The Matlab codes of SimNet are available at https://sites.google.com/site/guoxian85/simnet .

Journal ArticleDOI
TL;DR: This paper proposes and experimentally studies a scheme that keeps the training samples private while enabling accurate construction of predictive models and shows that the scheme is highly efficient and scalable to a large number of mHealth users.
Abstract: Advances in biomedical sensors and mobile communication technologies have fostered the rapid growth of mobile health (mHealth) applications in the past years. Users generate a high volume of biomedical data during health monitoring, which can be used by the mHealth server for training predictive models for disease diagnosis and treatment. However, the biomedical sensing data raise serious privacy concerns because they reveal sensitive information such as health status and lifestyles of the sensed subjects. This paper proposes and experimentally studies a scheme that keeps the training samples private while enabling accurate construction of predictive models. We specifically consider logistic regression models which are widely used for predicting dichotomous outcomes in healthcare, and decompose the logistic regression problem into small subproblems over two types of distributed sensing data, i.e., horizontally partitioned data and vertically partitioned data. The subproblems are solved using individual private data, and thus mHealth users can keep their private data locally and only upload (encrypted) intermediate results to the mHealth server for model training. Experimental results based on real datasets show that our scheme is highly efficient and scalable to a large number of mHealth users.

Journal ArticleDOI
TL;DR: A deep performance evaluation of GO- WAR is conducted by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.
Abstract: Gene Ontology (GO) is a structured repository of concepts (GO Terms) that are associated to one or more gene products through a process referred to as annotation. The analysis of annotated data is an important opportunity for bioinformatics. There are different approaches of analysis, among those, the use of association rules (AR) which provides useful knowledge, discovering biologically relevant associations between terms of GO, not previously known. In a previous work, we introduced GO-WAR (Gene Ontology-based Weighted Association Rules), a methodology for extracting weighted association rules from ontology-based annotated datasets. We here adapt the GO-WAR algorithm to mine cross-ontology association rules, i.e., rules that involve GO terms present in the three sub-ontologies of GO. We conduct a deep performance evaluation of GO-WAR by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.

Journal ArticleDOI
TL;DR: This paper proposes a new MI-based feature selection approach for microarray data that relies on two strategies: one is relevance boosting, which requires a desirable feature to show substantially additional relevance with class labeling beyond the already selected features, and the other is feature interaction enhancing, which probabilistically compensates for feature interaction missing from simple aggregation-based evaluation.
Abstract: Mutual information (MI) is a powerful concept for correlation-centric applications. It has been used for feature selection from microarray gene expression data in many works. One of the merits of MI is that, unlike many other heuristic methods, it is based on a mature theoretic foundation. When applied to microarray data, however, it faces some challenges. First, due to the large number of features (i.e., genes) present in microarray data, the true distributions for the expression values of some genes may be distorted by noise. Second, evaluating inter-group mutual information requires estimating multi-variate distributions, which is quite difficult if not impossible. To address these problems, in this paper, we propose a new MI-based feature selection approach for microarray data. Our approach relies on two strategies: one is relevance boosting , which requires a desirable feature to show substantially additional relevance with class labeling beyond the already selected features, the other is feature interaction enhancing , which probabilistically compensates for feature interaction missing from simple aggregation-based evaluation. We justify our approach from both theoretical perspective and experimental results. We use a synthetic dataset to show the statistical significance of the proposed strategies, and real-life datasets to show the improved performance of our approach over the existing methods.

Journal ArticleDOI
TL;DR: This paper proposes a novel algorithm for the (l; d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA), designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel.
Abstract: Finding approximately conserved sequences, called motifs , across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the $(l,d)$ motif search problem of identifying one or more motifs of length $l$ present in at least $q$ of the $n$ given sequences, with each occurrence differing from the motif in at most $d$ substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is $(26,11)$ . We propose a novel algorithm for the $(l,d)$ motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the $(l,d)$ motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances $(39,18)$ and $(40,17)$ . The paper serves as a useful guide to solving problems using this new accelerator technology.

Journal ArticleDOI
TL;DR: A novel method named robust graph regularized non-negative matrix factorization for characteristic gene selection using gene expression data, which mainly contains enforcing L21-norm minimization on error function which is robust to outliers and noises in data points.
Abstract: Many methods have been considered for gene selection and analysis of gene expression data. Nonetheless, there still exists the considerable space for improving the explicitness and reliability of gene selection. To this end, this paper proposes a novel method named robust graph regularized non-negative matrix factorization for characteristic gene selection using gene expression data, which mainly contains two aspects: Firstly, enforcing ${L_{21}}$ -norm minimization on error function which is robust to outliers and noises in data points. Secondly, it considers that the samples lie in low-dimensional manifold which embeds in a high-dimensional ambient space, and reveals the data geometric structure embedded in the original data. To demonstrate the validity of the proposed method, we apply it to gene expression data sets involving various human normal and tumor tissue samples and the results demonstrate that the method is effective and feasible.

Journal ArticleDOI
TL;DR: This paper investigates the controller designing for disturbance decoupling problem (DDP) of singular Boolean control networks (SBCNs) using semi-tensor product (STP) of matrices and the Implicit Function Theorem to solve the DDP of the SBCN.
Abstract: This paper investigates the controller designing for disturbance decoupling problem (DDP) of singular Boolean control networks (SBCNs). Using semi-tensor product (STP) of matrices and the Implicit Function Theorem, a SBCN is converted into the standard BCN. Based on the redundant variable separation technique, both state feedback and output feedback controllers are designed to solve the DDP of the SBCN. Sufficient conditions are also given to analyze the invariance of controllers concerning the DDP of the SBCN with function perturbation. Two illustrative examples are presented to support the effectiveness of these obtained results.

Journal ArticleDOI
TL;DR: A dynamic ensemble approach to identify protein-ligand binding residues by using sequence information only is proposed and it is demonstrated that of the proposed method compared favorably with the state-of-the-art.
Abstract: Background: Proteins have the fundamental ability to selectively bind to other molecules and perform specific functions through such interactions, such as protein-ligand binding. Accurate prediction of protein residues that physically bind to ligands is important for drug design and protein docking studies. Most of the successful protein-ligand binding predictions were based on known structures. However, structural information is not largely available in practice due to the huge gap between the number of known protein sequences and that of experimentally solved structures. Results: This paper proposes a dynamic ensemble approach to identify protein-ligand binding residues by using sequence information only. To avoid problems resulting from highly imbalanced samples between the ligand-binding sites and non ligand-binding sites, we constructed several balanced data sets and we trained a random forest classifier for each of them. We dynamically selected a subset of classifiers according to the similarity between the target protein and the proteins in the training data set. The combination of the predictions of the classifier subset to each query protein target yielded the final predictions. The ensemble of these classifiers formed a sequence-based predictor to identify protein-ligand binding sites. Conclusions: Experimental results on two Critical Assessment of protein Structure Prediction datasets and the ccPDB dataset demonstrated that of our proposed method compared favorably with the state-of-the-art. Availability: http://www2.ahu.edu.cn/pchen/web/LigandDSES.htm

Journal ArticleDOI
TL;DR: The experimental analysis of applying the WRWW method and other spectrum-based methods to five benchmark datasets has shown that the proposed method outperforms other methods along a wide range of the window lengths and is dominant in the prediction of both short and long exons.
Abstract: Prediction of protein coding regions is an important topic in the field of genomic sequence analysis. Several spectrum-based techniques for the prediction of protein coding regions have been proposed. However, the outstanding issue in most of the proposed techniques is that these techniques depend on an experimentally-selected, predefined value of the window length. In this paper, we propose a new Wide-Range Wavelet Window (WRWW) method for the prediction of protein coding regions. The analysis of the proposed wavelet window shows that its frequency response can adapt its width to accommodate the change in the window length so that it can allow or prevent frequencies other than the basic frequency in the analysis of DNA sequences. This feature makes the proposed window capable of analyzing DNA sequences with a wide range of the window lengths without degradation in the performance. The experimental analysis of applying the WRWW method and other spectrum-based methods to five benchmark datasets has shown that the proposed method outperforms other methods along a wide range of the window lengths. In addition, the experimental analysis has shown that the proposed method is dominant in the prediction of both short and long exons.

Journal ArticleDOI
TL;DR: A novel focal stacking technique, FocusALL, which is based on the modified Harris Corner Response Measure is introduced, which outperforms other methods on protein crystallization images and performs comparably well on other datasets such as retinal epithelial images and simulated datasets.
Abstract: Automated image analysis of microscopic images such as protein crystallization images and cellular images is one of the important research areas. If objects in a scene appear at different depths with respect to the camera's focal point, objects outside the depth of field usually appear blurred. Therefore, scientists capture a collection of images with different depths of field. Focal stacking is a technique of creating a single focused image from a stack of images collected with different depths of field. In this paper, we introduce a novel focal stacking technique, FocusALL, which is based on our modified Harris Corner Response Measure. We also propose enhanced FocusALL for application on images collected under high resolution and varying illumination. FocusALL resolves problems related to the assumption that focus regions have high contrast and high intensity. Especially, FocusALL generates sharper boundaries around protein crystal regions and good in focus images for high resolution images in reasonable time. FocusALL outperforms other methods on protein crystallization images and performs comparably well on other datasets such as retinal epithelial images and simulated datasets.

Journal ArticleDOI
TL;DR: This work proposes a new regression based method named bLARS that permits a variety of regulatory interactions from a predefined but otherwise arbitrary family of functions and offers the best performance among currently available similar algorithms.
Abstract: Inferring gene regulatory networks (GRNs) from high-throughput gene-expression data is an important and challenging problem in systems biology. Several existing algorithms formulate GRN inference as a regression problem. The available regression based algorithms are based on the assumption that all regulatory interactions are linear. However, nonlinear transcription regulation mechanisms are common in biology. In this work, we propose a new regression based method named bLARS that permits a variety of regulatory interactions from a predefined but otherwise arbitrary family of functions. On three DREAM benchmark datasets, namely gene expression data from E. coli, Yeast, and a synthetic data set, bLARS outperforms state-of-the-art algorithms in the terms of the overall score . On the individual networks, bLARS offers the best performance among currently available similar algorithms, namely algorithms that do not use perturbation information and are not meta-algorithms. Moreover, the presented approach can also be utilized for general feature selection problems in domains other than biology, provided they are of a similar structure.

Journal ArticleDOI
TL;DR: Results show that MMHO-DBN is more accurate than current time-delayed GRN learning methods, and has an intermediate computing performance, and it is able to learn long time-Delayed relationships between genes.
Abstract: Accurately reconstructing gene regulatory network (GRN) from gene expression data is a challenging task in systems biology. Although some progresses have been made, the performance of GRN reconstruction still has much room for improvement. Because many regulatory events are asynchronous, learning gene interactions with multiple time delays is an effective way to improve the accuracy of GRN reconstruction. Here, we propose a new approach, called Max-Min high-order dynamic Bayesian network (MMHO-DBN) by extending the Max-Min hill-climbing Bayesian network technique originally devised for learning a Bayesian network's structure from static data. Our MMHO-DBN can explicitly model the time lags between regulators and targets in an efficient manner. It first uses constraint-based ideas to limit the space of potential structures, and then applies search-and-score ideas to search for an optimal HO-DBN structure. The performance of MMHO-DBN to GRN reconstruction was evaluated using both synthetic and real gene expression time-series data. Results show that MMHO-DBN is more accurate than current time-delayed GRN learning methods, and has an intermediate computing performance. Furthermore, it is able to learn long time-delayed relationships between genes. We applied sensitivity analysis on our model to study the performance variation along different parameter settings. The result provides hints on the setting of parameters of MMHO-DBN.

Journal ArticleDOI
TL;DR: This paper develops novel haplotype assembly schemes that rely on the bit-flipping and belief propagation algorithms often used in communication systems, and demonstrates on both simulated and experimental data that the proposed algorithms compare favorably with state-of-the-art haplotypes assembly methods in terms of accuracy, while being scalable and computationally efficient.
Abstract: High-throughput DNA sequencing technologies allow fast and affordable sequencing of individual genomes and thus enable unprecedented studies of genetic variations. Information about variations in the genome of an individual is provided by haplotypes, ordered collections of single nucleotide polymorphisms. Knowledge of haplotypes is instrumental in finding genes associated with diseases, drug development, and evolutionary studies. Haplotype assembly from high-throughput sequencing data is challenging due to errors and limited lengths of sequencing reads. The key observation made in this paper is that the minimum error-correction formulation of the haplotype assembly problem is identical to the task of deciphering a coded message received over a noisy channel---a classical problem in the mature field of communication theory. Exploiting this connection, we develop novel haplotype assembly schemes that rely on the bit-flipping and belief propagation algorithms often used in communication systems. The latter algorithm is then adapted to the haplotype assembly of polyploids. We demonstrate on both simulated and experimental data that the proposed algorithms compare favorably with state-of-the-art haplotype assembly methods in terms of accuracy, while being scalable and computationally efficient.

Journal ArticleDOI
TL;DR: A novel computational method, called non-negative sparse singular value decomposition (NN-SSVD), is proposed to address the RCNA discovering problem in complex patterns and successfully identifies a number of genomic regions that are strongly correlated with previous studies, which harbor a bunch of known breast cancer associated genes.
Abstract: Recurrent copy number aberrations RCNAs in multiple cancer samples are strongly associated with tumorigenesis, and RCNA discovery is helpful to cancer research and treatment. Despite the emergence of numerous RCNA discovering methods, most of them are unable to detect RCNAs in complex patterns that are influenced by complicating factors including aberration in partial samples, co-existing of gains and losses and normal-like tumor samples. Here, we propose a novel computational method, called non-negative sparse singular value decomposition NN-SSVD, to address the RCNA discovering problem in complex patterns. In NN-SSVD, the measurement of RCNA is based on the aberration frequency in a part of samples rather than all samples, which can circumvent the complexity of different RCNA patterns. We evaluate NN-SSVD on synthetic dataset by comparison on detection scores and Receiver Operating Characteristics curves, and the results show that NN-SSVD outperforms existing methods in RCNA discovery and demonstrate more robustness to RCNA complicating factors. Applying our approach on a breast cancer dataset, we successfully identify a number of genomic regions that are strongly correlated with previous studies, which harbor a bunch of known breast cancer associated genes.

Journal ArticleDOI
TL;DR: This study comprehensively evaluates the performance of five types of probabilistic hierarchical classification methods used for predicting Gene Ontology (GO) terms related to ageing and concludes that the LHC-PCT algorithm ranks better across several tests.
Abstract: This study comprehensively evaluates the performance of five types of probabilistic hierarchical classification methods used for predicting Gene Ontology (GO) terms related to ageing. Of those tested, a new hybrid of a Local Hierarchical Classifier (LHC) and the Predictive Clustering Tree algorithm (LHC-PCT) had the best predictive accuracy results. We also tested the impact of two types of variations in most hierarchical classification algorithms, namely: (a) changing the base algorithm (we tested Naive Bayes and Support Vector Machines), and the impact of (b) using or not the Correlation based Feature Selection (CFS) algorithm in a pre-processing step. In total, we evaluated the predictive performance of 17 variations of hierarchical classifiers across 15 datasets of ageing and longevity-related genes. We conclude that the LHC-PCT algorithm ranks better across several tests (seven out of 12). In addition, we interpreted the models generated by the PCT algorithm to show how hierarchical classification algorithms can be used to extract biological insights out of the ageing-related datasets that we compiled.

Journal ArticleDOI
TL;DR: This approach provides a systemic way for searching the feasible topologies and the corresponding parameter sets to make the gene regulatory networks have robust adaptation and can be used to address more complex issues in biological networks.
Abstract: Robust adaptation plays a key role in gene regulatory networks, and it is thought to be an important attribute for the organic or cells to survive in fluctuating conditions. In this paper, a simplified three-node enzyme network is modeled by the Michaelis-Menten rate equations for all possible topologies, and a family of topologies and the corresponding parameter sets of the network with satisfactory adaptation are obtained using the multi-objective genetic algorithm. The proposed approach improves the computation efficiency significantly as compared to the time consuming exhaustive searching method. This approach provides a systemic way for searching the feasible topologies and the corresponding parameter sets to make the gene regulatory networks have robust adaptation. The proposed methodology, owing to its universality and simplicity, can be used to address more complex issues in biological networks.

Journal ArticleDOI
TL;DR: A software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information and uses a flexible, modular, and multilevel global data schema based on abstraction and generalization of integrated data features.
Abstract: Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists’ ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome, and OMIM). Our solution is general, as it uses a flexible, modular, and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure, and number. These procedures also assure consistency, quality, and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/ , a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction.

Journal ArticleDOI
TL;DR: A chemometric passport-based approach to improve the security of the pharmaceutical supply chain based on applying nuclear quadrupole resonance (NQR) spectroscopy to authenticate the contents of medicine packets is described.
Abstract: The production and sale of counterfeit and substandard pharmaceutical products, such as essential medicines, is an important global public health problem. We describe a chemometric passport-based approach to improve the security of the pharmaceutical supply chain. Our method is based on applying nuclear quadrupole resonance (NQR) spectroscopy to authenticate the contents of medicine packets. NQR is a non-invasive, non-destructive, and quantitative radio frequency (RF) spectroscopic technique. It is sensitive to subtle features of the solid-state chemical environment and thus generates unique chemical fingerprints that are intrinsically difficult to replicate. We describe several advanced NQR techniques, including two-dimensional measurements, polarization enhancement, and spin density imaging, that further improve the security of our authentication approach. We also present experimental results that confirm the specificity and sensitivity of NQR and its ability to detect counterfeit medicines.