scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2019"


Journal ArticleDOI
TL;DR: The developed spectral-convolutional neural network based method achieves success in integrating protein interaction network data and gene expression profiles to classify lung cancer.
Abstract: Deep learning technologies are permeating every field from image and speech recognition to computational and systems biology. However, the application of convolutional neural networks (CCNs) to "omics" data poses some difficulties, such as the processing of complex networks structures as well as its integration with transcriptome data. Here, we propose a CNN approach that combines spectral clustering information processing to classify lung cancer. The developed spectral-convolutional neural network based method achieves success in integrating protein interaction network data and gene expression profiles to classify lung cancer. The performed computational experiments suggest that in terms of accuracy the predictive performance of our proposed method was better than those of other machine learning methods such as SVM or Random Forest. Moreover, the computational results also indicate that the underlying protein network structure assists to enhance the predictions. Data and CNN code can be downloaded from the link: https://sites.google.com/site/nacherlab/analysis.

19 citations


Journal ArticleDOI
TL;DR: The cutting-edge machine learning method of artificial intelligence is adopted to develop a powerful model for improving MoRFs prediction, and the accuracy of the novel proposed method was comparable with that of state-of-the-art methods.
Abstract: Molecular recognition features (MoRFs) are key functional regions of intrinsically disordered proteins (IDPs), which play important roles in the molecular interaction network of cells and are implicated in many serious human diseases. Identifying MoRFs is essential for both functional studies of IDPs and drug design. This study adopts the cutting-edge machine learning method of artificial intelligence to develop a powerful model for improving MoRFs prediction. We proposed a method, named as en_DCNNMoRF (ensemble deep convolutional neural network-based MoRF predictor). It combines the outcomes of two independent deep convolutional neural network (DCNN) classifiers that take advantage of different features. The first, DCNNMoRF1, employs position-specific scoring matrix (PSSM) and 22 types of amino acid-related factors to describe protein sequences. The second, DCNNMoRF2, employs PSSM and 13 types of amino acid indexes to describe protein sequences. For both single classifiers, DCNN with a novel two-dimensional attention mechanism was adopted, and an average strategy was added to further process the output probabilities of each DCNN model. Finally, en_DCNNMoRF combined the two models by averaging their final scores. When compared with other well-known tools applied to the same datasets, the accuracy of the novel proposed method was comparable with that of state-of-the-art methods. The related web server can be accessed freely via http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/en_MoRFs.php .

14 citations


Journal ArticleDOI
TL;DR: Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing and other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2.
Abstract: New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core s...

14 citations


Journal ArticleDOI
TL;DR: This work has used a dynamic protein-protein interaction network (PPIN), time course gene expression data and protein sequence information for prediction of functional annotation of proteins, and achieves an average precision, recall and F-Score of 0.61, significantly higher than the reported state-of-the-art methods.
Abstract: Computational prediction of functional annotation of proteins is an uphill task. There is an ever increasing gap between functional characterization of protein sequences and deluge of protein sequences generated by large-scale sequencing projects. The dynamic nature of protein interactions is frequently observed which is mostly influenced by any new change of state or change in stimuli. Functional characterization of proteins can be inferred from their interactions with each other, which is dynamic in nature. In this work, we have used a dynamic protein-protein interaction network (PPIN), time course gene expression data and protein sequence information for prediction of functional annotation of proteins. During progression of a particular function, it has also been observed that not all the proteins are active at all time points. For unannotated active proteins, our proposed methodology explores the dynamic PPIN consisting of level-1 and level-2 neighboring proteins at different time points, filtered by Damerau-Levenshtein edit distance to estimate the similarity between two protein sequences and coefficient variation methods to assess the strength of an edge in a network. Finally, from the filtered dynamic PPIN, at each time point, functional annotations of the level-2 proteins are assigned to the unknown and unannotated active proteins through the level-1 neighbor, following a bottom-up strategy. Our proposed methodology achieves an average precision, recall and F-Score of 0.59, 0.76 and 0.61 respectively, which is significantly higher than the reported state-of-the-art methods.

13 citations


Journal ArticleDOI
TL;DR: MetaTox as discussed by the authors is a web-application for the generation of xenobiotics metabolic pathways in the human organism, which is based on the fragments datasets, which describe transformations of substrates structures to a metabolites structure.
Abstract: Xenobiotics biotransformation in humans is a process of the chemical modifications, which may lead to the formation of toxic metabolites. The prediction of such metabolites is very important for drug development and ecotoxicology studies. We created the web-application MetaTox ( http://way2drug.com/mg ) for the generation of xenobiotics metabolic pathways in the human organism. For each generated metabolite, the estimations of the acute toxicity (based on GUSAR software prediction), organ-specific carcinogenicity and adverse effects (based on PASS software prediction) are performed. Generation of metabolites by MetaTox is based on the fragments datasets, which describe transformations of substrates structures to a metabolites structure. We added three new classes of biotransformation reactions: Dehydrogenation, Glutathionation, and Hydrolysis, and now metabolite generation for 15 most frequent classes of xenobiotic's biotransformation reactions are available. MetaTox calculates the probability of formation of generated metabolite - it is the integrated assessment of the biotransformation reactions probabilities and their sites using the algorithm of PASS ( http://way2drug.com/passonline ). The prediction accuracy estimated by the leave-one-out cross-validation (LOO-CV) procedure calculated separately for the probabilities of biotransformation reactions and their sites is about 0.9 on the average for all reactions.

13 citations


Journal ArticleDOI
TL;DR: Bioinformatic analysis of the RNA-seq data deposited in The Cancer Genome Atlas consortium database revealed at least six genes that could serve as a prognostic marker of locally advanced lymph node-positive PCa.
Abstract: Prostate cancer (PCa) is one of the primary causes of cancer-related mortality in men worldwide. Patients with locally advanced PCa with metastases in regional lymph nodes are usually marked as a h...

12 citations


Journal ArticleDOI
TL;DR: An effective model for predicting GTP binding sites in Rab proteins is provided and a basis for further research that can apply deep learning in bioinformatics, especially in nucleotide binding site prediction.
Abstract: Deep learning has been increasingly and widely used to solve numerous problems in various fields with state-of-the-art performance. It can also be applied in bioinformatics to reduce the requiremen...

12 citations


Journal ArticleDOI
TL;DR: It has been shown that the inclusion of structurally similar templates with ample conformational diversity is crucial for the modeling algorithm to maximally as well as reliably span the target sequence and construct its near-native model.
Abstract: In contrast to ab-initio protein modeling methodologies, comparative modeling is considered as the most popular and reliable algorithm to model protein structure. However, the selection of the best set of templates is still a major challenge. An effective template-ranking algorithm is developed to efficiently select only the reliable hits for predicting the protein structures. The algorithm employs the pairwise as well as multiple sequence alignments of template hits to rank and select the best possible set of templates. It captures several key sequences and structural information of template hits and converts into scores to effectively rank them. This selected set of templates is used to model a target. Modeling accuracy of the algorithm is tested and evaluated on TBM-HA domain containing CASP8, CASP9 and CASP10 targets. On an average, this template ranking and selection algorithm improves GDT-TS, GDT-HA and TM_Score by 3.531, 4.814 and 0.022, respectively. Further, it has been shown that the inclusion of structurally similar templates with ample conformational diversity is crucial for the modeling algorithm to maximally as well as reliably span the target sequence and construct its near-native model. The optimal model sampling also holds the key to predict the best possible target structure.

12 citations


Journal ArticleDOI
TL;DR: It is observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods, especially when expanding to the 100 most statistically significant reported gene sets.
Abstract: Gene set analysis is a quantitative approach for generating biological insight from gene expression datasets. The abundance of gene set analysis methods speaks to their popularity, but raises the question of the extent to which results are affected by the choice of method. Our systematic analysis of 13 popular methods using 6 different datasets, from both DNA microarray and RNA-Seq origin, shows that this choice matters a great deal. We observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods. Furthermore, there was substantial disagreement between the 20 most statistically significant gene sets reported by the methods. This was also observed when expanding to the 100 most statistically significant reported gene sets. For different datasets of the same phenotype/condition, the top 20 and top 100 most significant results also showed little to no agreement even when using the same method. GAGE, PAGE, and ORA were the only methods able to achieve relatively high reproducibility when comparing the 20 and 100 most statistically significant gene sets. Biological validation on a juvenile idiopathic arthritis (JIA) dataset showed wide variation in terms of the relevance of the top 20 and top 100 most significant gene sets to known biology of the disease, where GAGE predicted the most relevant gene sets, followed by GSEA, ORA, and PAGE.

11 citations


Journal ArticleDOI
TL;DR: Comprehensive analysis of eight cancer types demonstrates that the evolutionary conservation-based models represent a valid and helpful approach for identifying cancer sub types and the core gene set offers distinguishable clues of cancer subtypes.
Abstract: Cancer subtype identification is an unmet need in precision diagnosis. Recently, evolutionary conservation has been indicated to contain informative signatures for functional significance in cancers. However, the importance of evolutionary conservation in distinguishing cancer subtypes remains largely unclear. Here, we identified the evolutionarily conserved genes (i.e. core genes) and observed that they are primarily involved in cellular pathways relevant to cell growth and metabolisms. By using these core genes, we developed two novel strategies, namely a feature-based strategy (FES) and an image-based strategy (IMS) by integrating their evolutionary and genomic profiles with the deep learning algorithm. In comparison with the FES using the random set and the strategy using the PAM50 classifier, the core gene set-based FES achieved a higher accuracy for identifying breast cancer subtypes. The IMS and FES using the core gene set yielded better performances than the other strategies, in terms of classifying both breast cancer subtypes and multiple cancer types. Moreover, the IMS is reproducible even using different gene expression data (i.e. RNA-seq and microarray). Comprehensive analysis of eight cancer types demonstrates that our evolutionary conservation-based models represent a valid and helpful approach for identifying cancer subtypes and the core gene set offers distinguishable clues of cancer subtypes.

10 citations



Journal ArticleDOI
TL;DR: A Python-based standalone tool, called PyPredT6, is designed and performed in silico prediction of T6 effector proteins in Vibrio cholerae and Yersinia pestis to establish the applicability of PypredT6.
Abstract: Prediction of effector proteins is of paramount importance due to their crucial role as first-line invaders while establishing a pathogen-host interaction, often leading to infection of the host. Prediction of T6 effector proteins is a new challenge since the discovery of T6 Secretion System and the unique nature of the particular secretion system. In this paper, we have first designed a Python-based standalone tool, called PyPredT6, to predict T6 effector proteins. A total of 873 unique features has been extracted from the peptide and nucleotide sequences of the experimentally verified effector proteins. Based on these features and using machine learning algorithms, we have performed in silico prediction of T6 effector proteins in Vibrio cholerae and Yersinia pestis to establish the applicability of PyPredT6. PyPredT6 is available at http://projectphd.droppages.com/PyPredT6.html .

Journal ArticleDOI
TL;DR: This study proposes an ensemble learning strategy, named MoRFPred_en, to predict MoRFs from protein sequences, which combines four submodels that utilize different sequence-derived features for the prediction, including a multichannel one-dimensional convolutional neural network (CNN_1D multich channel) based model.
Abstract: Molecular recognition features (MoRFs) usually act as "hub" sites in the interaction networks of intrinsically disordered proteins (IDPs). Because an increasing number of serious diseases have been found to be associated with disordered proteins, identifying MoRFs has become increasingly important. In this study, we propose an ensemble learning strategy, named MoRFPred_en, to predict MoRFs from protein sequences. This approach combines four submodels that utilize different sequence-derived features for the prediction, including a multichannel one-dimensional convolutional neural network (CNN_1D multichannel) based model, two deep two-dimensional convolutional neural network (DCNN_2D) based models, and a support vector machine (SVM) based model. When compared with other methods on the same datasets, the MoRFPred_en approach produced better results than existing state-of-the-art MoRF prediction methods, achieving an AUC of 0.762 on the VALIDATION419 dataset, 0.795 on the TEST45 dataset, and 0.776 on the TEST49 dataset. Availability: http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/MoRFPred_en.php.

Journal ArticleDOI
TL;DR: A new method is proposed, Identification of Protein Complex based on Refined Protein Interaction Network (IPC-RPIN), which integrates the topology, gene expression profiles and GO functional annotation information to predict protein complexes from the reconstructed networks.
Abstract: The prediction of protein complexes based on the protein interaction network is a fundamental task for the understanding of cellular life as well as the mechanisms underlying complex disease. A great number of methods have been developed to predict protein complexes based on protein-protein interaction (PPI) networks in recent years. However, because the high throughput data obtained from experimental biotechnology are incomplete, and usually contain a large number of spurious interactions, most of the network-based protein complex identification methods are sensitive to the reliability of the PPI network. In this paper, we propose a new method, Identification of Protein Complex based on Refined Protein Interaction Network (IPC-RPIN), which integrates the topology, gene expression profiles and GO functional annotation information to predict protein complexes from the reconstructed networks. To demonstrate the performance of the IPC-RPIN method, we evaluated the IPC-RPIN on three PPI networks of Saccharomycescerevisiae and compared it with four state-of-the-art methods. The simulation results show that the IPC-RPIN achieved a better result than the other methods on most of the measurements and is able to discover small protein complexes which have traditionally been neglected.

Journal ArticleDOI
TL;DR: The sensitivity operator, composed of the independent adjoint problem solutions ensemble, allows transforming the inverse problem to the family of nonlinear ill-posed operator equations, and is applied to the morphogen synthesis region identification problem for the model of regulation of the renewing zone size in biological tissue.
Abstract: Diffusion-reaction models are used to describe development processes in the framework of morphogen theory. The images of the concentration fields for the subset of the interacting morphogens are available. In order to interpret this data in terms of the model parameters, the inverse source problem is stated. The sensitivity operator, composed of the independent adjoint problem solutions ensemble, allows transforming the inverse problem to the family of nonlinear ill-posed operator equations. The equations are solved with the Newton-Kantorovich-type algorithm. The approach is applied to the morphogen synthesis region identification problem for the model of regulation of the renewing zone size in biological tissue.

Journal ArticleDOI
TL;DR: In this study, efforts are created to develop a quantitative structure-activity relationship (QSAR)-based model, which are used for the prediction of toxicities to reduce testing in animals, time, and money in the early stages of drug development.
Abstract: In this study, efforts are created to develop a quantitative structure–activity relationship (QSAR)-based model, which are used for the prediction of toxicities to reduce testing in animals, time, ...

Journal ArticleDOI
TL;DR: A new classification model is presented to identify more effective anti-cancer drug pairs using in silico network biology approach based on the hypotheses that the drug synergy comes from the collective effects on the biological network.
Abstract: Identification of effective drug combinations for patients is an expensive and time-consuming procedure, especially for in vitro experiments. To accelerate the synergistic drug discovery process, we present a new classification model to identify more effective anti-cancer drug pairs using in silico network biology approach. Based on the hypotheses that the drug synergy comes from the collective effects on the biological network, therefore, we developed six network biology features, including overlap and distance of drug perturbation network, that were derived by using individual drug-perturbed transcriptome profiles and the relevant biological network analysis. Using publicly available drug synergy databases and three machine-learning (ML) methods, the model was trained to discriminate the positive (synergistic) and negative (nonsynergistic) drug combinations. The proposed models were evaluated on the test cases to predict the most promising network biology feature, which is the network degree activity, i.e. the synergistic effect between drug pairs is mainly accounted by the complementary signaling pathways or molecular networks from two drugs.

Journal ArticleDOI
TL;DR: GEREDB is a publicly available, manually curated biological database that stores the data regarding relationships between expression and regulation of human genes and has the ability to analyze user-supplied gene expression data in a causal analysis oriented manner using the GEREA bioinformatics tool.
Abstract: Understanding how genes are expressed and regulated in different biological processes are fundamental and challenging issues. Considerable progress has been made in studying the relationship betwee...

Journal ArticleDOI
TL;DR: A new inference method is developed by modifying the existing random-forest-based inference method to take advantage of its ability to analyze both time-series and static gene expression data, which can be similarly applied to many of the other existing inference methods.
Abstract: In using gene expression levels for genetic network inference, we believe that two measurements that are similar to each other are less informative than two measurements that differ from each other. Given, for example, that gene expression levels measured at two adjacent time points in a time-series experiment are often similar to each other, we assume that each measurement in the time-series experiment will be less informative than each measurement in a steady-state experiment. Based on this idea, we propose a new inference method that relies heavily on informative gene expression data. Through numerical experiments, we prove that the quality of an inferred genetic network is slightly improved by heavily weighting informative gene expression data. In this study, we develop a new method by modifying the existing random-forest-based inference method to take advantage of its ability to analyze both time-series and static gene expression data. The idea we propose can be similarly applied to many of the other existing inference methods, as well.

Journal ArticleDOI
TL;DR: A gene expression prediction model, L-GEPM, based on long short-term memory (LSTM) neural networks, which captures the nonlinear features affecting gene expression and uses learned features to predict the target genes is proposed.
Abstract: Molecular biology combined with in silico machine learning and deep learning has facilitated the broad application of gene expression profiles for gene function prediction, optimal crop breeding, disease-related gene discovery, and drug screening. Although the acquisition cost of genome-wide expression profiles has been steadily declining, the requirement generates a compendium of expression profiles using thousands of samples remains high. The Library of Integrated Network-Based Cellular Signatures (LINCS) program used approximately 1000 landmark genes to predict the expression of the remaining target genes by linear regression; however, this approach ignored the nonlinear features influencing gene expression relationships, limiting the accuracy of the experimental results. We herein propose a gene expression prediction model, L-GEPM, based on long short-term memory (LSTM) neural networks, which captures the nonlinear features affecting gene expression and uses learned features to predict the target genes. By comparing and analyzing experimental errors and fitting the effects of different prediction models, the LSTM neural network-based model, L-GEPM, can achieve low error and a superior fitting effect.

Journal ArticleDOI
TL;DR: PVsiRNAPred is the first bioinformatics algorithm for predicting plant vsiRNAs based on vsiRNA sequence composition and has favorable generalization capabilities, which are hoped to allow efficient discovery of new vsi RNAs.
Abstract: Plant exclusive virus-derived small interfering RNAs (vsiRNAs) regulate various biological processes, especially important in antiviral immunity. The identification of plant vsiRNAs is important fo...

Journal ArticleDOI
TL;DR: A new idea is proposed for the realization of mutated proteins, on the surface of which more spacious transient pockets are formed and, therefore, are more suitable for hosting drugs.
Abstract: Nowadays, it is well established that most of the human diseases which are not related to pathogen infections have their origin from DNA disorders. Thus, DNA mutations, waiting for the availability...

Journal ArticleDOI
TL;DR: The results of the evaluation indicate that the proposed method recognized regulatory relations in Bayesian modeling process well, due to using of biological knowledge which is hidden in the data collection, and is able to recognize gene regulatory networks align with important methods in this field.
Abstract: In this study, in order to deal with the noise and uncertainty in gene expression data, learning networks, especially Bayesian networks, that have the ability to use prior knowledge, were used to i...

Journal ArticleDOI
TL;DR: The hypothesis that basins, directly tied to stable and semi-stable states, lead to better models of dynamics lead to MSMs of better quality and thus can be useful to further advance this widely-used technology for summarization of molecular equilibrium dynamics is evaluated.
Abstract: Molecular dynamics (MD) simulation software allows probing the equilibrium structural dynamics of a molecule of interest, revealing how a molecule navigates its structure space one structure at a time. To obtain a broader view of dynamics, typically one needs to launch many such simulations, obtaining many trajectories. A summarization of the equilibrium dynamics requires integrating the information in the various trajectories, and Markov State Models (MSM) are increasingly being used for this task. At its core, the task involves organizing the structures accessed in simulation into structural states, and then constructing a transition probability matrix revealing the transitions between states. While now considered a mature technology and widely used to summarize equilibrium dynamics, the underlying computational process in the construction of an MSM ignores energetics even though the transition of a molecule between two nearby structures in an MD trajectory is governed by the corresponding energies. In this paper, we connect theory with simulation and analysis of equilibrium dynamics. A molecule navigates the energy landscape underlying the structure space. The structural states that are identified via off-the-shelf clustering algorithms need to be connected to thermodynamically-stable and semi-stable (macro)states among which transitions can then be quantified. Leveraging recent developments in the analysis of energy landscapes that identify basins in the landscape, we evaluate the hypothesis that basins, directly tied to stable and semi-stable states, lead to better models of dynamics. Our analysis indicates that basins lead to MSMs of better quality and thus can be useful to further advance this widely-used technology for summarization of molecular equilibrium dynamics.

Journal ArticleDOI
TL;DR: IntaRNAhelix, a dynamic programming algorithm that length-restricts the runs of consecutive inter-molecular base pairs (perfect canonical stackings), which is hypothesize to implicitly model the steric and kinetic effects of interaction prediction models compared to the current state-of-the-art approach, is implemented.
Abstract: Efficient computational tools for the identification of putative target RNAs regulated by prokaryotic sRNAs rely on thermodynamic models of RNA secondary structures. While they typically predict RN...

Journal ArticleDOI
TL;DR: Using this method to predict the categories of the 6 major types of enzymes effectively improves its prediction accuracy to 94.54%, indicating that this method has general applicability to other protein problems.
Abstract: Oxidoreductase is an enzyme that widely exists in organisms. It plays an important role in cellular energy metabolism and biotransformation processes. Oxidoreductases have many subclasses with different functions, creating an important classification task in bioinformatics. In this paper, a dataset of 2640 oxidoreductase sequences was used to perform an analysis and comparison. The idea of dipeptides was introduced to process the Position Specific Score Matrix (PSSM), since each dipeptide consists of two amino acids and each column of PSSM corresponds to the information of one amino acid. Two kinds of dipeptide scores were proposed, the Standardization Normal Distribution PSSM (SND-PSSM) and the Correlation Coefficient PSSM (CC-PSSM). Recursive Feature Elimination (RFE) is used to extract features from the SND-PSSM and CC-PSSM, and the two sets of extracted features are combined to form a new feature matrix, the RFE-SND-CC-PSSM. The results show that, with the proposed method and a kernel-based nonlinear SVM classifier, the accuracy can reach 95.56% by the Jackknife test. Our method greatly improves the accuracy of oxidoreductase subclass prediction. Using this method to predict the categories of the 6 major types of enzymes effectively improves its prediction accuracy to 94.54%, indicating that this method has general applicability to other protein problems. The results show that our method is effective and universally applicable, and might be complementary to the existing methods.

Journal ArticleDOI
TL;DR: While FCS is a powerful approach, blind reliance on its non-objective p -value is ill-advised and it is found that FCS works best with big complexes.
Abstract: Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from n...

Journal ArticleDOI
Lei Gao1, Cong Wu1, Lin Liu1
TL;DR: This pipeline encompasses quality control, adaptor trimming, collapsing of reads, structural RNA removal, length selection, read mapping, and normalized wiggle file creation and is therefore a powerful tool for the steps before meta-analysis.
Abstract: There are many short-read aligners that can map short reads to a reference genome/sequence, and most of them can directly accept a FASTQ file as the input query file. However, the raw data usually need to be pre-processed. Few software programs specialize in pre-processing raw data generated by a variety of next-generation sequencing (NGS) technologies. Here, we present AUSPP, a Perl script-based pipeline for pre-processing and automatic mapping of NGS short reads. This pipeline encompasses quality control, adaptor trimming, collapsing of reads, structural RNA removal, length selection, read mapping, and normalized wiggle file creation. It facilitates the processing from raw data to genome mapping and is therefore a powerful tool for the steps before meta-analysis. Most importantly, since AUSPP has default processing pipeline settings for many types of NGS data, most of the time, users will simply need to provide the raw data and genome. AUSPP is portable and easy to install, and the source codes are freely available at https://github.com/highlei/AUSPP.

Journal ArticleDOI
TL;DR: A novel graph kernel named vertex-edge similarity kernel (VES kernel) based on mixed matrix is proposed, the innovation point of which is taking the adjacency matrix of the graph as the sample vector of each vertex and calculating kernel values by finding the most similar vertex pair of two graphs.
Abstract: At present, most of the researches on protein classification are based on graph kernels. The essence of graph kernels is to extract the substructure and use the similarity of substructures as the kernel values. In this paper, we propose a novel graph kernel named vertex-edge similarity kernel (VES kernel) based on mixed matrix, the innovation point of which is taking the adjacency matrix of the graph as the sample vector of each vertex and calculating kernel values by finding the most similar vertex pair of two graphs. In addition, we combine the novel kernel with the neural network and the experimental results show that the combination is better than the existing advanced methods.

Journal ArticleDOI
TL;DR: This paper proposes a new DNA motif discovery algorithm that has better time performance for large datasets and better accuracy of identifying infrequent motifs than the compared algorithms, and designs a new initial motif generation method with the utilization of the entire dataset.
Abstract: DNA motif discovery plays an important role in understanding the mechanisms of gene regulation. Most existing motif discovery algorithms can identify motifs in an efficient and effective manner when dealing with small datasets. However, large datasets generated by high-throughput sequencing technologies pose a huge challenge: it is too time-consuming to process the entire dataset, but if only a small sample sequence set is processed, it is difficult to identify infrequent motifs. In this paper, we propose a new DNA motif discovery algorithm: first divide the input dataset into multiple sample sequence sets, then refine initial motifs of each sample sequence set with the expectation maximization method, and finally combine all the results from each sample sequence set. Besides, we design a new initial motif generation method with the utilization of the entire dataset, which helps to identify infrequent motifs. The experimental results on the simulated data show that the proposed algorithm has better time performance for large datasets and better accuracy of identifying infrequent motifs than the compared algorithms. Also, we have verified the validity of the proposed algorithm on the real data.