scispace - formally typeset
Search or ask a question
Author

Kun Niu

Bio: Kun Niu is an academic researcher from Northeast Forestry University. The author has contributed to research in topics: Enhancer. The author has an hindex of 1, co-authored 1 publications receiving 1 citations.
Topics: Enhancer

Papers
More filters
Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors used bidirectional LSTM (EBLSTM) to extract subsequences by sliding a 3-mer window along the DNA sequence as features.
Abstract: Enhancers are regulatory DNA sequences that could be bound by specific proteins named transcription factors (TFs). The interactions between enhancers and TFs regulate specific genes by increasing the target gene expression. Therefore, enhancer identification and classification have been a critical issue in the enhancer field. Unfortunately, so far there has been a lack of suitable methods to identify enhancers. Previous research has mainly focused on the features of the enhancer's function and interactions, which ignores the sequence information. As we know, the recurrent neural network (RNN) and long short-term memory (LSTM) models are currently the most common methods for processing time series data. LSTM is more suitable than RNN to address the DNA sequence. In this paper, we take the advantages of LSTM to build a method named iEnhancer-EBLSTM to identify enhancers. iEnhancer-ensembles of bidirectional LSTM (EBLSTM) consists of two steps. In the first step, we extract subsequences by sliding a 3-mer window along the DNA sequence as features. Second, EBLSTM model is used to identify enhancers from the candidate input sequences. We use the dataset from the study of Quang H et al. as the benchmarks. The experimental results from the datasets demonstrate the efficiency of our proposed model.

14 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: In this paper, an integrative machine learning (ML)-based framework called Enhancer-IF was proposed for identifying cell-specific enhancers, which comprehensively explores a wide range of heterogeneous features with five commonly used ML methods (random forest, extremely randomized tree, multilayer perceptron, support vector machine and extreme gradient boosting).
Abstract: Enhancers are deoxyribonucleic acid (DNA) fragments which when bound by transcription factors enhance the transcription of related genes. Due to its sporadic distribution and similar fractions, identification of enhancers from the human genome seems a daunting task. Compared to the traditional experimental approaches, computational methods with easy-to-use platforms could be efficiently applied to annotate enhancers' functions and physiological roles. In this aspect, several bioinformatics tools have been developed to identify enhancers. Despite their spectacular performances, existing methods have certain drawbacks and limitations, including fixed length of sequences being utilized for model development and cell-specificity negligence. A novel predictor would be beneficial in the context of genome-wide enhancer prediction by addressing the above-mentioned issues. In this study, we constructed new datasets for eight different cell types. Utilizing these data, we proposed an integrative machine learning (ML)-based framework called Enhancer-IF for identifying cell-specific enhancers. Enhancer-IF comprehensively explores a wide range of heterogeneous features with five commonly used ML methods (random forest, extremely randomized tree, multilayer perceptron, support vector machine and extreme gradient boosting). Specifically, these five classifiers were trained with seven encodings and obtained 35 baseline models. The output of these baseline models was integrated and again inputted to five classifiers for the construction of five meta-models. Finally, the integration of five meta-models through ensemble learning improved the model robustness. Our proposed approach showed an excellent prediction performance compared to the baseline models on both training and independent datasets in different cell types, thus highlighting the superiority of our approach in the identification of the enhancers. We assume that Enhancer-IF will be a valuable tool for screening and identifying potential enhancers from the human DNA sequences.

27 citations

Journal ArticleDOI
TL;DR: In this article , a two-stage deep learning-based framework leveraging DNA structural features, natural language processing, convolutional neural network, and long short-term memory was proposed to predict the enhancer elements accurately in the genomics data.
Abstract: Enhancer, a distal cis-regulatory element controls gene expression. Experimental prediction of enhancer elements is time-consuming and expensive. Consequently, various inexpensive deep learning-based fast methods have been developed for predicting the enhancers and determining their strength. In this paper, we have proposed a two-stage deep learning-based framework leveraging DNA structural features, natural language processing, convolutional neural network, and long short-term memory to predict the enhancer elements accurately in the genomics data. In the first stage, we extracted the features from DNA sequence data by using three feature representation techniques viz., k-mer based feature extraction along with word2vector based interpretation of underlined patterns, one-hot encoding, and the DNAshape technique. In the second stage, strength of enhancers is predicted from the extracted features using a hybrid deep learning model. The method is capable of adapting itself to varying sizes of datasets. Also, as proposed model can capture long-range sequencing patterns, the robustness of the method remains unaffected against minor variations in the genomics sequence. The method outperforms the other state-of-the-art methods at both stages in terms of performance metrics of prediction accuracy, specificity, Mathew's correlation coefficient, and area under the ROC curve. In summary, the proposed method is a reliable method for enhancer prediction.

9 citations

Journal ArticleDOI
TL;DR: This work proposes a novel deep learning model, called CLNN-loop, to predict chromatin loops in different cell lines and CTCF-binding sites (CBS) pair types by fusing multiple sequence-based features, and applies the SHAP framework to interpret the predictions of different models.
Abstract: MOTIVATION Three-dimensional (3D) genome organization is of vital importance in gene regulation and disease mechanisms. Previous studies have shown that CTCF-mediated chromatin loops are crucial to studying the 3D structure of cells. Although various experimental techniques have been developed to detect chromatin loops, they have been found to be time-consuming and costly. Nowadays, various sequence-based computational methods can capture significant features of 3D genome organization and help predict chromatin loops. However, these methods have low performance and poor generalization ability in predicting chromatin loops. RESULTS Here, we propose a novel deep learning model, called CLNN-loop, to predict chromatin loops in different cell lines and CTCF-binding sites (CBS) pair types by fusing multiple sequence-based features. The analysis of a series of examinations based on the datasets in the previous study shows that CLNN-loop has satisfactory performance and is superior to the existing methods in terms of predicting chromatin loops. In addition, we apply the SHAP framework to interpret the predictions of different models, and find that CTCF motif and sequence conservation are important signs of chromatin loops in different cell lines and CBS pair types. The source code of CLNN-loop is freely available at https://github.com/HaoWuLab-Bioinformatics/CLNN-loop and the webserver of CLNN-loop is freely available at http://hwclnn.sdu.edu.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

7 citations

Journal ArticleDOI
01 Oct 2022-Methods
TL;DR: Li et al. as mentioned in this paper proposed a two-layer model called iEnhancer-MRBF, wherein the first layer is used to identify enhancers, and the identified enhancers are divided into strong enhancers and weak enhancers according to their strength in the second layer.

4 citations

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation.
Abstract: Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.

2 citations