scispace - formally typeset
Search or ask a question
Author

Zi-Jie Sun

Bio: Zi-Jie Sun is an academic researcher from University of Electronic Science and Technology of China. The author has contributed to research in topics: Composite number & Deep learning. The author has an hindex of 2, co-authored 3 publications receiving 8 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, a gradient boost decision tree (GBDT) classifier was trained on the optimal features to identify cyclins with an accuracy of 93.06% and AUC value of 0.971.
Abstract: Cyclin proteins are capable to regulate the cell cycle by forming a complex with cyclin-dependent kinases to activate cell cycle. Correct recognition of cyclin proteins could provide key clues for studying their functions. However, their sequences share low similarity, which results in poor prediction for sequence similarity-based methods. Thus, it is urgent to construct a machine learning model to identify cyclin proteins. This study aimed to develop a computational model to discriminate cyclin proteins from non-cyclin proteins. In our model, protein sequences were encoded by seven kinds of features that are amino acid composition, composition of k-spaced amino acid pairs, tri peptide composition, pseudo amino acid composition, geary correlation, normalized moreau-broto autocorrelation and composition/transition/distribution. Afterward, these features were optimized by using analysis of variance (ANOVA) and minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) technique. A gradient boost decision tree (GBDT) classifier was trained on the optimal features. Five-fold cross-validated results showed that our model would identify cyclins with an accuracy of 93.06% and AUC value of 0.971, which are higher than the two recent studies on the same data.

31 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a deep learning-based algorithm to identify whether an unknown sequence region would be potential hypersensitive site (DHS), which showed high prediction performance on both training datasets and independent datasets in different cell types and developmental stages, demonstrating that the method has excellent superiority in the identification of DHSs.
Abstract: DNase I hypersensitive site (DHS) refers to the hypersensitive region of chromatin for the DNase I enzyme. It is an important part of the noncoding region and contains a variety of regulatory elements, such as promoter, enhancer, and transcription factor-binding site, etc. Moreover, the related locus of disease (or trait) are usually enriched in the DHS regions. Therefore, the detection of DHS region is of great significance. In this study, we develop a deep learning-based algorithm to identify whether an unknown sequence region would be potential DHS. The proposed method showed high prediction performance on both training datasets and independent datasets in different cell types and developmental stages, demonstrating that the method has excellent superiority in the identification of DHSs. Furthermore, for the convenience of related wet-experimental researchers, the user-friendly web-server iDHS-Deep was established at http://lin-group.cn/server/iDHS-Deep/, by which users can easily distinguish DHS and non-DHS and obtain the corresponding developmental stage ofDHS.

22 citations

Journal ArticleDOI
02 Aug 2021-Methods
TL;DR: Wang et al. as discussed by the authors developed a deep learning-based model to predict 4mC sites in the Escherichia coli genome, where DNA sequences were encoded by word embedding technique 'word2vec'.

17 citations

Journal ArticleDOI
TL;DR: In this article , the fire resistance of steel reinforced concrete (SRC) columns was analyzed under different axial compression ratios (0.4, 0.6) and eccentricity (0, 40, 60, 80 mm).

Cited by
More filters
Journal ArticleDOI
TL;DR: In this paper, a machine learning-based meta-predictor called NeuroPred-FRL was developed by employing the feature representation learning approach, where the predicted probability scores of NPs based on the 66 baseline models were combined as the input feature vector.
Abstract: Neuropeptides (NPs) are the most versatile neurotransmitters in the immune systems that regulate various central anxious hormones. An efficient and effective bioinformatics tool for rapid and accurate large-scale identification of NPs is critical in immunoinformatics, which is indispensable for basic research and drug development. Although a few NP prediction tools have been developed, it is mandatory to improve their NPs' prediction performances. In this study, we have developed a machine learning-based meta-predictor called NeuroPred-FRL by employing the feature representation learning approach. First, we generated 66 optimal baseline models by employing 11 different encodings, six different classifiers and a two-step feature selection approach. The predicted probability scores of NPs based on the 66 baseline models were combined to be deemed as the input feature vector. Second, in order to enhance the feature representation ability, we applied the two-step feature selection approach to optimize the 66-D probability feature vector and then inputted the optimal one into a random forest classifier for the final meta-model (NeuroPred-FRL) construction. Benchmarking experiments based on both cross-validation and independent tests indicate that the NeuroPred-FRL achieves a superior prediction performance of NPs compared with the other state-of-the-art predictors. We believe that the proposed NeuroPred-FRL can serve as a powerful tool for large-scale identification of NPs, facilitating the characterization of their functional mechanisms and expediting their applications in clinical therapy. Moreover, we interpreted some model mechanisms of NeuroPred-FRL by leveraging the robust SHapley Additive exPlanation algorithm.

48 citations

Journal ArticleDOI
TL;DR: The findings suggest that the wearable smart T-shirt based on the DT classifier may be used in big data applications and health monitoring, and may help assess cardiovascular and related risk factors in the initial stage based on machine learning techniques.

27 citations

Journal ArticleDOI
TL;DR: In this paper, an integrative machine learning (ML)-based framework called Enhancer-IF was proposed for identifying cell-specific enhancers, which comprehensively explores a wide range of heterogeneous features with five commonly used ML methods (random forest, extremely randomized tree, multilayer perceptron, support vector machine and extreme gradient boosting).
Abstract: Enhancers are deoxyribonucleic acid (DNA) fragments which when bound by transcription factors enhance the transcription of related genes. Due to its sporadic distribution and similar fractions, identification of enhancers from the human genome seems a daunting task. Compared to the traditional experimental approaches, computational methods with easy-to-use platforms could be efficiently applied to annotate enhancers' functions and physiological roles. In this aspect, several bioinformatics tools have been developed to identify enhancers. Despite their spectacular performances, existing methods have certain drawbacks and limitations, including fixed length of sequences being utilized for model development and cell-specificity negligence. A novel predictor would be beneficial in the context of genome-wide enhancer prediction by addressing the above-mentioned issues. In this study, we constructed new datasets for eight different cell types. Utilizing these data, we proposed an integrative machine learning (ML)-based framework called Enhancer-IF for identifying cell-specific enhancers. Enhancer-IF comprehensively explores a wide range of heterogeneous features with five commonly used ML methods (random forest, extremely randomized tree, multilayer perceptron, support vector machine and extreme gradient boosting). Specifically, these five classifiers were trained with seven encodings and obtained 35 baseline models. The output of these baseline models was integrated and again inputted to five classifiers for the construction of five meta-models. Finally, the integration of five meta-models through ensemble learning improved the model robustness. Our proposed approach showed an excellent prediction performance compared to the baseline models on both training and independent datasets in different cell types, thus highlighting the superiority of our approach in the identification of the enhancers. We assume that Enhancer-IF will be a valuable tool for screening and identifying potential enhancers from the human DNA sequences.

27 citations

Journal ArticleDOI
TL;DR: Deepm5C as discussed by the authors is a bioinformatics method for identifying RNA m5C sites throughout the human genome, which uses a mixture of three conventional feature-encoding algorithms and a feature derived from word embedding approaches.

25 citations

Journal ArticleDOI
TL;DR: This work proposed the Phage_UniR_LGBM, a model that utilizes the UniRep as the feature and the LightGBM algorithm as the classification model to classify the virion proteins.
Abstract: Phage, the most prevalent creature on the planet, serves a variety of critical roles. Phage's primary role is to facilitate gene-to-gene communication. The phage proteins can be defined as the virion proteins and the nonvirion ones. Nowadays, experimental identification is a difficult process that necessitates a significant amount of laboratory time and expense. Considering such situation, it is critical to design practical calculating techniques and develop well-performance tools. In this work, the Phage_UniR_LGBM has been proposed to classify the virion proteins. In detailed, such model utilizes the UniRep as the feature and the LightGBM algorithm as the classification model. And then, the training data train the model, and the testing data test the model with the cross-validation. The Phage_UniR_LGBM was compared with the several state-of-the-art features and classification algorithms. The performances of the Phage_UniR_LGBM are 88.51% in Sp,89.89% in Sn, 89.18% in Acc, 0.7873 in MCC, and 0.8925 in F1 score.

21 citations