scispace - formally typeset
Search or ask a question
Author

Ying Ju

Bio: Ying Ju is an academic researcher from Xiamen University. The author has contributed to research in topics: Support vector machine & Computer science. The author has an hindex of 12, co-authored 13 publications receiving 711 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: Two methods to predict microRNA-disease association by integrating the social network analysis method with machine learning and based on networks derived from known microRNAs, diseases, and micro RNA-microRNA associations are introduced.
Abstract: MicroRNAs constitute an important class of noncoding, single-stranded, ~22 nucleotide long RNA molecules encoded by endogenous genes. They play an important role in regulating gene transcription and the regulation of normal development. MicroRNAs can be associated with disease; however, only a few microRNA-disease associations have been confirmed by traditional experimental approaches. We introduce two methods to predict microRNA-disease association. The first method, KATZ, focuses on integrating the social network analysis method with machine learning and is based on networks derived from known microRNA-disease associations, disease-disease associations, and microRNA-microRNA associations. The other method, CATAPULT, is a supervised machine learning method. We applied the two methods to 242 known microRNA-disease associations and evaluated their performance using leave-one-out cross-validation and 3-fold cross-validation. Experiments proved that our methods outperformed the state-of-the-art methods.

149 citations

Journal ArticleDOI
Quan Zou1, Quan Zou2, Sifa Xie1, Ziyu Lin1, Meihong Wu1, Ying Ju1 
TL;DR: The drawbacks of using ROC as the sole measure of imbalance in data classification problems are analyzed and a novel framework for finding the best classification threshold is proposed.

148 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently, which can guide for the special protein identification with computational intelligence strategies.
Abstract: It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies. Firstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA-binding protein prediction accuracy, which is better than all other existing methods. The experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/ .

144 citations

Journal ArticleDOI
TL;DR: TRNA-Predict as mentioned in this paper is a machine learning method to improve the tRNAscan-SE results, which is a tRNA detection program that is widely used for tRNA annotation; however, the false positive rate of tRNcan-SE is unacceptable for large sequences.
Abstract: tRNAScan-SE is a tRNA detection program that is widely used for tRNA annotation; however, the false positive rate of tRNAScan-SE is unacceptable for large sequences. Here, we used a machine learning method to try to improve the tRNAScan-SE results. A new predictor, tRNA-Predict, was designed. We obtained real and pseudo-tRNA sequences as training data sets using tRNAScan-SE and constructed three different tRNA feature sets. We then set up an ensemble classifier, LibMutil, to predict tRNAs from the training data. The positive data set of 623 tRNA sequences was obtained from tRNAdb 2009 and the negative data set was the false positive tRNAs predicted by tRNAscan-SE. Our in silico experiments revealed a prediction accuracy rate of 95.1 % for tRNA-Predict using 10-fold cross-validation. tRNA-Predict was developed to distinguish functional tRNAs from pseudo-tRNAs rather than to predict tRNAs from a genome-wide scan. However, tRNA-Predict can work with the output of tRNAscan-SE, which is a genome-wide scanning method, to improve the tRNAscan-SE annotation results. The tRNA-Predict web server is accessible at http://datamining.xmu.edu.cn/∼gjs/tRNA-Predict.

58 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: An in depth review of rare event detection from an imbalanced learning perspective and a comprehensive taxonomy of the existing application domains of im balanced learning are provided.
Abstract: 527 articles related to imbalanced data and rare events are reviewed.Viewing reviewed papers from both technical and practical perspectives.Summarizing existing methods and corresponding statistics by a new taxonomy idea.Categorizing 162 application papers into 13 domains and giving introduction.Some opening questions are discussed at the end of this manuscript. Rare events, especially those that could potentially negatively impact society, often require humans decision-making responses. Detecting rare events can be viewed as a prediction task in data mining and machine learning communities. As these events are rarely observed in daily life, the prediction task suffers from a lack of balanced data. In this paper, we provide an in depth review of rare event detection from an imbalanced learning perspective. Five hundred and seventeen related papers that have been published in the past decade were collected for the study. The initial statistics suggested that rare events detection and imbalanced learning are concerned across a wide range of research areas from management science to engineering. We reviewed all collected papers from both a technical and a practical point of view. Modeling methods discussed include techniques such as data preprocessing, classification algorithms and model evaluation. For applications, we first provide a comprehensive taxonomy of the existing application domains of imbalanced learning, and then we detail the applications for each category. Finally, some suggestions from the reviewed papers are incorporated with our experiences and judgments to offer further research directions for the imbalanced learning and rare event detection fields.

1,448 citations

Journal ArticleDOI
02 Jun 2017-PLOS ONE
TL;DR: The proposed MCC-classifier has a close performance to SVM-imba while being simpler and more efficient and an optimal Bayes classifier for the MCC metric using an approach based on Frechet derivative.
Abstract: Data imbalance is frequently encountered in biomedical applications Resampling techniques can be used in binary classification to tackle this issue However such solutions are not desired when the number of samples in the small class is limited Moreover the use of inadequate performance metrics, such as accuracy, lead to poor generalization results because the classifiers tend to predict the largest size class One of the good approaches to deal with this issue is to optimize performance metrics that are designed to handle data imbalance Matthews Correlation Coefficient (MCC) is widely used in Bioinformatics as a performance metric We are interested in developing a new classifier based on the MCC metric to handle imbalanced data We derive an optimal Bayes classifier for the MCC metric using an approach based on Frechet derivative We show that the proposed algorithm has the nice theoretical property of consistency Using simulated data, we verify the correctness of our optimality result by searching in the space of all possible binary classifiers The proposed classifier is evaluated on 64 datasets from a wide range data imbalance We compare both classification performance and CPU efficiency for three classifiers: 1) the proposed algorithm (MCC-classifier), the Bayes classifier with a default threshold (MCC-base) and imbalanced SVM (SVM-imba) The experimental evaluation shows that MCC-classifier has a close performance to SVM-imba while being simpler and more efficient

850 citations

Journal Article
TL;DR: Findings provide direct evidence that let-7 acts as a tumor suppressor gene in the lung and indicate that this miRNA may be useful as a novel therapeutic agent in lung cancer.
Abstract: LB-194 Lung cancer is the most prevalent form of cancer worldwide and accounts for the most cancer deaths. MicroRNAs (miRNAs) are small, non-protein coding RNAs that have recently emerged as important regulators of gene expression and direct proper cellular growth, differentiation and cell death - all mechanisms that go awry in cancer. The let-7 miRNA is postulated to function as a tumor suppressor gene in a variety of human tissues, particularly in the lung, by negatively regulating the post-transcriptional expression of multiple oncogenes including RAS, MYC, and HMGA2, as well as other cell cycle progression genes. Here we have used both in vitro and in vivo approaches to show that let-7 directly represses cancer growth in the lung. We show that let-7 inhibits the growth of multiple human lung cancer cell lines in culture, as well as the growth of lung cancer cell xenografts in immunodeficient mice. Using the established Lox-Stop-Lox K-ras mouse lung cancer model, we find that intranasal let-7 administration can reduce tumor formation in vivo in the lungs of animals expressing a G12D activating mutation for the K-ras oncogene. These findings support the notion that let-7 functions as a tumor suppressor in the lung and indicates that this miRNA could be used as a therapeutic agent to treat lung cancer.

556 citations

Journal ArticleDOI
Quan Zou1, Kaiyang Qu1, Yamei Luo, Dehui Yin, Ying Ju2, Hua Tang 
TL;DR: The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used and principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) was used to reduce the dimensionality.
Abstract: Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world's diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients' data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used.

468 citations

Journal ArticleDOI
TL;DR: A novel model of Inductive Matrix Completion for MiRNA‐Disease Association prediction (IMCMDA) to complete the missing miRNA‐disease association based on the known associations and the integrated miRNA similarity and disease similarity.
Abstract: Motivation It has been shown that microRNAs (miRNAs) play key roles in variety of biological processes associated with human diseases. In Consideration of the cost and complexity of biological experiments, computational methods for predicting potential associations between miRNAs and diseases would be an effective complement. Results This paper presents a novel model of Inductive Matrix Completion for MiRNA-Disease Association prediction (IMCMDA). The integrated miRNA similarity and disease similarity are calculated based on miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity. The main idea is to complete the missing miRNA-disease association based on the known associations and the integrated miRNA similarity and disease similarity. IMCMDA achieves AUC of 0.8034 based on leave-one-out-cross-validation and improved previous models. In addition, IMCMDA was applied to five common human diseases in three types of case studies. In the first type, respectively, 42, 44, 45 out of top 50 predicted miRNAs of Colon Neoplasms, Kidney Neoplasms, Lymphoma were confirmed by experimental reports. In the second type of case study for new diseases without any known miRNAs, we chose Breast Neoplasms as the test example by hiding the association information between the miRNAs and Breast Neoplasms. As a result, 50 out of top 50 predicted Breast Neoplasms-related miRNAs are verified. In the third type of case study, IMCMDA was tested on HMDD V1.0 to assess the robustness of IMCMDA, 49 out of top 50 predicted Esophageal Neoplasms-related miRNAs are verified. Availability and implementation The code and dataset of IMCMDA are freely available at https://github.com/IMCMDAsourcecode/IMCMDA. Supplementary information Supplementary data are available at Bioinformatics online.

362 citations