scispace - formally typeset
Search or ask a question

Showing papers by "Volkan Atalay published in 2019"


Journal ArticleDOI
TL;DR: The objective of this study is to examine and discuss the recent applications of machine learning techniques in VS, including deep learning, which became highly popular after giving rise to epochal developments in the fields of computer vision and natural language processing.
Abstract: The identification of interactions between drugs/compounds and their targets is crucial for the development of new drugs. In vitro screening experiments (i.e. bioassays) are frequently used for this purpose; however, experimental approaches are insufficient to explore novel drug-target interactions, mainly because of feasibility problems, as they are labour intensive, costly and time consuming. A computational field known as 'virtual screening' (VS) has emerged in the past decades to aid experimental drug discovery studies by statistically estimating unknown bio-interactions between compounds and biological targets. These methods use the physico-chemical and structural properties of compounds and/or target proteins along with the experimentally verified bio-interaction information to generate predictive models. Lately, sophisticated machine learning techniques are applied in VS to elevate the predictive performance. The objective of this study is to examine and discuss the recent applications of machine learning techniques in VS, including deep learning, which became highly popular after giving rise to epochal developments in the fields of computer vision and natural language processing. The past 3 years have witnessed an unprecedented amount of research studies considering the application of deep learning in biomedicine, including computational drug discovery. In this review, we first describe the main instruments of VS methods, including compound and protein features (i.e. representations and descriptors), frequently used libraries and toolkits for VS, bioactivity databases and gold-standard data sets for system training and benchmarking. We subsequently review recent VS studies with a strong emphasis on deep learning applications. Finally, we discuss the present state of the field, including the current challenges and suggest future directions. We believe that this survey will provide insight to the researchers working in the field of computational drug discovery in terms of comprehending and developing novel bio-prediction methods.

298 citations


Journal ArticleDOI
Naihui Zhou1, Yuxiang Jiang2, Timothy Bergquist3, Alexandra J. Lee4  +185 moreInstitutions (71)
TL;DR: The third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed, concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not.
Abstract: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

227 citations


Posted ContentDOI
Naihui Zhou1, Yuxiang Jiang2, Timothy Bergquist3, Alexandra J. Lee4  +178 moreInstitutions (67)
29 May 2019-bioRxiv
TL;DR: It is reported that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bioontologies, working together to improve functional annotation, computational function prediction, and the ability to manage big data in the era of large experimental screens.
Abstract: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility (P. aureginosa only). We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that, while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. We finally report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bioontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

121 citations


Journal ArticleDOI
TL;DR: DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, is proposed as a solution to Gene Ontology based protein function prediction and the neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations.
Abstract: Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests, and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms form different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the 'biofilm formation process' in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations. The source code and all datasets used in this study are available at: https://github.com/cansyl/DEEPred .

85 citations


Proceedings ArticleDOI
20 Nov 2019
TL;DR: Root Mean Square Error of the models for predictions are calculated for performance assessment which reveals the performance of these deep learning methods for forecasting based on time-series data.
Abstract: Dramatic increase in data size enabled researchers to study analysis and prediction of big data. Big data can be formed in many ways and one alternative is through the use of sensors. An important aspect of data coming from sensors is that they are time-series data. Although forecasting based on time-series data has been studied widely, it is still possible to advance the state-of-the-art by constructing new hybrid deep learning models. In this study, Random Forest, Convolutional Neural Network, Long Short Term Memory and hybrid Convolutional Neural Network-Long Short Term Memory models are applied and assessed on meteorological time-series data. Vector Auto-regression model and Multi-layer Perceptron model are used as the baseline forecasting methods for comparison purposes. Root Mean Square Error of the models for predictions are calculated for performance assessment which reveals the performance of these deep learning methods for forecasting based on time-series data.

3 citations


Proceedings ArticleDOI
20 Nov 2019
TL;DR: This study has developed a new method to apply UMAP on data streams, adopt concept drift and cluster embedded data instances using any distance based clustering algorithms.
Abstract: Number of connected devices is steadily increasing and these devices continuously generate data streams. These data streams are often high dimensional and contain concept drift. Real-time processing of data streams is arousing interest despite many challenges. Clustering is a method that does not need labeled instances (it is unsupervised) and it can be applied with less prior information about the data. These properties make clustering one of the most suitable methods for real-time data stream processing. Moreover, data embedding is a process that may simplify clustering and makes visualization of high dimensional data possible. There exist several data stream clustering algorithms in the literature, however no data stream embedding method exists. UMAP is a data embedding algorithm that is suitable to be applied on data streams, but it cannot adopt concept drift. In this study, we have developed a new method to apply UMAP on data streams, adopt concept drift and cluster embedded data instances using any distance based clustering algorithms.

1 citations


Book ChapterDOI
16 Sep 2019
TL;DR: The ultimate aim is to design a system that automates the detection of defective units among the sampled freezer units manufactured in high volumes in a factory of one of the leading home appliances manufacturers.
Abstract: Forecasting of product quality by means of anomaly detection is crucial in real-world applications such as manufacturing systems. In manufacturing systems, the quality is assured through tests performed on sample units randomly chosen from a batch of manufactured units. One of the major issues is to detect defective units among the sample test units as early as possible in terms of test time and of course as accurate as possible. Traditional way of detecting defective units is to make use of human experts during test. However, human intervention is prone to errors and it is time consuming. On the other hand, automated systems are efficient alternatives and of assistance to human experts. There are on-line and off-line approaches for automated systems. Our ultimate aim is to design a system that automates the detection of defective units among the sampled freezer units manufactured in high volumes in a factory of one of the leading home appliances manufacturers. We start by analyzing the data of the test units sampled from the batches of freezer units. For analysis, we first embedded data in two-dimensional space to observe if there are any structures exist in the data. Clustering was then applied to see if the data can be grouped into two classes without their labels. As off-line approaches, state-of-the-art classifier methods including one-class-classifier are employed. Finally, a deep learning method for time-series analysis combined with a classifier is applied as an on-line method.