scispace - formally typeset
Search or ask a question
Author

Anita Rácz

Bio: Anita Rácz is an academic researcher from Hungarian Academy of Sciences. The author has contributed to research in topics: Similarity (network science) & Medicine. The author has an hindex of 12, co-authored 41 publications receiving 914 citations. Previous affiliations of Anita Rácz include University of Debrecen & Corvinus University of Budapest.

Papers
More filters
Journal ArticleDOI
TL;DR: Eight well-known similarity/distance metrics are compared on a large dataset of molecular fingerprints with sum of ranking differences (SRD) and ANOVA analysis and the Tanimoto index, Dice index, Cosine coefficient and Soergel distance were identified to be the best metrics for similarity calculations.
Abstract: Cheminformaticians are equipped with a very rich toolbox when carrying out molecular similarity calculations. A large number of molecular representations exist, and there are several methods (similarity and distance metrics) to quantify the similarity of molecular representations. In this work, eight well-known similarity/distance metrics are compared on a large dataset of molecular fingerprints with sum of ranking differences (SRD) and ANOVA analysis. The effects of molecular size, selection methods and data pretreatment methods on the outcome of the comparison are also assessed. A supplier database ( https://mcule.com/ ) was used as the source of compounds for the similarity calculations in this study. A large number of datasets, each consisting of one hundred compounds, were compiled, molecular fingerprints were generated and similarity values between a randomly chosen reference compound and the rest were calculated for each dataset. Similarity metrics were compared based on their ranking of the compounds within one experiment (one dataset) using sum of ranking differences (SRD), while the results of the entire set of experiments were summarized on box and whisker plots. Finally, the effects of various factors (data pretreatment, molecule size, selection method) were evaluated with analysis of variance (ANOVA). This study complements previous efforts to examine and rank various metrics for molecular similarity calculations. Here, however, an entirely general approach was taken to neglect any a priori knowledge on the compounds involved, as well as any bias introduced by examining only one or a few specific scenarios. The Tanimoto index, Dice index, Cosine coefficient and Soergel distance were identified to be the best (and in some sense equivalent) metrics for similarity calculations, i.e. these metrics could produce the rankings closest to the composite (average) ranking of the eight metrics. The similarity metrics derived from Euclidean and Manhattan distances are not recommended on their own, although their variability and diversity from other similarity metrics might be advantageous in certain cases (e.g. for data fusion). Conclusions are also drawn regarding the effects of molecule size, selection method and data pretreatment on the ranking behavior of the studied metrics.

770 citations

Journal ArticleDOI
TL;DR: It is demonstrated the capabilities of sum of ranking differences (SRD) in model selection and ranking, and the best performance indicators and models are identified.
Abstract: Recent implementations of QSAR modelling software provide the user with numerous models and a wealth of information. In this work, we provide some guidance on how one should interpret the results of QSAR modelling, compare and assess the resulting models, and select the best and most consistent ones. Two QSAR datasets are applied as case studies for the comparison of model performance parameters and model selection methods. We demonstrate the capabilities of sum of ranking differences (SRD) in model selection and ranking, and identify the best performance indicators and models. While the exchange of the original training and (external) test sets does not affect the ranking of performance parameters, it provides improved models in certain cases (despite the lower number of molecules in the training set). Performance parameters for external validation are substantially separated from the other merits in SRD analyses, highlighting their value in data fusion.

78 citations

Journal ArticleDOI
TL;DR: In this paper, the authors compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification.
Abstract: Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

72 citations

Journal ArticleDOI
TL;DR: This work assessed the effect of similarity metrics and IFP configurations to a number of virtual screening scenarios with ten different protein targets and thousands of molecules and identified metrics that are viable alternatives to the commonly used Tanimoto coefficient.
Abstract: Interaction fingerprints (IFP) have been repeatedly shown to be valuable tools in virtual screening to identify novel hit compounds that can subsequently be optimized to drug candidates. As a complementary method to ligand docking, IFPs can be applied to quantify the similarity of predicted binding poses to a reference binding pose. For this purpose, a large number of similarity metrics can be applied, and various parameters of the IFPs themselves can be customized. In a large-scale comparison, we have assessed the effect of similarity metrics and IFP configurations to a number of virtual screening scenarios with ten different protein targets and thousands of molecules. Particularly, the effect of considering general interaction definitions (such as Any Contact, Backbone Interaction and Sidechain Interaction), the effect of filtering methods and the different groups of similarity metrics were studied. The performances were primarily compared based on AUC values, but we have also used the original similarity data for the comparison of similarity metrics with several statistical tests and the novel, robust sum of ranking differences (SRD) algorithm. With SRD, we can evaluate the consistency (or concordance) of the various similarity metrics to an ideal reference metric, which is provided by data fusion from the existing metrics. Different aspects of IFP configurations and similarity metrics were examined based on SRD values with analysis of variance (ANOVA) tests. A general approach is provided that can be applied for the reliable interpretation and usage of similarity measures with interaction fingerprints. Metrics that are viable alternatives to the commonly used Tanimoto coefficient were identified based on a comparison with an ideal reference metric (consensus). A careful selection of the applied bits (interaction definitions) and IFP filtering rules can improve the results of virtual screening (in terms of their agreement with the consensus metric). The open-source Python package FPKit was introduced for the similarity calculations and IFP filtering; it is available at: https://github.com/davidbajusz/fpkit .

59 citations

Journal ArticleDOI
TL;DR: In consensus-based comparisons, the shake-flask method performed the best, closely followed by computational estimates, while the chromatographic estimates often overlap with in silico assessments, mostly with methods involving octadecyl-modified silica stationary phases.

53 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Principal component analysis (PCA) and hierarchical cluster analysis (HCA) are criticized as their indiscriminate use to assess the association between bioactive compounds and in vitro functional properties is criticized.
Abstract: Background The development of statistical software has enabled food scientists to perform a wide variety of mathematical/statistical analyses and solve problems. Therefore, not only sophisticated analytical methods but also the application of multivariate statistical methods have increased considerably. Herein, principal component analysis (PCA) and hierarchical cluster analysis (HCA) are the most widely used tools to explore similarities and hidden patterns among samples where relationship on data and grouping are until unclear. Usually, larger chemical data sets, bioactive compounds and functional properties are the target of these methodologies. Scope and approach In this article, we criticize these methods when correlation analysis should be calculated and results analyzed. Key findings and conclusions The use of PCA and HCA in food chemistry studies has increased because the results are easy to interpret and discuss. However, their indiscriminate use to assess the association between bioactive compounds and in vitro functional properties is criticized as they provide a qualitative view of the data. When appropriate, one should bear in mind that the correlation between the content of chemical compounds and bioactivity could be duly discussed using correlation coefficients.

535 citations

Journal ArticleDOI
TL;DR: The Therapeutic Target Database (TTD) is constructed with expanded information about target-regulating microRNAs and transcription factors, target-interacting proteins, and patented agents and their targets, which can be conveniently retrieved and is further enriched with regulatory mechanisms or biochemical classes.
Abstract: Knowledge of therapeutic targets and early drug candidates is useful for improved drug discovery. In particular, information about target regulators and the patented therapeutic agents facilitates research regarding druggability, systems pharmacology, new trends, molecular landscapes, and the development of drug discovery tools. To complement other databases, we constructed the Therapeutic Target Database (TTD) with expanded information about (i) target-regulating microRNAs and transcription factors, (ii) target-interacting proteins, and (iii) patented agents and their targets (structures and experimental activity values if available), which can be conveniently retrieved and is further enriched with regulatory mechanisms or biochemical classes. We also updated the TTD with the recently released International Classification of Diseases ICD-11 codes and additional sets of successful, clinical trial, and literature-reported targets that emerged since the last update. TTD is accessible at http://bidd.nus.edu.sg/group/ttd/ttd.asp. In case of possible web connectivity issues, two mirror sites of TTD are also constructed (http://db.idrblab.org/ttd/ and http://db.idrblab.net/ttd/).

523 citations

Journal ArticleDOI
18 Jul 2018-Nature
TL;DR: An organic synthesis robot is presented that can perform chemical reactions and analysis faster than they can be performed manually, as well as predict the reactivity of possible reagent combinations after conducting a small number of experiments, thus effectively navigating chemical reaction space.
Abstract: The discovery of chemical reactions is an inherently unpredictable and time-consuming process1. An attractive alternative is to predict reactivity, although relevant approaches, such as computer-aided reaction design, are still in their infancy2. Reaction prediction based on high-level quantum chemical methods is complex3, even for simple molecules. Although machine learning is powerful for data analysis4,5, its applications in chemistry are still being developed6. Inspired by strategies based on chemists’ intuition7, we propose that a reaction system controlled by a machine learning algorithm may be able to explore the space of chemical reactions quickly, especially if trained by an expert8. Here we present an organic synthesis robot that can perform chemical reactions and analysis faster than they can be performed manually, as well as predict the reactivity of possible reagent combinations after conducting a small number of experiments, thus effectively navigating chemical reaction space. By using machine learning for decision making, enabled by binary encoding of the chemical inputs, the reactions can be assessed in real time using nuclear magnetic resonance and infrared spectroscopy. The machine learning system was able to predict the reactivity of about 1,000 reaction combinations with accuracy greater than 80 per cent after considering the outcomes of slightly over 10 per cent of the dataset. This approach was also used to calculate the reactivity of published datasets. Further, by using real-time data from our robot, these predictions were followed up manually by a chemist, leading to the discovery of four reactions.

430 citations

Journal ArticleDOI
23 Jul 2018-Analyst
TL;DR: The aim of the article is to review, outline and describe the contemporary PLS-DA modelling practice strategies, and to critically discuss the respective knowledge gaps that have emerged in response to the present big data era.
Abstract: Partial least squares-discriminant analysis (PLS-DA) is a versatile algorithm that can be used for predictive and descriptive modelling as well as for discriminative variable selection. However, versatility is both a blessing and a curse and the user needs to optimize a wealth of parameters before reaching reliable and valid outcomes. Over the past two decades, PLS-DA has demonstrated great success in modelling high-dimensional datasets for diverse purposes, e.g. product authentication in food analysis, diseases classification in medical diagnosis, and evidence analysis in forensic science. Despite that, in practice, many users have yet to grasp the essence of constructing a valid and reliable PLS-DA model. As the technology progresses, across every discipline, datasets are evolving into a more complex form, i.e. multi-class, imbalanced and colossal. Indeed, the community is welcoming a new era called big data. In this context, the aim of the article is two-fold: (a) to review, outline and describe the contemporary PLS-DA modelling practice strategies, and (b) to critically discuss the respective knowledge gaps that have emerged in response to the present big data era. This work could complement other available reviews or tutorials on PLS-DA, to provide a timely and user-friendly guide to researchers, especially those working in applied research.

357 citations

Journal ArticleDOI
TL;DR: This work explores the use of neural networks for predicting reaction types, using a new reaction fingerprinting method and combines this predictor with SMARTS transformations to build a system which, given a set of reagents and reactants, predicts the likely products.
Abstract: Reaction prediction remains one of the major challenges for organic chemistry and is a prerequisite for efficient synthetic planning. It is desirable to develop algorithms that, like humans, “learn” from being exposed to examples of the application of the rules of organic chemistry. We explore the use of neural networks for predicting reaction types, using a new reaction fingerprinting method. We combine this predictor with SMARTS transformations to build a system which, given a set of reagents and reactants, predicts the likely products. We test this method on problems from a popular organic chemistry textbook.

331 citations