scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Predicting glycosylation stereoselectivity using machine learning.

04 Mar 2021-Chemical Science (The Royal Society of Chemistry)-Vol. 12, Iss: 8, pp 2931-2939
TL;DR: A random forest algorithm was trained using a highly reproducible, concise dataset to accurately predict the stereoselective outcome of glycosylations and accurately predicts previously unknown means of stereocontrol.
Abstract: Predicting the stereochemical outcome of chemical reactions is challenging in mechanistically ambiguous transformations. The stereoselectivity of glycosylation reactions is influenced by at least eleven factors across four chemical participants and temperature. A random forest algorithm was trained using a highly reproducible, concise dataset to accurately predict the stereoselective outcome of glycosylations. The steric and electronic contributions of all chemical reagents and solvents were quantified by quantum mechanical calculations. The trained model accurately predicts stereoselectivities for unseen nucleophiles, electrophiles, acid catalyst, and solvents across a wide temperature range (overall root mean square error 6.8%). All predictions were validated experimentally on a standardized microreactor platform. The model helped to identify novel ways to control glycosylation stereoselectivity and accurately predicts previously unknown means of stereocontrol. By quantifying the degree of influence of each variable, we begin to gain a better general understanding of the transformation, for example that environmental factors influence the stereoselectivity of glycosylations more than the coupling partners in this area of chemical space.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is shown that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases, and highlighted the likely importance of systematically generating reliable and standardized data sets for algorithm training.
Abstract: Applications of machine learning (ML) to synthetic chemistry rely on the assumption that large numbers of literature-reported examples should enable construction of accurate and predictive models of chemical reactivity. This paper demonstrates that abundance of carefully curated literature data may be insufficient for this purpose. Using an example of Suzuki–Miyaura coupling with heterocyclic building blocks—and a carefully selected database of >10,000 literature examples—we show that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases. This result holds irrespective of the ML model applied (from simple feed-forward to state-of-the-art graph-convolution neural networks) or the representation to describe the reaction partners (various fingerprints, chemical descriptors, latent representations, etc.). In all cases, the ML methods fail to perform significantly better than naive assignments based on the sheer frequency of certain reaction conditions reported in the literature. These unsatisfactory results likely reflect subjective preferences of various chemists to use certain protocols, other biasing factors as mundane as availability of certain solvents/reagents, and/or a lack of negative data. These findings highlight the likely importance of systematically generating reliable and standardized data sets for algorithm training.

35 citations

Journal ArticleDOI
TL;DR: In this paper, a computational approach to evaluate the reaction mechanisms of glycosylation using ab initio molecular dynamics (AIMD) simulations in explicit solvent is presented. But the authors do not consider the effect of the free energy surface, the synchronicity of the transition state structure and the time gap between leaving group dissociation and nucleophile association.
Abstract: We report a computational approach to evaluate the reaction mechanisms of glycosylation using ab initio molecular dynamics (AIMD) simulations in explicit solvent. The reaction pathways are simulated via free energy calculations based on metadynamics and trajectory simulations using Born-Oppenheimer molecular dynamics. We applied this approach to investigate the mechanisms of the glycosylation of glucosyl α-trichloroacetimidate with three acceptors (EtOH, i-PrOH, and t-BuOH) in three solvents (ACN, DCM, and MTBE). The reactants and the solvents are treated explicitly using density functional theory. We show that the profile of the free energy surface, the synchronicity of the transition state structure, and the time gap between leaving group dissociation and nucleophile association can be used as three complementary indicators to describe the glycosylation mechanism within the SN1/SN2 continuum for a given reaction. This approach provides a reliable means to rationalize and predict reaction mechanisms and to estimate lifetimes of oxocarbenium intermediates and their dependence on the glycosyl donor, acceptor, and solvent environment.

26 citations

Journal ArticleDOI
TL;DR: In this paper, 12 guidelines for the choice of concentration, temperature, and counterions are adumbrated with a view to reducing the complexity and irreproducibility of glycosylation reactions.
Abstract: With a view to reducing the notorious complexity and irreproducibility of glycosylation reactions, 12 guidelines for the choice of concentration, temperature, and counterions are adumbrated.

25 citations

Journal ArticleDOI
TL;DR: In this article , the authors address the recent development of data-driven technologies for chemical reaction tasks, including forward reaction prediction, retrosynthesis, reaction optimization, catalysts design, inference of experimental procedures, and reaction classification.
Abstract: Discovering new reactions, optimizing their performance, and extending the synthetically accessible chemical space are critical drivers for major technological advances and more sustainable processes. The current wave of machine intelligence is revolutionizing all data‐rich disciplines. Machine intelligence has emerged as a potential game‐changer for chemical reaction space exploration and the synthesis of novel molecules and materials. Herein, we will address the recent development of data‐driven technologies for chemical reaction tasks, including forward reaction prediction, retrosynthesis, reaction optimization, catalysts design, inference of experimental procedures, and reaction classification. Accurate predictions of chemical reactivity are changing the R&D processes and, at the same time, promoting an accelerated discovery scheme both in academia and across chemical and pharmaceutical industries. This work will help to clarify the key contributions in the fields and the open challenges that remain to be addressed.

19 citations

Journal ArticleDOI
TL;DR: This article shows how specifically tuned machine learning models, based on random forest classifiers, can expand the applicability of Pd-catalyzed cross-coupling reactions to types of nucleophiles unknown to the model.
Abstract: Transfer and active learning have the potential to accelerate the development of new chemical reactions, using prior data and new experiments to inform models that adapt to the target area of interest. This article shows how specifically tuned machine learning models, based on random forest classifiers, can expand the applicability of Pd-catalyzed cross-coupling reactions to types of nucleophiles unknown to the model. First, model transfer is shown to be effective when reaction mechanisms and substrates are closely related, even when models are trained on relatively small numbers of data points. Then, a model simplification scheme is tested and found to provide comparative predictivity on reactions of new nucleophiles that include unseen reagent combinations. Lastly, for a challenging target where model transfer only provides a modest benefit over random selection, an active transfer learning strategy is introduced to improve model predictions. Simple models, composed of a small number of decision trees with limited depths, are crucial for securing generalizability, interpretability, and performance of active transfer learning.

12 citations

References
More filters
Journal ArticleDOI
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

14,509 citations

Journal ArticleDOI
TL;DR: Physical structure is known to contribute to the appearance of bird plumage through structural color and specular reflection, but a third mechanism, structural absorption, leads to low reflectance and super black color in birds of paradise feathers.
Abstract: Many studies have shown how pigments and internal nanostructures generate color in nature. External surface structures can also influence appearance, such as by causing multiple scattering of light (structural absorption) to produce a velvety, super black appearance. Here we show that feathers from five species of birds of paradise (Aves: Paradisaeidae) structurally absorb incident light to produce extremely low-reflectance, super black plumages. Directional reflectance of these feathers (0.05-0.31%) approaches that of man-made ultra-absorbent materials. SEM, nano-CT, and ray-tracing simulations show that super black feathers have titled arrays of highly modified barbules, which cause more multiple scattering, resulting in more structural absorption, than normal black feathers. Super black feathers have an extreme directional reflectance bias and appear darkest when viewed from the distal direction. We hypothesize that structurally absorbing, super black plumage evolved through sensory bias to enhance the perceived brilliance of adjacent color patches during courtship display.

5,916 citations

Journal ArticleDOI
26 Jul 2018-Nature
TL;DR: A future in which the design, synthesis, characterization and application of molecules and materials is accelerated by artificial intelligence is envisaged.
Abstract: Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning techniques that are suitable for addressing research questions in this domain, as well as future directions for the field. We envisage a future in which the design, synthesis, characterization and application of molecules and materials is accelerated by artificial intelligence.

2,295 citations

Journal ArticleDOI
TL;DR: A general index of predictive discrimination is used to measure the ability of a model developed on training samples of varying sizes to predict survival in an independent test sample of patients suspected of having coronary artery disease.
Abstract: Regression models such as the Cox proportional hazards model have had increasing use in modelling and estimating the prognosis of patients with a variety of diseases. Many applications involve a large number of variables to be modelled using a relatively small patient sample. Problems of overfitting and of identifying important covariates are exacerbated in analysing prognosis because the accuracy of a model is more a function of the number of events than of the sample size. We used a general index of predictive discrimination to measure the ability of a model developed on training samples of varying sizes to predict survival in an independent test sample of patients suspected of having coronary artery disease. We compared three methods of model fitting: (1) standard ‘step-up’ variable selection, (2) incomplete principal components regression, and (3) Cox model regression after developing clinical indices from variable clusters. We found regression using principal components to offer superior predictions in the test sample, whereas regression using indices offers easily interpretable models nearly as good as the principal components models. Standard variable selection has a number of deficiencies.

1,657 citations