scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Assigning the Origin of Microbial Natural Products by Chemical Space Map and Machine Learning.

28 Sep 2020-Vol. 10, Iss: 10, pp 1385
TL;DR: The recently reported MinHashed Atom Pair fingerprint with diameter of four bonds (MAP4), a fingerprint suitable for molecules across very different sizes, is used to analyze the Natural Products Atlas (NPAtlas), a database of 25,523 NPs of bacterial or fungal origin and separates bacterial and fungal NPs from one another.
Abstract: Microbial natural products (NPs) are an important source of drugs, however, their structural diversity remains poorly understood. Here we used our recently reported MinHashed Atom Pair fingerprint with diameter of four bonds (MAP4), a fingerprint suitable for molecules across very different sizes, to analyze the Natural Products Atlas (NPAtlas), a database of 25,523 NPs of bacterial or fungal origin. To visualize NPAtlas by MAP4 similarity, we used the dimensionality reduction method tree map (TMAP). The resulting interactive map organizes molecules by physico-chemical properties and compound families such as peptides and glycosides. Remarkably, the map separates bacterial and fungal NPs from one another, revealing that these two compound families are intrinsically different despite their related biosynthetic pathways. We used these differences to train a machine learning model capable of distinguishing between NPs of bacterial or fungal origin.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, a deep learning tool for the automated structural classification of natural products (NPs) from their counted Morgan fingerprints is introduced. But it is not suitable for the classification of large numbers of NPs, and it cannot handle the massive amounts of data appearing for NP structures.
Abstract: Computational approaches such as genome and metabolome mining are becoming essential to natural products (NPs) research. Consequently, a need exists for an automated structure-type classification system to handle the massive amounts of data appearing for NP structures. An ideal semantic ontology for the classification of NPs should go beyond the simple presence/absence of chemical substructures, but also include the taxonomy of the producing organism, the nature of the biosynthetic pathway, and/or their biological properties. Thus, a holistic and automatic NP classification framework could have considerable value to comprehensively navigate the relatedness of NPs, and especially so when analyzing large numbers of NPs. Here, we introduce NPClassifier, a deep-learning tool for the automated structural classification of NPs from their counted Morgan fingerprints. NPClassifier is expected to accelerate and enhance NP discovery by linking NP structures to their underlying properties.

69 citations

Journal ArticleDOI
TL;DR: In this article , the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity are discussed.
Abstract: Natural products (NPs) are primarily recognized as privileged structures to interact with protein drug targets. Their unique characteristics and structural diversity continue to marvel scientists for developing NP-inspired medicines, even though the pharmaceutical industry has largely given up. High-performance computer hardware, extensive storage, accessible software and affordable online education have democratized the use of artificial intelligence (AI) in many sectors and research areas. The last decades have introduced natural language processing and machine learning algorithms, two subfields of AI, to tackle NP drug discovery challenges and open up opportunities. In this article, we review and discuss the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity.

44 citations

Journal ArticleDOI
TL;DR: An overview of the role of alkaloids in drug discovery, the application of more sustainable chemicals, and biological approaches, and the implementation of information systems to address the current challenges faced in meeting global disease needs is presented in this article.
Abstract: An overview is presented of the well-established role of alkaloids in drug discovery, the application of more sustainable chemicals, and biological approaches, and the implementation of information systems to address the current challenges faced in meeting global disease needs. The necessity for a new international paradigm for natural product discovery and development for the treatment of multidrug resistant organisms, and rare and neglected tropical diseases in the era of the Fourth Industrial Revolution and the Quintuple Helix is discussed.

21 citations

Journal ArticleDOI
TL;DR: In this article, the authors discuss the recent progress on chemoinformatic tools developed to expand and characterize the chemical space of compound data sets using different types of molecular representations, generate visual representations of such spaces, and explore structure-property relationships in the context of chemical spaces.
Abstract: The concept of chemical space is a cornerstone in chemoinformatics, and it has broad conceptual and practical applicability in many areas of chemistry, including drug design and discovery. One of the most considerable impacts is in the study of structure-property relationships where the property can be a biological activity or any other characteristic of interest to a particular chemistry discipline. The chemical space is highly dependent on the molecular representation that is also a cornerstone concept in computational chemistry. Herein, we discuss the recent progress on chemoinformatic tools developed to expand and characterize the chemical space of compound data sets using different types of molecular representations, generate visual representations of such spaces, and explore structure-property relationships in the context of chemical spaces. We emphasize the development of methods and freely available tools focusing on drug discovery applications. We also comment on the general advantages and shortcomings of using freely available and easy-to-use tools and discuss the value of using such open resources for research, education, and scientific dissemination.

17 citations

Journal ArticleDOI
01 Mar 2021
TL;DR: An overview of the known peptide chemical space is presented in form of an interactive map representing 40,531 peptides collected from eleven open-access peptide and peptide-containing databases, accessible at https://tm.gdb.tools/map4/peptide_databases_tmap/ .
Abstract: Peptides, defined as sequences of amino acids up to approximately 50 residues in length, represent an extremely large reservoir of potentially bioactive compounds, referred to here as the peptide chemical space. Recent advances in computer hardware and software have led to a wide application of computational methods to explore this chemical space. Here, we review different in silico approaches including structure-based design, genetic algorithms, and machine learning. We also review the use of molecular fingerprints to sample virtual libraries and to visualize the peptide chemical space. Finally, we present an overview of the known peptide chemical space in form of an interactive map representing 40,531 peptides collected from eleven open-access peptide and peptide-containing databases, accessible at https://tm.gdb.tools/map4/peptide_databases_tmap/ . These peptides are displayed as TMAP (Tree-Map) according to their molecular fingerprint similarity computed using MAP4, a MinHashed atom pair fingerprint well suited to analyze large molecules.

14 citations

References
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

Journal Article
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

30,124 citations

Posted Content
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations

Journal ArticleDOI
TL;DR: Experimental and computational approaches to estimate solubility and permeability in discovery and development settings are described in this article, where the rule of 5 is used to predict poor absorption or permeability when there are more than 5 H-bond donors, 10 Hbond acceptors, and the calculated Log P (CLogP) is greater than 5 (or MlogP > 415).

14,026 citations

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open source scientific computing library for the Python programming language, which includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics.
Abstract: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

12,774 citations