scispace - formally typeset
Search or ask a question
Journal Article

Persistence images: a stable vector representation of persistent homology

TL;DR: In this article, a persistence diagram (PD) is converted to a finite-dimensional vector representation which is called a persistence image (PI) and proved the stability of this transformation with respect to small perturbations in the inputs.
Abstract: Many data sets can be viewed as a noisy sampling of an underlying space, and tools from topological data analysis can characterize this structure for the purpose of knowledge discovery. One such tool is persistent homology, which provides a multiscale description of the homological features within a data set. A useful representation of this homological information is a persistence diagram (PD). Efforts have been made to map PDs into spaces with additional structure valuable to machine learning tasks. We convert a PD to a finite-dimensional vector representation which we call a persistence image (PI), and prove the stability of this transformation with respect to small perturbations in the inputs. The discriminatory power of PIs is compared against existing methods, showing significant performance gains. We explore the use of PIs with vector-based machine learning tools, such as linear sparse support vector machines, which identify features containing discriminating topological information. Finally, high accuracy inference of parameter values from the dynamic output of a discrete dynamical system (the linked twist map) and a partial differential equation (the anisotropic Kuramoto-Sivashinsky equation) provide a novel application of the discriminatory power of PIs.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: This paper is a brief introduction, through a few selected topics, to basic fundamental and practical aspects of TDA for non experts.
Abstract: Topological Data Analysis is a recent and fast growing field providing a set of new topological and geometric tools to infer relevant features for possibly complex data. This paper is a brief introduction, through a few selected topics, to basic fundamental and practical aspects of \tda\ for non experts.

324 citations

Posted Content
TL;DR: In this paper, the authors propose a technique that enables inputting topological signatures to deep neural networks and learn a task-optimal representation during training, which is realized as a novel input layer with favorable theoretical properties.
Abstract: Inferring topological and geometrical information from data can offer an alternative perspective on machine learning problems. Methods from topological data analysis, e.g., persistent homology, enable us to obtain such information, typically in the form of summary representations of topological features. However, such topological signatures often come with an unusual structure (e.g., multisets of intervals) that is highly impractical for most machine learning techniques. While many strategies have been proposed to map these topological signatures into machine learning compatible representations, they suffer from being agnostic to the target learning task. In contrast, we propose a technique that enables us to input topological signatures to deep neural networks and learn a task-optimal representation during training. Our approach is realized as a novel input layer with favorable theoretical properties. Classification experiments on 2D object shapes and social network graphs demonstrate the versatility of the approach and, in case of the latter, we even outperform the state-of-the-art by a large margin.

151 citations

Journal ArticleDOI
TL;DR: It is shown that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations.
Abstract: By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal organic frameworks (MOFs). At present, we have libraries of over ten thousand synthesized materials and millions of in-silico predicted materials. The fact that we have so many materials opens many exciting avenues to tailor make a material that is optimal for a given application. However, from an experimental and computational point of view we simply have too many materials to screen using brute-force techniques. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We emphasize the importance of data collection, methods to augment small data sets, how to select appropriate training sets. An important part of this review are the different approaches that are used to represent these materials in feature space. The review also includes a general overview of the different ML techniques, but as most applications in porous materials use supervised ML our review is focused on the different approaches for supervised ML. In particular, we review the different method to optimize the ML process and how to quantify the performance of the different methods. In the second part, we review how the different approaches of ML have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. The range of topics illustrates the large variety of topics that can be studied with big-data science. Given the increasing interest of the scientific community in ML, we expect this list to rapidly expand in the coming years.

93 citations

Journal ArticleDOI
TL;DR: A persistence homology based molecular representation derived from persistent homology is demonstrated through an active-learning approach for predicting CO 2 /N 2 interaction energies at the density functional theory (DFT) level.
Abstract: Machine learning and high-throughput computational screening have been valuable tools in accelerated first-principles screening for the discovery of the next generation of functionalized molecules and materials The application of machine learning for chemical applications requires the conversion of molecular structures to a machine-readable format known as a molecular representation The choice of such representations impacts the performance and outcomes of chemical machine learning methods Herein, we present a new concise molecular representation derived from persistent homology, an applied branch of mathematics We have demonstrated its applicability in a high-throughput computational screening of a large molecular database (GDB-9) with more than 133,000 organic molecules Our target is to identify novel molecules that selectively interact with CO2 The methodology and performance of the novel molecular fingerprinting method is presented and the new chemically-driven persistence image representation is used to screen the GDB-9 database to suggest molecules and/or functional groups with enhanced properties The choice of molecular representations can severely impact the performances of machine-learning methods Here the authors demonstrate a persistence homology based molecular representation through an active-learning approach for predicting CO2/N2 interaction energies at the density functional theory (DFT) level

77 citations

Journal ArticleDOI
TL;DR: The objective classification of PCs can be achieved with methods from algebraic topology, and the dendritic arborization is sufficient for the reliable identification of distinct types of cortical PCs, and helps settle the long-standing debate on whether cell-types are discrete or continuous morphological variations of each other.
Abstract: A consensus on the number of morphologically different types of pyramidal cells (PCs) in the neocortex has not yet been reached, despite over a century of anatomical studies, due to the lack of agreement on the subjective classifications of neuron types, which is based on expert analyses of neuronal morphologies. Even for neurons that are visually distinguishable, there is no common ground to consistently define morphological types. The objective classification of PCs can be achieved with methods from algebraic topology, and the dendritic arborization is sufficient for the reliable identification of distinct types of cortical PCs. Therefore, we objectively identify 17 types of PCs in the rat somatosensory cortex. In addition, we provide a solution to the challenging problem of whether 2 similar neurons belong to different types or to a continuum of the same type. Our topological classification does not require expert input, is stable, and helps settle the long-standing debate on whether cell-types are discrete or continuous morphological variations of each other.

71 citations

References
More filters
Book
03 Dec 2001

6,660 citations

Journal ArticleDOI
Tin Kam Ho1
TL;DR: A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity.
Abstract: Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy.

5,984 citations

Journal ArticleDOI
TL;DR: This paper will discuss how geometry and topology can be applied to make useful contributions to the analysis of various kinds of data, particularly high throughput data from microarray or other sources.
Abstract: An important feature of modern science and engineering is that data of various kinds is being produced at an unprecedented rate This is so in part because of new experimental methods, and in part because of the increase in the availability of high powered computing technology It is also clear that the nature of the data we are obtaining is significantly different For example, it is now often the case that we are given data in the form of very long vectors, where all but a few of the coordinates turn out to be irrelevant to the questions of interest, and further that we don’t necessarily know which coordinates are the interesting ones A related fact is that the data is often very high-dimensional, which severely restricts our ability to visualize it The data obtained is also often much noisier than in the past and has more missing information (missing data) This is particularly so in the case of biological data, particularly high throughput data from microarray or other sources Our ability to analyze this data, both in terms of quantity and the nature of the data, is clearly not keeping pace with the data being produced In this paper, we will discuss how geometry and topology can be applied to make useful contributions to the analysis of various kinds of data Geometry and topology are very natural tools to apply in this direction, since geometry can be regarded as the study of distance functions, and what one often works with are distance functions on large finite sets of data The mathematical formalism which has been developed for incorporating geometric and topological techniques deals with point clouds, ie finite sets of points equipped with a distance function It then adapts tools from the various branches of geometry to the study of point clouds The point clouds are intended to be thought of as finite samples taken from a geometric object, perhaps with noise Here are some of the key points which come up when applying these geometric methods to data analysis • Qualitative information is needed: One important goal of data analysis is to allow the user to obtain knowledge about the data, ie to understand how it is organized on a large scale For example, if we imagine that we are looking at a data set constructed somehow from diabetes patients, it would be important to develop the understanding that there are two types of the disease, namely the juvenile and adult onset forms Once that is established, one of course wants to develop quantitative methods for distinguishing them, but the first insight about the distinct forms of the disease is key

2,203 citations

Journal ArticleDOI
TL;DR: It is shown that CLBP_S preserves more information of the local structure thanCLBP_M, which explains why the simple LBP operator can extract the texture features reasonably well and can be made for rotation invariant texture classification.
Abstract: In this correspondence, a completed modeling of the local binary pattern (LBP) operator is proposed and an associated completed LBP (CLBP) scheme is developed for texture classification. A local region is represented by its center pixel and a local difference sign-magnitude transform (LDSMT). The center pixels represent the image gray level and they are converted into a binary code, namely CLBP-Center (CLBP_C), by global thresholding. LDSMT decomposes the image local differences into two complementary components: the signs and the magnitudes, and two operators, namely CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M), are proposed to code them. The traditional LBP is equivalent to the CLBP_S part of CLBP, and we show that CLBP_S preserves more information of the local structure than CLBP_M, which explains why the simple LBP operator can extract the texture features reasonably well. By combining CLBP_S, CLBP_M, and CLBP_C features into joint or hybrid distributions, significant improvement can be made for rotation invariant texture classification.

1,981 citations

Journal ArticleDOI
TL;DR: Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.
Abstract: This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.

1,629 citations