Author
Alex Rodriguez
Other affiliations: University of Barcelona, International School for Advanced Studies, Polytechnic University of Catalonia
Bio: Alex Rodriguez is an academic researcher from International Centre for Theoretical Physics. The author has contributed to research in topics: Intrinsic dimension & Cluster analysis. The author has an hindex of 14, co-authored 32 publications receiving 3018 citations. Previous affiliations of Alex Rodriguez include University of Barcelona & International School for Advanced Studies.
Papers
More filters
••
TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.
Abstract: Cluster analysis is aimed at classifying elements into categories on the basis of their similarity. Its applications range from astronomy to bioinformatics, bibliometrics, and pattern recognition. We propose an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded from the analysis, and clusters are recognized regardless of their shape and of the dimensionality of the space in which they are embedded. We demonstrate the power of the algorithm on several test cases.
3,441 citations
••
TL;DR: This Review provides a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicates likely directions for further developments in the field.
Abstract: Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.
144 citations
••
TL;DR: In this article, the authors propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample, which is theoretically exact in uniformly distributed datasets, and provides consistent measures in general.
Abstract: Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.
131 citations
••
TL;DR: A new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample is proposed, which enables us to reduce the effects of curvature, of density variation, and the resulting computational cost.
Abstract: Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved, in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.
99 citations
••
TL;DR: This work introduces an approach for computing the free energy and the probability density in high-dimensional spaces, such as those explored in molecular dynamics simulations of biomolecules, that exploits the presence of correlations between the coordinates induced by the chemical nature of molecules.
Abstract: We introduce an approach for computing the free energy and the probability density in high-dimensional spaces, such as those explored in molecular dynamics simulations of biomolecules The approach exploits the presence of correlations between the coordinates, induced, in molecular dynamics, by the chemical nature of the molecules Due to these correlations, the data points lay on a manifold that can be highly curved and twisted, but whose dimension is normally small We estimate the free energies by finding, with a statistical test, the largest neighborhood in which the free energy in the embedding manifold can be considered constant Importantly, this procedure does not require defining explicitly the manifold and provides an estimate of the error that is approximately unbiased up to large dimensions We test this approach on artificial and real data sets, demonstrating that the free energy estimates are reliable for data sets on manifolds of dimension up to ∼10, embedded in an arbitrarily large space
46 citations
Cited by
More filters
••
TL;DR: Monocle 2, an algorithm that uses reversed graph embedding to describe multiple fate decisions in a fully unsupervised manner, is applied to two studies of blood development and found that mutations in the genes encoding key lineage transcription factors divert cells to alternative fates.
Abstract: Single-cell trajectories can unveil how gene regulation governs cell fate decisions. However, learning the structure of complex trajectories with multiple branches remains a challenging computational problem. We present Monocle 2, an algorithm that uses reversed graph embedding to describe multiple fate decisions in a fully unsupervised manner. We applied Monocle 2 to two studies of blood development and found that mutations in the genes encoding key lineage transcription factors divert cells to alternative fates.
2,257 citations
•
TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Abstract: We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.
1,992 citations
••
University of KwaZulu-Natal1, University of Massachusetts Medical School2, Ragon Institute of MGH, MIT and Harvard3, Harvard University4, Broad Institute5, Massachusetts Institute of Technology6, Boston Children's Hospital7, Aix-Marseille University8, Centre national de la recherche scientifique9, Vanderbilt University Medical Center10, Brigham and Women's Hospital11, University of California, Berkeley12, University of Washington13, Fred Hutchinson Cancer Research Center14, Seattle Children's15, University of Pittsburgh16, University of Sheffield17, United States Department of Veterans Affairs18, University College London19, Scripps Research Institute20
TL;DR: The data suggest that SARS-CoV-2 could exploit species-specific interferon-driven upregulation of ACE2, a tissue-protective mediator during lung injury, to enhance infection.
1,911 citations