scispace - formally typeset
Search or ask a question
Author

Alex Rodriguez

Bio: Alex Rodriguez is an academic researcher from International Centre for Theoretical Physics. The author has contributed to research in topics: Intrinsic dimension & Cluster analysis. The author has an hindex of 14, co-authored 32 publications receiving 3018 citations. Previous affiliations of Alex Rodriguez include University of Barcelona & International School for Advanced Studies.

Papers
More filters
Journal ArticleDOI
27 Jun 2014-Science
TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.
Abstract: Cluster analysis is aimed at classifying elements into categories on the basis of their similarity. Its applications range from astronomy to bioinformatics, bibliometrics, and pattern recognition. We propose an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded from the analysis, and clusters are recognized regardless of their shape and of the dimensionality of the space in which they are embedded. We demonstrate the power of the algorithm on several test cases.

3,441 citations

Journal ArticleDOI
TL;DR: This Review provides a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicates likely directions for further developments in the field.
Abstract: Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.

144 citations

Journal ArticleDOI
TL;DR: In this article, the authors propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample, which is theoretically exact in uniformly distributed datasets, and provides consistent measures in general.
Abstract: Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.

131 citations

Journal ArticleDOI
TL;DR: A new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample is proposed, which enables us to reduce the effects of curvature, of density variation, and the resulting computational cost.
Abstract: Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved, in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.

99 citations

Journal ArticleDOI
TL;DR: This work introduces an approach for computing the free energy and the probability density in high-dimensional spaces, such as those explored in molecular dynamics simulations of biomolecules, that exploits the presence of correlations between the coordinates induced by the chemical nature of molecules.
Abstract: We introduce an approach for computing the free energy and the probability density in high-dimensional spaces, such as those explored in molecular dynamics simulations of biomolecules The approach exploits the presence of correlations between the coordinates, induced, in molecular dynamics, by the chemical nature of the molecules Due to these correlations, the data points lay on a manifold that can be highly curved and twisted, but whose dimension is normally small We estimate the free energies by finding, with a statistical test, the largest neighborhood in which the free energy in the embedding manifold can be considered constant Importantly, this procedure does not require defining explicitly the manifold and provides an estimate of the error that is approximately unbiased up to large dimensions We test this approach on artificial and real data sets, demonstrating that the free energy estimates are reliable for data sets on manifolds of dimension up to ∼10, embedded in an arbitrarily large space

46 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Monocle 2, an algorithm that uses reversed graph embedding to describe multiple fate decisions in a fully unsupervised manner, is applied to two studies of blood development and found that mutations in the genes encoding key lineage transcription factors divert cells to alternative fates.
Abstract: Single-cell trajectories can unveil how gene regulation governs cell fate decisions. However, learning the structure of complex trajectories with multiple branches remains a challenging computational problem. We present Monocle 2, an algorithm that uses reversed graph embedding to describe multiple fate decisions in a fully unsupervised manner. We applied Monocle 2 to two studies of blood development and found that mutations in the genes encoding key lineage transcription factors divert cells to alternative fates.

2,257 citations

Reference EntryDOI
15 Oct 2004

2,118 citations

Journal Article
TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Abstract: We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.

1,992 citations

Journal ArticleDOI
Carly G. K. Ziegler, Samuel J. Allon, Sarah K. Nyquist, Ian M. Mbano1, Vincent N. Miao, Constantine N. Tzouanas, Yuming Cao2, Ashraf S. Yousif3, Julia Bals3, Blake M. Hauser4, Blake M. Hauser3, Jared Feldman4, Jared Feldman3, Christoph Muus5, Christoph Muus4, Marc H. Wadsworth, Samuel W. Kazer, Travis K. Hughes, Benjamin Doran, G. James Gatter5, G. James Gatter6, G. James Gatter3, Marko Vukovic, Faith Taliaferro5, Faith Taliaferro7, Benjamin E. Mead, Zhiru Guo2, Jennifer P. Wang2, Delphine Gras8, Magali Plaisant9, Meshal Ansari, Ilias Angelidis, Heiko Adler, Jennifer M.S. Sucre10, Chase J. Taylor10, Brian M. Lin4, Avinash Waghray4, Vanessa Mitsialis7, Vanessa Mitsialis11, Daniel F. Dwyer11, Kathleen M. Buchheit11, Joshua A. Boyce11, Nora A. Barrett11, Tanya M. Laidlaw11, Shaina L. Carroll12, Lucrezia Colonna13, Victor Tkachev7, Victor Tkachev4, Christopher W. Peterson14, Christopher W. Peterson13, Alison Yu15, Alison Yu7, Hengqi Betty Zheng15, Hengqi Betty Zheng13, Hannah P. Gideon16, Caylin G. Winchell16, Philana Ling Lin7, Philana Ling Lin16, Colin D. Bingle17, Scott B. Snapper7, Scott B. Snapper11, Jonathan A. Kropski18, Jonathan A. Kropski10, Fabian J. Theis, Herbert B. Schiller, Laure-Emmanuelle Zaragosi9, Pascal Barbry9, Alasdair Leslie19, Alasdair Leslie1, Hans-Peter Kiem13, Hans-Peter Kiem14, JoAnne L. Flynn16, Sarah M. Fortune3, Sarah M. Fortune5, Sarah M. Fortune4, Bonnie Berger6, Robert W. Finberg2, Leslie S. Kean4, Leslie S. Kean7, Manuel Garber2, Aaron G. Schmidt3, Aaron G. Schmidt4, Daniel Lingwood3, Alex K. Shalek, Jose Ordovas-Montanes, Nicholas E. Banovich, Alvis Brazma, Tushar J. Desai, Thu Elizabeth Duong, Oliver Eickelberg, Christine S. Falk, Michael Farzan20, Ian A. Glass, Muzlifah Haniffa, Peter Horvath, Deborah T. Hung, Naftali Kaminski, Mark A. Krasnow, Malte Kühnemund, Robert Lafyatis, Haeock Lee, Sylvie Leroy, Sten Linnarson, Joakim Lundeberg, Kerstin B. Meyer, Alexander V. Misharin, Martijn C. Nawijn, Marko Nikolic, Dana Pe'er, Joseph E. Powell, Stephen R. Quake, Jay Rajagopal, Purushothama Rao Tata, Emma L. Rawlins, Aviv Regev, Paul A. Reyfman, Mauricio Rojas, Orit Rosen, Kourosh Saeb-Parsy, Christos Samakovlis, Herbert B. Schiller, Joachim L. Schultze, Max A. Seibold, Douglas P. Shepherd, Jason R. Spence, Avrum Spira, Xin Sun, Sarah A. Teichmann, Fabian J. Theis, Alexander M. Tsankov, Maarten van den Berge, Michael von Papen, Jeffrey A. Whitsett, Ramnik J. Xavier, Yan Xu, Kun Zhang 
28 May 2020-Cell
TL;DR: The data suggest that SARS-CoV-2 could exploit species-specific interferon-driven upregulation of ACE2, a tissue-protective mediator during lung injury, to enhance infection.

1,911 citations