Top 14 papers published by José Luis Ambite from University of Southern California in 2021

Journal Article•DOI•

Toward a fine-scale population health monitoring system.

[...]

Gillian M. Belbin¹, Sinead Cullina¹, Stephane Wenric¹, Emily R. Soper¹, Benjamin S. Glicksberg¹, Denis Torre¹, Arden Moscati¹, Genevieve L. Wojcik², Ruhollah Shemirani³, Noam D. Beckmann¹, Ariella Cohain¹, Elena P. Sorokin², Danny S. Park¹, José Luis Ambite³, Steve Ellis¹, Adam Auton⁴, Cbipm Genomics Team¹, Erwin P. Bottinger¹, Judy H. Cho¹, Ruth J. F. Loos¹, Noura S. Abul-Husn¹, Noah Zaitlen⁵, Christopher R. Gignoux⁶, Eimear E. Kenny¹ - Show less +20 more•Institutions (6)

Icahn School of Medicine at Mount Sinai¹, Stanford University², University of Southern California³, Albert Einstein College of Medicine⁴, University of California, Los Angeles⁵, Anschutz Medical Campus⁶

15 Apr 2021-Cell

TL;DR: A framework for repurposing data from EHRs in concert with genomic data to explore the demographic ties that can impact disease burdens and demonstrates that fine-scale population structure can impact the prediction of complex disease risk within groups.

...read moreread less

55 citations

Proceedings Article•DOI•

Scaling Neuroscience Research Using Federated Learning

[...]

Dimitris Stripelis¹, José Luis Ambite¹, Pradeep K. Lam, Paul M. Thompson•Institutions (1)

University of Southern California¹

13 Apr 2021

TL;DR: In this article, a semi-synchronized federated learning architecture is proposed to learn a joint model over data silos, which does not share any subject data across sites, only aggregated parameters, often in encrypted environments.

...read moreread less

Abstract: The amount of biomedical data continues to grow rapidly. However, the ability to analyze these data is limited due to privacy and regulatory concerns. Machine learning approaches that require data to be copied to a single location are hampered by the challenges of data sharing. Federated Learning is a promising approach to learn a joint model over data silos. This architecture does not share any subject data across sites, only aggregated parameters, often in encrypted environments, thus satisfying privacy and regulatory requirements. Here, we describe our Federated Learning architecture and training policies. We demonstrate our approach on a brain age prediction model on structural MRI scans distributed across multiple sites with diverse amounts of data and subject (age) distributions. In these heterogeneous environments, our Semi-Synchronous protocol provides faster convergence.

...read moreread less

16 citations

Journal Article•DOI•

Rapid detection of identity-by-descent tracts for mega-scale datasets.

[...]

Ruhollah Shemirani¹, Gillian M. Belbin², Christy L. Avery³, Eimear E. Kenny, Christopher R. Gignoux⁴, José Luis Ambite¹ - Show less +2 more•Institutions (4)

University of Southern California¹, Icahn School of Medicine at Mount Sinai², University of North Carolina at Chapel Hill³, Anschutz Medical Campus⁴

10 Jun 2021-Nature Communications

TL;DR: In this paper, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals.

...read moreread less

Abstract: The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (~500,000 individuals), detecting 12.9 billion pairwise connections. Traditional methods to identify genomic regions identical-by-descent (IBD) do not scale well to biobank-level datasets. Here, the authors describe a new IBD algorithm, iLASH, which uses LocAlity-Sensitive Hashing to provide rapid IBD estimation when applied to the PAGE and UK Biobank datasets.

...read moreread less

11 citations

Posted Content•

Secure Neuroimaging Analysis using Federated Learning with Homomorphic Encryption.

[...]

Dimitris Stripelis, Hamza Saleem, Tanmay Ghai, Nikhil J. Dhinagar, Umang Gupta, Chrysovalantis Anastasiou, Greg Ver Steeg, Srivatsan Ravi, Muhammad Naveed, Paul M. Thompson, José Luis Ambite - Show less +7 more

07 Aug 2021-arXiv: Cryptography and Security

TL;DR: In this article, the authors proposed a secure federated learning framework using fully-homomorphic encryption (FHE) to train a deep learning model to predict a person's age from distributed MRI scans, a common benchmarking task.

...read moreread less

Abstract: Federated learning (FL) enables distributed computation of machine learning models over various disparate, remote data sources, without requiring to transfer any individual data to a centralized location. This results in an improved generalizability of models and efficient scaling of computation as more sources and larger datasets are added to the federation. Nevertheless, recent membership attacks show that private or sensitive personal data can sometimes be leaked or inferred when model parameters or summary statistics are shared with a central site, requiring improved security solutions. In this work, we propose a framework for secure FL using fully-homomorphic encryption (FHE). Specifically, we use the CKKS construction, an approximate, floating point compatible scheme that benefits from ciphertext packing and rescaling. In our evaluation on large-scale brain MRI datasets, we use our proposed secure FL framework to train a deep learning model to predict a person's age from distributed MRI scans, a common benchmarking task, and demonstrate that there is no degradation in the learning performance between the encrypted and non-encrypted federated models.

...read moreread less

10 citations

Posted Content•DOI•

3D Convolutional Neural Networks for Classification of Alzheimer's and Parkinson's Disease with T1-Weighted Brain MRI

[...]

Nikhil J. Dhinagar¹, Sophia I. Thomopoulos¹, Conor Owens-Walton¹, Dimitris Stripelis¹, José Luis Ambite¹, Greg Ver Steeg¹, Daniel Weintraub², Philip A. Cook², Corey T. McMillan², Paul M. Thompson¹ - Show less +6 more•Institutions (2)

University of Southern California¹, University of Pennsylvania²

27 Jul 2021-bioRxiv

TL;DR: In this paper, a 3D convolutional neural network (CNN) was proposed to classify Parkinson9s disease and Alzheimer9s diseases based on 3D T1-weighted brain MRI.

...read moreread less

Abstract: Parkinson9s disease (PD) and Alzheimer9s disease (AD) are progressive neurodegenerative disorders that affect millions of people worldwide. In this work, we propose a deep learning approach to classify these diseases based on 3D T1-weighted brain MRI. We analyzed several datasets including the Parkinson9s Progression Markers Initiative (PPMI), an independent dataset from the University of Pennsylvania School of Medicine (UPenn), the Alzheimer9s Disease Neuroimaging Initiative (ADNI), and the Open Access Series of Imaging Studies (OASIS) dataset. PPMI and ADNI were partitioned to train (70%), validate (20%), and test (10%) a 3D convolutional neural network (CNN) for PD and AD classification. The UPenn and OASIS datasets were used as independent test sets to evaluate the model performance during inference. We also implemented a random forest classifier as a baseline model by extracting key radiomics features from the same T1-weighted MRI scans. The proposed 3D CNN model was trained from scratch for the classification tasks. For AD classification, the 3D CNN model achieved an ROC-AUC of 0.878 on the ADNI test set and an average ROC-AUC of 0.789 on the OASIS dataset. For PD classification, the proposed 3D CNN model achieved an ROC-AUC of 0.667 on the PPMI test set and an average ROC-AUC of 0.743 on the UPenn dataset. We also found that model performance was largely maintained when using only 25% of the training dataset. The 3D CNN outperformed the random forest classifier for both the PD and AD tasks. The 3D CNN also generalized better on unseen MRI data from different imaging centers. Our results show that the proposed 3D CNN model was less prone to overfitting for AD than for PD classification. This approach shows promise for screening of PD and AD patients using only T1-weighted brain MRI, which is relatively widely available. This model with additional validation could also be used to help differentiate between challenging cases of AD and PD when they present with similarly subtle motor and non-motor symptoms.

...read moreread less

8 citations

Journal Article•DOI•

BD2K Training Coordinating Center's ERuDIte: The Educational Resource Discovery Index for Data Science

[...]

José Luis Ambite¹, Lily Fierro¹, Jonathan Gordon², Gully A. P. C. Burns¹, Florian Geigl, Kristina Lerman¹, John D. Van Horn¹ - Show less +3 more•Institutions (2)

University of Southern California¹, Vassar College²

01 Jan 2021-IEEE Transactions on Emerging Topics in Computing

TL;DR: The BD2K TCC web portal is powered by ERuDIte, the Educational Resource Discovery Index, which collects training resources for data science, including online courses, videos of tutorials and research talks, textbooks, and other web-based materials.

...read moreread less

Abstract: Data science is a field that has developed to enable efficient integration and analysis of increasingly large data sets in many domains. In particular, big data in genetics, neuroimaging, mobile health, and other subfields of biomedical science, promises new insights, but also poses challenges. To address these challenges, the National Institutes of Health launched the Big Data to Knowledge (BD2K) initiative, including a Training Coordinating Center (TCC) tasked with developing a resource for personalized data science training for biomedical researchers. The BD2K TCC web portal is powered by ERuDIte, the Educational Resource Discovery Index, which collects training resources for data science, including online courses, videos of tutorials and research talks, textbooks, and other web-based materials. While the availability of so many potential learning resources is exciting, they are highly heterogeneous in quality, difficulty, format, and topic, making the field intimidating to enter and difficult to navigate. Moreover, data science is rapidly evolving, so there is a constant influx of new materials and concepts. We leverage data science techniques to build ERuDIte itself, using data extraction, data integration, machine learning, information retrieval, and natural language processing to automatically collect, integrate, describe, and organize existing online resources for learning data science.

...read moreread less

6 citations

Membership Inference Attacks on Deep Regression Models for Neuroimaging

[...]

Umang Gupta, Dimitris Stripelis, Pradeep K. Lam, Paul M. Thompson, José Luis Ambite, Greg Ver Steeg - Show less +2 more

11 Feb 2021

TL;DR: Membership inference attacks on deep learning models for 3D neuroimaging tasks have been studied in this article, showing that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to a white-box and some leaked samples from the training data distribution.

...read moreread less

Abstract: Ensuring the privacy of research participants is vital, even more so in healthcare environments. Deep learning approaches to neuroimaging require large datasets, and this often necessitates sharing data between multiple sites, which is antithetical to the privacy objectives. Federated learning is a commonly proposed solution to this problem. It circumvents the need for data sharing by sharing parameters during the training process. However, we demonstrate that allowing access to parameters may leak private information even if data is never directly shared. In particular, we show that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to the model itself (white-box) and some leaked samples from the training data distribution. Such attacks are commonly referred to as Membership Inference attacks. We show realistic Membership Inference attacks on deep learning models trained for 3D neuroimaging tasks in a centralized as well as decentralized setup. We demonstrate feasible attacks on brain age prediction models (deep learning models that predict a person's age from their brain MRI scan). We correctly identified whether an MRI scan was used in model training with a 60% to over 80% success rate depending on model complexity and security assumptions.

...read moreread less

5 citations

Posted Content•

Semi-Synchronous Federated Learning.

[...]

Dimitris Stripelis¹, José Luis Ambite¹•Institutions (1)

Information Sciences Institute¹

04 Feb 2021-arXiv: Learning

TL;DR: In this paper, Semi-Synchronous Federated Learning (SFL) protocol is proposed to learn a joint model over all the available data across silos in federated learning.

...read moreread less

Abstract: There are situations where data relevant to a machine learning problem are distributed among multiple locations that cannot share the data due to regulatory, competitiveness, or privacy reasons For example, data present in users' cellphones, manufacturing data of companies in a given industrial sector, or medical records located at different hospitals Federated Learning (FL) provides an approach to learn a joint model over all the available data across silos In many cases, participating sites have different data distributions and computational capabilities In these heterogeneous environments previous approaches exhibit poor performance: synchronous FL protocols are communication efficient, but have slow learning convergence; conversely, asynchronous FL protocols have faster convergence, but at a higher communication cost Here we introduce a novel Semi-Synchronous Federated Learning protocol that mixes local models periodically with minimal idle time and fast convergence We show through extensive experiments that our approach significantly outperforms previous work in data and computationally heterogeneous environments

...read moreread less

3 citations

Posted Content•DOI•

Selecting Clustering Algorithms for IBD Mapping

[...]

Ruhollah Shemirani¹, Gillian M. Belbin², Keith Burghardt¹, Kristina Lerman¹, Christy L. Avery³, Eimear E. Kenny², Christopher R. Gignoux⁴, José Luis Ambite¹ - Show less +4 more•Institutions (4)

University of Southern California¹, Icahn School of Medicine at Mount Sinai², University of North Carolina at Chapel Hill³, Anschutz Medical Campus⁴

12 Aug 2021-bioRxiv

TL;DR: In this paper, the authors designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power and investigated the effectiveness of common clustering metrics as replacements for statistical power.

...read moreread less

Abstract: Background Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with 51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.

...read moreread less

2 citations

Posted Content•DOI•

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

[...]

Kanix Wang¹, Robert Stevens², Halima Alachram³, Yu Li⁴, Larisa N. Soldatova⁵, Ross D. King⁶, Ross D. King⁷, Ross D. King⁸, Sophia Ananiadou², Annika Marie Schoene², Maolin Li², Fenia Christopoulou², José Luis Ambite⁹, Joel Matthew⁹, Sahil Garg⁹, Ulf Hermjakob⁹, Daniel Marcu⁹, Emily Sheng⁹, Tim Beißbarth³, Edgar Wingender, Aram Galstyan⁹, Xin Gao⁴, Brendan Chambers¹, Weidi Pan¹, Bohdan B. Khomtchouk¹, James A. Evans¹, Andrey Rzhetsky - Show less +23 more•Institutions (9)

University of Chicago¹, University of Manchester², University of Göttingen³, King Abdullah University of Science and Technology⁴, Goldsmiths, University of London⁵, The Turing Institute⁶, Chalmers University of Technology⁷, University of Cambridge⁸, University of Southern California⁹

20 Oct 2021

TL;DR: The NER Ontology (NERO) as mentioned in this paper is a named entity recognition ontology developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine.

...read moreread less

Abstract: Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

...read moreread less

2 citations

Posted Content•

Scaling Neuroscience Research using Federated Learning

[...]

Dimitris Stripelis¹, José Luis Ambite¹, Pradeep K. Lam, Paul M. Thompson•Institutions (1)

University of Southern California¹

16 Feb 2021-arXiv: Learning

TL;DR: In this article, a semi-synchronized federated learning architecture is proposed to learn a joint model over data silos, which does not share any subject data across sites, only aggregated parameters, often in encrypted environments.

...read moreread less

Abstract: The amount of biomedical data continues to grow rapidly. However, the ability to analyze these data is limited due to privacy and regulatory concerns. Machine learning approaches that require data to be copied to a single location are hampered by the challenges of data sharing. Federated Learning is a promising approach to learn a joint model over data silos. This architecture does not share any subject data across sites, only aggregated parameters, often in encrypted environments, thus satisfying privacy and regulatory requirements. Here, we describe our Federated Learning architecture and training policies. We demonstrate our approach on a brain age prediction model on structural MRI scans distributed across multiple sites with diverse amounts of data and subject (age) distributions. In these heterogeneous environments, our Semi-Synchronous protocol provides faster convergence.

...read moreread less

Journal Article•DOI•

W-TSS: A Wavelet-Based Algorithm for Discovering Time Series Shapelets.

[...]

Kenan Li¹, Huiyu Deng², John Morrison¹, Rima Habre¹, Meredith Franklin¹, Yao-Yi Chiang³, Katherine A. Sward⁴, Frank D. Gilliland¹, José Luis Ambite¹, Sandrah P. Eckel¹ - Show less +6 more•Institutions (4)

University of Southern California¹, City of Hope National Medical Center², University of Minnesota³, University of Utah⁴

28 Aug 2021-Sensors

TL;DR: In this article, a novel intelligent method for identifying candidate shapelets in TSS using wavelet transformation discovery is proposed. But the method does not require pre-specification of shapelet length.

...read moreread less

Abstract: Many approaches to time series classification rely on machine learning methods. However, there is growing interest in going beyond black box prediction models to understand discriminatory features of the time series and their associations with outcomes. One promising method is time-series shapelets (TSS), which identifies maximally discriminative subsequences of time series. For example, in environmental health applications TSS could be used to identify short-term patterns in exposure time series (shapelets) associated with adverse health outcomes. Identification of candidate shapelets in TSS is computationally intensive. The original TSS algorithm used exhaustive search. Subsequent algorithms introduced efficiencies by trimming/aggregating the set of candidates or training candidates from initialized values, but these approaches have limitations. In this paper, we introduce Wavelet-TSS (W-TSS) a novel intelligent method for identifying candidate shapelets in TSS using wavelet transformation discovery. We tested W-TSS on two datasets: (1) a synthetic example used in previous TSS studies and (2) a panel study relating exposures from residential air pollution sensors to symptoms in participants with asthma. Compared to previous TSS algorithms, W-TSS was more computationally efficient, more accurate, and was able to discover more discriminative shapelets. W-TSS does not require pre-specification of shapelet length.

...read moreread less

Journal Article•DOI•

EPS: Automated Feature Selection in Case-Control Studies using Extreme Pseudo-Sampling.

[...]

Ruhollah Shemirani¹, Stephane Wenric², Eimear E. Kenny², José Luis Ambite¹•Institutions (2)

University of Southern California¹, Icahn School of Medicine at Mount Sinai²

11 Oct 2021-Bioinformatics

TL;DR: The Extreme Pseudo-Sampling (EPS) algorithm as discussed by the authors uses a combination of deep learning and linear regression models to find informative predictive features in high dimensional biological case-control datasets.

...read moreread less

Abstract: SUMMARY Finding informative predictive features in high dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use, and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets. AVAILABILITY The software package for Python is available online at https://github.com/roohy/eps.

...read moreread less

Posted Content•

Membership Inference Attacks on Deep Regression Models for Neuroimaging

[...]

Umang Gupta¹, Dimitris Stripelis², Pradeep K. Lam¹, Paul M. Thompson¹, José Luis Ambite², Greg Ver Steeg² - Show less +2 more•Institutions (2)

University of Southern California¹, Information Sciences Institute²

06 May 2021-arXiv: Quantitative Methods

TL;DR: Membership inference attacks on deep learning models for 3D neuroimaging tasks have been studied in this article, showing that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to a white-box and some leaked samples from the training data distribution.

...read moreread less

Abstract: Ensuring the privacy of research participants is vital, even more so in healthcare environments. Deep learning approaches to neuroimaging require large datasets, and this often necessitates sharing data between multiple sites, which is antithetical to the privacy objectives. Federated learning is a commonly proposed solution to this problem. It circumvents the need for data sharing by sharing parameters during the training process. However, we demonstrate that allowing access to parameters may leak private information even if data is never directly shared. In particular, we show that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to the model itself (white-box) and some leaked samples from the training data distribution. Such attacks are commonly referred to as Membership Inference attacks. We show realistic Membership Inference attacks on deep learning models trained for 3D neuroimaging tasks in a centralized as well as decentralized setup. We demonstrate feasible attacks on brain age prediction models (deep learning models that predict a person's age from their brain MRI scan). We correctly identified whether an MRI scan was used in model training with a 60% to over 80% success rate depending on model complexity and security assumptions.

...read moreread less

Showing papers by "José Luis Ambite published in 2021"