scispace - formally typeset
Search or ask a question

Showing papers by "José Luis Ambite published in 2021"


Journal ArticleDOI
15 Apr 2021-Cell
TL;DR: A framework for repurposing data from EHRs in concert with genomic data to explore the demographic ties that can impact disease burdens and demonstrates that fine-scale population structure can impact the prediction of complex disease risk within groups.

55 citations


Proceedings ArticleDOI
13 Apr 2021
TL;DR: In this article, a semi-synchronized federated learning architecture is proposed to learn a joint model over data silos, which does not share any subject data across sites, only aggregated parameters, often in encrypted environments.
Abstract: The amount of biomedical data continues to grow rapidly. However, the ability to analyze these data is limited due to privacy and regulatory concerns. Machine learning approaches that require data to be copied to a single location are hampered by the challenges of data sharing. Federated Learning is a promising approach to learn a joint model over data silos. This architecture does not share any subject data across sites, only aggregated parameters, often in encrypted environments, thus satisfying privacy and regulatory requirements. Here, we describe our Federated Learning architecture and training policies. We demonstrate our approach on a brain age prediction model on structural MRI scans distributed across multiple sites with diverse amounts of data and subject (age) distributions. In these heterogeneous environments, our Semi-Synchronous protocol provides faster convergence.

16 citations


Journal ArticleDOI
TL;DR: In this paper, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals.
Abstract: The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (~500,000 individuals), detecting 12.9 billion pairwise connections. Traditional methods to identify genomic regions identical-by-descent (IBD) do not scale well to biobank-level datasets. Here, the authors describe a new IBD algorithm, iLASH, which uses LocAlity-Sensitive Hashing to provide rapid IBD estimation when applied to the PAGE and UK Biobank datasets.

11 citations


Posted Content
TL;DR: In this article, the authors proposed a secure federated learning framework using fully-homomorphic encryption (FHE) to train a deep learning model to predict a person's age from distributed MRI scans, a common benchmarking task.
Abstract: Federated learning (FL) enables distributed computation of machine learning models over various disparate, remote data sources, without requiring to transfer any individual data to a centralized location. This results in an improved generalizability of models and efficient scaling of computation as more sources and larger datasets are added to the federation. Nevertheless, recent membership attacks show that private or sensitive personal data can sometimes be leaked or inferred when model parameters or summary statistics are shared with a central site, requiring improved security solutions. In this work, we propose a framework for secure FL using fully-homomorphic encryption (FHE). Specifically, we use the CKKS construction, an approximate, floating point compatible scheme that benefits from ciphertext packing and rescaling. In our evaluation on large-scale brain MRI datasets, we use our proposed secure FL framework to train a deep learning model to predict a person's age from distributed MRI scans, a common benchmarking task, and demonstrate that there is no degradation in the learning performance between the encrypted and non-encrypted federated models.

10 citations


Posted ContentDOI
27 Jul 2021-bioRxiv
TL;DR: In this paper, a 3D convolutional neural network (CNN) was proposed to classify Parkinson9s disease and Alzheimer9s diseases based on 3D T1-weighted brain MRI.
Abstract: Parkinson9s disease (PD) and Alzheimer9s disease (AD) are progressive neurodegenerative disorders that affect millions of people worldwide. In this work, we propose a deep learning approach to classify these diseases based on 3D T1-weighted brain MRI. We analyzed several datasets including the Parkinson9s Progression Markers Initiative (PPMI), an independent dataset from the University of Pennsylvania School of Medicine (UPenn), the Alzheimer9s Disease Neuroimaging Initiative (ADNI), and the Open Access Series of Imaging Studies (OASIS) dataset. PPMI and ADNI were partitioned to train (70%), validate (20%), and test (10%) a 3D convolutional neural network (CNN) for PD and AD classification. The UPenn and OASIS datasets were used as independent test sets to evaluate the model performance during inference. We also implemented a random forest classifier as a baseline model by extracting key radiomics features from the same T1-weighted MRI scans. The proposed 3D CNN model was trained from scratch for the classification tasks. For AD classification, the 3D CNN model achieved an ROC-AUC of 0.878 on the ADNI test set and an average ROC-AUC of 0.789 on the OASIS dataset. For PD classification, the proposed 3D CNN model achieved an ROC-AUC of 0.667 on the PPMI test set and an average ROC-AUC of 0.743 on the UPenn dataset. We also found that model performance was largely maintained when using only 25% of the training dataset. The 3D CNN outperformed the random forest classifier for both the PD and AD tasks. The 3D CNN also generalized better on unseen MRI data from different imaging centers. Our results show that the proposed 3D CNN model was less prone to overfitting for AD than for PD classification. This approach shows promise for screening of PD and AD patients using only T1-weighted brain MRI, which is relatively widely available. This model with additional validation could also be used to help differentiate between challenging cases of AD and PD when they present with similarly subtle motor and non-motor symptoms.

8 citations


Journal ArticleDOI
TL;DR: The BD2K TCC web portal is powered by ERuDIte, the Educational Resource Discovery Index, which collects training resources for data science, including online courses, videos of tutorials and research talks, textbooks, and other web-based materials.
Abstract: Data science is a field that has developed to enable efficient integration and analysis of increasingly large data sets in many domains. In particular, big data in genetics, neuroimaging, mobile health, and other subfields of biomedical science, promises new insights, but also poses challenges. To address these challenges, the National Institutes of Health launched the Big Data to Knowledge (BD2K) initiative, including a Training Coordinating Center (TCC) tasked with developing a resource for personalized data science training for biomedical researchers. The BD2K TCC web portal is powered by ERuDIte, the Educational Resource Discovery Index, which collects training resources for data science, including online courses, videos of tutorials and research talks, textbooks, and other web-based materials. While the availability of so many potential learning resources is exciting, they are highly heterogeneous in quality, difficulty, format, and topic, making the field intimidating to enter and difficult to navigate. Moreover, data science is rapidly evolving, so there is a constant influx of new materials and concepts. We leverage data science techniques to build ERuDIte itself, using data extraction, data integration, machine learning, information retrieval, and natural language processing to automatically collect, integrate, describe, and organize existing online resources for learning data science.

6 citations


11 Feb 2021
TL;DR: Membership inference attacks on deep learning models for 3D neuroimaging tasks have been studied in this article, showing that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to a white-box and some leaked samples from the training data distribution.
Abstract: Ensuring the privacy of research participants is vital, even more so in healthcare environments. Deep learning approaches to neuroimaging require large datasets, and this often necessitates sharing data between multiple sites, which is antithetical to the privacy objectives. Federated learning is a commonly proposed solution to this problem. It circumvents the need for data sharing by sharing parameters during the training process. However, we demonstrate that allowing access to parameters may leak private information even if data is never directly shared. In particular, we show that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to the model itself (white-box) and some leaked samples from the training data distribution. Such attacks are commonly referred to as Membership Inference attacks. We show realistic Membership Inference attacks on deep learning models trained for 3D neuroimaging tasks in a centralized as well as decentralized setup. We demonstrate feasible attacks on brain age prediction models (deep learning models that predict a person's age from their brain MRI scan). We correctly identified whether an MRI scan was used in model training with a 60% to over 80% success rate depending on model complexity and security assumptions.

5 citations


Posted Content
TL;DR: In this paper, Semi-Synchronous Federated Learning (SFL) protocol is proposed to learn a joint model over all the available data across silos in federated learning.
Abstract: There are situations where data relevant to a machine learning problem are distributed among multiple locations that cannot share the data due to regulatory, competitiveness, or privacy reasons For example, data present in users' cellphones, manufacturing data of companies in a given industrial sector, or medical records located at different hospitals Federated Learning (FL) provides an approach to learn a joint model over all the available data across silos In many cases, participating sites have different data distributions and computational capabilities In these heterogeneous environments previous approaches exhibit poor performance: synchronous FL protocols are communication efficient, but have slow learning convergence; conversely, asynchronous FL protocols have faster convergence, but at a higher communication cost Here we introduce a novel Semi-Synchronous Federated Learning protocol that mixes local models periodically with minimal idle time and fast convergence We show through extensive experiments that our approach significantly outperforms previous work in data and computationally heterogeneous environments

3 citations


Posted ContentDOI
12 Aug 2021-bioRxiv
TL;DR: In this paper, the authors designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power and investigated the effectiveness of common clustering metrics as replacements for statistical power.
Abstract: Background Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with 51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.

2 citations


Posted ContentDOI
20 Oct 2021
TL;DR: The NER Ontology (NERO) as mentioned in this paper is a named entity recognition ontology developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine.
Abstract: Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

2 citations


Posted Content
TL;DR: In this article, a semi-synchronized federated learning architecture is proposed to learn a joint model over data silos, which does not share any subject data across sites, only aggregated parameters, often in encrypted environments.
Abstract: The amount of biomedical data continues to grow rapidly. However, the ability to analyze these data is limited due to privacy and regulatory concerns. Machine learning approaches that require data to be copied to a single location are hampered by the challenges of data sharing. Federated Learning is a promising approach to learn a joint model over data silos. This architecture does not share any subject data across sites, only aggregated parameters, often in encrypted environments, thus satisfying privacy and regulatory requirements. Here, we describe our Federated Learning architecture and training policies. We demonstrate our approach on a brain age prediction model on structural MRI scans distributed across multiple sites with diverse amounts of data and subject (age) distributions. In these heterogeneous environments, our Semi-Synchronous protocol provides faster convergence.

Journal ArticleDOI
28 Aug 2021-Sensors
TL;DR: In this article, a novel intelligent method for identifying candidate shapelets in TSS using wavelet transformation discovery is proposed. But the method does not require pre-specification of shapelet length.
Abstract: Many approaches to time series classification rely on machine learning methods. However, there is growing interest in going beyond black box prediction models to understand discriminatory features of the time series and their associations with outcomes. One promising method is time-series shapelets (TSS), which identifies maximally discriminative subsequences of time series. For example, in environmental health applications TSS could be used to identify short-term patterns in exposure time series (shapelets) associated with adverse health outcomes. Identification of candidate shapelets in TSS is computationally intensive. The original TSS algorithm used exhaustive search. Subsequent algorithms introduced efficiencies by trimming/aggregating the set of candidates or training candidates from initialized values, but these approaches have limitations. In this paper, we introduce Wavelet-TSS (W-TSS) a novel intelligent method for identifying candidate shapelets in TSS using wavelet transformation discovery. We tested W-TSS on two datasets: (1) a synthetic example used in previous TSS studies and (2) a panel study relating exposures from residential air pollution sensors to symptoms in participants with asthma. Compared to previous TSS algorithms, W-TSS was more computationally efficient, more accurate, and was able to discover more discriminative shapelets. W-TSS does not require pre-specification of shapelet length.

Journal ArticleDOI
TL;DR: The Extreme Pseudo-Sampling (EPS) algorithm as discussed by the authors uses a combination of deep learning and linear regression models to find informative predictive features in high dimensional biological case-control datasets.
Abstract: SUMMARY Finding informative predictive features in high dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use, and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets. AVAILABILITY The software package for Python is available online at https://github.com/roohy/eps.

Posted Content
TL;DR: Membership inference attacks on deep learning models for 3D neuroimaging tasks have been studied in this article, showing that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to a white-box and some leaked samples from the training data distribution.
Abstract: Ensuring the privacy of research participants is vital, even more so in healthcare environments. Deep learning approaches to neuroimaging require large datasets, and this often necessitates sharing data between multiple sites, which is antithetical to the privacy objectives. Federated learning is a commonly proposed solution to this problem. It circumvents the need for data sharing by sharing parameters during the training process. However, we demonstrate that allowing access to parameters may leak private information even if data is never directly shared. In particular, we show that it is possible to infer if a sample was used to train the model given only access to the model prediction (black-box) or access to the model itself (white-box) and some leaked samples from the training data distribution. Such attacks are commonly referred to as Membership Inference attacks. We show realistic Membership Inference attacks on deep learning models trained for 3D neuroimaging tasks in a centralized as well as decentralized setup. We demonstrate feasible attacks on brain age prediction models (deep learning models that predict a person's age from their brain MRI scan). We correctly identified whether an MRI scan was used in model training with a 60% to over 80% success rate depending on model complexity and security assumptions.