scispace - formally typeset
Search or ask a question

Showing papers by "José Luis Ambite published in 2019"


Journal ArticleDOI
Genevieve L. Wojcik1, Mariaelisa Graff2, Katherine K. Nishimura3, Ran Tao4, Jeffrey Haessler3, Christopher R. Gignoux5, Christopher R. Gignoux1, Heather M. Highland2, Yesha Patel6, Elena P. Sorokin1, Christy L. Avery2, Gillian M. Belbin7, Stephanie A. Bien3, Iona Cheng8, Sinead Cullina7, Chani J. Hodonsky2, Yao Hu3, Laura M. Huckins7, Janina M. Jeff7, Anne E. Justice2, Jonathan M. Kocarnik3, Unhee Lim9, Bridget M Lin2, Yingchang Lu7, Sarah C. Nelson10, Sungshim L. Park6, Hannah Poisner7, Michael Preuss7, Melissa A. Richard11, Claudia Schurmann12, Claudia Schurmann7, Veronica Wendy Setiawan6, Alexandra Sockell1, Karan Vahi6, Marie Verbanck7, Abhishek Vishnu7, Ryan W. Walker7, Kristin L. Young2, Niha Zubair3, Victor Acuña-Alonso, José Luis Ambite6, Kathleen C. Barnes5, Eric Boerwinkle11, Erwin P. Bottinger12, Erwin P. Bottinger7, Carlos Bustamante1, Christian Caberto9, Samuel Canizales-Quinteros, Matthew P. Conomos10, Ewa Deelman6, Ron Do7, Kimberly F. Doheny13, Lindsay Fernández-Rhodes14, Lindsay Fernández-Rhodes2, Myriam Fornage11, Benyam Hailu15, Gerardo Heiss2, Brenna M. Henn16, Lucia A. Hindorff15, Rebecca D. Jackson17, Cecelia A. Laurie10, Cathy C. Laurie10, Yuqing Li18, Yuqing Li8, Danyu Lin2, Andrés Moreno-Estrada, Girish N. Nadkarni7, Paul Norman5, Loreall Pooler6, Alexander P. Reiner10, Jane Romm13, Chiara Sabatti1, Karla Sandoval, Xin Sheng6, Eli A. Stahl7, Daniel O. Stram6, Timothy A. Thornton10, Christina L. Wassel19, Lynne R. Wilkens9, Cheryl A. Winkler, Sachi Yoneyama2, Steven Buyske20, Christopher A. Haiman6, Charles Kooperberg3, Loic Le Marchand9, Ruth J. F. Loos7, Tara C. Matise20, Kari E. North2, Ulrike Peters3, Eimear E. Kenny7, Christopher S. Carlson3 
27 Jun 2019-Nature
TL;DR: The value of diverse, multi-ethnic participants in large-scale genomic studies is demonstrated and evidence of effect-size heterogeneity across ancestries for published GWAS associations, substantial benefits for fine-mapping using diverse cohorts and insights into clinical implications are shown.
Abstract: Genome-wide association studies (GWAS) have laid the foundation for investigations into the biology of complex traits, drug development and clinical guidelines. However, the majority of discovery efforts are based on data from populations of European ancestry1-3. In light of the differential genetic architecture that is known to exist between populations, bias in representation can exacerbate existing disease and healthcare disparities. Critical variants may be missed if they have a low frequency or are completely absent in European populations, especially as the field shifts its attention towards rare variants, which are more likely to be population-specific4-10. Additionally, effect sizes and their derived risk prediction scores derived in one population may not accurately extrapolate to other populations11,12. Here we demonstrate the value of diverse, multi-ethnic participants in large-scale genomic studies. The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioural phenotypes in 49,839 non-European individuals. Using strategies tailored for analysis of multi-ethnic and admixed populations, we describe a framework for analysing diverse populations, identify 27 novel loci and 38 secondary signals at known loci, as well as replicate 1,444 GWAS catalogue associations across these traits. Our data show evidence of effect-size heterogeneity across ancestries for published GWAS associations, substantial benefits for fine-mapping using diverse cohorts and insights into clinical implications. In the United States-where minority populations have a disproportionately higher burden of chronic conditions13-the lack of representation of diverse populations in genetic research will result in inequitable access to precision medicine for those with the highest burden of disease. We strongly advocate for continued, large genome-wide efforts in diverse populations to maximize genetic discovery and reduce health disparities.

591 citations


Journal ArticleDOI
TL;DR: An HAR framework adapted to variable duration activity bouts is created by detecting the change points of activity bouts in a multivariate time series and predicting activity for each homogeneous window defined by these change points.
Abstract: Background: Time-resolved quantification of physical activity can contribute to both personalized medicine and epidemiological research studies, for example, managing and identifying triggers of asthma exacerbations. A growing number of reportedly accurate machine learning algorithms for human activity recognition (HAR) have been developed using data from wearable devices (eg, smartwatch and smartphone). However, many HAR algorithms depend on fixed-size sampling windows that may poorly adapt to real-world conditions in which activity bouts are of unequal duration. A small sliding window can produce noisy predictions under stable conditions, whereas a large sliding window may miss brief bursts of intense activity. Objective: We aimed to create an HAR framework adapted to variable duration activity bouts by (1) detecting the change points of activity bouts in a multivariate time series and (2) predicting activity for each homogeneous window defined by these change points. Methods: We applied standard fixed-width sliding windows (4-6 different sizes) or greedy Gaussian segmentation (GGS) to identify break points in filtered triaxial accelerometer and gyroscope data. After standard feature engineering, we applied an Xgboost model to predict physical activity within each window and then converted windowed predictions to instantaneous predictions to facilitate comparison across segmentation methods. We applied these methods in 2 datasets: the human activity recognition using smartphones (HARuS) dataset where a total of 30 adults performed activities of approximately equal duration (approximately 20 seconds each) while wearing a waist-worn smartphone, and the Biomedical REAl-Time Health Evaluation for Pediatric Asthma (BREATHE) dataset where a total of 14 children performed 6 activities for approximately 10 min each while wearing a smartwatch. To mimic a real-world scenario, we generated artificial unequal activity bout durations in the BREATHE data by randomly subdividing each activity bout into 10 segments and randomly concatenating the 60 activity bouts. Each dataset was divided into ~90% training and ~10% holdout testing. Results: In the HARuS data, GGS produced the least noisy predictions of 6 physical activities and had the second highest accuracy rate of 91.06% (the highest accuracy rate was 91.79% for the sliding window of size 0.8 second). In the BREATHE data, GGS again produced the least noisy predictions and had the highest accuracy rate of 79.4% of predictions for 6 physical activities. Conclusions: In a scenario with variable duration activity bouts, GGS multivariate segmentation produced smart-sized windows with more stable predictions and a higher accuracy rate than traditional fixed-size sliding window approaches. Overall, accuracy was good in both datasets but, as expected, it was slightly lower in the more real-world study using wrist-worn smartwatches in children (BREATHE) than in the more tightly controlled study using waist-worn smartphones in adults (HARuS). We implemented GGS in an offline setting, but it could be adapted for real-time prediction with streaming data.

30 citations


Journal ArticleDOI
TL;DR: The authors' genetic study identified a novel association at NMT2 for CKD and showed for the first time strong associations of the APOL1 variants with ESKD across multi-ethnic populations.
Abstract: Background Chronic kidney disease (CKD) is common and disproportionally burdens United States ethnic minorities. Its genetic determinants may differ by disease severity and clinical stages. To uncover genetic factors associated CKD severity among high-risk ethnic groups, we performed genome-wide association studies (GWAS) in diverse populations within the Population Architecture using Genomics and Epidemiology (PAGE) study. Methods We assembled multi-ethnic genome-wide imputed data on CKD non-overlapping cases [4,150 mild to moderate CKD, 1,105 end-stage kidney disease (ESKD)] and non-CKD controls for up to 41,041 PAGE participants (African Americans, Hispanics/Latinos, East Asian, Native Hawaiian, and American Indians). We implemented a generalized estimating equation approach for GWAS using ancestry combined data while adjusting for age, sex, principal components, study, and ethnicity. Results The GWAS identified a novel genome-wide associated locus for mild to moderate CKD nearby NMT2 (rs10906850, p = 3.7 × 10-8) that replicated in the United Kingdom Biobank white British (p = 0.008). Several variants at the APOL1 locus were associated with ESKD including the APOL1 G1 rs73885319 (p = 1.2 × 10-9). There was no overlap among associated loci for CKD and ESKD traits, even at the previously reported APOL1 locus (p = 0.76 for CKD). Several additional loci were associated with CKD or ESKD at p-values below the genome-wide threshold. These loci were often driven by variants more common in non-European ancestry. Conclusion Our genetic study identified a novel association at NMT2 for CKD and showed for the first time strong associations of the APOL1 variants with ESKD across multi-ethnic populations. Our findings suggest differences in genetic effects across CKD severity and provide information for study design of genetic studies of CKD in diverse populations.

27 citations


Posted ContentDOI
08 Sep 2019-bioRxiv
TL;DR: iLASH enables fast and accurate detection of IBD, an upstream step in applications of I BD for population genetics and trait mapping, making IBD estimation tractable for hundreds of thousands to millions of individuals.
Abstract: The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, IBD by LocAlity-Sensitive Hashing, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to the current leading method and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for hundreds of thousands to millions of individuals. We applied iLASH to the Population Architecture using Genomics and Epidemiology (PAGE) dataset of ∼52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, which identified IBD segments on a single machine in an hour (∼3 minutes per chromosome compared to over 6 days per chromosome for a state-of-the-art algorithm). iLASH is able to efficiently estimate IBD tracts in very large-scale datasets, as demonstrated via IBD estimation across the entire UK Biobank (∼500,000 individuals), detecting nearly 13 billion pairwise IBD tracts shared between ∼11% of participants. In summary, iLASH enables fast and accurate detection of IBD, an upstream step in applications of IBD for population genetics and trait mapping.

24 citations


Proceedings ArticleDOI
13 May 2019
TL;DR: This work presents a new method for identifying prerequisite relations based on naturally occurring data, namely the navigation patterns of users on the Wikipedia online encyclopedia, and shows that the navigation network structure can be used to identify dependencies among concepts in several domains.
Abstract: The increased availability of online learning resources in the form of courses, videos, and tutorials has created new opportunities for independent learners, but it has also increased the difficulty of planning a course of study. Where should the learner start? What should the learner know before tackling a new course? Manually identifying these prerequisite relations between learning resources or concepts is expensive in terms of time and expertise, and it is particularly difficult to do so for new or rapidly changing areas of knowledge. To address this challenge, we present a new method for identifying prerequisite relations based on naturally occurring data, namely the navigation patterns of users on the Wikipedia online encyclopedia. Our supervised learning approach shows that the navigation network structure can be used to identify dependencies among concepts in several domains.

20 citations


Journal ArticleDOI
31 Dec 2019-PLOS ONE
TL;DR: A hypothesis-generating phenome-wide association study to identify and characterize cross-phenotype associations, where one SNP is associated with two or more phenotypes, between thousands of genetic variants assayed on the Metabochip and hundreds of phenotypes in 5,897 African Americans as part of the Population Architecture using Genomics and Epidemiology (PAGE) I study.
Abstract: We performed a hypothesis-generating phenome-wide association study (PheWAS) to identify and characterize cross-phenotype associations, where one SNP is associated with two or more phenotypes, between thousands of genetic variants assayed on the Metabochip and hundreds of phenotypes in 5,897 African Americans as part of the Population Architecture using Genomics and Epidemiology (PAGE) I study. The PAGE I study was a National Human Genome Research Institute-funded collaboration of four study sites accessing diverse epidemiologic studies genotyped on the Metabochip, a custom genotyping chip that has dense coverage of regions in the genome previously associated with cardio-metabolic traits and outcomes in mostly European-descent populations. Here we focus on identifying novel phenome-genome relationships, where SNPs are associated with more than one phenotype. To do this, we performed a PheWAS, testing each SNP on the Metabochip for an association with up to 273 phenotypes in the participating PAGE I study sites. We identified 133 putative pleiotropic variants, defined as SNPs associated at an empirically derived p-value threshold of p<0.01 in two or more PAGE study sites for two or more phenotype classes. We further annotated these PheWAS-identified variants using publicly available functional data and local genetic ancestry. Amongst our novel findings is SPARC rs4958487, associated with increased glucose levels and hypertension. SPARC has been implicated in the pathogenesis of diabetes and is also known to have a potential role in fibrosis, a common consequence of multiple conditions including hypertension. The SPARC example and others highlight the potential that PheWAS approaches have in improving our understanding of complex disease architecture by identifying novel relationships between genetic variants and an array of common human phenotypes.

16 citations


Posted Content
TL;DR: A weakly-supervised data augmentation approach to improve Named Entity Recognition (NER) in a challenging domain: extracting biomedical entities from the scientific literature.
Abstract: We present a weakly-supervised data augmentation approach to improve Named Entity Recognition (NER) in a challenging domain: extracting biomedical entities (e.g., proteins) from the scientific literature. First, we train a neural NER (NNER) model over a small seed of fully-labeled examples. Second, we use a reference set of entity names (e.g., proteins in UniProt) to identify entity mentions with high precision, but low recall, on an unlabeled corpus. Third, we use the NNER model to assign weak labels to the corpus. Finally, we retrain our NNER model iteratively over the augmented training set, including the seed, the reference-set examples, and the weakly-labeled examples, which improves model performance. We show empirically that this augmented bootstrapping process significantly improves NER performance, and discuss the factors impacting the efficacy of the approach.

16 citations


Book ChapterDOI
16 Sep 2019
TL;DR: A general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations, which is used for fast mapping of new entities to large reference sets, and empirically shows the effectiveness of the framework in challenging bio-entity normalization datasets.
Abstract: Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. However, entity normalization is a challenging problem; there often are many textual forms for a canonical entity that may not be captured in the reference set, and entities mentioned in text may include many syntactic variations, or errors. The problem is particularly acute in scientific domains, such as biology. To address this problem, we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.

15 citations


Posted ContentDOI
24 Sep 2019-bioRxiv
TL;DR: In this article, a framework for repurposing data from Electronic Health Records (EHRs) in concert with genomic data to explore enrichment of disease within sub-populations was proposed.
Abstract: Understanding population health disparities is an essential component of equitable precision health efforts. Epidemiology research often relies on definitions of race and ethnicity, but these population labels may not adequately capture disease burdens specific to sub-populations. Here we propose a framework for repurposing data from Electronic Health Records (EHRs) in concert with genomic data to explore enrichment of disease within sub-populations. Using data from a diverse biobank in New York City, we genetically identified 17 sub-populations, and noted the presence of genetic founder effects in 7. By then linking community membership to the EHR, we were able to identify over 600 health outcomes that were statistically enriched within a specific population, with many representing known associations, and many others being novel. This work reinforces the utility of linking genomic data to EHRs, and provides a framework towards fine-scale monitoring of population health.

15 citations


Journal ArticleDOI
17 Jul 2019
TL;DR: The construction of ERuDIte, the Educational Resource Discovery Index for Data Science, and its release as linked data are described, which are hoped to provide a framework to foster open linked educational resources on the Web.
Abstract: The availability of massive datasets in genetics, neuroimaging, mobile health, and other subfields of biology and medicine promises new insights but also poses significant challenges. To realize the potential of big data in biomedicine, the National Institutes of Health launched the Big Data to Knowledge (BD2K) initiative, funding several centers of excellence in biomedical data analysis and a Training Coordinating Center (TCC) tasked with facilitating online and inperson training of biomedical researchers in data science. A major initiative of the BD2K TCC is to automatically identify, describe, and organize data science training resources available on the Web and provide personalized training paths for users. In this paper, we describe the construction of ERuDIte, the Educational Resource Discovery Index for Data Science, and its release as linked data. ERuDIte contains over 11,000 training resources including courses, video tutorials, conference talks, and other materials. The metadata for these resources is described uniformly using Schema.org. We use machine learning techniques to tag each resource with concepts from the Data Science Education Ontology, which we developed to further describe resource content. Finally, we map references to people and organizations in learning resources to entities in DBpedia, DBLP, and ORCID, embedding our collection in the web of linked data. We hope that ERuDIte will provide a framework to foster open linked educational resources on the Web.

5 citations


Proceedings ArticleDOI
01 Dec 2019
TL;DR: A vision of a general machine learning framework for explainable predictive analytics for location-dependent time-series data is presented that will enable fine spatial and temporal scale environmental exposure assessment and allow researchers to carry out unprecedented inquiries, such as understanding relationships between health outcomes and long-term air pollution exposures.
Abstract: There are increasing numbers of online sources of real-time and historical location-dependent time-series data describing various types of environmental phenomena, e.g., traffic conditions and air quality levels. When coupled with the information that characterizes the natural and built environments, these location-dependent time-series data can help better understand interactions between and within human social systems and the ecosystem. Nevertheless, these data are still limited by their spatial and temporal resolution for downstream use (e.g., generating residential-level environmental exposures for human health studies). In this paper, we present a vision of a general machine learning framework for explainable predictive analytics for location-dependent time-series data. The framework will effectively deal with data-and model-related challenges for general scientific predictive analytics on spatiotemporal environmental phenomena. The challenges include how to identify the main features driving the phenomena, how to handle complex spatiotemporal variations in the phenomena, and how to utilize sparse ground truth measurements for training and validation. The resulting framework will enable fine spatial and temporal scale environmental exposure assessment and allow researchers to carry out unprecedented inquiries, such as understanding relationships between health outcomes and long-term air pollution exposures.

Journal ArticleDOI
TL;DR: A two-day meeting between the BD2K Training Coordinating Center (TCC), ELIXIR Training/TeSS, GOBLET, H3ABioNet, EMBL-ABR, bioCADDIE and the CSIRO, in Huntington Beach, California, to compare and contrast their respective activities, and how these might be leveraged for wider impact on an international scale.
Abstract: The increasing richness and diversity of biomedical data types creates major organizational and analytical impediments to rapid translational impact in the context of training and education. As biomedical data-sets increase in size, variety and complexity, they challenge conventional methods for sharing, managing and analyzing those data. In May 2017, we convened a two-day meeting between the BD2K Training Coordinating Center (TCC), ELIXIR Training/TeSS, GOBLET, H3ABioNet, EMBL-ABR, bioCADDIE and the CSIRO, in Huntington Beach, California, to compare and contrast our respective activities, and how these might be leveraged for wider impact on an international scale. Discussions focused on the role of i) training for biomedical data science; ii) the need to promote core competencies, and the ii) development of career paths. These led to specific conversations about i) the values of standardizing and sharing data science training resources; ii) challenges in encouraging adoption of training material standards; iii) strategies and best practices for the personalization and customization of learning experiences; iv) processes of identifying stakeholders and determining how they should be accommodated; and v) discussions of joint partnerships to lead the world on data science training in ways that benefit all stakeholders. Generally, international cooperation was viewed as essential for accommodating the widest possible participation in the modern bioscience enterprise, providing skills in a truly “FAIR” manner, addressing the importance of data science understanding worldwide. Several recommendations for the exchange of educational frameworks are made, along with potential sources for support, and plans for further cooperative efforts are presented.