scispace - formally typeset
Search or ask a question

Showing papers by "Carlos Bustamante published in 2019"


Journal ArticleDOI
Genevieve L. Wojcik1, Mariaelisa Graff2, Katherine K. Nishimura3, Ran Tao4, Jeffrey Haessler3, Christopher R. Gignoux1, Christopher R. Gignoux5, Heather M. Highland2, Yesha Patel6, Elena P. Sorokin1, Christy L. Avery2, Gillian M. Belbin7, Stephanie A. Bien3, Iona Cheng8, Sinead Cullina7, Chani J. Hodonsky2, Yao Hu3, Laura M. Huckins7, Janina M. Jeff7, Anne E. Justice2, Jonathan M. Kocarnik3, Unhee Lim9, Bridget M Lin2, Yingchang Lu7, Sarah C. Nelson10, Sungshim L. Park6, Hannah Poisner7, Michael Preuss7, Melissa A. Richard11, Claudia Schurmann7, Claudia Schurmann12, Veronica Wendy Setiawan6, Alexandra Sockell1, Karan Vahi6, Marie Verbanck7, Abhishek Vishnu7, Ryan W. Walker7, Kristin L. Young2, Niha Zubair3, Victor Acuña-Alonso, José Luis Ambite6, Kathleen C. Barnes5, Eric Boerwinkle11, Erwin P. Bottinger7, Erwin P. Bottinger12, Carlos Bustamante1, Christian Caberto9, Samuel Canizales-Quinteros, Matthew P. Conomos10, Ewa Deelman6, Ron Do7, Kimberly F. Doheny13, Lindsay Fernández-Rhodes14, Lindsay Fernández-Rhodes2, Myriam Fornage11, Benyam Hailu15, Gerardo Heiss2, Brenna M. Henn16, Lucia A. Hindorff15, Rebecca D. Jackson17, Cecelia A. Laurie10, Cathy C. Laurie10, Yuqing Li18, Yuqing Li8, Danyu Lin2, Andrés Moreno-Estrada, Girish N. Nadkarni7, Paul Norman5, Loreall Pooler6, Alexander P. Reiner10, Jane Romm13, Chiara Sabatti1, Karla Sandoval, Xin Sheng6, Eli A. Stahl7, Daniel O. Stram6, Timothy A. Thornton10, Christina L. Wassel19, Lynne R. Wilkens9, Cheryl A. Winkler, Sachi Yoneyama2, Steven Buyske20, Christopher A. Haiman6, Charles Kooperberg3, Loic Le Marchand9, Ruth J. F. Loos7, Tara C. Matise20, Kari E. North2, Ulrike Peters3, Eimear E. Kenny7, Christopher S. Carlson3 
27 Jun 2019-Nature
TL;DR: The value of diverse, multi-ethnic participants in large-scale genomic studies is demonstrated and evidence of effect-size heterogeneity across ancestries for published GWAS associations, substantial benefits for fine-mapping using diverse cohorts and insights into clinical implications are shown.
Abstract: Genome-wide association studies (GWAS) have laid the foundation for investigations into the biology of complex traits, drug development and clinical guidelines. However, the majority of discovery efforts are based on data from populations of European ancestry1-3. In light of the differential genetic architecture that is known to exist between populations, bias in representation can exacerbate existing disease and healthcare disparities. Critical variants may be missed if they have a low frequency or are completely absent in European populations, especially as the field shifts its attention towards rare variants, which are more likely to be population-specific4-10. Additionally, effect sizes and their derived risk prediction scores derived in one population may not accurately extrapolate to other populations11,12. Here we demonstrate the value of diverse, multi-ethnic participants in large-scale genomic studies. The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioural phenotypes in 49,839 non-European individuals. Using strategies tailored for analysis of multi-ethnic and admixed populations, we describe a framework for analysing diverse populations, identify 27 novel loci and 38 secondary signals at known loci, as well as replicate 1,444 GWAS catalogue associations across these traits. Our data show evidence of effect-size heterogeneity across ancestries for published GWAS associations, substantial benefits for fine-mapping using diverse cohorts and insights into clinical implications. In the United States-where minority populations have a disproportionately higher burden of chronic conditions13-the lack of representation of diverse populations in genetic research will result in inequitable access to precision medicine for those with the highest burden of disease. We strongly advocate for continued, large genome-wide efforts in diverse populations to maximize genetic discovery and reduce health disparities.

591 citations


Journal ArticleDOI
TL;DR: Nine groups are presented that meet a key need in pharmacogenetics research by enabling consistent communication of the scale of variability in global allele frequencies and are now used by Pharmacogenomics Knowledgebase (PharmGKB).
Abstract: The varying frequencies of pharmacogenetic alleles among populations have important implications for the impact of these alleles in different populations. Current population grouping methods to communicate these patterns are insufficient as they are inconsistent and fail to reflect the global distribution of genetic variability. To facilitate and standardize the reporting of variability in pharmacogenetic allele frequencies, we present seven geographically defined groups: American, Central/South Asian, East Asian, European, Near Eastern, Oceanian, and Sub-Saharan African, and two admixed groups: African American/Afro-Caribbean and Latino. These nine groups are defined by global autosomal genetic structure and based on data from large-scale sequencing initiatives. We recognize that broadly grouping global populations is an oversimplification of human diversity and does not capture complex social and cultural identity. However, these groups meet a key need in pharmacogenetics research by enabling consistent communication of the scale of variability in global allele frequencies and are now used by Pharmacogenomics Knowledgebase (PharmGKB).

74 citations


Journal ArticleDOI
TL;DR: Water-miscible ethylene glycol ethers are found to modify structure, dynamics, and reactivity of DNA by mechanisms possibly related to a biologically relevant hydrophobic catalysis and it is proposed that a modulated chemical potential of water can promote “longitudinal breathing” and the formation of unstacked holes while base unpairing is suppressed.
Abstract: Hydrophobic base stacking is a major contributor to DNA double-helix stability. We report the discovery of specific unstacking effects in certain semihydrophobic environments. Water-miscible ethylene glycol ethers are found to modify structure, dynamics, and reactivity of DNA by mechanisms possibly related to a biologically relevant hydrophobic catalysis. Spectroscopic data and optical tweezers experiments show that base-stacking energies are reduced while base-pair hydrogen bonds are strengthened. We propose that a modulated chemical potential of water can promote “longitudinal breathing” and the formation of unstacked holes while base unpairing is suppressed. Flow linear dichroism in 20% diglyme indicates a 20 to 30% decrease in persistence length of DNA, supported by an increased flexibility in single-molecule nanochannel experiments in poly(ethylene glycol). A limited (3 to 6%) hyperchromicity but unaffected circular dichroism is consistent with transient unstacking events while maintaining an overall average B-DNA conformation. Further information about unstacking dynamics is obtained from the binding kinetics of large thread-intercalating ruthenium complexes, indicating that the hydrophobic effect provides a 10 to 100 times increased DNA unstacking frequency and an “open hole” population on the order of 10−2 compared to 10−4 in normal aqueous solution. Spontaneous DNA strand exchange catalyzed by poly(ethylene glycol) makes us propose that hydrophobic residues in the L2 loop of recombination enzymes RecA and Rad51 may assist gene recombination via modulation of water activity near the DNA helix by hydrophobic interactions, in the manner described here. We speculate that such hydrophobic interactions may have catalytic roles also in other biological contexts, such as in polymerases.

73 citations


Journal ArticleDOI
TL;DR: A genome-wide meta-analysis from the Consortium on Asthma among African Ancestry Populations (CAAPA) finds strong evidence for association at four previously reported asthma loci and identifies two potentially African ancestry-specific loci that may be specific to asthma risk in African ancestry populations.
Abstract: Asthma is a complex disease with striking disparities across racial and ethnic groups. Despite its relatively high burden, representation of individuals of African ancestry in asthma genome-wide association studies (GWAS) has been inadequate, and true associations in these underrepresented minority groups have been inconclusive. We report the results of a genome-wide meta-analysis from the Consortium on Asthma among African Ancestry Populations (CAAPA; 7009 asthma cases, 7645 controls). We find strong evidence for association at four previously reported asthma loci whose discovery was driven largely by non-African populations, including the chromosome 17q12-q21 locus and the chr12q13 region, a novel (and not previously replicated) asthma locus recently identified by the Trans-National Asthma Genetic Consortium (TAGC). An additional seven loci reported by TAGC show marginal evidence for association in CAAPA. We also identify two novel loci (8p23 and 8q24) that may be specific to asthma risk in African ancestry populations.

63 citations


Journal ArticleDOI
31 Jul 2019-eLife
TL;DR: Topographic and transcriptional maps of canonical, H2A.Z, and monoubiquitinated H2B (uH2B) nucleosomes are obtained at near base-pair resolution and accuracy and suggest a mechanism for selective control of gene expression.
Abstract: Nucleosomes represent mechanical and energetic barriers that RNA Polymerase II (Pol II) must overcome during transcription. A high-resolution description of the barrier topography, its modulation by epigenetic modifications, and their effects on Pol II nucleosome crossing dynamics, is still missing. Here, we obtain topographic and transcriptional (Pol II residence time) maps of canonical, H2A.Z, and monoubiquitinated H2B (uH2B) nucleosomes at near base-pair resolution and accuracy. Pol II crossing dynamics are complex, displaying pauses at specific loci, backtracking, and nucleosome hopping between wrapped states. While H2A.Z widens the barrier, uH2B heightens it, and both modifications greatly lengthen Pol II crossing time. Using the dwell times of Pol II at each nucleosomal position we extract the energetics of the barrier. The orthogonal barrier modifications of H2A.Z and uH2B, and their effects on Pol II dynamics rationalize their observed enrichment in +1 nucleosomes and suggest a mechanism for selective control of gene expression.

56 citations


Journal ArticleDOI
Li Qigang, Keyan Zhao, Carlos Bustamante1, Xin Ma1, Wing Hung Wong1 
TL;DR: The Xrare model is learned from a large database of clinical variants, and derives its strength from the tight integration of medical genetics features and phenotypic features similarity scores.

53 citations


Journal ArticleDOI
TL;DR: It is found that hairpin opening occurs during EF-G-catalyzed translocation and is driven by the forward rotation of the small subunit head, and that ribosomes occasionally open the hairpin in two successive sub-codon steps, revealing a previously unobserved translocation intermediate.

49 citations


Journal ArticleDOI
20 Mar 2019-PLOS ONE
TL;DR: This phylogeographic analysis of Canarian ancient mitogenomes, the first of its kind, shows that some lineages are restricted to Central North Africa, while others have a wider distribution, including both West and Central North Morocco, and, in some cases, Europe and the Near East.
Abstract: The Canary Islands’ indigenous people have been the subject of substantial archaeological, anthropological, linguistic and genetic research pointing to a most probable North African Berber source. However, neither agreement about the exact point of origin nor a model for the indigenous colonization of the islands has been established. To shed light on these questions, we analyzed 48 ancient mitogenomes from 25 archaeological sites from the seven main islands. Most lineages observed in the ancient samples have a Mediterranean distribution, and belong to lineages associated with the Neolithic expansion in the Near East and Europe (T2c, J2a, X3a…). This phylogeographic analysis of Canarian ancient mitogenomes, the first of its kind, shows that some lineages are restricted to Central North Africa (H1cf, J2a2d and T2c1d3), while others have a wider distribution, including both West and Central North Africa, and, in some cases, Europe and the Near East (U6a1a1, U6a7a1, U6b, X3a, U6c1). In addition, we identify four new Canarian-specific lineages (H1e1a9, H4a1e, J2a2d1a and L3b1a12) whose coalescence dates correlate with the estimated time for the colonization of the islands (1st millennia CE). Additionally, we observe an asymmetrical distribution of mtDNA haplogroups in the ancient population, with certain haplogroups appearing more frequently in the islands closer to the continent. This reinforces results based on modern mtDNA and Y-chromosome data, and archaeological evidence suggesting the existence of two distinct migrations. Comparisons between insular populations show that some populations had high genetic diversity, while others were probably affected by genetic drift and/or bottlenecks. In spite of observing interinsular differences in the survival of indigenous lineages, modern populations, with the sole exception of La Gomera, are homogenous across the islands, supporting the theory of extensive human mobility after the European conquest.

46 citations


Journal ArticleDOI
TL;DR: The utility of a theoretical framework, recently formulated in which a generalized friction coefficient quantifies the energetic efficiency in nonequilibrium processes, is demonstrated by rapidly unfolding and folding single DNA hairpins by designing efficient driving processes (“protocols”).
Abstract: Cells must operate far from equilibrium, utilizing and dissipating energy continuously to maintain their organization and to avoid stasis and death. However, they must also avoid unnecessary waste of energy. Recent studies have revealed that molecular machines are extremely efficient thermodynamically compared with their macroscopic counterparts. However, the principles governing the efficient out-of-equilibrium operation of molecular machines remain a mystery. A theoretical framework has been recently formulated in which a generalized friction coefficient quantifies the energetic efficiency in nonequilibrium processes. Moreover, it posits that, to minimize energy dissipation, external control should drive the system along the reaction coordinate with a speed inversely proportional to the square root of that friction coefficient. Here, we demonstrate the utility of this theory for designing and understanding energetically efficient nonequilibrium processes through the unfolding and folding of single DNA hairpins.

37 citations


Journal ArticleDOI
TL;DR: It is found that genes involved in heart and bone development and immune responses are enriched in both selection signals and local hunter-gatherer ancestry in admixed populations, suggesting that selection has maintained adaptive variation in the face of recent gene flow from farmers.

37 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compare the folding kinetics of a ribosome-bound, multi-domain calcium-binding protein stalled at different points in translation with the nascent chain as is being synthesized in real-time, via optical tweezers.
Abstract: Protein folding can begin co-translationally. Due to the difference in timescale between folding and synthesis, co-translational folding is thought to occur at equilibrium for fast-folding domains. In this scenario, the folding kinetics of stalled ribosome-bound nascent chains should match the folding of nascent chains in real time. To test if this assumption is true, we compare the folding of a ribosome-bound, multi-domain calcium-binding protein stalled at different points in translation with the nascent chain as is it being synthesized in real-time, via optical tweezers. On stalled ribosomes, a misfolded state forms rapidly (1.5 s). However, during translation, this state is only attained after a long delay (63 s), indicating that, unexpectedly, the growing polypeptide is not equilibrated with its ensemble of accessible conformations. Slow equilibration on the ribosome can delay premature folding until adequate sequence is available and/or allow time for chaperone binding, thus promoting productive folding.

Journal ArticleDOI
TL;DR: Ancestry at 18q21 was significantly associated with asthma in Latinos and implicated multiple ancestry‐informative noncoding variants upstream ofSMAD2 with asthma susceptibility, and decreased SMAD2 expression in blood was strongly associated with increased asthma risk and increased exacerbations.
Abstract: Background Asthma is a common but complex disease with racial/ethnic differences in prevalence, morbidity, and response to therapies Objective We sought to perform an analysis of genetic ancestry to identify new loci that contribute to asthma susceptibility Methods We leveraged the mixed ancestry of 3902 Latinos and performed an admixture mapping meta-analysis for asthma susceptibility We replicated associations in an independent study of 3774 Latinos, performed targeted sequencing for fine mapping, and tested for disease correlations with gene expression in the whole blood of more than 500 subjects from 3 racial/ethnic groups Results We identified a genome-wide significant admixture mapping peak at 18q21 in Latinos (P = 68 × 10−6), where Native American ancestry was associated with increased risk of asthma (odds ratio [OR], 120; 95% CI, 107-134; P = 002) and European ancestry was associated with protection (OR, 086; 95% CI, 077-096; P = 008) Our findings were replicated in an independent childhood asthma study in Latinos (P = 53 × 10−3, combined P = 26 × 10−7) Fine mapping of 18q21 in 1978 Latinos identified a significant association with multiple variants 5′ of SMAD family member 2 (SMAD2) in Mexicans, whereas a single rare variant in the same window was the top association in Puerto Ricans Low versus high SMAD2 blood expression was correlated with case status (134% lower expression; OR, 393; 95% CI, 212-728; P Conclusion Ancestry at 18q21 was significantly associated with asthma in Latinos and implicated multiple ancestry-informative noncoding variants upstream of SMAD2 with asthma susceptibility Furthermore, decreased SMAD2 expression in blood was strongly associated with increased asthma risk and increased exacerbations

Journal ArticleDOI
TL;DR: A bioinspired, catassembly-like isothermal chain-growth approach to programmably copolymerize DNA hairpin tiles into 1D nanofilaments with desirable composition, chain length and function is devised.
Abstract: Formation of biological filaments via intracellular supramolecular polymerization of proteins or protein/nucleic acid complexes is under programmable and spatiotemporal control to maintain cellular and genomic integrity. Here we devise a bioinspired, catassembly-like isothermal chain-growth approach to copolymerize DNA hairpin tiles (DHTs) into nanofilaments with desirable composition, chain length and function. By designing metastable DNA hairpins with shape-defining intramolecular hydrogen bonds, we generate two types of DHT monomers for copolymerization with high cooperativity and low dispersity indexes. Quantitative single-molecule dissection methods reveal that catalytic opening of a DHT motif harbouring a toehold triggers successive branch migration, which autonomously propagates to form copolymers with alternate tile units. We find that these shape-defined supramolecular nanostructures become substrates for efficient endocytosis by living mammalian cells in a stiffness-dependent manner. Hence, this catassembly-like in-vitro reconstruction approach provides clues for understanding structure-function relationship of biological filaments under physiological and pathological conditions.

Journal ArticleDOI
20 Dec 2019
TL;DR: The power of applying genetic mapping to hibernation is highlighted and new insight is presented into genetics driving its onset, as well as high heritability for hibernation onset.
Abstract: Hibernation in sciurid rodents is a dynamic phenotype timed by a circannual clock. When housed in an animal facility, 13-lined ground squirrels exhibit variation in seasonal onset of hibernation, which is not explained by environmental or biological factors. We hypothesized that genetic factors instead drive variation in timing. After increasing genome contiguity, here, we employ a genotype-by-sequencing approach to characterize genetic variation in 153 ground squirrels. Combined with datalogger records (n = 72), we estimate high heritability (61–100%) for hibernation onset. Applying a genome-wide scan with 46,996 variants, we identify 2 loci significantly (p < 7.14 × 10−6), and 12 loci suggestively (p < 2.13 × 10−4), associated with onset. At the most significant locus, whole-genome resequencing reveals a putative causal variant in the promoter of FAM204A. Expression quantitative trait loci (eQTL) analyses further reveal gene associations for 8/14 loci. Our results highlight the power of applying genetic mapping to hibernation and present new insight into genetics driving its onset. Katherine Grabek et al. use genotype-by-sequencing to characterize genetic variation in 153 ground squirrels, finding high heritability for hibernation onset. They find 14 loci associated with hibernation onset, and a putative causal variant in the promoter of FAM204A.

Posted ContentDOI
30 Jan 2019-bioRxiv
TL;DR: The coding fraction of the genome is targeted and its full site frequency spectrum is characterized by sequencing 76 exomes from five indigenous populations across Mexico to find BCL2L13 and KBTBD8 genes as potential candidates for adaptive evolution in Rarámuris and Triquis, respectively.
Abstract: Native American genetic variation remains underrepresented in most catalogs of human genome sequencing data. Previous genotyping efforts have revealed that Mexico’s indigenous population is highly differentiated and substructured, thus potentially harboring higher proportions of private genetic variants of functional and biomedical relevance. Here we have targeted the coding fraction of the genome and characterized its full site frequency spectrum by sequencing 76 exomes from five indigenous populations across Mexico. Using diffusion approximations, we modeled the demographic history of indigenous populations from Mexico with northern and southern ethnic groups splitting 7.2 kya and subsequently diverging locally 6.5 kya and 5.7 kya, respectively. Selection scans for positive selection revealed BCL2L13 and KBTBD8 genes as potential candidates for adaptive evolution in Raramuris and Triquis, respectively. BCL2L13 is highly expressed in skeletal muscle and could be related to physical endurance, a well-known phenotype of the northern Mexico Raramuri. The KBTBD8 gene has been associated with idiopathic short stature and we found it to be highly differentiated in Triqui, a southern indigenous group from Oaxaca whose height is extremely low compared to other native populations.

Posted ContentDOI
12 Sep 2019-bioRxiv
TL;DR: Three mtDNA haplotypes from pre-contact Puerto Rico persist among Puerto Ricans and other Caribbean islanders, indicating that present-day populations are reservoirs of pre- contact mtDNA diversity.
Abstract: Indigenous peoples have occupied the island of Puerto Rico since at least 3000 B.C. Due to the demographic shifts that occurred after European contact, the origin(s) of these ancient populations, and their genetic relationship to present-day islanders, are unclear. We use ancient DNA to characterize the population history and genetic legacies of pre-contact Indigenous communities from Puerto Rico. Bone, tooth and dental calculus samples were collected from 124 individuals from three pre-contact archaeological sites: Tibes, Punta Candelero and Paso del Indio. Despite poor DNA preservation, we used target enrichment and high-throughput sequencing to obtain complete mitochondrial genomes (mtDNA) from 45 individuals and autosomal genotypes from two individuals. We found a high proportion of Native American mtDNA haplogroups A2 and C1 in the pre-contact Puerto Rico sample (40% and 44%, respectively). This distribution, as well as the haplotypes represented, support a primarily Amazonian South American origin for these populations, and mirrors the Native American mtDNA diversity patterns found in present-day islanders. Three mtDNA haplotypes from pre-contact Puerto Rico persist among Puerto Ricans and other Caribbean islanders, indicating that present-day populations are reservoirs of pre-contact mtDNA diversity. Lastly, we find similarity in autosomal ancestry patterns between pre-contact individuals from Puerto Rico and the Bahamas, suggesting a shared component of Indigenous Caribbean ancestry with close affinity to South American populations. Our findings contribute to a more complete reconstruction of pre-contact Caribbean population history and explore the role of Indigenous peoples in shaping the biocultural diversity of present-day Puerto Ricans and other Caribbean islanders.

Journal ArticleDOI
TL;DR: While genetic counseling has expanded globally, Mexico has not adopted it as a separate profession and understanding the current genetic counseling landscape in Mexico is crucial to improving healthcare outcomes.
Abstract: Background While genetic counseling has expanded globally, Mexico has not adopted it as a separate profession. Given the rapid expansion of genetic and genomic services, understanding the current genetic counseling landscape in Mexico is crucial to improving healthcare outcomes. Methods Our needs assessment strategy has two components. First, we gathered quantitative data about genetics education and medical geneticists' geographic distribution through an exhaustive compilation of available information across several medical schools and public databases. Second, we conducted semi-structured interviews of 19 key-informants from 10 Mexican states remotely with digital recording and transcription. Results Across 32 states, ~54% of enrolled medical students receive no medical genetics training, and only Mexico City averages at least one medical geneticist per 100,000 people. Barriers to genetic counseling services include: geographic distribution of medical geneticists, lack of access to diagnostic tools, patient health literacy and cultural beliefs, and education in medical genetics/genetic counseling. Participants reported generally positive attitudes towards a genetic counseling profession; concerns regarding a current shortage of available jobs for medical geneticists persisted. Conclusion To create a foundation that can support a genetic counseling profession in Mexico, the clinical significance of medical genetics must be promoted nationwide. Potential approaches include: requiring medical genetics coursework, developing community genetics services, and increasing jobs for medical geneticists.

Posted Content
TL;DR: This work presents a class-conditional VAE-GAN to generate new human genomic sequences that can be used to train local ancestry inference (LAI) algorithms and evaluates the quality of the generated data by comparing the performance of a state-of-the-art LAI method when trained with generated versus real data.
Abstract: Local ancestry inference (LAI) allows identification of the ancestry of all chromosomal segments in admixed individuals, and it is a critical step in the analysis of human genomes with applications from pharmacogenomics and precision medicine to genome-wide association studies. In recent years, many LAI techniques have been developed in both industry and academic research. However, these methods require large training data sets of human genomic sequences from the ancestries of interest. Such reference data sets are usually limited, proprietary, protected by privacy restrictions, or otherwise not accessible to the public. Techniques to generate training samples that resemble real haploid sequences from ancestries of interest can be useful tools in such scenarios, since a generalized model can often be shared, but the unique human sample sequences cannot. In this work we present a class-conditional VAE-GAN to generate new human genomic sequences that can be used to train local ancestry inference (LAI) algorithms. We evaluate the quality of our generated data by comparing the performance of a state-of-the-art LAI method when trained with generated versus real data.

Journal ArticleDOI
TL;DR: A high-throughput chromosome conformation capture approach for FFPE samples that is similar in concept to Hi-C and will enable detailed resolution of global genome rearrangement events during cancer progression from FFPEs and will inform the development of targeted molecular diagnostic assays for patient care.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: This work proposes the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity, and uses semi-supervised deep learning to predict the type of evidence provided by each paper.
Abstract: As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences-e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)-the flagship NIH program for clinical curation-we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evi+dence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation.

Posted ContentDOI
13 Apr 2019-bioRxiv
TL;DR: Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives, and the use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies.
Abstract: Objective Unstructured clinical narratives are continuously being recorded as part of delivery of care in electronic health records, and dedicated tagging staff spend considerable effort manually assigning clinical codes for billing purposes; despite these efforts, label availability and accuracy are both suboptimal. Materials and Methods In this retrospective study, we trained long short-term memory (LSTM) recurrent neural networks (RNNs) on 52,722 human and 89,591 veterinary records. We investigated the accuracy of both separate-domain and combined-domain models and probed model portability. We established relevant baselines by training Decision Trees (DT) and Random Forests (RF), and using MetaMap Lite, a clinical natural language processing tool. Results We show that the LSTM-RNNs accurately classify veterinary and human text narratives into top-level categories with an average weighted macro F1 score of 0.74 and 0.68 respectively. In the “neoplasia” category, the model built with veterinary data has a high accuracy in veterinary data, and moderate accuracy in human data, with F1 scores of 0.91 and 0.70 respectively. Our LSTM method scored slightly higher than that of the DT and RF models. Discussion The use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies. Conclusion Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives. Our approach is a step forward for these two domains to learn from, and inform, one another.

Journal ArticleDOI
TL;DR: Even with the best reimbursement reforms, these market entry rewards will be needed to restore the antibacterial R&D ecosystem to health without driving inappropriate overuse of novel antibiotics.
Abstract: It remains to be seen whether the legislative or regulatory path to a carve-out is more achievable. In the coming year, both will be pursued. And although reimbursement reforms are welcome, antibiotic stewardship must remain a central feature, which will inevitably depress the volume of sales. For this reason, every major policy group that has looked at this problem has called for a market entry reward payment that is unlinked to sales, payable when the FDA or European Medicines Agency approves a high-quality new antibiotic13–19. Even with the best reimbursement reforms, these market entry rewards will be needed to restore the antibacterial R&D ecosystem to health without driving inappropriate overuse of novel antibiotics. ❐

Journal ArticleDOI
TL;DR: A novel method to identify sex-bias from genetic sequence data that models population size changes and estimates the female fraction of the effective population size during each time epoch is presented.
Abstract: Sex-biased demographic events (“sex-bias”) involve unequal numbers of females and males. These events are typically inferred from the relative amount of X-chromosomal to autosomal genetic variation and have led to conflicting conclusions about human demographic history. Though population size changes alter the relative amount of X-chromosomal to autosomal genetic diversity even in the absence of sex-bias, this has generally not been accounted for in sex-bias estimators to date. Here, we present a novel method to identify sex-bias from genetic sequence data that models population size changes and estimates the female fraction of the effective population size during each time epoch. Compared to recent sex-bias inference methods, our approach can detect sex-bias that changes on a single population branch without requiring data from an outgroup or knowledge of divergence events. When applied to simulated data, conventional sex-bias estimators are biased by population size changes, especially recent growth or bottlenecks, while our estimator is unbiased. We next apply our method to high-coverage exome data from the 1000 Genomes Project and estimate a male bias in Yorubans (47% female) and Europeans (44%), possibly due to stronger background selection on the X chromosome than on the autosomes. Finally, we apply our method to the 1000 Genomes Project Phase 3 high-coverage Complete Genomics whole-genome data and estimate a female bias in Yorubans (63% female), Europeans (84%), Punjabis (82%), as well as Peruvians (56%), and a male bias in the Southern Han Chinese (45%). Our method additionally identifies a male-biased migration out of Africa based on data from Europeans (20% female). Our results demonstrate that modeling population size change is necessary to estimate sex-bias parameters accurately. Our approach gives insight into signatures of sex-bias in sexual species, and the demographic models it produces can serve as more accurate null models for tests of selection.

Journal ArticleDOI
TL;DR: An amendment to this paper has been published and can be accessed via a link at the top of the paper.
Abstract: Author(s): Daya, Michelle; Rafaels, Nicholas; Brunetti, Tonya M; Chavan, Sameer; Levin, Albert M; Shetty, Aniket; Gignoux, Christopher R; Boorgula, Meher Preethi; Wojcik, Genevieve; Campbell, Monica; Vergara, Candelaria; Torgerson, Dara G; Ortega, Victor E; Doumatey, Ayo; Johnston, Henry Richard; Acevedo, Nathalie; Araujo, Maria Ilma; Avila, Pedro C; Belbin, Gillian; Bleecker, Eugene; Bustamante, Carlos; Caraballo, Luis; Cruz, Alvaro; Dunston, Georgia M; Eng, Celeste; Faruque, Mezbah U; Ferguson, Trevor S; Figueiredo, Camila; Ford, Jean G; Gan, Weiniu; Gourraud, Pierre-Antoine; Hansel, Nadia N; Hernandez, Ryan D; Herrera-Paz, Edwin Francisco; Jimenez, Silvia; Kenny, Eimear E; Knight-Madden, Jennifer; Kumar, Rajesh; Lange, Leslie A; Lange, Ethan M; Lizee, Antoine; Maul, Pissamai; Maul, Trevor; Mayorga, Alvaro; Meyers, Deborah; Nicolae, Dan L; O'Connor, Timothy D; Oliveira, Ricardo Riccio; Olopade, Christopher O; Olopade, Olufunmilayo; Qin, Zhaohui S; Rotimi, Charles; Vince, Nicolas; Watson, Harold; Wilks, Rainford J; Wilson, James G; Salzberg, Steven; Ober, Carole; Burchard, Esteban G; Williams, L Keoki; Beaty, Terri H; Taub, Margaret A; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen C; CAAPA | Abstract: An amendment to this paper has been published and can be accessed via a link at the top of the paper.

Posted ContentDOI
17 May 2019-bioRxiv
TL;DR: Topographic and transcriptional maps of canonical, H2A.Z, and monoubiquitinated H2B (uH2B) nucleosomes are obtained at near base-pair resolution and accuracy and a unified mechanical model links position-dependent dwell times of Pol II on the nucleosome with energetics of the barrier.
Abstract: Nucleosomes represent mechanical and energetic barriers that RNA Polymerase II (Pol II) must overcome during transcription. A high-resolution description of the barrier topography, its modulation by epigenetic modifications, and their effects on Pol II nucleosome crossing dynamics, is still missing. Here, we obtain topographic and transcriptional (Pol II residence time) maps of canonical, H2A.Z, and monoubiquitinated H2B (uH2B) nucleosomes at near base-pair resolution and accuracy. Pol II crossing dynamics are complex, displaying pauses at specific loci, backtracking, and nucleosome hopping between wrapped states. While H2A.Z widens the barrier, uH2B heightens it, and both modifications greatly lengthen Pol II crossing time. Using the dwell times of Pol II at each nucleosomal position we extract the energetics of the barrier. The orthogonal barrier modifications of H2A.Z and uH2B, and their effects on Pol II dynamics rationalize their observed enrichment in +1 nucleosomes and suggest a mechanism for selective control of gene expression.nnHighlightsO_LIA single-molecule unzipping assay mimics DNA unwinding by Pol II and maps the topography of human canonical, H2A.Z and uH2B nucleosome barriers at high resolutionnC_LIO_LIReal-time dynamics and full molecular trajectories of Pol II crossing the nucleosomal barrier reveal the transcriptional landscape of the barrier at high accuracynC_LIO_LIH2A.Z enhances the width and uH2B the height of the barriernC_LIO_LIA unified mechanical model links position-dependent dwell times of Pol II on the nucleosome with energetics of the barriernC_LI

Posted ContentDOI
10 Dec 2019-bioRxiv
TL;DR: A hybrid-capture RNAseq assay for FFPE tissue (Fusion-STAMP) that fully targets the transcript isoforms of 43 genes selected for their known impact as actionable targets of existing and emerging anti-cancer therapies, prognostic features, and/or utility as diagnostic cancer biomarkers.
Abstract: RNA sequencing is emerging as a powerful technique to detect a diverse array of fusions in human neoplasia, but few clinically validated assays have been described to date. We designed and validated a hybrid-capture RNAseq assay for FFPE tissue (Fusion-STAMP). It fully targets the transcript isoforms of 43 genes selected for their known impact as actionable targets of existing and emerging anti-cancer therapies (especially in lung adenocarcinomas), prognostic features, and/or utility as diagnostic cancer biomarkers (especially in sarcomas). 57 fusion results across 34 samples were evaluated. Fusion-STAMP demonstrated high overall accuracy with 98% sensitivity and 94% specificity for fusion detection. There was high intra- and inter-run reproducibility. Detection was sensitive to approximately 10% tumor, though this is expected to be impacted by fusion transcript expression levels, hybrid capture efficiency, and RNA quality. Challenges of clinically validating RNA sequencing for fusion detection include a low average RNA quality in FFPE specimens, and variable RNA total content and expression profile per cell. These challenges contribute to highly variable on-target rates, total read pairs, and total mapped read pairs. False positive results may be caused by intergenic splicing, barcode hopping / index hopping, or misalignment. Despite this, Fusion-STAMP demonstrates high overall performance metrics for qualitative fusion detection and is expected to provide clinical utility in identifying actionable fusions.

Journal ArticleDOI
TL;DR: Development and characterization of 17 novel microsatellite markers for the reef manta ray (Mobula alfredi) are reported on, increasing the total number of micros satellite markers available for this species to 27.
Abstract: Limited sample sizes are often a problem for species of conservation concern when using genetic tools to make population assessments. Lack of analytical power from small sample sizes can be compensated for by use of a large marker set. Here we report on development and characterization of 17 novel microsatellite markers for the reef manta ray (Mobula alfredi). Loci were screened on 60 reef manta rays (M. alfredi) sampled from the east coast of Australia. The number of alleles per locus varied from 2 to 13 with observed heterozygosities ranging between 0.300 and 0.917. The development of these 17 additional markers increases the total number of microsatellite markers available for this species to 27.


Posted ContentDOI
17 Aug 2019-medRxiv
TL;DR: An event alignment algorithm is proposed, Medal, which uses a dynamic programming approach for pairwise alignment of medication histories and is identified four clusters in PANS with distinct medication usage histories, driven primarily by penicillin.
Abstract: Objective Pediatric acute-onset neuropsychiatric syndrome (PANS) is a complex neuropsychiatric syndrome characterized by an abrupt onset of obsessive-compulsive symptoms and/or severe eating restrictions, along with at least two concomitant debilitating cognitive, behavioral, or neurological symptoms. A wide range of pharmacological interventions along with behavioral and environmental modifications, and psychotherapies have been adopted to treat symptoms and underlying etiologies. Our goal was to develop a data-driven approach to identify treatment patterns in this cohort. Materials and Methods In this cohort study, we extracted medical prescription histories from electronic health records. We developed a modified dynamic programming approach to perform global alignment of those medication histories. Our approach is unique since it considers time gaps in prescription patterns as part of the similarity strategy. Results This study included 43 consecutive new-onset pre-pubertal patients who had at least 3 clinic visits. Our algorithm identified six clusters with distinct medication usage history which may represent clinician’s practice of treating PANS of different severities and etiologies i.e., two most severe groups requiring high dose intravenous steroids; two arthritic or inflammatory groups requiring prolonged nonsteroidal anti-inflammatory drug (NSAID); and two mild relapsing/remitting group treated with a short course of NSAID. The psychometric scores as outcomes in each cluster generally improved within the first two years. Discussion and conclusion Our algorithm shows potential to improve our knowledge of treatment patterns in the PANS cohort, while helping clinicians understand how patients respond to a combination of drugs.