scispace - formally typeset
Search or ask a question
Author

Katayoon Darvishi

Bio: Katayoon Darvishi is an academic researcher from Harvard University. The author has contributed to research in topics: Single-nucleotide polymorphism & Copy-number variation. The author has an hindex of 13, co-authored 14 publications receiving 4550 citations. Previous affiliations of Katayoon Darvishi include Brigham and Women's Hospital & Jawaharlal Nehru University.

Papers
More filters
Journal ArticleDOI
02 Sep 2010-Nature
TL;DR: An expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.
Abstract: Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of

2,863 citations

Journal ArticleDOI
TL;DR: Birdsuite is presented, a four-stage analytical framework instantiated in software for deriving integrated and mutually consistent copy number and SNP genotypes that more accurately depict the underlying sequence of each individual, reducing the rate of apparent mendelian inconsistencies.
Abstract: Accurate and complete measurement of single nucleotide (SNP) and copy number (CNV) variants, both common and rare, will be required to understand the role of genetic variation in disease. We present Birdsuite, a four-stage analytical framework instantiated in software for deriving integrated and mutually consistent copy number and SNP genotypes. The method sequentially assigns copy number across regions of common copy number polymorphisms (CNPs), calls genotypes of SNPs, identifies rare CNVs via a hidden Markov model (HMM), and generates an integrated sequence and copy number genotype at every locus (for example, including genotypes such as A-null, AAB and BBB in addition to AA, AB and BB calls). Such genotypes more accurately depict the underlying sequence of each individual, reducing the rate of apparent mendelian inconsistencies. The Birdsuite software is applied here to data from the Affymetrix SNP 6.0 array. Additionally, we describe a method, implemented in PLINK, to utilize these combined SNP and CNV genotypes for association testing with a phenotype.

835 citations

Journal ArticleDOI
TL;DR: The striking differences between CNV calls from different platforms and analytic tools highlight the importance of careful assessment of experimental design in discovery and association studies and of strict data curation and filtering in diagnostics.
Abstract: We have systematically compared copy number variant (CNV) detection on eleven microarrays to evaluate data quality and CNV calling, reproducibility, concordance across array platforms and laboratory sites, breakpoint accuracy and analysis tool variability. Different analytic tools applied to the same raw data typically yield CNV calls with <50% concordance. Moreover, reproducibility in replicate experiments is <70% for most platforms. Nevertheless, these findings should not preclude detection of large CNVs for clinical diagnostic purposes because large CNVs with poor reproducibility are found primarily in complex genomic regions and would typically be removed by standard clinical data curation. The striking differences between CNV calls from different platforms and analytic tools highlight the importance of careful assessment of experimental design in discovery and association studies and of strict data curation and filtering in diagnostics. The CNV resource presented here allows independent data evaluation and provides a means to benchmark new algorithms.

418 citations

Journal ArticleDOI
TL;DR: A new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals and discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs.
Abstract: Copy number variants (CNVs) account for the majority of human genomic diversity in terms of base coverage. Here, we have developed and applied a new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals. The genomes of 30 individuals from three Asian populations (Korean, Chinese and Japanese) were interrogated with an ultra-high-resolution array CGH platform containing 24 million probes. Whole-genome sequencing data from a reference genome (NA10851, with 28.3× coverage) and two Asian genomes (AK1, with 27.8× coverage and AK2, with 32.0× coverage) were used to transform the relative copy number information obtained from array CGH experiments into absolute copy number values. We discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs. These common CNVs in Asian populations will be a useful resource for subsequent genetic studies in these populations, and the new method of calling absolute CNVs will be essential for applying CNV data to personalized medicine.

224 citations

Journal ArticleDOI
TL;DR: This study makes an attempt to validate the exclusive presence of mtG10398A (Ala-->Thr) polymorphism in a haplotype constituting mtDNA haplogroup N and its sublineages, imparting this group a higher risk for breast cancer, based on the re-analyses of approximately 1000 complete human mtDNA sequences worldwide and collated information on 2334 individuals belonging to 18 regions in India.

161 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations

Journal ArticleDOI
28 Oct 2010-Nature
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

7,538 citations

Journal ArticleDOI
TL;DR: This unit describes how to use BWA and the Genome Analysis Toolkit to map genome sequencing data to a reference and produce high‐quality variant calls that can be used in downstream analyses.
Abstract: This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK.

5,150 citations

Journal ArticleDOI
Adam J. Bass1, Vesteinn Thorsson2, Ilya Shmulevich2, Sheila Reynolds2  +254 moreInstitutions (32)
11 Sep 2014-Nature
TL;DR: A comprehensive molecular evaluation of 295 primary gastric adenocarcinomas as part of The Cancer Genome Atlas (TCGA) project is described and a molecular classification dividing gastric cancer into four subtypes is proposed.
Abstract: Gastric cancer was the world’s third leading cause of cancer mortality in 2012, responsible for 723,000 deaths1. The vast majority of gastric cancers are adenocarcinomas, which can be further subdivided into intestinal and diffuse types according to the Lauren classification2. An alternative system, proposed by the World Health Organization, divides gastric cancer into papillary, tubular, mucinous (colloid) and poorly cohesive carcinomas3. These classification systems have little clinical utility, making the development of robust classifiers that can guide patient therapy an urgent priority. The majority of gastric cancers are associated with infectious agents, including the bacterium Helicobacter pylori4 and Epstein–Barr virus (EBV). The distribution of histological subtypes of gastric cancer and the frequencies of H. pylori and EBV associated gastric cancer vary across the globe5. A small minority of gastric cancer cases are associated with germline mutation in E-cadherin (CDH1)6 or mismatch repair genes7 (Lynch syndrome), whereas sporadic mismatch repair-deficient gastric cancers have epigenetic silencing of MLH1 in the context of a CpG island methylator phenotype (CIMP)8. Molecular profiling of gastric cancer has been performed using gene expression or DNA sequencing9–12, but has not led to a clear biologic classification scheme. The goals of this study by The Cancer Genome Atlas (TCGA) were to develop a robust molecular classification of gastric cancer and to identify dysregulated pathways and candidate drivers of distinct classes of gastric cancer.

4,583 citations

Journal ArticleDOI
Shaun Purcell1, Shaun Purcell2, Naomi R. Wray3, Jennifer Stone2, Jennifer Stone1, Peter M. Visscher, Michael Conlon O'Donovan4, Patrick F. Sullivan5, Pamela Sklar1, Pamela Sklar2, Douglas M. Ruderfer, Andrew McQuillin, Derek W. Morris6, Colm O'Dushlaine6, Aiden Corvin6, Peter Holmans4, Stuart MacGregor3, Hugh Gurling, Douglas Blackwood7, Nicholas John Craddock5, Michael Gill6, Christina M. Hultman8, Christina M. Hultman9, George Kirov4, Paul Lichtenstein8, Walter J. Muir7, Michael John Owen4, Carlos N. Pato10, Edward M. Scolnick1, Edward M. Scolnick2, David St Clair, Nigel Williams4, Lyudmila Georgieva4, Ivan Nikolov4, Nadine Norton4, Hywel Williams4, Draga Toncheva, Vihra Milanova, Emma Flordal Thelander8, Patrick Sullivan11, Elaine Kenny6, Emma M. Quinn6, Khalid Choudhury12, Susmita Datta12, Jonathan Pimm12, Srinivasa Thirumalai13, Vinay Puri12, Robert Krasucki12, Jacob Lawrence12, Digby Quested14, Nicholas Bass12, Caroline Crombie15, Gillian Fraser15, Soh Leh Kuan, Nicholas Walker, Kevin A. McGhee7, Ben S. Pickard16, P. Malloy7, Alan W Maclean7, Margaret Van Beck7, Michele T. Pato10, Helena Medeiros10, Frank A. Middleton17, Célia Barreto Carvalho10, Christopher P. Morley17, Ayman H. Fanous, David V. Conti10, James A. Knowles10, Carlos Ferreira, António Macedo18, M. Helena Azevedo18, Andrew Kirby2, Andrew Kirby1, Manuel A. R. Ferreira1, Manuel A. R. Ferreira2, Mark J. Daly1, Mark J. Daly2, Kimberly Chambert1, Finny G Kuruvilla1, Stacey Gabriel1, Kristin G. Ardlie1, Jennifer L. Moran1 
06 Aug 2009-Nature
TL;DR: The extent to which common genetic variation underlies the risk of schizophrenia is shown, using two analytic approaches, and the major histocompatibility complex is implicate, which is shown to involve thousands of common alleles of very small effect.
Abstract: Schizophrenia is a severe mental disorder with a lifetime risk of about 1%, characterized by hallucinations, delusions and cognitive deficits, with heritability estimated at up to 80%(1,2). We performed a genome-wide association study of 3,322 European individuals with schizophrenia and 3,587 controls. Here we show, using two analytic approaches, the extent to which common genetic variation underlies the risk of schizophrenia. First, we implicate the major histocompatibility complex. Second, we provide molecular genetic evidence for a substantial polygenic component to the risk of schizophrenia involving thousands of common alleles of very small effect. We show that this component also contributes to the risk of bipolar disorder, but not to several non-psychiatric diseases.

4,573 citations