scispace - formally typeset
Search or ask a question
Author

Tanya Sokolsky

Bio: Tanya Sokolsky is an academic researcher from Life Technologies. The author has contributed to research in topics: Saccharomyces cerevisiae & MSH2. The author has an hindex of 10, co-authored 17 publications receiving 3844 citations. Previous affiliations of Tanya Sokolsky include Cornell University & Massachusetts Institute of Technology.

Papers
More filters
Journal ArticleDOI
21 Jul 2011-Nature
TL;DR: A DNA sequencing technology in which scalable, low-cost semiconductor manufacturing techniques are used to make an integrated circuit able to directly perform non-optical DNA sequencing of genomes, showing its robustness and scalability by producing ion chips with up to 10 times as many sensors and sequencing a human genome.
Abstract: The seminal importance of DNA sequencing to the life sciences, biotechnology and medicine has driven the search for more scalable and lower-cost solutions. Here we describe a DNA sequencing technology in which scalable, low-cost semiconductor manufacturing techniques are used to make an integrated circuit able to directly perform non-optical DNA sequencing of genomes. Sequence data are obtained by directly sensing the ions produced by template-directed DNA polymerase synthesis using all-natural nucleotides on this massively parallel semiconductor-sensing device or ion chip. The ion chip contains ion-sensitive, field-effect transistor-based sensors in perfect register with 1.2 million wells, which provide confinement and allow parallel, simultaneous detection of independent sequencing reactions. Use of the most widely used technology for constructing integrated circuits, the complementary metal-oxide semiconductor (CMOS) process, allows for low-cost, large-scale production and scaling of the device to higher densities and larger array sizes. We show the performance of the system by sequencing three bacterial genomes, its robustness and scalability by producing ion chips with up to 10 times as many sensors and sequencing a human genome.

2,246 citations

Journal ArticleDOI
17 Aug 2007-Science
TL;DR: It is concluded that aneuploidy causes not only a proliferative disadvantage but also a set of phenotypes that is independent of the identity of the individual extra chromosomes.
Abstract: Aneuploidy is a condition frequently found in tumor cells, but its effect on cellular physiology is not known. We have characterized one aspect of aneuploidy: the gain of extra chromosomes. We created a collection of haploid yeast strains that each bear an extra copy of one or more of almost all of the yeast chromosomes. Their characterization revealed that aneuploid strains share a number of phenotypes, including defects in cell cycle progression, increased glucose uptake, and increased sensitivity to conditions interfering with protein synthesis and protein folding. These phenotypes were observed only in strains carrying additional yeast genes, which indicates that they reflect the consequences of additional protein production as well as the resulting imbalances in cellular protein composition. We conclude that aneuploidy causes not only a proliferative disadvantage but also a set of phenotypes that is independent of the identity of the individual extra chromosomes.

862 citations

Journal ArticleDOI
TL;DR: Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual.
Abstract: We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.

595 citations

Journal ArticleDOI
TL;DR: It is proposed that the MSH2 helix-turn-helix domain mediates changes in Msh2p-Msh6p interactions that are induced by ATP hydrolysis; the net result of these changes is a modulation of mismatch recognition.
Abstract: Recent studies have shown that Saccharomyces cerevisiae Msh2p and Msh6p form a complex that specifically binds to DNA containing base pair mismatches. In this study, we performed a genetic and biochemical analysis of the Msh2p-Msh6p complex by introducing point mutations in the ATP binding and putative helix-turn-helix domains of MSH2. The effects of these mutations were analyzed genetically by measuring mutation frequency and biochemically by measuring the stability, mismatch binding activity, and ATPase activity of msh2p (mutant msh2p)-Msh6p complexes. A mutation in the ATP binding domain of MSH2 did not affect the mismatch binding specificity of the msh2p-Msh6p complex; however, this mutation conferred a dominant negative phenotype when the mutant gene was overexpressed in a wild-type strain, and the mutant protein displayed biochemical defects consistent with defects in mismatch repair downstream of mismatch recognition. Helix-turn-helix domain mutant proteins displayed two different properties. One class of mutant proteins was defective in forming complexes with Msh6p and also failed to recognize base pair mismatches. A second class of mutant proteins displayed properties similar to those observed for the ATP binding domain mutant protein. Taken together, these data suggested that the proposed helix-turn-helix domain of Msh2p was unlikely to be involved in mismatch recognition. We propose that the MSH2 helix-turn-helix domain mediates changes in Msh2p-Msh6p interactions that are induced by ATP hydrolysis; the net result of these changes is a modulation of mismatch recognition.

136 citations

Journal ArticleDOI
TL;DR: In UV cross-linking, filter binding, and gel retardation assays, the MSH2-msh6-F337A complex displayed a mismatch recognition defect and observations, in conjunction with ATPase and dissociation rate analysis, suggested that MSH 2-msH6- F337A formed an unproductive complex that was unable to stably bind to mismatch DNA.

91 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

37,898 citations

Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations

Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations

Journal ArticleDOI
TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.
Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.

10,056 citations

Journal ArticleDOI
TL;DR: A technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments is presented.
Abstract: Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.

7,023 citations