scispace - formally typeset
Search or ask a question
Author

Joel A. Malek

Bio: Joel A. Malek is an academic researcher from Cornell University. The author has contributed to research in topics: Population & Genome. The author has an hindex of 40, co-authored 114 publications receiving 12447 citations. Previous affiliations of Joel A. Malek include J. Craig Venter Institute & Indian Ministry of Environment and Forests.
Topics: Population, Genome, Gene, Exome sequencing, Circular RNA


Papers
More filters
Journal Article•DOI•
Robert L. Strausberg, Elise A. Feingold1, Lynette H. Grouse1, Jeffery G. Derge2, Richard D. Klausner1, Francis S. Collins1, Lukas Wagner1, Carolyn M. Shenmen1, Gregory D. Schuler1, Stephen F. Altschul1, Barry R. Zeeberg1, Kenneth H. Buetow1, Carl F. Schaefer1, Narayan K. Bhat1, Ralph F. Hopkins1, Heather Jordan1, Troy Moore3, Steve I Max3, Jun Wang3, Florence Hsieh, Luda Diatchenko, Kate Marusina, Andrew A Farmer, Gerald M. Rubin4, Ling Hong4, Mark Stapleton4, M. Bento Soares5, Maria de Fatima Bonaldo5, Thomas L. Casavant5, Todd E. Scheetz5, Michael J. Brownstein1, Ted B. Usdin1, Shiraki Toshiyuki, Piero Carninci, Christa Prange6, Sam S Raha7, Naomi A Loquellano7, Garrick J Peters7, Rick D Abramson7, Sara J Mullahy7, Stephanie Bosak, Paul J. McEwan, Kevin McKernan, Joel A. Malek, Preethi H. Gunaratne8, Stephen Richards8, Kim C. Worley8, Sarah Hale8, Angela M. Garcia8, Stephen W. Hulyk8, Debbie K Villalon8, Donna M. Muzny8, Erica Sodergren8, Xiuhua Lu8, Richard A. Gibbs8, Jessica Fahey9, Erin Helton9, Mark Ketteman9, Anuradha Madan9, Stephanie Rodrigues9, Amy Sanchez9, Michelle Whiting9, Anup Madan9, Alice C. Young1, Yuriy O. Shevchenko1, Gerard G. Bouffard1, Robert W. Blakesley1, Jeffrey W. Touchman1, Eric D. Green1, Mark Dickson10, Alex Rodriguez10, Jane Grimwood10, Jeremy Schmutz10, Richard M. Myers10, Yaron S.N. Butterfield11, Martin Krzywinski11, Ursula Skalska11, Duane E. Smailus11, Angelique Schnerch11, Jacqueline E. Schein11, Steven J.M. Jones11, Marco A. Marra11 •
TL;DR: The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene.
Abstract: The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).

2,184 citations

Journal Article•DOI•
Robert A. Holt1, G. Mani Subramanian1, Aaron L. Halpern1, Granger G. Sutton1, Rosane Charlab1, Deborah R. Nusskern1, Patrick Wincker2, Andrew G. Clark3, José M. C. Ribeiro4, Ron Wides5, Steven L. Salzberg6, Brendan J. Loftus6, Mark Yandell1, William H. Majoros6, William H. Majoros1, Douglas B. Rusch1, Zhongwu Lai1, Cheryl L. Kraft1, Josep F. Abril, Véronique Anthouard2, Peter Arensburger7, Peter W. Atkinson7, Holly Baden1, Véronique de Berardinis2, Danita Baldwin1, Vladimir Benes, Jim Biedler8, Claudia Blass, Randall Bolanos1, Didier Boscus2, Mary Barnstead1, Shuang Cai1, Kabir Chatuverdi1, George K. Christophides, Mathew A. Chrystal9, Michele Clamp10, Anibal Cravchik1, Val Curwen10, Ali N Dana9, Arthur L. Delcher1, Ian M. Dew1, Cheryl A. Evans1, Michael Flanigan1, Anne Grundschober-Freimoser11, Lisa Friedli7, Zhiping Gu1, Ping Guan1, Roderic Guigó, Maureen E. Hillenmeyer9, Susanne L. Hladun1, James R. Hogan9, Young S. Hong9, Jeffrey Hoover1, Olivier Jaillon2, Zhaoxi Ke9, Zhaoxi Ke1, Chinnappa D. Kodira1, Kokoza Eb, Anastasios C. Koutsos12, Ivica Letunic, Alex Levitsky1, Yong Liang1, Jhy-Jhu Lin1, Jhy-Jhu Lin6, Neil F. Lobo9, John Lopez1, Joel A. Malek6, Tina C. McIntosh1, Stephan Meister, Jason R. Miller1, Clark M. Mobarry1, Emmanuel Mongin13, Sean D. Murphy1, David A. O'Brochta11, Cynthia Pfannkoch1, Rong Qi1, Megan A. Regier1, Karin A. Remington1, Hongguang Shao8, Maria V. Sharakhova9, Cynthia Sitter1, Jyoti Shetty6, Thomas J. Smith1, Renee Strong1, Jingtao Sun1, Dana Thomasova, Lucas Q. Ton9, Pantelis Topalis12, Zhijian Tu8, Maria F. Unger9, Brian P. Walenz1, Aihui Wang1, Jian Wang1, Mei Wang1, X. Wang9, Kerry J. Woodford1, Jennifer R. Wortman6, Jennifer R. Wortman1, Martin Wu6, Alison Yao1, Evgeny M. Zdobnov, Hongyu Zhang1, Qi Zhao1, Shaying Zhao6, Shiaoping C. Zhu1, Igor F. Zhimulev, Mario Coluzzi14, Alessandra della Torre14, Charles Roth15, Christos Louis12, Francis Kalush1, Richard J. Mural1, Eugene W. Myers1, Mark Raymond Adams1, Hamilton O. Smith1, Samuel Broder1, Malcolm J. Gardner6, Claire M. Fraser6, Ewan Birney13, Peer Bork, Paul T. Brey15, J. Craig Venter1, J. Craig Venter6, Jean Weissenbach2, Fotis C. Kafatos, Frank H. Collins9, Stephen L. Hoffman1 •
04 Oct 2002-Science
TL;DR: Analysis of the PEST strain of A. gambiae revealed strong evidence for about 14,000 protein-encoding transcripts, and prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted.
Abstract: Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency ("dual haplotypes") in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.

2,033 citations

Journal Article•DOI•
27 May 1999-Nature
TL;DR: Genome analysis reveals numerous pathways involved in degradation of sugars and plant polysaccharides, and 108 genes that have orthologues only in the genomes of other thermophilic Eubacteria and Archaea.
Abstract: The 1,860,725-base-pair genome of Thermotoga maritima MSB8 contains 1,877 predicted coding regions, 1,014 (54%) of which have functional assignments and 863 (46%) of which are of unknown function. Genome analysis reveals numerous pathways involved in degradation of sugars and plant polysaccharides, and 108 genes that have orthologues only in the genomes of other thermophilic Eubacteria and Archaea. Of the Eubacteria sequenced to date, T. maritima has the highest percentage (24%) of genes that are most similar to archaeal genes. Eighty-one archaeal-like genes are clustered in 15 regions of the T. maritima genome that range in size from 4 to 20 kilobases. Conservation of gene order between T. maritima and Archaea in many of the clustered regions suggests that lateral gene transfer may have occurred between thermophilic Eubacteria and Archaea.

1,486 citations

Journal Article•DOI•
Daniela S. Gerhard1, Lukas Wagner1, Elise A. Feingold1, Carolyn M. Shenmen1, Lynette H. Grouse1, Greg Schuler1, Steven L. Klein1, Susan Old1, Rebekah S. Rasooly1, Peter J. Good1, Mark S. Guyer1, Allison M. Peck1, Jeffery G. Derge2, David J. Lipman1, Francis S. Collins1, Wonhee Jang1, Steven Sherry1, Mike Feolo1, Leonie Misquitta1, Eduardo Lee1, Kirill Rotmistrovsky1, Susan F. Greenhut1, Carl F. Schaefer1, Kenneth H. Buetow1, Tom I. Bonner1, David Haussler3, Jim Kent3, Mark Diekhans3, Terry Furey3, Michael R. Brent4, Christa Prange5, Kirsten Schreiber5, Nicole Shapiro5, Narayan K. Bhat2, Ralph F. Hopkins2, Florence Hsie, Tom Driscoll, M. Bento Soares6, Maria de Fatima Bonaldo6, Thomas L. Casavant6, Todd E. Scheetz6, Michael J. Brownstein1, Ted B. Usdin1, Shiraki Toshiyuki, Piero Carninci, Yulan Piao1, Dawood B. Dudekula1, Minoru S.H. Ko1, Koichi Kawakami7, Yutaka Suzuki8, Sumio Sugano8, C. E. Gruber, M. R. Smith, Blake A. Simmons, Troy Moore, Richard C. Waterman4, Stephen L. Johnson4, Yijun Ruan9, Chia-Lin Wei9, Sinnakaruppan Mathavan9, Preethi H. Gunaratne10, Jia Qian Wu10, Angela M. Garcia10, Stephen W. Hulyk10, Edwin Fuh10, Ye Yuan10, Anna Sneed10, Carla Kowis10, Anne Hodgson10, Donna M. Muzny10, John Douglas Mcpherson10, Richard A. Gibbs10, Jessica Fahey6, Jessica Fahey11, Erin Helton11, Mark Ketteman11, Anuradha Madan11, Anuradha Madan6, Stephanie Rodrigues11, Stephanie Rodrigues6, Amy Sanchez11, Michelle Whiting11, Anup Madan6, Anup Madan11, Alice C. Young1, Keith Wetherby1, Steven J. Granite1, Peggy N. Kwong1, Charles P. Brinkley1, Russell L. Pearson1, Gerard G. Bouffard1, Robert W. Blakesly1, Eric D. Green1, Mark Dickson12, Alex Rodriguez12, Jane Grimwood12, Jeremy Schmutz12, Richard M. Myers12, Yaron S.N. Butterfield13, Malachi Griffith13, Obi L. Griffith13, Martin Krzywinski13, Nancy Y. Liao13, Ryan Morrin13, Diana L. Palmquist13, Anca Petrescu13, Ursula Skalska13, Duane E. Smailus13, Jeff M. Stott13, Angelique Schnerch13, Jacqueline E. Schein13, Steven J.M. Jones13, Robert A. Holt13, Agnes Baross13, Marco A. Marra13, Sandra W. Clifton4, Kathryn A. Makowski, Stephanie Bosak, Joel A. Malek •
TL;DR: Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors.
Abstract: The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues Candidate clones were chosen based on 5'-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline

641 citations

Journal Article•DOI•
TL;DR: These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution and release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.
Abstract: Using the massively parallel technique of sequencing by oligonucleotide ligation and detection (SOLiD; Applied Biosystems), we have assessed the in vivo positions of more than 44 million putative nucleosome cores in the multicellular genetic model organism Caenorhabditis elegans. These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution. While we observe some degree of reproducible positioning throughout the genome in our mixed stage population of animals, we note that the major chromatin feature in the worm is a diversity of allowed nucleosome positions at the vast majority of individual loci. While absolute positioning of nucleosomes can vary substantially, relative positioning of nucleosomes (in a repeated array structure likely to be maintained at least in part by steric constraints) appears to be a significant property of chromatin structure. The high density of nucleosomal reads enabled a substantial extension of previous analysis describing the usage of individual oligonucleotide sequences along the span of the nucleosome core and linker. We release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.

630 citations


Cited by
More filters
Journal Article•DOI•
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 more•Institutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations

Journal Article•DOI•
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations

Journal Article•DOI•
TL;DR: Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors.
Abstract: We have mapped and quantified mouse transcriptomes by deeply sequencing them and recording how frequently each gene is represented in the sequence sample (RNA-Seq). This provides a digital measure of the presence and prevalence of transcripts from known and previously unknown genes. We report reference measurements composed of 41–52 million mapped 25-base-pair reads for poly(A)-selected RNA from adult mouse brain, liver and skeletal muscle tissues. We used RNA standards to quantify transcript prevalence and to test the linear range of transcript detection, which spanned five orders of magnitude. Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors. RNA splice events, which are not readily measured by standard gene expression microarray or serial analysis of gene expression methods, were detected directly by mapping splice-crossing sequence reads. We observed 1.45 × 10 5 distinct splices, and alternative splices were prominent, with 3,500 different genes expressing one or more alternate internal splices. The mRNA population specifies a cell’s identity and helps to govern its present and future activities. This has made transcriptome analysis a general phenotyping method, with expression microarrays of many kinds in routine use. Here we explore the possibility that transcriptome analysis, transcript discovery and transcript refinement can be done effectively in large and complex mammalian genomes by ultra-high-throughput sequencing. Expression microarrays are currently the most widely used methodology for transcriptome analysis, although some limitations persist. These include hybridization and cross-hybridization artifacts 1–3 , dye-based detection issues and design constraints that preclude or seriously limit the detection of RNA splice patterns and previously unmapped genes. These issues have made it difficult for standard array designs to provide full sequence comprehensiveness (coverage of all possible genes, including unknown ones, in large genomes) or transcriptome comprehensiveness (reliable detection of all RNAs of all prevalence classes, including the least abundant ones that are physiologically relevant). Other

12,293 citations

Journal Article•DOI•
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 more•Institutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal Article•DOI•
TL;DR: The RNA-Seq approach to transcriptome profiling that uses deep-sequencing technologies provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
Abstract: RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. This article describes the RNA-Seq approach, the challenges associated with its application, and the advances made so far in characterizing several eukaryote transcriptomes.

11,528 citations