Home
/
Authors
/
Michael Roberts

Author

Michael Roberts

Other affiliations: Johns Hopkins University School of Medicine

Bio: Michael Roberts is an academic researcher from University of Maryland, College Park. The author has contributed to research in topics: Sequence assembly & Shotgun sequencing. The author has an hindex of 13, co-authored 15 publications receiving 4070 citations. Previous affiliations of Michael Roberts include Johns Hopkins University School of Medicine.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A whole-genome assembly of the domestic cow, Bos taurus

[...]

Aleksey V. Zimin¹, Arthur L. Delcher¹, Liliana Florea¹, David R. Kelley¹, Michael C. Schatz¹, Daniela Puiu¹, Finnian Hanrahan¹, Geo Pertea¹, Curtis P. Van Tassell², Tad S. Sonstegard², Guillaume Marçais¹, Michael Roberts¹, Poorani Subramanian¹, James A. Yorke¹, Steven L. Salzberg¹ - Show less +11 more•Institutions (2)

University of Maryland, College Park¹, United States Department of Agriculture²

24 Apr 2009-Genome Biology

TL;DR: By using independent mapping data and conserved synteny between the cow and human genomes, this work was able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes.

...read moreread less

Abstract: Background: The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods. Results: We have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions. Conclusions: By using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.

...read moreread less

1,097 citations

Journal Article•DOI•

The MaSuRCA genome assembler

[...]

Aleksey V. Zimin¹, Guillaume Marçais¹, Daniela Puiu¹, Michael Roberts¹, Steven L. Salzberg¹, James A. Yorke¹ - Show less +2 more•Institutions (1)

Johns Hopkins University School of Medicine¹

01 Nov 2013-Bioinformatics

TL;DR: A new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error is described.

...read moreread less

Abstract: Motivation. Second-generation sequencing technologies produce high coverage of the genome by short reads at a very low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this paper we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms very large numbers of paired-end reads into a much smaller number of longer “super-reads.” The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced “mazurka”). Results. We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two data sets from organisms for which highquality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. Availability. MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Contact. Aleksey Zimin, alekseyz@ipst.umd.edu

...read moreread less

1,032 citations

Journal Article•DOI•

GAGE: A critical evaluation of genome assemblies and assembly algorithms

[...]

Steven L. Salzberg¹, Adam M. Phillippy², Aleksey V. Zimin³, Daniela Puiu⁴, Tanja Magoc⁴, Sergey Koren³, Sergey Koren², Todd J. Treangen⁴, Michael C. Schatz⁵, Arthur L. Delcher, Michael Roberts³, Guillaume Marçais³, Mihai Pop³, James A. Yorke³ - Show less +10 more•Institutions (5)

Johns Hopkins University School of Medicine¹, Battelle Memorial Institute², University of Maryland, College Park³, Johns Hopkins University⁴, Cold Spring Harbor Laboratory⁵

01 Mar 2012-Genome Research

TL;DR: Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.

...read moreread less

Abstract: New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

...read moreread less

751 citations

Journal Article•DOI•

Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies

[...]

David B. Neale¹, Jill L. Wegrzyn¹, Kristian Stevens¹, Aleksey V. Zimin², Daniela Puiu³, Marc W. Crepeau¹, Charis Cardeno¹, Maxim Koriabine⁴, Ann E. Holtz-Morris⁴, John D. Liechty¹, Pedro J. Martínez-García¹, Hans A. Vasquez-Gross¹, Brian Y. Lin¹, Jacob J. Zieve¹, William M. Dougherty¹, Sara Fuentes-Soriano⁵, Le-Shin Wu⁵, Don Gilbert⁵, Guillaume Marçais², Michael Roberts², Carson Holt⁶, Mark Yandell⁶, John M. Davis⁷, Katherine E. Smith⁸, Jeffrey F. D. Dean⁹, W. Walter Lorenz⁹, Ross W. Whetten¹⁰, Ronald R. Sederoff¹⁰, Nicholas C. Wheeler¹, Patrick E. McGuire¹, Doreen Main¹¹, Carol A. Loopstra¹², Keithanne Mockaitis⁵, Pieter J. deJong⁴, James A. Yorke², Steven L. Salzberg³, Charles H. Langley¹ - Show less +33 more•Institutions (12)

University of California, Davis¹, University of Maryland, College Park², Johns Hopkins University³, Children's Hospital Oakland Research Institute⁴, Indiana University⁵, University of Utah⁶, University of Florida⁷, United States Forest Service⁸, University of Georgia⁹, North Carolina State University¹⁰, Washington State University¹¹, Texas A&M University¹²

04 Mar 2014-Genome Biology

TL;DR: In this paper, the authors used a whole genome shotgun approach relying on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding.

...read moreread less

Abstract: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.

...read moreread less

420 citations

Journal Article•DOI•

Reducing storage requirements for biological sequence comparison

[...]

Michael Roberts¹, Wayne B. Hayes¹, Brian R. Hunt¹, Stephen M. Mount¹, James A. Yorke¹ - Show less +1 more•Institutions (1)

University of Maryland, College Park¹

12 Dec 2004-Bioinformatics

TL;DR: A simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored, which can speed up string-matching computations by a large factor while missing only aSmall fraction of the matches found using all seeds.

...read moreread less

Abstract: Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. Results: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.

...read moreread less

357 citations

Cited by

PDF

Open Access

More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

[...]

Glenn Tesler

01 Jun 2012

TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).

...read moreread less

Abstract: The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.

...read moreread less

10,124 citations

Journal Article•DOI•

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

[...]

Mihaela Pertea¹, Geo Pertea¹, Corina Antonescu¹, Tsung Cheng Chang², Joshua T. Mendell², Steven L. Salzberg¹ - Show less +2 more•Institutions (2)

Johns Hopkins University¹, University of Texas Southwestern Medical Center²

01 Mar 2015-Nature Biotechnology

TL;DR: StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts produces more complete and accurate reconstructions of genes and better estimates of expression levels.

...read moreread less

Abstract: Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

...read moreread less

6,594 citations

Journal Article•DOI•

Minimap2: pairwise alignment for nucleotide sequences

[...]

Heng Li¹•Institutions (1)

Broad Institute¹

15 Sep 2018-Bioinformatics

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.

...read moreread less

Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

6,264 citations

Journal Article•DOI•

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

[...]

Donovan H. Parks¹, Michael Imelfort¹, Connor T. Skennerton¹, Philip Hugenholtz¹, Gene W. Tyson¹ - Show less +1 more•Institutions (1)

University of Queensland¹

01 Jul 2015-Genome Research

TL;DR: An objective measure of genome quality is proposed that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities and is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches.

...read moreread less

Abstract: Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of “marker” genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.

...read moreread less

5,788 citations

Journal Article•DOI•

QUAST: quality assessment tool for genome assemblies

[...]

Alexey Gurevich¹, Vladislav Saveliev¹, Nikolay Vyahhi¹, Glenn Tesler¹•Institutions (1)

University of California, San Diego¹

15 Apr 2013-Bioinformatics

TL;DR: This tool improves on leading assembly comparison software with new ideas and quality metrics, and can evaluate assemblies both with a reference genome, as well as without a reference.

...read moreread less

Abstract: Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST—a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website.

...read moreread less

5,757 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse