MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale
Mikhail Karasikov,Harun Mustafa,Daniel Danciu,Marc Zimmermann,Christopher Barber,Gunnar Rätsch,André Kahles +6 more
Reads0
Chats0
TLDR
This work presents MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories, and introduces the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set.Abstract:
The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by an index and its query performance. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph indexes can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases. Meta-Graphprovides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Notably, processing of data sets ranging from 1 TB of raw WGS reads to 20 TB of human RNA-sequencing data results in indexes whose memory footprints are small enough to host on standard desktop workstations. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including indexes of over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 40,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes will be available for download and in the cloud. In total, indexes comprising more than 1 million sequencing records are available for download. As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.read more
Citations
More filters
SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Journal ArticleDOI
Petabase-scale sequence alignment catalyses viral discovery
TL;DR: This article developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude.
Posted ContentDOI
Petabase-scale sequence alignment catalyses viral discovery
Robert C. Edgar,Jeff Taylor,Tomer Altman,Pierre Barbera,Dmitry Meleshko,Dmitry Meleshko,Victor Lin,Dan Lohr,Gherman Novakovsky,Basem Al-Shayeb,Jillian F. Banfield,Anton Korobeynikov,Rayan Chikhi,Artem Babaian +13 more
TL;DR: A free and comprehensive database of nucleic acid sequences, characterised novel viruses related to coronaviruses and to hepatitis δ virus, respectively and explored their environmental reservoirs.
Journal ArticleDOI
A global metagenomic map of urban microbiomes and antimicrobial resistance
David Danko,Daniela Bezdan,Evan E. Afshin,Sofia Ahsanuddin,Chandrima Bhattacharya,Daniel Butler,Kern Rei Chng,Daisy Donnellan,Jochen Hecht,Katelyn Jackson,Katerina Kuchin,Mikhail Karasikov,Mikhail Karasikov,Mikhail Karasikov,Abigail Lyons,Lauren Mak,Dmitry Meleshko,Harun Mustafa,Harun Mustafa,Harun Mustafa,Beth Mutai,Russell Y. Neches,Amanda Ng,Olga Nikolayeva,Tatyana Nikolayeva,Eileen Png,Krista Ryon,Jorge L. Sanchez,Heba Shaaban,Maria A. Sierra,Dominique Thomas,Ben Young,Omar O. Abudayyeh,Josue Alicea,Malay Bhattacharyya,Ran Blekhman,Eduardo Castro-Nallar,Ana M. Cañas,Aspassia D. Chatziefthimiou,Robert W. Crawford,Francesca De Filippis,Youping Deng,Christelle Desnues,Emmanuel Dias-Neto,Marius Dybwad,Eran Elhaik,Danilo Ercolini,Alina Frolova,Dennis Gankin,Jonathan S. Gootenberg,Alexandra B. Graf,David C. Green,Iman Hajirasouliha,Jaden J.A. Hastings,Mark Hernandez,Gregorio Iraola,Gregorio Iraola,Gregorio Iraola,Soojin Jang,André Kahles,André Kahles,Frank J. Kelly,Kaymisha Knights,Nikos C. Kyrpides,Paweł P. Łabaj,Paweł P. Łabaj,Patrick K. H. Lee,Marcus H. Y. Leung,Per O. Ljungdahl,Gabriella Mason-Buck,Ken McGrath,Cem Meydan,Emmanuel F. Mongodin,Milton Ozório Moraes,Niranjan Nagarajan,Marina Nieto-Caballero,Houtan Noushmehr,Manuela Oliveira,Stephan Ossowski,O. Osuolale,Orhan Özcan,David Paez-Espino,Nicolás Rascovan,Hugues Richard,Hugues Richard,Gunnar Rätsch,Gunnar Rätsch,Gunnar Rätsch,Lynn M. Schriml,Torsten Semmler,Osman U. Sezerman,Leming Shi,Tieliu Shi,Rania Siam,Le Huu Song,Haruo Suzuki,Denise Syndercombe Court,Scott Tighe,Xinzhao Tong,Klas I. Udekwu,Klas I. Udekwu,Juan A. Ugalde,Brandon Valentine,Dimitar Vassilev,Elena M. Vayndorf,Thirumalaisamy P. Velavan,Jun Wu,María Mercedes Zambrano,Jifeng Zhu,Sibo Zhu,Christopher E. Mason,Natasha Abdullah,Marcos Abraao,Ait-hamlat Adel,Muhammad Afaq,Faisal Alquaddoomi,Ireen Alam,Gabriela E. Albuquerque,Alex Alexiev,Kalyn Ali,Lucia E. Alvarado-Arnez,Sarh Aly,Jennifer Amachee,Maria João Amorim,Majelia Ampadu,Muhammad Al-Fath Amran,Nala An,Watson Andrew,Harilanto Andrianjakarivony,Michael Angelov,Verónica Antelo,Catharine Aquino,Álvaro Aranguren,Luiza Ferreira de Araújo,Hitler Francois Vasquez Arevalo,Jenny Arevalo,Carme Arnan,Lucia Elena Alvarado Arnez,Fernanda Arredondo,Matthew Arthur,Freddy Asenjo,Thomas Saw Aung,Juliette Auvinet,Nuria Aventin,Sadaf Ayaz,Silva Baburyan,Abd-Manaaf Bakere,Katrin Bakhl,Thais Fernanda Bartelli,Erdenetsetseg Batdelger,François Baudon,Kevin Becher,Carla Bello,Médine Benchouaia,Hannah Benisty,Anne-Sophie Benoiston,Joseph Benson,Diego Benítez,Juliana S Bernardes,Denis Bertrand,Silvia Beurmann,Tristan Bitard-Feildel,Lucie Bittner,Christina Black,Guillaume Blanc,Brittany Blyther,Toni Bode,Julia Boeri,Bazartseren Boldgiv,Kevin Bolzli,Alexia Bordigoni,Ciro Borrelli,Sonia Bouchard,Jean-Pierre Bouly,Alicia Boyd,Gabriela P. Branco,Alessandra Breschi,Björn Brindefalk,Christian Brion,Alan Briones,Paulina Buczansla,Catherine Burke,Aszia Burrell,Alina Butova,Irvind Buttar,Jalia Bynoe,Sven Bönigk,Kari Oline Bøifot,Hiram Caballero,Xiao Wen Cai,Dayana Calderon,Angela Cantillo,Miguel Carbajo,Alessandra Carbone,Anais Cardenas,Katerine Carrillo,Laurie Casalot,Sofia Castro,Ana Valeria B Castro,Astred Castro,Ana Valeria Castro,Simone Cawthorne,Jonathan Cedillo,Salama Chaker,Jasna Chalangal,Allison Chan,Anastasia Chasapi,Starr Chatziefthimiou,Sreya Ray Chaudhuri,Akash Keluth Chavan,Francisco Chavez,Gregory Chem,Xiaoqing Chen,Michelle B. Chen,Jenn-Wei Chen,Ariel Chernomoretz,Allaeddine Chettouh,Daisy Cheung,Diana Chicas,Shirley Chiu,Hira Choudhry,Carl Chrispin,Kianna Ciaramella,Erika Cifuentes,Jake Cohen,David A. Coil,Sylvie Collin,Colleen Conger,Romain Conte,Flavia Corsi,Cecilia N. Cossio,Ana Flávia Costa,Delisia Cuebas,Bruno D'Alessandro,Katherine E. Dahlhausen,Aaron E. Darling,Pujita Das,Lucinda B. Davenport,Laurent David,Natalie R. Davidson,Gargi Dayama,Stéphane Delmas,Chris K. Deng,Chloé Dequeker,Alexandre Desert,Monika Devi,Felipe Segato Dezem,Clara N. Dias,Timothy Donahoe,Sonia Dorado,LaShonda Dorsey,Valeriia Dotsenko,Steven Du,Alexandra Dutan,Naya Eady,Jonathan A. Eisen,Miar Elaskandrany,Lennard Epping,Juan P. Escalera-Antezana,Cassie L. Ettinger,Iqra Faiz,Luice Fan,Nadine Farhat,Emile Faure,Fazlina Fauzi,Charlie Feigin,Skye Felice,Laís Pereira Ferreira,Gabriel Figueroa,Aubin Fleiss,Denisse Flores,Jhovana L. Velasco Flores,Marcos A. S. Fonseca,Jonathan Foox,Juan Carlos Forero,Aaishah Francis,Kelly French,Pablo Fresia,Jacob Friedman,Jaime J. Fuentes,Josephine Galipon,Mathilde Garcia,Laura Garcia,Catalina García,Annie Geiger,Samuel M. Gerner,Sonia L. Ghose,Dao Phuong Giang,Matías Giménez,Donato Giovannelli,Dedan Githae,Spyridon Gkotzis,Liliana Godoy,Samantha L. Goldman,Gaston H. Gonnet,Juana Gonzalez,Andrea Gonzalez,Camila Gonzalez-Poblete,Andrew N. Gray,Tranette Gregory,Charlotte Greselle,Sophie Guasco,Juan Guerra,Nika Gurianova,Wolfgang Haehr,Sebastien Halary,Felix Hartkopf,Arya Hawkins-Zafarnia,Nur Hazlin Hazrin-Chong,Eric Helfrich,Eva Hell,Tamera Henry,Samuel Hernandez,Pilar Lopez Hernandez,David Hess-Homeier,Lauren E. Hittle,Nghiem Xuan Hoan,Aliaksei Holik,Chiaki Homma,Irene Hoxie,Michael Huber,Elizabeth Humphries,Stephanie L. Hyland,Andrea Hässig,Roland Häusler,Nathalie Hüsser,Robert A. Petit,Badamnyambuu Iderzorig,Mizuki Igarashi,Shaikh B. Iqbal,Shino Ishikawa,Sakura Ishizuka,Sharah Islam,Riham Islam,Kohei Ito,Sota Ito,Takayuki Ito,Tomislav Ivankovic,Tomoki Iwashiro,Sarah S. Jackson,JoAnn Jacobs,Marisano James,Marianne Jaubert,Marie-Laure Jerier,Esmeralda Jiminez,Ayantu Jinfessa,Ymke De Jong,Hyun Woo Joo,Guilllaume Jospin,Takema Kajita,Affifah Saadah Ahmad Kassim,Nao Kato,Amrit Kaur,Inderjit Kaur,Fernanda de Souza Gomes Kehdy,Vedbar S. Khadka,Shaira Khan,Mahshid Khavari,Michelle Ki,Gina Kim,Hyung Jun Kim,Sangwan Kim,Ryan J. King,Giuseppe KoLoMonaco,Ellen Koag,Nadezhda Kobko-Litskevitch,Maryna Korshevniuk,Michael Kozhar,Jonas Krebs,Nanami Kubota,Andrii Kuklin,Sheelta S. Kumar,Rachel Kwong,Lawrence Kwong,Ingrid Lafontaine,Juliana Lago,Tsoi Ying Lai,Elodie Laine,Manolo Laiola,Olha Lakhneko,Isha Lamba,Gerardo de Lamotte,Romain Lannes,Eleonora De Lazzari,Madeline Leahy,Hyun Jung Lee,Yunmi Lee,Lucy Lee,Vincent Lemaire,Emily Leong,Dagmara Lewandowska,Chenhao Li,Weijun Liang,Moses Lin,Priscilla Lisboa,Anna Litskevitch,Eric Minwei Liu,Tracy W. Liu,Mayra Arauco Livia,Yui Him Lo,Sonia Losim,Manon Loubens,Jennifer Q. Lu,Olexandr Lykhenko,Simona Lysakova,Salah Mahmoud,Sara Abdul Majid,Natalka Makogon,Denisse Maldonado,Krizzy Mallari,Tathiane M. Malta,Maliha Mamun,Dimitri Manoir,German Marchandon,Natalia Marciniak,Sonia Marinovic,Brunna Marques,Nicole Mathews,Yuri Matsuzaki,Vincent Matthys,Madelyn May,Elias McComb,Annabelle Meagher,Adiell Melamed,Wayne Menary,Katterinne N. Mendez,Ambar Mendez,Irène Mauricette Mendy,Irene Meng,Ajay Menon,Mark Menor,Roy Meoded,Nancy Merino,Karishma Miah,Mathilde Mignotte,Tanja Miketic,Wilson Miranda,Athena Mitsios,Ryusei Miura,Kunihiko Miyake,Maria Domenica Moccia,Natasha Mohan,Mohammed Mohsin,Karobi Moitra,Mauricio Moldes,Laura Molina,Jennifer Molinet,Orgil-Erdene Molomjamts,Eftar Moniruzzaman,Sookwon Moon,Isabelle de Oliveira Moraes,Mario Moreno,Maritza S Mosella,Josef W. Moser,Christopher Mozsary,Amanda L. Muehlbauer,Oasima Muner,Muntaha Munia,Naimah Munim,Maureen Muscat,Tatjana Mustac,Cristina Muñoz,Francesca Nadalin,Areeg Naeem,Dorottya Nagy-Szakal,Mayuko Nakagawa,Ashanti Narce,Masaki Nasu,Irene González Navarrete,Hiba Naveed,Bryan Nazario,Narasimha Rao Nedunuri,Thomas Neff,Aida Nesimi,Wan Chiew Ng,Synti Ng,Gloria Nguyen,Elsy Mankah Ngwa,Agier Nicolas,Pierre Nicolas,Abdollahi Nika,Hosna Noorzi,Avigdor Nosrati,Diana N. Nunes,Kathryn O’Brien,Niamh B. O’Hara,Gabriella Oken,Rantimi A. Olawoyin,Javier Quilez Oliete,Kiara Olmeda,Tolulope Oluwadare,Itunu A. Oluwadare,Nils Ordioni,Jenessa Orpilla,Jacqueline Orrego,Melissa Ortega,Princess Osma,Israel O. Osuolale,Oluwatosin M. Osuolale,Mitsuki Ota,Francesco Oteri,Yuya Oto,Rachid Ounit,Christos A. Ouzounis,Subhamitra Pakrashi,Rachel Paras,Coral Pardo-Esté,Youngja Park,Paulina Pastuszek,Suraj Patel,Jananan Pathmanathan,Andrea Patrignani,Manuel Perez,Ante Peros,Sabrina Persaud,Anisia Peters,Adam Phillips,Lisbeth Pineda,Melissa Pool Pizzi,Alma Plaku,Alketa Plaku,Brianna Pompa-Hogan,María Gabriela Portilla,Leonardo Posada,Max Priestman,Bharath Prithiviraj,Sambhawa Priya,Phanthira Pugdeethosal,Catherine E. Pugh,Benjamin Pulatov,Angelika Pupiec,Kyrylo Pyrshev,Tao Qing,Saher Rahiel,Savlatjon Rahmatulloev,Kannan Rajendran,Aneisa Ramcharan,Adan Ramirez-Rojas,Shahryar Rana,Prashanthi Ratnanandan,Timothy D. Read,Hubert Rehrauer,Renee Richer,Alexis Rivera,Michelle Rivera,Alessandro Robertiello,Courtney Robinson,Paula Rodríguez,Nayra Aguilar Rojas,Paul Roldán,Anyelic Rosario,Sandra Roth,Maria del Mar Vivanco Ruiz,Stephen Eduard Boja Ruiz,Kaitlan Russell,Mariia Rybak,Thais S. Sabedot,Mahfuza Sabina,Ikuto Saito,Yoshitaka Saito,Gustavo Adolfo Malca Salas,Cecilia Salazar,Kaung Myat San,Jorge Sanchez,Khaliun Sanchir,Ryan Sankar,Paulo Thiago de Souza Santos,Zulena Saravi,Kai Sasaki,Yuma Sato,Masaki Sato,Seisuke Sato,Ryo Sato,Kaisei Sato,Nowshin Sayara,Steffen Schaaf,Oli Schacher,Anna-Lena M. Schinke,Ralph Schlapbach,Christian Schori,Jason R. Schriml,Felipe Segato,Felipe Sepúlveda,Marianna S. Serpa,Paola Florez de Sessions,Juan C. Severyn,Maheen Shakil,Sarah Shalaby,Aliyah Shari,Hyenah Shim,Hikaru Shirahata,Yuh Shiwa,Ophélie Da Silva,Jordana M. Silva,Gwenola Simon,Shaleni K. Singh,Kasia Sluzek,Rebecca Smith,Eunice So,Núria Andreu Somavilla,Yuya Sonohara,Nuno Rufino de Sousa,Camila P. E. de Souza,Jason Sperry,Nicolas Sprinsky,Stefan G. Stark,Antonietta La Storia,Kiyoshi Suganuma,Hamood Suliman,Jill Sullivan,Arif Asyraf Md Supie,Chisato Suzuki,Sora Takagi,Fumie Takahara,Naoya Takahashi,Kou Takahashi,Tomoki Takeda,Isabella Kuniko T. Takenaka,Soma Tanaka,Anyi Tang,Yuk Man Tang,Emilio Tarcitano,Andrea Tassinari,Mahdi Taye,Alexis Terrero,Eunice Thambiraja,Antonin Thiébaut,Sade Thomas,Andrew Maltez Thomas,Yuto Togashi,Takumi Togashi,Anna Tomaselli,Masaru Tomita,Itsuki Tomita,Oliver Toth,Nora C. Toussaint,Jennifer M. Tran,Catalina Truong,Stefan I. Tsonev,Kazutoshi Tsuda,Takafumi Tsurumaki,Michelle Tuz,Yelyzaveta Tymoshenko,Carmen Urgiles,Mariko Usui,Sophie Vacant,Laura E. Vann,Fabienne Velter,Valeria Ventorino,Patricia Vera-Wolf,Riccardo Vicedomini,Michael A. Suarez-Villamil,Sierra Vincent,Renee Vivancos-Koopman,Andrew Wan,Cindy Wang,Tomoro Warashina,Ayuki Watanabe,Samuel Weekes,Johannes Werner,David A. Westfall,Lothar H. Wieler,Michelle D. Williams,Silver A. Wolf,Brian W. Wong,Yan Ling Wong,Tyler Wong,Rasheena Wright,Tina Wunderlin,Ryota Yamanaka,Jingcheng Yang,Hirokazu Yano,George C. Yeh,Olena Yemets,Tetiana Yeskova,Shusei Yoshikawa,Laraib Zafar,Yang Zhang,Shu Zhang,Amy Zhang,Yuanting Zheng,Stas Zubenko +681 more
TL;DR: This paper presented a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over three years, representing the first systematic, worldwide catalog of the urban microbial ecosystem.
Journal ArticleDOI
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Geraldo F. Oliveira,Juan Gómez-Luna,Lois Orosa,Saugata Ghose,Nandita Vijaykumar,Ivan Fernandez,Mohammad Sadrosadati,Onur Mutlu +7 more
TL;DR: In this paper, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.
References
More filters
Journal ArticleDOI
Basic Local Alignment Search Tool
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Stephen F. Altschul,Thomas L. Madden,Alejandro A. Schäffer,Jinghui Zhang,Zheng Zhang,Webb Miller,David J. Lipman +6 more
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI
STAR: ultrafast universal RNA-seq aligner
Alexander Dobin,Carrie A. Davis,Felix Schlesinger,Jorg Drenkow,Chris Zaleski,Sonali Jha,Philippe Batut,Mark Chaisson,Thomas R. Gingeras +8 more
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Journal ArticleDOI
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
Anton Bankevich,Sergey Nurk,Dmitry Antipov,Alexey Gurevich,Mikhail Dvorkin,Alexander S. Kulikov,Valery M. Lesin,Sergey I. Nikolenko,Son Pham,Andrey D. Prjibelski,Alexey V. Pyshkin,Alexander Sirotkin,Nikolay Vyahhi,Glenn Tesler,Max A. Alekseyev,Pavel A. Pevzner +15 more
TL;DR: SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies.
SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Related Papers (5)
Space-efficient and exact de Bruijn graph representation based on a Bloom filter
Rayan Chikhi,Guillaume Rizk +1 more