Showing papers by "Zhong Wang published in 2017"
••
Bielefeld University1, BRICS2, University of Düsseldorf3, Oregon State University4, University of California, San Diego5, University of Copenhagen6, Aarhus University7, Roskilde University8, Joint Genome Institute9, Pittsburgh Supercomputing Center10, Saint Petersburg State University11, Max Planck Society12, University of Vienna13, University of Technology, Sydney14, Centre national de la recherche scientifique15, Genome Institute of Singapore16, University of Warwick17, University of Tübingen18, Intel19, French Institute for Research in Computer Science and Automation20, Taipei Medical University21, Joint BioEnergy Institute22, Lawrence Berkeley National Laboratory23, Georgia Institute of Technology24, University of Calgary25, University of Göttingen26, National Health Research Institutes27, San Diego State University28, Boyce Thompson Institute for Plant Research29, Coordenadoria de Aperfeiçoamento de Pessoal de Nível Superior30, Robert Koch Institute31, University of Maryland, College Park32, Newcastle University33, Leibniz Association34, ETH Zurich35
TL;DR: The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups as discussed by the authors.
Abstract: Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
593 citations
••
Bielefeld University1, University of Düsseldorf2, Oregon State University3, University of California, Berkeley4, BRICS5, University of Copenhagen6, Joint Genome Institute7, Pittsburgh Supercomputing Center8, Saint Petersburg State University9, Chinese Academy of Sciences10, University of Vienna11, University of Technology, Sydney12, Centre national de la recherche scientifique13, Genome Institute of Singapore14, University of Warwick15, Aarhus University16, Intel17, French Institute for Research in Computer Science and Automation18, Taipei Medical University19, Lawrence Berkeley National Laboratory20, Max Planck Society21, University of Calgary22, University of Göttingen23, National Health Research Institutes24, San Diego State University25, Boyce Thompson Institute for Plant Research26, Robert Koch Institute27, University of Maryland, College Park28, Newcastle University29, Leibniz Association30, ETH Zurich31
TL;DR: Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups.
Abstract: In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from newly sequenced ~700 microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.
59 citations
••
01 Jan 2017TL;DR: An exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits are presented.
Abstract: The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this study, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). This paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.
15 citations