scispace - formally typeset
Open AccessJournal ArticleDOI

SCALCE: boosting sequence compression algorithms using locally consistent encoding

Faraz Hach, +3 more
- 01 Dec 2012 - 
- Vol. 28, Iss: 23, pp 3051-3057
TLDR
SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome is presented.
Abstract
Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome. Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE þ gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE þ gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Computational solutions for omics data

TL;DR: A review of state-of-the-art techniques for analyzing omics data can be found in this article, where specific examples have facilitated and enriched analyses of sequence, transcriptomic and network data sets.

Computational solutions for omics data

TL;DR: This Review samples the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data.
Journal ArticleDOI

Compression of FASTQ and SAM format sequencing data.

TL;DR: Several compression entries from the SequenceSqueeze contest are presented, including the winning entry, and the tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs.
Journal ArticleDOI

MFCompress: a compression tool for fasta and multi-fasta data

TL;DR: MFCompress is described, specially designed for the compression of FASTA and multi-FASTA files, which can provide additional average compression gains of almost 50%, and potentially doubles the available storage, although at the cost of some more computation time.
Journal ArticleDOI

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

TL;DR: A novel reference-free method meant to compress data issued from high throughput sequencing technologies, based on a reference probabilistic de Bruijn Graph, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.
References
More filters
Journal ArticleDOI

The Sequence Alignment/Map format and SAMtools

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Journal ArticleDOI

Fast and accurate short read alignment with Burrows–Wheeler transform

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Journal ArticleDOI

Base-calling of automated sequencer traces using Phred. I. accuracy assessment

TL;DR: In this article, a base-calling program for automated sequencer traces, phred, with improved accuracy was proposed. But it was not shown to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.
Related Papers (5)