Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.

doi:10.1093/BIOINFORMATICS/BTAB452

Open AccessJournal ArticleDOI

Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.

Ruilin Li, +6 more

- 19 Jun 2021 -

Bioinformatics

Chats0

TLDR

In this article, Ravi et al. developed two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals.

Abstract:

Motivation Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. Results We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10 minutes and using less than 32GB of memory. Availability https://github.com/rivas-lab/snpnet/tree/compact.

Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.

Citations

Significant sparse polygenic risk scores across 813 traits in UK Biobank

A genome sequencing system for universal newborn screening, diagnosis, and precision medicine for severe genetic diseases

Significant Sparse Polygenic Risk Scores across 428 traits in UK Biobank

Construction and validation of prognostic prediction established on N6-methyladenosine related genes in cervical squamous cell carcinoma

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity

References

Regularization and variable selection via the elastic net

Regularization Paths for Generalized Linear Models via Coordinate Descent

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

Model selection and estimation in regression with grouped variables

Second-generation PLINK: rising to the challenge of larger and richer datasets

Related Papers (5)

A space and time-efficient index for the compacted colored de Bruijn graph

Validating Paired-End Read Alignments in Sequence Graphs

gammaMAXT: a fast multiple-testing correction algorithm

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

A framework for space-efficient variable-order Markov models