Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs

doi:10.1093/BIOINFORMATICS/BTT595

Journal Article•DOI•

Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs

Hongseok Tae¹, Dong-Yun Kim¹, John K. McCormick¹, Robert E. Settlage¹, Harold R. Garner¹ - Show less +1 more•Institutions (1)

National Institutes of Health¹

01 Mar 2014-Bioinformatics (Oxford University Press)-Vol. 30, Iss: 5, pp 652-659

TL;DR: GenoTan, a program using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads without paired-end information, effectively distinguishes length variants from noise including insertion/deletion errors in homopolymer runs by addressing the bidirectional aspect of insertion and deletion errors in sequence reads.

read less

Abstract: Motivation: Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise caused by the repetitive nature of microsatellites and the technologies used to generate raw sequence data. Results: We have developed a program, GenoTan, using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads without paired-end information. It effectively distinguishes length variants from noise including insertion/deletion errors in homopolymer runs by addressing the bidirectional aspect of insertion and deletion errors in sequence reads. Here we first introduce a homopolymer decomposition method which estimates error bias toward insertion or deletion in homopolymer sequence runs. Combining these approaches, GenoTan was able to genotype 94.9% of microsatellite loci accurately from simulated data with 40x sequence coverage quickly while the other programs showed590% correct calls for the same data and required 5� 30� more computational time than GenoTan. It also showed the highest true-positive rate for real data using mixed sequence data of two Drosophila inbred lines, which was a novel validation approach for genotyping. Availability: GenoTan is open-source software available at http://gen otan.sourceforge.net.

...read moreread less

Content maybe subject to copyright Report

Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs

Citations

Cites background from "Discretized Gaussian mixture for ge..."

Cites background from "Discretized Gaussian mixture for ge..."

References

"Discretized Gaussian mixture for ge..." refers methods in this paper

"Discretized Gaussian mixture for ge..." refers methods in this paper

Related Papers (5)