scispace - formally typeset
Search or ask a question
Author

Minzhu Xie

Bio: Minzhu Xie is an academic researcher from Hunan Normal University. The author has contributed to research in topics: Chromosome (genetic algorithm) & Genome-wide association study. The author has an hindex of 6, co-authored 10 publications receiving 222 citations. Previous affiliations of Minzhu Xie include University of California, Riverside.

Papers
More filters
Journal ArticleDOI
TL;DR: A predicting framework named by XGBFEMF for identifying essential proteins, which includes a SUB-EXPAND-SHRINK method for constructing the composite features with original features and obtaining the better subset of features for essential protein prediction, and also includes a model fusion method for getting a more effective prediction model.
Abstract: Essential proteins as a vital part of maintaining the cells’ life play an important role in the study of biology and drug design. With the generation of large amounts of biological data related to essential proteins, an increasing number of computational methods have been proposed. Different from the methods which adopt a single machine learning method or an ensemble machine learning method, this paper proposes a predicting framework named by XGBFEMF for identifying essential proteins, which includes a SUB-EXPAND-SHRINK method for constructing the composite features with original features and obtaining the better subset of features for essential protein prediction, and also includes a model fusion method for getting a more effective prediction model. We carry out experiments on Yeast data to assess the performance of the XGBFEMF with ROC analysis, accuracy analysis, and top analysis. Meanwhile, we set up experiments on E. coli data for the validation of performance. The test results show that the XGBFEMF framework can effectively improve many essential indicators. In addition, we analyze each step in the XGBFEMF framework; our results show that both each step of the SUB-EXPAND-SHRINK method as well as the step of multi-model fusion can improve prediction performance.

104 citations

Journal ArticleDOI
TL;DR: This article develops a simple, fast and effective algorithm to detect genome-wide multi-locus epistatic interactions based on the clustering of relatively frequent items that is fast and more powerful in general than some recently proposed methods.
Abstract: Motivation: In genome-wide association studies (GWAS), up to millions of single nucleotide polymorphisms (SNPs) are genotyped for thousands of individuals. However, conventional single locus-based approaches are usually unable to detect gene–gene interactions underlying complex diseases. Due to the huge search space for complicated high order interactions, many existing multi-locus approaches are slow and may suffer from low detection power for GWAS. Results: In this article, we develop a simple, fast and effective algorithm to detect genome-wide multi-locus epistatic interactions based on the clustering of relatively frequent items. Extensive experiments on simulated data show that our algorithm is fast and more powerful in general than some recently proposed methods. On a real genome-wide case–control dataset for age-related macular degeneration (AMD), the algorithm has identified genotype combinations that are significantly enriched in the cases. Availability: http://www.cs.ucr.edu/~minzhux/EDCF.zip Contact:minzhux@cs.ucr.edu; jingli@cwru.edu Supplementary information:Supplementary data are available at Bioinformatics online.

69 citations

Journal ArticleDOI
TL;DR: H-PoP and H-PoPG are proposed, based on dynamic programming and a strategy of limiting the number of intermediate solutions at each iteration, to solve the two models, respectively, and are much faster and more accurate than the recent state of theart polyploid haplotyping algorithms.
Abstract: Author(s): Xie, Minzhu; Wu, Qiong; Wang, Jianxin; Jiang, Tao | Abstract: MotivationSome economically important plants including wheat and cotton have more than two copies of each chromosome. With the decreasing cost and increasing read length of next-generation sequencing technologies, reconstructing the multiple haplotypes of a polyploid genome from its sequence reads becomes practical. However, the computational challenge in polyploid haplotyping is much greater than that in diploid haplotyping, and there are few related methods.ResultsThis article models the polyploid haplotyping problem as an optimal poly-partition problem of the reads, called the Polyploid Balanced Optimal Partition model. For the reads sequenced from a k-ploid genome, the model tries to divide the reads into k groups such that the difference between the reads of the same group is minimized while the difference between the reads of different groups is maximized. When the genotype information is available, the model is extended to the Polyploid Balanced Optimal Partition with Genotype constraint problem. These models are all NP-hard. We propose two heuristic algorithms, H-PoP and H-PoPG, based on dynamic programming and a strategy of limiting the number of intermediate solutions at each iteration, to solve the two models, respectively. Extensive experimental results on simulated and real data show that our algorithms can solve the models effectively, and are much faster and more accurate than the recent state-of-the-art polyploid haplotyping algorithms. The experiments also show that our algorithms can deal with long reads and deep read coverage effectively and accurately. Furthermore, H-PoP might be applied to help determine the ploidy of an organism.Availability and implementationhttps://github.com/MinzhuXie/H-PoPG CONTACT: xieminzhu@hotmail.comSupplementary information: Supplementary data are available at Bioinformatics online.

50 citations

Journal ArticleDOI
TL;DR: A new optimization model, called Balanced Optimal Partition (BOP), for single individual haplotyping, which generalizes two existing models, Minimum Error Correction (MEC) and Maximum Fragments Cut (MFC), and could be made either model by using some extreme parameter values.
Abstract: Due to the difficulty in separating two (paternal and maternal) copies of a chromosome, most published human genome sequences only provide genotype information, i.e., the mixed information of the underlying two haplotypes. However, phased haplotype information is needed to completely understand complex genetic polymorphisms and to increase the power of genome-wide association studies for complex diseases. With the rapid development of DNA sequencing technologies, reconstructing a pair of haplotypes from an individual's aligned DNA fragments by computer algorithms (i.e., Single Individual Haplotyping) has become a practical haplotyping approach. In the paper, we combine two measures "errors corrected" and "fragments cut" and propose a new optimization model, called Balanced Optimal Partition (BOP), for single individual haplotyping. The model generalizes two existing models, Minimum Error Correction (MEC) and Maximum Fragments Cut (MFC), and could be made either model by using some extreme parameter values. To solve the model, we design a heuristic dynamic programming algorithm H-BOP. By limiting the number of intermediate solutions at each iteration to an appropriately chosen small integer k, H-BOP is able to solve the model efficiently. Extensive experimental results on simulated and real data show that when k = 8, H-BOP is generally faster and more accurate than a recent state-of-art algorithm ReFHap in haplotype reconstruction. The running time of H-BOP is linearly dependent on some of the key parameters controlling the input size and H-BOP scales well to large input data. The code of H-BOP is available to the public for free upon request to the corresponding author.

28 citations

Journal ArticleDOI
TL;DR: An accurate HLA gene type inference algorithm is designed by utilizing SNP genotype data from pedigrees, known Hla gene types of some individuals and the relationship between inferred SNP haplotypes and HLAGene types to achieve higher accuracy.
Abstract: Background: The human leukocyte antigen system (HLA) contains many highly variable genes. HLA genes play an important role in the human immune system, and HLA gene matching is crucial for the success of human organ transplantations. Numerous studies have demonstrated that variation in HLA genes is associated with many autoimmune, inflammatory and infectious diseases. However, typing HLA genes by serology or PCR is time consuming and expensive, which limits large-scale studies involving HLA genes. Since it is much easier and cheaper to obtain single nucleotide polymorphism (SNP) genotype data, accurate computational algorithms to infer HLA gene types from SNP genotype data are in need. To infer HLA types from SNP genotypes, the first step is to infer SNP haplotypes from genotypes. However, for the same SNP genotype data set, the haplotype configurations inferred by different methods are usually inconsistent, and it is often difficult to decide which one is true. Results: In this paper, we design an accurate HLA gene type inference algorithm by utilizing SNP genotype data from pedigrees, known HLA gene types of some individuals and the relationship between inferred SNP haplotypes and HLA gene types. Given a set of haplotypes inferred from the genotypes of a population consisting of many pedigrees, the algorithm first constructs a weighted similarity graph based on a new haplotype similarity measure and derives constraint edges from known HLA gene types. Based on the principle that different HLA gene alleles should have different background haplotypes, the algorithm searches for an optimal labeling of all the haplotypes with unknown HLA gene types such that the total weight among the same HLA gene types is maximized. To deal with ambiguous haplotype solutions, we use a genetic algorithm to select haplotype configurations that tend to maximize the same optimization criterion. Our experiments on a previously typed subset of the HapMap data show that the algorithm is highly accurate, achieving an accuracy of 96% for gene HLA-A, 95% for HLA-B, 97% for HLA-C, 84% for HLA-DRB1, 98% for HLA-DQA1 and 97% for HLA-DQB1 in a leave-one-out test. Conclusions: Our algorithm can infer HLA gene types from neighboring SNP genotype data accurately. Compared with a recent approach on the same input data, our algorithm achieved a higher accuracy. The code of our algorithm is available to the public for free upon request to the corresponding authors.

27 citations


Cited by
More filters
Journal ArticleDOI
06 Jun 2013-PLOS ONE
TL;DR: A computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I and class II HLA loci and how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals is described.
Abstract: DNA sequence variation within human leukocyte antigen (HLA) genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC) makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC) region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A ,- B ,- C) and class II (-DPA1 ,- DPB1 ,- DQA1 ,- DQB1, and -DRB1) loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals) and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals). We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N=918) with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the highdensity Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.

576 citations

Journal ArticleDOI
TL;DR: The purpose of this Review is to summarize recent directions in methodology for detecting epistasis and to discuss evidence of the role of epistasis in human complex trait variation.
Abstract: Genome-wide association studies (GWASs) have become the focus of the statistical analysis of complex traits in humans, successfully shedding light on several aspects of genetic architecture and biological aetiology. Single-nucleotide polymorphisms (SNPs) are usually modelled as having additive, cumulative and independent effects on the phenotype. Although evidently a useful approach, it is often argued that this is not a realistic biological model and that epistasis (that is, the statistical interaction between SNPs) should be included. The purpose of this Review is to summarize recent directions in methodology for detecting epistasis and to discuss evidence of the role of epistasis in human complex trait variation. We also discuss the relevance of epistasis in the context of GWASs and potential hazards in the interpretation of statistical interaction terms.

391 citations

Journal ArticleDOI
TL;DR: In this paper, the authors describe new single-molecule protein sequencing and identification technologies alongside innovations in mass spectrometry that will eventually enable broad sequence coverage in single-cell profiling.
Abstract: Single-cell profiling methods have had a profound impact on the understanding of cellular heterogeneity. While genomes and transcriptomes can be explored at the single-cell level, single-cell profiling of proteomes is not yet established. Here we describe new single-molecule protein sequencing and identification technologies alongside innovations in mass spectrometry that will eventually enable broad sequence coverage in single-cell profiling. These technologies will in turn facilitate biological discovery and open new avenues for ultrasensitive disease diagnostics. This Perspective describes new single-molecule protein sequencing and identification technologies alongside innovations in mass spectrometry that will eventually enable broad sequence coverage in single-cell proteomics.

142 citations

15 Nov 2016
TL;DR: Liu et al. as discussed by the authors presented TopPIC, a tool that identifies and characterizes complex proteoforms with unknown primary structure alterations, such as amino acid mutations and post-translational modifications, by searching top-down tandem mass spectra against a protein database.
Abstract: Top-down mass spectrometry enables the observation of whole complex proteoforms in biological samples and provides crucial information complementary to bottom-up mass spectrometry. Because of the complexity of top-down mass spectra and proteoforms, it is a challenging problem to efficiently interpret top-down tandem mass spectra in high-throughput proteome-level proteomics studies. We present TopPIC, a tool that efficiently identifies and characterizes complex proteoforms with unknown primary structure alterations, such as amino acid mutations and post-translational modifications, by searching top-down tandem mass spectra against a protein database. Availability and implementation http://proteomics.informatics.iupui.edu/software/toppic/ CONTACT: xwliu@iupui.eduSupplementary information: Supplementary data are available at Bioinformatics online.

109 citations

Journal ArticleDOI
TL;DR: RC shear walls are commonly used as lateral load-resisting elements in seismic regions, and the estimation of their shear strengths can become simultaneously design-critical and complex.
Abstract: RC shear walls are commonly used as lateral load-resisting elements in seismic regions, and the estimation of their shear strengths can become simultaneously design-critical and complex whe...

103 citations