Author
Liqing Zhang
Other affiliations: University of Chicago, University of California, Irvine, Virginia Bioinformatics Institute ...read more
Bio: Liqing Zhang is an academic researcher from Virginia Tech. The author has contributed to research in topics: Genome & Gene. The author has an hindex of 30, co-authored 120 publications receiving 3566 citations. Previous affiliations of Liqing Zhang include University of Chicago & University of California, Irvine.
Topics: Genome, Gene, Indel, Population, Metagenomics
Papers published on a yearly basis
Papers
More filters
••
Virginia Tech1, United States Department of Agriculture2, University of Maryland, College Park3, Wageningen University and Research Centre4, European Bioinformatics Institute5, Roche Applied Science6, University of Edinburgh7, Virginia Bioinformatics Institute8, Utah State University9, National Institutes of Health10, University of California, Davis11, Michigan State University12, Texas A&M University13, Leipzig University14, Children's Hospital Oakland Research Institute15, Institute for Animal Health16, Seoul National University17, University of Marburg18, Wellcome Trust Sanger Institute19, University of Delaware20, University of Vienna21, University of Minnesota22
TL;DR: The combined application of next-generation sequencing platforms has provided an economical approach to unlocking the potential of the turkey genome.
Abstract: A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.
415 citations
••
TL;DR: The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice, and DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs.
Abstract: Growing concerns about increasing rates of antibiotic resistance call for expanded and comprehensive global monitoring. Advancing methods for monitoring of environmental media (e.g., wastewater, agricultural waste, food, and water) is especially needed for identifying potential resources of novel antibiotic resistance genes (ARGs), hot spots for gene exchange, and as pathways for the spread of ARGs and human exposure. Next-generation sequencing now enables direct access and profiling of the total metagenomic DNA pool, where ARGs are typically identified or predicted based on the “best hits” of sequence searches against existing databases. Unfortunately, this approach produces a high rate of false negatives. To address such limitations, we propose here a deep learning approach, taking into account a dissimilarity matrix created using all known categories of ARGs. Two deep learning models, DeepARG-SS and DeepARG-LS, were constructed for short read sequences and full gene length sequences, respectively. Evaluation of the deep learning models over 30 antibiotic resistance categories demonstrates that the DeepARG models can predict ARGs with both high precision (> 0.97) and recall (> 0.90). The models displayed an advantage over the typical best hit approach, yielding consistently lower false negative rates and thus higher overall recall (> 0.9). As more data become available for under-represented ARG categories, the DeepARG models’ performance can be expected to be further enhanced due to the nature of the underlying neural networks. Our newly developed ARG database, DeepARG-DB, encompasses ARGs predicted with a high degree of confidence and extensive manual inspection, greatly expanding current ARG repositories. The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice. DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs. The DeepARG models and database are available as a command line version and as a Web service at http://bench.cs.vt.edu/deeparg
.
402 citations
••
TL;DR: The results show that, in comparison to tissue-specific genes, housekeeping genes on average evolve more slowly and are under stronger selective constraints as reflected by significantly smaller values of Ka/Ks, and contrary to the old textbook concept, approximately 74% of theHousekeeping genes in this study belong to multigene families, not significantly different from that of the tissue- specific genes.
Abstract: Do housekeeping genes, which are turned on most of the time in almost every tissue, evolve more slowly than genes that are turned on only at specific developmental times or tissues? Recent large-scale gene expression studies enable us to have a better definition of housekeeping genes and to address the above question in detail. In this study, we examined 1581 human-mouse orthologous gene pairs for their patterns of sequence evolution, contrasting housekeeping genes with tissue-specific genes. Our results show that, in comparison to tissue-specific genes, housekeeping genes on average evolve more slowly and are under stronger selective constraints as reflected by significantly smaller values of Ka/Ks. Besides stronger purifying selection, we explored several other factors that can possibly slow down nonsynonymous rates in housekeeping genes. Although mutational bias might slightly slow the nonsynonymous rates in housekeeping genes, it is unlikely to be the major cause of the rate difference between the two types of genes. The codon usage pattern of housekeeping genes does not seem to differ from that of tissue-specific genes. Moreover, contrary to the old textbook concept, we found that approximately 74% of the housekeeping genes in our study belong to multigene families, not significantly different from that of the tissue-specific genes ( approximately 70%). Therefore, the stronger selective constraints on housekeeping genes are not due to a lower degree of genetic redundancy.
356 citations
••
TL;DR: Insight is provided into the evolution and distribution of SSRs in the two sequenced model plant genomes of monocots and dicots and reveals that the distributions appear highly non-random and vary a great deal in different regions of the genes in the genomes.
Abstract: Simple sequence repeats (SSRs) in DNA have been traditionally thought of as functionally unimportant and have been studied mainly as genetic markers. A recent handful of studies have shown, however, that SSRs in different positions of a gene can play important roles in determining protein function, genetic development, and regulation of gene expression. We have performed a detailed comparative study of the distribution of SSRs in the sequenced genomes of Arabidopsis thaliana and rice. SSRs in different genic regions - 5'untranslated region (UTR), 3'UTR, exon, and intron - show distinct patterns of distribution both within and between the two genomes. Especially notable is the much higher density of SSRs in 5'UTRs compared to the other regions and a strong affinity towards trinucleotide repeats in these regions for both rice and Arabidopsis. On a genomic level, mononucleotide repeats are the most prevalent type of SSRs in Arabidopsis and trinucleotide repeats are the most prevalent type in rice. Both plants have the same most common mononucleotide (A/T) and dinucleotide (AT and AG) repeats, but have little in common for the other types of repeats. Our work provides insight into the evolution and distribution of SSRs in the two sequenced model plant genomes of monocots and dicots. Our analyses reveal that the distributions of SSRs appear highly non-random and vary a great deal in different regions of the genes in the genomes.
221 citations
••
TL;DR: P PseKNC-General (the general form of pseudo k-tuple nucleotide composition) is developed, that allows for fast and accurate computation of all the widely used nucleotide structural and physicochemical properties of both DNA and RNA sequences.
Abstract: Associate Editor: John HancockABSTRACTSummary: The avalanche of genomic sequences generated in thepost-genomic age requires efficient computational methods for rapidlyand accurately identifying biological features from sequence informa-tion. Towards this goal, we developed a freely available and open-source package, called PseKNC-General (the general form ofpseudo k-tuple nucleotide composition), that allows for fast and ac-curate computation of all the widely used nucleotide structural andphysicochemical properties of both DNA and RNA sequences.PseKNC-General can generate several modes of pseudo nucleotidecompositions, including conventional k-tuple nucleotide compositions,Moreau–Broto autocorrelation coefficient, Moran autocorrelation coef-ficient, Geary autocorrelation coefficient, Type I PseKNC and Type IIPseKNC. In every mode,4100 physicochemical properties are avail-able for choosing. Moreover, it is flexible enough to allow the users tocalculate PseKNC with user-defined properties. The package can berun on Linux, Mac and Windows systems and also provides a graph-ical user interface.Availability and implementation: The package is freely available at:http://lin.uestc.edu.cn/server/pseknc.Contact: chenweiimu@gmail.com or lqzhang@vt.edu or kcchou@gor-donlifescience.org.Supplementary information: Supplementary data are available atBioinformatics online.Received on July 22, 2014; revised on August 19, 2014; accepted onAugust 31, 2014
198 citations
Cited by
More filters
28 Jul 2005
TL;DR: PfPMP1)与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作�ly.
Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1(PfPMP1)与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员,通过启动转录不同的var基因变异体为抗原变异提供了分子基础。
18,940 citations
••
[...]
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
13,246 citations
•
TL;DR: It is suggested that the natural selection against large insertion/deletion is so weak that a large amount of variation is maintained in a population.
11,521 citations
01 Jun 2012
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Abstract: The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.
10,124 citations