Home
/
Authors
/
Yanxiang Chen

Author

Yanxiang Chen

Bio: Yanxiang Chen is an academic researcher from University of Hong Kong. The author has contributed to research in topics: Hybrid genome assembly & Sequence assembly. The author has an hindex of 7, co-authored 13 publications receiving 4504 citations. Previous affiliations of Yanxiang Chen include Peking University.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

[...]

Ruibang Luo¹, Binghang Liu¹, Yinlong Xie¹, Yinlong Xie², Zhenyu Li¹, Weihua Huang, Jianying Yuan, Guangzhu He, Yanxiang Chen, Qi Pan, Yunjie Liu, Jingbo Tang, Gengxiong Wu, Hao Zhang, Yujian Shi, Yong Liu, Chang Yu, Bo Wang, Yao Lu, Changlei Han, David W. Cheung¹, Siu-Ming Yiu¹, Shaoliang Peng³, Zhu Xiao-qian³, Guangming Liu³, Xiangke Liao³, Yingrui Li¹, Huanming Yang, Jian Wang, Tak-Wah Lam¹, Jun Wang - Show less +27 more•Institutions (3)

University of Hong Kong¹, South China University of Technology², National University of Defense Technology³

27 Dec 2012-GigaScience

TL;DR: This work provides an updated assembly version of the 2008 Asian genome using SOAPdenovo2, a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.

...read moreread less

Abstract: There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions. To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome. Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

...read moreread less

4,284 citations

Posted Content•

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

[...]

Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, Wei Fan - Show less +6 more

09 Aug 2013-arXiv: Genomics

TL;DR: The k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve the understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms.

...read moreread less

Abstract: Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++, and freely accessible at Github URL (this https URL) or BGI ftp ( this ftp URL).

...read moreread less

317 citations

Journal Article•DOI•

Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph.

[...]

Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, Bicheng Yang, Wei Fan - Show less +8 more

01 Jan 2012-Briefings in Functional Genomics

TL;DR: A detailed comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph is made, from how they match the Lander-Waterman model, to the required sequencing depth and reads length.

...read moreread less

Abstract: Since the completion of the cucumber and panda genome projects using Illumina sequencing in 2009, the global scientific community has had to pay much more attention to this new cost-effective approach to generate the draft sequence of large genomes. To allow new users to more easily understand the assembly algorithms and the optimum software packages for their projects, we make a detailed comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph, from how they match the Lander-Waterman model, to the required sequencing depth and reads length. We also discuss the computational efficiency of each class of algorithm, the influence of repeats and heterozygosity and points of note in the subsequent scaffold linkage and gap closure steps. We hope this review can help further promote the application of second-generation de novo sequencing, as well as aid the future development of assembly algorithms.

...read moreread less

238 citations

Journal Article•DOI•

Erratum to "SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler" [GigaScience, (2012), 1, 18]

[...]

Ruibang Luo, Binghang Liu, Yinlong Xie, Zhenyu Li, Weihua Huang, Jianying Yuan, Guangzhu He, Yanxiang Chen, Qi Pan, Yunjie Liu, Jingbo Tang, Gengxiong Wu, Hao Zhang, Yujian Shi, Yong Liu, Chang Yu, Bo Wang, Yao Lu, Changlei Han, David W. Cheung, Siu-Ming Yiu, Shaoliang Peng, Zhu Xiao-qian, Guangming Liu, Xiangke Liao, Yingrui Li, Huanming Yang, Jian Wang, Tak-Wah Lam, Jun Wang - Show less +26 more

01 Jan 2015

TL;DR: This research presents a novel probabilistic approach to estimating the response of the immune system to laser-spot assisted, 3D image recognition.

...read moreread less

200 citations

Journal Article•DOI•

pIRS: Profile-based Illumina pair-end reads simulator

[...]

Xuesong Hu¹, Jianying Yuan¹, Yujian Shi¹, Jianliang Lu¹, Binghang Liu¹, Zhenyu Li¹, Yanxiang Chen¹, Desheng Mu¹, Hao Zhang¹, Nan Li¹, Zhen Yue¹, Fan Bai¹, Heng Li¹, Wei Fan¹ - Show less +10 more•Institutions (1)

Peking University¹

01 Jun 2012-Bioinformatics

TL;DR: A software package, pIRS (profile-based Illumina pair-end reads simulator), which simulates Illumina reads with empirical Base-Calling and GC%-depth profiles trained from real re-sequencing data, fits the properties of real sequencing data better than existing simulators.

...read moreread less

Abstract: M otivation: The next generation high-throughput sequencing technologies, especially from Illumina, have been widely used in re- sequencing and de novo assembly studies. However, there is no existing software that can simulate Illumina reads with real error and quality distributions and coverage bias yet, which is very useful in relevant software development and study designing of sequencing projects. Results: We provide a software package, pIRS (profile based Illumina pair-end Reads Simulator), which simulates Illumina reads with empirical Base-Calling and GC%-depth profiles trained from real re-sequencing data. The error and quality distributions as well as coverage bias patterns of simulated reads using pIRS fit the properties of real sequencing data better than existing simulators. In addition, pIRS also comes with a tool to simulate the heterozygous diploid genomes. Availability: pIRS is written in C++ and Perl, and is freely available at ftp://ftp.genomics.org.cn/pub/pIRS/ .

...read moreread less

178 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

[...]

Daehwan Kim¹, Joseph M. Paggi², Chanhee Park¹, Christopher Bennett¹, Steven L. Salzberg³ - Show less +1 more•Institutions (3)

University of Texas Southwestern Medical Center¹, Stanford University², Johns Hopkins University³

01 Aug 2019-Nature Biotechnology

TL;DR: This work presents a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index, and uses it to represent and search an expanded model of the human reference genome.

...read moreread less

Abstract: The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays. A graph-based genome indexing scheme enables variant-aware alignment of sequences with very low memory requirements.

...read moreread less

4,855 citations

Journal Article•DOI•

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

[...]

Dinghua Li¹, Chi-Man Liu¹, Ruibang Luo¹, Kunihiko Sadakane¹, Tak-Wah Lam¹ - Show less +1 more•Institutions (1)

National Institute of Informatics¹

15 May 2015-Bioinformatics

TL;DR: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner and generated a three-time larger assembly, with longer contig N50 and average contig length.

...read moreread less

Abstract: Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., no pre-processing like partitioning and normalization was needed. When compared with previous methods (Chikhi and Rizk, 2012; Howe, et al., 2014) on assembling the soil data, MEGAHIT generated a 3-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a 4-fold improvement . Availability: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license. Contact: rb@l3-bioinfo.com, twlam@cs.hku.hk

...read moreread less

3,634 citations

Journal Article•DOI•

PEAR: a fast and accurate Illumina Paired-End reAd mergeR

[...]

Jiajie Zhang¹, Kassian Kobert¹, Tomasÿ Flouri¹, Alexandros Stamatakis¹•Institutions (1)

Heidelberg Institute for Theoretical Studies¹

01 Mar 2014-Bioinformatics

TL;DR: The PEAR software for merging raw Illumina paired-end reads from target fragments of varying length evaluates all possible paired- end read overlaps and does not require the target fragment size as input, and implements a statistical test for minimizing false-positive results.

...read moreread less

Abstract: Motivation The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths. Results We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR. Availability and implementation PEAR is implemented in C and uses POSIX threads. It is freely available at http://www.exelixis-lab.org/web/software/pear.

...read moreread less

3,270 citations

Posted Content•

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

[...]

Dinghua Li¹, Chi-Man Liu¹, Ruibang Luo¹, Kunihiko Sadakane¹, Tak-Wah Lam¹ - Show less +1 more•Institutions (1)

National Institute of Informatics¹

25 Sep 2014-arXiv: Genomics

TL;DR: MEGAHIT as mentioned in this paper is a NGS de novo assembler for assembling large and complex metagenomics data in a time and cost-efficient manner, which avoids preprocessing like partitioning and normalization, which might compromise on result integrity.

...read moreread less

Abstract: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. The source code of MEGAHIT is freely available at this https URL under GPLv3 license.

...read moreread less

2,673 citations

Journal Article•DOI•

Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life.

[...]

Fredrik Bäckhed¹, Fredrik Bäckhed², Josefine Roswall¹, Yangqing Peng, Qiang Feng², Huijue Jia, Petia Kovatcheva-Datchary¹, Yin Li, Yan Xia, Hailiang Xie, Huanzi Zhong, Muhammad Tanweer Khan¹, Jianfeng Zhang, Junhua Li, Liang Xiao, Jumana Y. Al-Aama³, Dongya Zhang, Ying Shiuan Lee¹, Dorota Ewa Kotowska², Camilla Colding², Valentina Tremaroli¹, Ye Yin, Stefan Bergman¹, Xun Xu, Lise Madsen⁴, Lise Madsen², Karsten Kristiansen², Jovanna Dahlgren¹, Jun Wang - Show less +25 more•Institutions (4)

University of Gothenburg¹, University of Copenhagen², King Abdulaziz University³, National Institute of Nutrition, Hyderabad⁴

13 May 2015-Cell Host & Microbe

TL;DR: The gut microbiota of infants delivered by C-section showed significantly less resemblance to their mothers and nutrition had a major impact on early microbiota composition and function, with cessation of breast-feeding, rather than introduction of solid food, being required for maturation into an adult-like microbiota.

...read moreread less

2,227 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse