Home
/
Authors
/
Weizhong Li

Author

Weizhong Li

Other affiliations: Sanford-Burnham Institute for Medical Research, University of California, San Diego, Peking University ...read more

Bio: Weizhong Li is an academic researcher from J. Craig Venter Institute. The author has contributed to research in topics: Microbiome & Metagenomics. The author has an hindex of 43, co-authored 98 publications receiving 20927 citations. Previous affiliations of Weizhong Li include Sanford-Burnham Institute for Medical Research & University of California, San Diego.

Topics: Microbiome, Metagenomics, Cluster analysis, Workflow, Butyrate ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2003
2002
2001
2000
1999

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

[...]

Weizhong Li¹, Adam Godzik¹•Institutions (1)

Sanford-Burnham Institute for Medical Research¹

01 Jul 2006-Bioinformatics

TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.

...read moreread less

Abstract: Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282--283, Bioinformatics, 18, 77--82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability: http://cd-hit.org Contact: [email protected]

...read moreread less

8,306 citations

Journal Article•DOI•

Cd-hit

[...]

Limin Fu¹, Beifang Niu¹, Zhengwei Zhu¹, Sitao Wu¹, Weizhong Li¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

01 Dec 2012-Bioinformatics

TL;DR: A new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets to reduce sequence redundancy and improve the performance of other sequence analyses is developed.

...read moreread less

Abstract: Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ~24 cores and a quasi-linear speedup for up to ~8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

5,959 citations

Journal Article•DOI•

CD-HIT Suite

[...]

Ying Huang¹, Beifang Niu¹, Ying Gao¹, Limin Fu¹, Weizhong Li¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

01 Mar 2010-Bioinformatics

TL;DR: A new web server, CD-HIT Suite, is developed for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels and users can now interactively explore the clusters within web browsers.

...read moreread less

Abstract: Summary: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels. Availability: Free access at http://cd-hit.org Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

2,084 citations

Journal Article•DOI•

Clustering of highly homologous sequences to reduce the size of large protein databases

[...]

Weizhong Li¹, Lukasz Jaroszewski, Adam Godzik•Institutions (1)

San Diego Supercomputer Center¹

01 Mar 2001-Bioinformatics

TL;DR: A fast and flexible program for clustering large protein databases at different sequence identity levels takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC.

...read moreread less

Abstract: Summary: We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560 000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches. Availability: The program is available from http: //bioinformatics.burnham-inst.org/cd-hi

...read moreread less

884 citations

Journal Article•DOI•

The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

[...]

Shibu Yooseph¹, Granger G. Sutton¹, Douglas B. Rusch¹, Aaron L. Halpern¹, Shannon J. Williamson¹, Karin A. Remington¹, Jonathan A. Eisen², Jonathan A. Eisen¹, Karla B. Heidelberg¹, Gerard Manning³, Weizhong Li⁴, Lukasz Jaroszewski⁴, Piotr Cieplak⁴, Christopher S. Miller⁵, Huiying Li⁵, Susan T. Mashiyama⁶, Marcin P. Joachimiak⁶, Christopher van Belle⁶, John-Marc Chandonia⁷, John-Marc Chandonia⁶, David A W Soergel⁶, Yufeng Zhai³, Kannan Natarajan⁸, Shaun W. Lee⁸, Benjamin J. Raphael⁹, Vineet Bafna⁸, Robert Friedman¹, Steven E. Brenner⁶, Adam Godzik⁴, David Eisenberg⁵, Jack E. Dixon⁸, Susan S. Taylor⁸, Robert L. Strausberg¹, Marvin Frazier¹, J. Craig Venter¹ - Show less +31 more•Institutions (9)

J. Craig Venter Institute¹, University of California, Davis², Salk Institute for Biological Studies³, Sanford-Burnham Institute for Medical Research⁴, University of California, Los Angeles⁵, University of California, Berkeley⁶, Lawrence Berkeley National Laboratory⁷, University of California, San Diego⁸, Brown University⁹

01 Mar 2007-PLOS Biology

TL;DR: This work used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling sequences to add a great deal of diversity to known protein families and shed light on their evolution.

...read moreread less

Abstract: Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

...read moreread less

871 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

I and i

[...]

Kevin Barraclough

08 Dec 2001-BMJ

TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.

...read moreread less

Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

...read moreread less

33,785 citations

Journal Article•DOI•

QIIME allows analysis of high-throughput community sequencing data.

[...]

J. Gregory Caporaso¹, Justin Kuczynski¹, Jesse Stombaugh¹, Kyle Bittinger², Frederic D. Bushman², Elizabeth K. Costello¹, Noah Fierer³, Antonio Gonzalez Peña¹, Julia K. Goodrich¹, Jeffrey I. Gordon⁴, Gavin A. Huttley⁵, Scott T. Kelley⁶, Dan Knights¹, Jeremy E. Koenig⁷, Ruth E. Ley⁷, Catherine A. Lozupone¹, Daniel McDonald¹, Brian D. Muegge⁴, Meg Pirrung¹, Jens Reeder¹, Joel Sevinsky, Peter J. Turnbaugh⁴, William A. Walters¹, Jeremy Widmann¹, Tanya Yatsunenko⁴, Jesse R. Zaneveld¹, Rob Knight¹, Rob Knight⁸ - Show less +24 more•Institutions (8)

University of Colorado Boulder¹, University of Pennsylvania², Cooperative Institute for Research in Environmental Sciences³, Washington University in St. Louis⁴, Australian National University⁵, San Diego State University⁶, Cornell University⁷, Howard Hughes Medical Institute⁸

11 Apr 2010-Nature Methods

TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.

...read moreread less

Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

...read moreread less

28,911 citations

Journal Article•DOI•

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

[...]

Patrick D. Schloss¹, Patrick D. Schloss², Sarah L. Westcott², Sarah L. Westcott¹, Thomas Ryabin², Justine R. Hall³, Martin Hartmann⁴, Emily B. Hollister⁵, Ryan A. Lesniewski⁶, Brian B. Oakley⁷, Donovan H. Parks⁸, Courtney J. Robinson¹, Jason W. Sahl⁹, Blaz Stres¹⁰, Gerhard G. Thallinger¹¹, David J. Van Horn¹, Carolyn F. Weber¹² - Show less +13 more•Institutions (12)

University of Michigan¹, University of Massachusetts Amherst², University of New Mexico³, University of British Columbia⁴, Texas A&M University⁵, University of Minnesota⁶, University of Warwick⁷, Dalhousie University⁸, Colorado School of Mines⁹, University of Ljubljana¹⁰, Graz University of Technology¹¹, Louisiana State University¹²

01 Dec 2009-Applied and Environmental Microbiology

TL;DR: M mothur is used as a case study to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments.

...read moreread less

Abstract: mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the alpha and beta diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.

...read moreread less

17,350 citations

Journal Article•DOI•

Search and clustering orders of magnitude faster than BLAST

[...]

Robert C. Edgar

01 Oct 2010-Bioinformatics

TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

...read moreread less

Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

17,301 citations

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse