Cd-hit

doi:10.1093/BIOINFORMATICS/BTS565

Home
/
Papers
/
Cd-hit

Journal Article•DOI•

Cd-hit

Limin Fu¹, Beifang Niu¹, Zhengwei Zhu¹, Sitao Wu¹, Weizhong Li¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

01 Dec 2012-Bioinformatics (Oxford University Press)-Vol. 28, Iss: 23, pp 3150-3152

TL;DR: A new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets to reduce sequence redundancy and improve the performance of other sequence analyses is developed.

read less

Abstract: Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ~24 cores and a quasi-linear speedup for up to ~8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Ribosomal Database Project: data and tools for high throughput rRNA analysis

[...]

James R. Cole¹, Qiong Wang¹, Jordan A. Fish¹, Benli Chai¹, Donna M. McGarrell¹, Yanni Sun¹, C. Titus Brown¹, Andrea Porras-Alfaro¹, Cheryl R. Kuske¹, James M. Tiedje¹ - Show less +6 more•Institutions (1)

Los Alamos National Laboratory¹

01 Jan 2014-Nucleic Acids Research

TL;DR: RDP now includes a collection of fungal large subunit rRNA genes, and most tools are now available as open source packages for download and local use by researchers with high-volume needs or who would like to develop custom analysis pipelines.

...read moreread less

Abstract: Ribosomal Database Project (RDP; http://rdp.cme.msu.edu/) provides the research community with aligned and annotated rRNA gene sequence data, along with tools to allow researchers to analyze their own rRNA gene sequences in the RDP framework. RDP data and tools are utilized in fields as diverse as human health, microbial ecology, environmental microbiology, nucleic acid chemistry, taxonomy and phylogenetics. In addition to aligned and annotated collections of bacterial and archaeal small subunit rRNA genes, RDP now includes a collection of fungal large subunit rRNA genes. RDP tools, including Classifier and Aligner, have been updated to work with this new fungal collection. The use of high-throughput sequencing to characterize environmental microbial populations has exploded in the past several years, and as sequence technologies have improved, the sizes of environmental datasets have increased. With release 11, RDP is providing an expanded set of tools to facilitate analysis of high-throughput data, including both single-stranded and paired-end reads. In addition, most tools are now available as open source packages for download and local use by researchers with high-volume needs or who would like to develop custom analysis pipelines.

...read moreread less

3,443 citations

Journal Article•DOI•

Roary: Rapid large-scale prokaryote pan genome analysis

[...]

Andrew J. Page¹, Carla A. Cummins¹, Martin Hunt¹, Vanessa K. Wong¹, Sandra Reuter², Matthew T. G. Holden³, Maria Fookes¹, Daniel Falush⁴, Jacqueline A. Keane¹, Julian Parkhill¹ - Show less +6 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of Cambridge², University of St Andrews³, Swansea University⁴

15 Nov 2015-Bioinformatics

TL;DR: Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes, is introduced, making construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results.

...read moreread less

Abstract: Summary: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. Availability and implementation: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary Contact: ku.ca.regnas@yraor Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

3,147 citations

Journal Article•DOI•

SignalP 5.0 improves signal peptide predictions using deep neural networks

[...]

Jose Juan Almagro Armenteros¹, Konstantinos D. Tsirigos, Casper Kaae Sønderby², Thomas Nordahl Petersen¹, Ole Winther¹, Ole Winther², Søren Brunak², Søren Brunak¹, Gunnar von Heijne³, Gunnar von Heijne⁴, Henrik Nielsen¹ - Show less +7 more•Institutions (4)

Technical University of Denmark¹, University of Copenhagen², Stockholm University³, Science for Life Laboratory⁴

18 Feb 2019-Nature Biotechnology

TL;DR: A deep neural network-based approach that improves SP prediction across all domains of life and distinguishes between three types of prokaryotic SPs is presented.

...read moreread less

Abstract: Signal peptides (SPs) are short amino acid sequences in the amino terminus of many newly synthesized proteins that target proteins into, or across, membranes. Bioinformatic tools can predict SPs from amino acid sequences, but most cannot distinguish between various types of signal peptides. We present a deep neural network-based approach that improves SP prediction across all domains of life and distinguishes between three types of prokaryotic SPs.

...read moreread less

2,732 citations

Journal Article•DOI•

PHASTER: a better, faster version of the PHAST phage search tool

[...]

David Arndt, Jason R. Grant, Ana Marcu, Tanvir Sajed, Allison Pon, Yongjie Liang, David S. Wishart¹, David S. Wishart² - Show less +4 more•Institutions (2)

National Institute for Nanotechnology¹, University of Alberta²

08 Jul 2016-Nucleic Acids Research

TL;DR: PHASTER (PHAge Search Tool – Enhanced Release) is a significant upgrade to the popular PHAST web server for the rapid identification and annotation of prophage sequences within bacterial genomes and plasmids.

...read moreread less

Abstract: PHASTER (PHAge Search Tool - Enhanced Release) is a significant upgrade to the popular PHAST web server for the rapid identification and annotation of prophage sequences within bacterial genomes and plasmids. Although the steps in the phage identification pipeline in PHASTER remain largely the same as in the original PHAST, numerous software improvements and significant hardware enhancements have now made PHASTER faster, more efficient, more visually appealing and much more user friendly. In particular, PHASTER is now 4.3× faster than PHAST when analyzing a typical bacterial genome. More specifically, software optimizations have made the backend of PHASTER 2.7X faster than PHAST, while the addition of 80 CPUs to the PHASTER compute cluster are responsible for the remaining speed-up. PHASTER can now process a typical bacterial genome in 3 min from the raw sequence alone, or in 1.5 min when given a pre-annotated GenBank file. A number of other optimizations have also been implemented, including automated algorithms to reduce the size and redundancy of PHASTER's databases, improvements in handling multiple (metagenomic) queries and higher user traffic, along with the ability to perform automated look-ups against 14 000 previously PHAST/PHASTER annotated bacterial genomes (which can lead to complete phage annotations in seconds as opposed to minutes). PHASTER's web interface has also been entirely rewritten. A new graphical genome browser has been added, gene/genome visualization tools have been improved, and the graphical interface is now more modern, robust and user-friendly. PHASTER is available online at www.phaster.ca.

...read moreread less

2,454 citations

Cites methods from "Cd-hit"

...For PHASTER, we reduced the size of the bacterial sequence database by removing sequences with >70% sequence identity to any other sequence in the database, using CD-HIT (18)....
[...]

Journal Article•DOI•

MetaSPAdes: A new versatile metagenomic assembler

[...]

Sergey Nurk¹, Dmitry Meleshko¹, Anton Korobeynikov¹, Pavel A. Pevzner², Pavel A. Pevzner¹ - Show less +1 more•Institutions (2)

Saint Petersburg State University¹, University of California, San Diego²

01 May 2017-Genome Research

TL;DR: MetaSPAdes as mentioned in this paper addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.

...read moreread less

Abstract: While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.

...read moreread less

2,295 citations

Cites background from "Cd-hit"

...6 (Li and Godzik 2006; Fu et al. 2012) clustering software (with 99% similarity) to correct for potential advantage of more redundant assemblies and to retain only the longest predicted gene in a cluster....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Search and clustering orders of magnitude faster than BLAST

[...]

Robert C. Edgar

01 Oct 2010-Bioinformatics

TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

...read moreread less

Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

17,301 citations

Additional excerpts

...221) from (Edgar, 2010)....
[...]
...1.221) from Edgar (2010)....
[...]

Journal Article•DOI•

A human gut microbial gene catalogue established by metagenomic sequencing

[...]

Junjie Qin¹, Ruiqiang Li¹, Jeroen Raes², Manimozhiyan Arumugam, Kristoffer Sølvsten Burgdorf, Chaysavanh Manichanh, Trine Nielsen, Nicolas Pons³, Florence Levenez³, Takuji Yamada, Daniel R. Mende, Junhua Li¹, Junming Xu¹, Shaochuan Li¹, Dongfang Li¹, Jianjun Cao¹, Bo Wang¹, Huiqing Liang¹, Huisong Zheng¹, Yinlong Xie¹, Julien Tap³, Patricia Lepage³, Marcelo Bertalan, Jean-Michel Batto³, Torben Hansen, Denis Le Paslier, Allan Linneberg, H. Bjørn Nielsen, Eric Pelletier, Pierre Renault³, Thomas Sicheritz-Pontén, Keith Turner⁴, Hongmei Zhu¹, Chang Yu¹, Shengting Li¹, Min Jian¹, Yan Zhou¹, Yingrui Li¹, Xiuqing Zhang¹, Songgang Li¹, Nan Qin¹, Huanming Yang¹, Jian Wang¹, Søren Brunak, Joël Doré³, Francisco Guarner⁵, Karsten Kristiansen, Oluf Pedersen, Julian Parkhill, Jean Weissenbach, Peer Bork, S. Dusko Ehrlich³, Jun Wang¹ - Show less +49 more•Institutions (5)

Beijing Genomics Institute¹, Vrije Universiteit Brussel², Institut national de la recherche agronomique³, Wellcome Trust Sanger Institute⁴, Hebron University⁵

04 Mar 2010-Nature

TL;DR: The Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals are described, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species.

...read moreread less

Abstract: To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively

...read moreread less

9,268 citations

"Cd-hit" refers background in this paper

...Different computation data buffers are allocated for different threads....
[...]

Journal Article•DOI•

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

[...]

Weizhong Li¹, Adam Godzik¹•Institutions (1)

Sanford-Burnham Institute for Medical Research¹

01 Jul 2006-Bioinformatics

TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.

...read moreread less

Abstract: Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282--283, Bioinformatics, 18, 77--82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability: http://cd-hit.org Contact: [email protected]

...read moreread less

8,306 citations

"Cd-hit" refers background or methods in this paper

..., 2001) and was then extended to support clustering nucleotide sequences (Li and Godzik, 2006)....
[...]
...…computational time and noise interference in some analysis methods, etc. CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li et al., 2001) and was then extended to support clustering nucleotide sequences (Li and Godzik, 2006)....
[...]

Journal Article•DOI•

A core gut microbiome in obese and lean twins

[...]

Peter J. Turnbaugh¹, Micah Hamady², Tanya Yatsunenko¹, Brandi L. Cantarel³, Alexis E. Duncan¹, Ruth E. Ley¹, Mitchell L. Sogin⁴, William J. Jones⁵, Bruce A. Roe⁶, Jason P. Affourtit, Michael Egholm, Bernard Henrissat³, Andrew C. Heath¹, Rob Knight², Jeffrey I. Gordon¹ - Show less +11 more•Institutions (6)

Washington University in St. Louis¹, University of Colorado Boulder², Centre national de la recherche scientifique³, Marine Biological Laboratory⁴, University of South Carolina⁵, University of Oklahoma⁶

22 Jan 2009-Nature

TL;DR: The faecal microbial communities of adult female monozygotic and dizygotic twin pairs concordant for leanness or obesity, and their mothers are characterized to address how host genotype, environmental exposure and host adiposity influence the gut microbiome.

...read moreread less

Abstract: The human distal gut harbours a vast ensemble of microbes (the microbiota) that provide important metabolic capabilities, including the ability to extract energy from otherwise indigestible dietary polysaccharides. Studies of a few unrelated, healthy adults have revealed substantial diversity in their gut communities, as measured by sequencing 16S rRNA genes, yet how this diversity relates to function and to the rest of the genes in the collective genomes of the microbiota (the gut microbiome) remains obscure. Studies of lean and obese mice suggest that the gut microbiota affects energy balance by influencing the efficiency of calorie harvest from the diet, and how this harvested energy is used and stored. Here we characterize the faecal microbial communities of adult female monozygotic and dizygotic twin pairs concordant for leanness or obesity, and their mothers, to address how host genotype, environmental exposure and host adiposity influence the gut microbiome. Analysis of 154 individuals yielded 9,920 near full-length and 1,937,461 partial bacterial 16S rRNA sequences, plus 2.14 gigabases from their microbiomes. The results reveal that the human gut microbiome is shared among family members, but that each person's gut microbial community varies in the specific bacterial lineages present, with a comparable degree of co-variation between adult monozygotic and dizygotic twin pairs. However, there was a wide array of shared microbial genes among sampled individuals, comprising an extensive, identifiable 'core microbiome' at the gene, rather than at the organismal lineage, level. Obesity is associated with phylum-level changes in the microbiota, reduced bacterial diversity and altered representation of bacterial genes and metabolic pathways. These results demonstrate that a diversity of organismal assemblages can nonetheless yield a core microbiome at a functional level, and that deviations from this core are associated with different physiological states (obese compared with lean).

...read moreread less

6,970 citations

Journal Article•DOI•

UniRef: comprehensive and non-redundant UniProt reference clusters.

[...]

Baris E. Suzek¹, Hongzhan Huang¹, Peter B. McGarvey¹, Raja Mazumder¹, Cathy H. Wu¹ - Show less +1 more•Institutions (1)

Georgetown University Medical Center¹

15 May 2007-Bioinformatics

TL;DR: The UniRef (UniProt Reference Clusters) provides clustered sets of sequences from the UniProt Knowledgebase and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences.

...read moreread less

Abstract: Motivation Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. Results The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. Availability UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

1,155 citations