
| Weizhong Li, liwz@sdsc.edu at Adam Godzik's lab |
The CD-HIT manual and download is available from
bioinformatics.org. If you have special request,
discuss it with the author
If you find CD-HIT useful, please cite:
1. "Clustering of highly homologous sequences to reduce the
size of large protein database",
Weizhong Li, Lukasz Jaroszewski & Adam Godzik
Bioinformatics, (2001) 17:282-283.
PDF
Pubmed
2. "Tolerating some redundancy significantly speeds up clustering of large
protein databases",
Weizhong Li, Lukasz Jaroszewski & Adam Godzik
Bioinformatics, (2002) 18:77-82.
PDF
Pubmed
3. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences",
Weizhong Li & Adam Godzik
Bioinformatics, (2006) 22:1658-9.
Open source PDF
Pubmed
Who is using CD-HIT
The CD-HIT program is currently used by hundreds of research and
educational groups, including some of the worlds best-known
institutions such as UniProt, PDB, EBI, and TIGR.
UniProt is the world's most comprehensive catalog of information on proteins. In UniProt, CD-HIT program is used to generate the UniRef reference data sets, UniRef90 and UniRef50. CD-HIT is also used at the PDB to treat redundant sequences. Google CD-HIT.
Related resources:
NRDB90 and nrdb90.pl,
a nonredundant sequence database and the perl script used to generate it.
RSDB,
Representative protein Sequence DataBases.