Is there any way to trace which sequences are included in each cd-hit cluster?
I'm aware of the .clstr file that gets generated, but the names are truncated, and I can't figure out a way to get the whole names.
For example, here is a snippet of a cluster file from a recent cd-hit run:
>Cluster 100
0 570aa, >ncbi|1353254|Penici... at 86.67%
1 570aa, >ncbi|500485|Penicil... at 87.02%
2 486aa, >ncbi|1170229|Penici... at 91.36%
3 486aa, >ncbi|1170230|Penici... at 91.36%
4 486aa, >ncbi|1170230|Penici... at 91.36%
5 572aa, >ncbi|27334|Penicill... at 99.65%
6 1967aa, >ncbi|27334|Penicill... *
7 570aa, >ncbi|5078|Penicilli... at 87.54%
8 570aa, >ncbi|5078|Penicilli... at 87.54%
9 1967aa, >ncbi|40296|Penicill... at 96.24%
10 1967aa, >ncbi|40296|Penicill... at 96.24%
11 570aa, >ncbi|1346256|Penici... at 85.09%
12 570aa, >ncbi|1439352|Penici... at 86.67%
13 572aa, >ncbi|60172|Penicill... at 92.13%
14 572aa, >ncbi|60172|Penicill... at 92.48%
15 571aa, >ncbi|2136024|Penici... at 90.72%
16 572aa, >ncbi|293382|Penicil... at 91.43%
I would like to make a fasta file from just the sequences in this cluster, but I can't because the part of the name shown is not enough to uniquely identify the sequences.
The obvious solution would be to rename the sequences before running cd-hit, then rename them back afterwards, but it seems like there should be a more direct way.
Where did you get the sequences from? If
1353254
refers togi
number you should be able to get the sequence from NCBI using Entrezdirect.They aren't from genbank. They are gene models from genome assemblies. The ncbi|[0-9]+| indicates NCBI Taxonomy ID. I do that to be compatible with bbmap taxonomy tools.