Question

Help in understanding cd-hit cluster results

0

Entering edit mode

7 months ago

Nilavrah • 0

Hi all

I am trying to cluster my protein sequence data downloaded from Uniprot (2023_05 release) using a 30% identity threshold. I know that CD-HIT is not reliable when it comes to clustering at such low thresholds in one step so I used hierarchical clustering by first clustering to 90% then 60% then using psi-cd-hit to cluster at 30% sequence identity (exactly how they have shown in their documentation here: https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST). When parsing the cluster file to pick sequences within each cluster to filter out based on a threshold identity, I can see there are 2 or more sequence identities for some cluster members. For example: 377aa, >Q64563... at 6.06e-148/380aa/58.15%,72.41%,92.57%. Any help in understanding these three sequence identities would be appreciated.

I understand from the documentation that one of these scores is the global sequence identity but I don't understand which is which.

If I have done the clustering incorrectly please let me know.

Thanks

CD-HIT • 322 views

ADD COMMENT • link 7 months ago by Nilavrah • 0