Hi all
I am trying to cluster my protein sequence data downloaded from Uniprot (2023_05 release) using a 30% identity threshold. I know that CD-HIT is not reliable when it comes to clustering at such low thresholds in one step so I used hierarchical clustering by first clustering to 90% then 60% then using psi-cd-hit to cluster at 30% sequence identity (exactly how they have shown in their documentation here: https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST). When parsing the cluster file to pick sequences within each cluster to filter out based on a threshold identity, I can see there are 2 or more sequence identities for some cluster members. For example: 377aa, >Q64563... at 6.06e-148/380aa/58.15%,72.41%,92.57%. Any help in understanding these three sequence identities would be appreciated.
I understand from the documentation that one of these scores is the global sequence identity but I don't understand which is which.
If I have done the clustering incorrectly please let me know.
Thanks