Help in understanding cd-hit cluster results
0
0
Entering edit mode
3 months ago
Nilavrah • 0

Hi all

I am trying to cluster my protein sequence data downloaded from Uniprot (2023_05 release) using a 30% identity threshold. I know that CD-HIT is not reliable when it comes to clustering at such low thresholds in one step so I used hierarchical clustering by first clustering to 90% then 60% then using psi-cd-hit to cluster at 30% sequence identity (exactly how they have shown in their documentation here: https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST). When parsing the cluster file to pick sequences within each cluster to filter out based on a threshold identity, I can see there are 2 or more sequence identities for some cluster members. For example: 377aa, >Q64563... at 6.06e-148/380aa/58.15%,72.41%,92.57%. Any help in understanding these three sequence identities would be appreciated.

I understand from the documentation that one of these scores is the global sequence identity but I don't understand which is which.

If I have done the clustering incorrectly please let me know.

Thanks

CD-HIT • 228 views
ADD COMMENT

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6