I have a list of proteins that I want to divide into homogeneous (in terms of potential similar function) clusters.
As first approach I clustered them using h-cd-hit, with 3 reiteration at 90-80-70% similarity and allowing only 75 aa difference among the proteins. This parameters were chosen because they better resolve my data.
I obtained decent results for them, but when I look at the domain composition of the representative sequences of each clusters I can see that in same cases I have highly similar domain architecture. I would say that similar domain architecture suggest similarity in function. Therefore I would like to perform a second clustering based on similarity of domain architecture.
For example:
----domA-------domB---domC
---domB---domC-------
---domA--domW---domE
In this case 1) and 2) will cluster together.
Is there any available tool for doing that?
Today I got to know about this tool:
MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information
I do not really see the advantage of performing the comparison between proteins based on Pfam domains at the DNA level. Any comments?