Following on my previous question regarding discovering protein homology. After finding sequences of interest against a profile of a family, I want to determine whether these sequences can be categorized into this family or not. How can one score proteins against each other so that they can be grouped as so?
Originally, this "family" was determined via simple statistics (pairwise scoring via z-score and alignment calculated from shuffling of these sequences), although I'm not convinced this is a sophisticated enough to determine membership. Therefore I'm looking for a more sophisticated method of scoring this. There are important secondary structures that I am adding to my scoring function, but beyond this, I can't seem to find much on google regarding this type of scoring.
Well, if you look at the Superfamily database, you'll find that it is in fact a collection of HMMs just like Pfam, SMART, and InterPro. The difference lies in how they made the multiple sequence alignments.
Knowing the methodology of how the MSAs are created is indeed critical; many people overlook these two as they are derived via structure rather than function. Note that although gene3d and superfamily are interpro member databases, you need to check interpro's release notes to see how many of the hmms have been integrated. (only about half of Gene3D has been so far). Hence I recommend going to the site directly to get the latest data.