Any multiple sequence alignment (MSA) can be converted to a profile HMM (pHMM). And I DO understand that mathematical modeling of the diversity at each alignment position in an MSA can be used to score matches using something like HMMER2 / HMMER3 / HHpred etc.
However, I am curious to know if there are established guidelines for what % identity amongst sequences should be ideally, in order to balance signal and noise in the pHMM, so that both sensitivity and specificity of detecting sequence homologs are as high as possible.
I could argue that an MSA composed of sequences that are < 20% pair-wise identity would be hard to justify without solid evidence of structural or functional equivalence despite poor sequence conservation. So where should I stop in terms of diversity of sequences during MSA inference, if I am going to build pHMMs from these MSAs?
Links to any published literature on this topic would be much appreciated. Thanks folks!
I've analyzed seed sequences for 14,831 Pfam profiles, and indeed as you suspect, there is no uniform average pairwise sequence % identity for these profiles. Some of them are really low (< 20%). How can you infer an accurate MSA when % identity is so low? False positive rates in such cases are very high. So I might question the validity of these MSAs and the pHMMs inferred from them - doesn't matter if PFam builds them or I build them!
At least that is my current stance. But I would love for someone to correct me or educate me on this aspect. Thanks for your reply.
If I remember correctly, they somehow control the false discovery/positive rate by calculating the p-Values with respect to the protein family, i.e., each pHMM has an adjustment associated. Search for "gathering threshold"...
However, for mathematical modelling in general, there is no need for a good overall similarity. It is enough to identify the features that are unique for a particular family/group/whatsoever. Hypothetically, imagine that a particular sequence of 10 amino-acids out of a 1000 AA protein is unique to all proteins carrying out a specific function while no other protein happens to have this sequence... then you need to train your profile to target exactly these 10 AA... not more not less. The remaining AA sequence does not matter, but a 1% overall sequence similarity is enough to answer your question.
By the way, this reduction of data dimensions happens all over in Bioinformatics, from biomarker discovery (ignore genes that are not a different between control and experiment) to sequence classification (remove uninformative sequence-parts)...