Hello Biostar community,
I have a sequence analysis question. I am using a database of protein fingerprints (fingerprint = a set of several distinct sequence motifs excised from multiple sequence alignments). I am investigating functionally important ares (ligand binding, protein-protein, protein-ion interactions) and how they relate to these motifs that are function/structure-agnostic sequence descriptors (if they can, and how frequently they are found to lie within motifs). Although there are some interesting correlations appearing, e.g. 60% of protein-ion interaction residues fall within motifs, I would need to apply a statistical significance test to ascertain if motifs coinciding with functional residues is important or due to chance.
Practically this would go like: if 30% of a sequence is comprised of motifs, then any functional residue (and any residue picket at random on the primary sequence of the polypeptide) has a 30% probability to be within a motif by chance alone. Is there any probabilistic model (probability distribution, e.g. binomial test) that can be used for statistical significance testing?
Many thanks and apologies for the lengthy text!
@Michael: Thank you for the input, when you say "shuffle" the data do you mean take a multiple sequence alignment (where motifs are excised from), randomly re-distribute the amino acid residues? And is "observing the background distribution" referring to consensus sequences/sequence logos? I'm not sure how programmatically I would go about doing that... Thanks again for your ideas!
I don't understand you approach enough to suggest how to shuffle correctly, perhaps you could use a random set of starting sequences and create the multiple alignment from those (this would of course only work if you start with a certain subset of proteins).
The approach is not really a formal one, I just simplify amino acid residue occurrence to "fit" a binomial distribution => if len(fingerprint) = 20% of my protein sequence then there is a 20% chance that any coincidence between fingerprint residues and functional residues will occur purely due to chance. I then assume independent trias, the positions of motifs and functional residues are statistically independent, and then apply the binomial test. Thank you for your input, I'll think about how to programmatically implement such a randomized shuffling of my alignment sequences.