Question

Statistical Significance Of Sequence-Based Descriptor Data

1

Entering edit mode

13.0 years ago

Spyros ▴ 10

Hello Biostar community,

I have a sequence analysis question. I am using a database of protein fingerprints (fingerprint = a set of several distinct sequence motifs excised from multiple sequence alignments). I am investigating functionally important ares (ligand binding, protein-protein, protein-ion interactions) and how they relate to these motifs that are function/structure-agnostic sequence descriptors (if they can, and how frequently they are found to lie within motifs). Although there are some interesting correlations appearing, e.g. 60% of protein-ion interaction residues fall within motifs, I would need to apply a statistical significance test to ascertain if motifs coinciding with functional residues is important or due to chance.

Practically this would go like: if 30% of a sequence is comprised of motifs, then any functional residue (and any residue picket at random on the primary sequence of the polypeptide) has a 30% probability to be within a motif by chance alone. Is there any probabilistic model (probability distribution, e.g. binomial test) that can be used for statistical significance testing?

Many thanks and apologies for the lengthy text!

sequence statistics • 2.4k views

ADD COMMENT • link updated 13.0 years ago by Michael Kuhn 5.0k • written 13.0 years ago by Spyros ▴ 10

score 1 · Answer 1 · 2011-12-07

1

Entering edit mode

13.0 years ago

Michael Kuhn 5.0k

I'm not sure if I understand your approach, but if in doubt: Shuffle the data, and observe the background distribution. From this you can calculate empirical p-values.

You have to be careful, though, what you shuffle: If there are hidden correlations in the data and you destroy those, you would get low p-values, but these might be caused by the destroyed correlations.

ADD COMMENT • link 13.0 years ago by Michael Kuhn 5.0k

0

Entering edit mode

@Michael: Thank you for the input, when you say "shuffle" the data do you mean take a multiple sequence alignment (where motifs are excised from), randomly re-distribute the amino acid residues? And is "observing the background distribution" referring to consensus sequences/sequence logos? I'm not sure how programmatically I would go about doing that... Thanks again for your ideas!

ADD REPLY • link 13.0 years ago by Spyros ▴ 10

0

Entering edit mode

I don't understand you approach enough to suggest how to shuffle correctly, perhaps you could use a random set of starting sequences and create the multiple alignment from those (this would of course only work if you start with a certain subset of proteins).

ADD REPLY • link 13.0 years ago by Michael Kuhn 5.0k

0

Entering edit mode

The approach is not really a formal one, I just simplify amino acid residue occurrence to "fit" a binomial distribution => if len(fingerprint) = 20% of my protein sequence then there is a 20% chance that any coincidence between fingerprint residues and functional residues will occur purely due to chance. I then assume independent trias, the positions of motifs and functional residues are statistically independent, and then apply the binomial test. Thank you for your input, I'll think about how to programmatically implement such a randomized shuffling of my alignment sequences.

ADD REPLY • link 13.0 years ago by Spyros ▴ 10