I have some transposons identified using their terminal DNA sequences, lets say 10nt each.
These have been identified based on sequence similarity to a set of pre-defined terminal DNA sequences, so discovery is based on match to a profile / HMM / PSSM if you will.
Note that the lengths of these terminals are not strictly 10nt always, but vary a little. Also, the length of the element, i.e. separation between the terminals is quite variable. Just throwing in these details to provide a more complete picture of the problem. Which is:
I want to calculate a statistic that conveys how non-random the combination of these two terminal DNA sequences is, given the whole genome sequence, their lengths, their composition, their length of separation etc.
For example, I envision that when transposon #1 is flanked by terminal DNA sequences that occur way more often in the genome generally speaking, than for transposon #2, this statistic will convey the higher confidence in transposon #2 than for #1.
Question is what should this statistic be, and how would you advice me to go about calculating it?
Some authors have approached this by randomly shuffling the genome and reporting false discovery rate as the ratio of loci discovered in intact versus shuffled genome. I am not too convinced with this approach. I am not against it however, just want advice on a different statistic to describe the chance or alternatively non-random likelihood of these elements.
I am comfortable with Perl coding, and some R coding. If I have multiple options for calculating / reporting a suitable statistic, I request you keep my skill-set in mind. Thanks folks!
THANK YOU! It makes total sense. Just 2 follow-up questions:
1. I wonder if I should count n for each matched sequence in genome, rather than total match counts to the profile. In other words, if profile1 yields 20 different sequence matches, cumulatively with 100 occurrences, then I think you are saying n1=100. But I am asking if n1 should be split into counts for each of 20 sequences matching profile 1 (in my made-up example). Likewise, sub-categories for matches to profile p2. Which quickly becomes factorial, but still.... is that better or worse or something else?
2. Is there any way (or need) to factor in variable scores of the matching sequences identified by these profiles?
I don't quite understand what #1 would tell you, so you'd have to come up with a good reason why the particular sequences are important. What's the null hypothesis? It's already exceedingly rare to observe any pair of 10bp sequences in close proximity in the genome, but the relevant question, I presume, is whether the number of hits is surprising, not whether their particular sequences are rare.
As for #2, you could get specific FDRs for any pair of variable scores by changing
n1
andn2
above to the number of hits for each profile that have a score greater than or equal to the scores of a given pair, then recalculating the E-vale (m) and the number of pairs in the new, smaller set (k). This would tell you what the E-value would be if you used the specific scores for that pair as a threshold.