Question

Conditional motif enrichment in sequence and selex data

1

Entering edit mode

9.4 years ago

t.candelli ▴ 70

I am working with sequence data and want to determine the enrichment of flanking nucleotides of a motif.

Let XXXX be the motif I'm interested in. I want to know if, for example, (A|XXXX) is enriched or not over the background.

To do this I took the frequency of (A|XXXX) and compared it to the frequency of A in the overall set of sequences I'm working with (in particular, I compared the frequency (A|XXXX) with the binomial distribution with p=A frequency in the overall set of sequences).

Is this approach correct?

Also let's assume I possess a background pool of sequences from which the ones I referred to above were derived (through a process of selection like a selex experiment for example). Is it imperative for me to use the background pool even in the case of conditional enrichment? My reasoning went like this:

Let us assume two datasets, a background starting dataset and a selected dataset derived from the background one. Let us also assume the motif XXXX mentioned before was indeed selected, but its flanking nucleotides were not. Now, if I am comparing the frequency of (A|XXXX) in the selected pool with the frequency of (A|XXXX) in the background pool under these conditions, I am effectively comparing freq(A) in the selected to freq(A) in the background, because flanking nucleotides did not undergo selection. Because some OTHER sequences (namely XXXX itself, but could be YYYY or WWWW) were selected, however, the nucleotide composition of the selected pool is most likely going to be different from that of the background pool. Hence, under these conditions, I would detect spurious differences in flanking nucleotide enrichment, despite no selection, whether comparing freq(A|XXXX) in the selected with freq(A) in the selected would have yielded no such confusion.

motif enrichment conditional-probability selex • 1.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by t.candelli ▴ 70

Ram · Answer 1 · 2015-08-13

I think I understand what you are getting at but there are a few phrases that I am not sure I totally understand. I went ahead and provided an answer, hoping my interpretation was correct.

The part I am unsure of is:

...because some OTHER sequences were selected, however, the nucleotide composition of the selected pool is most likely going to be different from that of the background pool...

here it would be helpful to know exactly what you mean by "OTHER sequences"

I think that what you mean is something like, you constructed the "selected dataset" by choosing segments that have the motif XXXX PLUS some other motif, and you are worried that the motif is co-incident with the A in AXXXX.

To me this seems like something you can test.

Let us call the other motif YYYY. Could you not first determine whether the frequency of A | (XXXX) differs from the frequency of A | (XXXX and YYYY) ? If it does, I think you could potentially still generate valid test statistics by controlling for this, one you know the degree of covariation, if there is any.

If there is no difference between the frequency of A based on the presence or absence of YYYY, the "OTHER" motif, then I would be comfortable assuming that at least that one aspect of the selection process did not inflate the test statistic...

I hope that helps. This was interesting to think about. I think you are doing a great job of looking for potential issues... wish everyone did that.