Conditional motif enrichment in sequence and selex data
1
1
Entering edit mode
9.4 years ago
t.candelli ▴ 70

I am working with sequence data and want to determine the enrichment of flanking nucleotides of a motif.

Let XXXX be the motif I'm interested in. I want to know if, for example, (A|XXXX) is enriched or not over the background.

To do this I took the frequency of (A|XXXX) and compared it to the frequency of A in the overall set of sequences I'm working with (in particular, I compared the frequency (A|XXXX) with the binomial distribution with p=A frequency in the overall set of sequences).

Is this approach correct?

Also let's assume I possess a background pool of sequences from which the ones I referred to above were derived (through a process of selection like a selex experiment for example). Is it imperative for me to use the background pool even in the case of conditional enrichment? My reasoning went like this:

Let us assume two datasets, a background starting dataset and a selected dataset derived from the background one. Let us also assume the motif XXXX mentioned before was indeed selected, but its flanking nucleotides were not. Now, if I am comparing the frequency of (A|XXXX) in the selected pool with the frequency of (A|XXXX) in the background pool under these conditions, I am effectively comparing freq(A) in the selected to freq(A) in the background, because flanking nucleotides did not undergo selection. Because some OTHER sequences (namely XXXX itself, but could be YYYY or WWWW) were selected, however, the nucleotide composition of the selected pool is most likely going to be different from that of the background pool. Hence, under these conditions, I would detect spurious differences in flanking nucleotide enrichment, despite no selection, whether comparing freq(A|XXXX) in the selected with freq(A) in the selected would have yielded no such confusion.

motif enrichment conditional-probability selex • 1.9k views
ADD COMMENT
1
Entering edit mode
9.4 years ago
LauferVA 4.5k

I think I understand what you are getting at but there are a few phrases that I am not sure I totally understand. I went ahead and provided an answer, hoping my interpretation was correct.

The part I am unsure of is:

...because some OTHER sequences were selected, however, the nucleotide composition of the selected pool is most likely going to be different from that of the background pool...

here it would be helpful to know exactly what you mean by "OTHER sequences"

I think that what you mean is something like, you constructed the "selected dataset" by choosing segments that have the motif XXXX PLUS some other motif, and you are worried that the motif is co-incident with the A in AXXXX.

To me this seems like something you can test.

Let us call the other motif YYYY. Could you not first determine whether the frequency of A | (XXXX) differs from the frequency of A | (XXXX and YYYY) ? If it does, I think you could potentially still generate valid test statistics by controlling for this, one you know the degree of covariation, if there is any.

If there is no difference between the frequency of A based on the presence or absence of YYYY, the "OTHER" motif, then I would be comfortable assuming that at least that one aspect of the selection process did not inflate the test statistic...

I hope that helps. This was interesting to think about. I think you are doing a great job of looking for potential issues... wish everyone did that.

ADD COMMENT
0
Entering edit mode

Hey Vincent, thanks for your answer. By "other sequences" I actually meant motif XXXX, although it can easily be extended to motif YYYY. What I mean is that the process of selection itself (independently of what it selects) is going to change the nucleotide frequency. Therefore comparing the frequency of As in a position that was NOT selected to the A frequency in the background pool of sequences is going to generate spurious results. I think this is because the frequency of As at an unselected position is dependent on the current frequency of As in the overall pool of sequences (your method of conditioning on XXXXYYYY is theoretically good, but practically very very complex, as the complete dynamic of a selex experiment are very difficult to define).

I will correct this in the question text.

As for the first question, do you think the approach of comparing nucleotide frequency at one particular position with the overall nucleotide frequency in the pool is sound?

ADD REPLY

Login before adding your answer.

Traffic: 1617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6