I am working with sequence data and want to determine the enrichment of flanking nucleotides of a motif.
Let XXXX
be the motif I'm interested in. I want to know if, for example, (A|XXXX
) is enriched or not over the background.
To do this I took the frequency of (A|XXXX
) and compared it to the frequency of A in the overall set of sequences I'm working with (in particular, I compared the frequency (A|XXXX
) with the binomial distribution with p=A frequency in the overall set of sequences).
Is this approach correct?
Also let's assume I possess a background pool of sequences from which the ones I referred to above were derived (through a process of selection like a selex experiment for example). Is it imperative for me to use the background pool even in the case of conditional enrichment? My reasoning went like this:
Let us assume two datasets, a background starting dataset and a selected dataset derived from the background one. Let us also assume the motif XXXX
mentioned before was indeed selected, but its flanking nucleotides were not. Now, if I am comparing the frequency of (A|XXXX
) in the selected pool with the frequency of (A|XXXX
) in the background pool under these conditions, I am effectively comparing freq(A) in the selected to freq(A) in the background, because flanking nucleotides did not undergo selection. Because some OTHER sequences (namely XXXX
itself, but could be YYYY
or WWWW
) were selected, however, the nucleotide composition of the selected pool is most likely going to be different from that of the background pool. Hence, under these conditions, I would detect spurious differences in flanking nucleotide enrichment, despite no selection, whether comparing freq(A|XXXX
) in the selected with freq(A) in the selected would have yielded no such confusion.
Hey Vincent, thanks for your answer. By "other sequences" I actually meant motif XXXX, although it can easily be extended to motif YYYY. What I mean is that the process of selection itself (independently of what it selects) is going to change the nucleotide frequency. Therefore comparing the frequency of
A
s in a position that was NOT selected to theA
frequency in the background pool of sequences is going to generate spurious results. I think this is because the frequency ofA
s at an unselected position is dependent on the current frequency ofA
s in the overall pool of sequences (your method of conditioning on XXXXYYYY is theoretically good, but practically very very complex, as the complete dynamic of a selex experiment are very difficult to define).I will correct this in the question text.
As for the first question, do you think the approach of comparing nucleotide frequency at one particular position with the overall nucleotide frequency in the pool is sound?