Hello,
I have several questions regarding the use of FIMO as motif scanning. I have used it for a while but I still don't understand some part of it.
Regarding low complexity region or repeating region. I noticed that FIMO result actually match a lot with repeating region. Does this mean anything to the protein binding site prediction? It seems that this low complexity match actually skewing the q-value calculation so that almost all significant result come from the match in the low complexity region. After I try to mask the these low complexity region, I got the whole different result with more match. WIth q-value threshold 0.1, I got ~2,299 significant matches but after masking the fasta, I got ~46,000 significant result.
Regarding background frequency for FIMO. I have tested several promoter region, for example 1000 nt upstream 1st exon, 2000 nt upstream 1st exon, until 5000 nt upstream 1st exon. I noticed that because of the difference in length, the background frequency of ACGT (from fasta-get-markov) is also different. This makes same PWM with same target sequence have a different p-value and q-value. So, should I use same background frequency for all target sequence so that my result is consistent for p-value and q-value calculation?
Regarding uniform random background ACGT frequency (A, C, G, and T frequency are 0.25). What is the justification of using uniform background frequency rather than whole genome frequency or the target sequence? What is the actual best practice in determining this background frequency? What is the meaning to the p-value if I use uniform background, whole genome background, and target sequence background?
Thank you.