Hi,
What I want to know is whether repeats will affect the results of motif discovery? Programs like MEME and Weeder will be affected or not? Masking the repeats will solve the problem or not if it is affected?
Hi,
What I want to know is whether repeats will affect the results of motif discovery? Programs like MEME and Weeder will be affected or not? Masking the repeats will solve the problem or not if it is affected?
Repeats will affect de novo motif finding algorithms. Oftentimes, they will present the strongest signal and thus overwhelm the signal from other motifs. One way to mitigate this is to choose a proper background set so that you only find the repeat if it is overrepresented compared to the background. Another is to find increase the number of motifs you search for, thus, your top motifs may come from repeats but you will see other motifs lower than them.
This really comes down to the type of motif you are expecting to find. I guess you can speed up motif discovery with repeatmasker (Assuming you are working with the human genome or some other higher eukaryote). However, if your motif is residing in repeats you will loose information. You can test with and without repeatmasker. You might start with a small chromosome.
I routinely use Weeder and analyse the 200bp centred on the summit of my MACS ChIP-seq binding regions. At these sequence lengths i do not mask for repeats, primarily because ChIP-seq can report regions in repeat regions that can contain functional motifs. Therefore, masking would potentially hide useful information.
When i used to analyse ChIP-chip data i looked at larger regions and did not expect regions covering repeats. So then i did mask out repeats.
Have you tired GimmeMotifs yet :) Uses a consensus of whatever motif discovery tool you like. Still for some jobs i still stick with Weeder.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I absolutely agree here - I also use the genome specific motif counts when running Weeder!
There aren't any standards (as far as I know) but ideally it should be from the same organism and similarly filtered. E.g., if you are looking for motifs in promoters than your background would be random promoters from the same organism.
Thanks for your answer. But how do I know which is the proper backgroud set? Is there any standard to choose?
For future reference, HOMER find motifs commands use this philosophy automatically.