I have a PWM of which I'd like to asses accuracy. The problem with scanning the ENTIRE genome (other than the fact its a overnight code run) is that it yields a lot of false positives compare to true positives.
Going through some posts on biostars, it may make more sense to only scan the promoter regions with my PWM. Is there an easy way to extract/get promoter regions for hg18 and mm9?
Would I have to deal with gene information? Obviously the transcription factor my PWM is based on is a for muscle genes. Would I have to only look for promoters around these genes?
I am working with bacteria and there were several paper that reported the presence of binding sites inside ORFs. So on the one hand it would make sense to scan the whole genome. On the other hand, as you said, this will create a lot of false hits. A first and relatively conservative approach I would say that is scanning only the upstream regions. This is commonly done, and I would say that you will find for sure the more conserved boxes.
So I've built my PWM based on TF data. I already have the start and end coordinates for the TF binding site. Would it be sufficient to just grab a 1000bp neighbourhood around this binding site? Is 1000bp sufficient enough to encapsulate the entire promoter region for this particular binding site?
For prokaryotes 300 to 500 bp are commonly used.
I'm working with mm9, and so would a 1000bp (or 2000bp) neighbourhood be sufficient to calculate the accuracy?