Question

how to get promoter regions from a genome

0

Entering edit mode

10.2 years ago

Affan ▴ 310

I have a PWM of which I'd like to asses accuracy. The problem with scanning the ENTIRE genome (other than the fact its a overnight code run) is that it yields a lot of false positives compare to true positives.

Going through some posts on biostars, it may make more sense to only scan the promoter regions with my PWM. Is there an easy way to extract/get promoter regions for hg18 and mm9?

promoter pwm • 4.7k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Affan ▴ 310

0

Entering edit mode

10.2 years ago

EagleEye 7.6k

Check this out:

A: How can I find the location of promoter of a gene?

ADD COMMENT • link 10.2 years ago by EagleEye 7.6k

0

Entering edit mode

I don't really know the genes. I just know the coordinates of where the transcription factor bind. Is it sufficient for me to take a 2000bp neighbourhood around this coordinate. I would imagine that encapsulates the binding site.

ADD REPLY • link 10.2 years ago by Affan ▴ 310

0

Entering edit mode

In that case you cannot define the range for binding sites (which could have more variation). But you can start with ± 1kb and extend the search upto ± 5kb.

ADD REPLY • link 10.2 years ago by EagleEye 7.6k

0

Entering edit mode

Okay, thanks. I am not sure if taking a range will help me though. Whether its a 1kb or 5kb neighbourhood, there only exists one true site in there given by my coordinate. However, the larger my neighbourhood, the higher number of false positives I get. I think I'll take a 2kb neighbourhood as promoter regions are upto 1000bp anyway.

ADD REPLY • link 10.2 years ago by Affan ▴ 310

0

Entering edit mode

Great finally came to an conclusion. I still do not understand what your actual question is about!! If you are looking for exact transcription factor binding site the length is usually between 5 to 31 nucleotides. The question you started was how to extract promoter regions, which already includes TFB sites. Anyway you found a solution. Best of luck.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by EagleEye 7.6k

Ram · Accepted Answer · 2015-02-20

2

Entering edit mode

10.2 years ago

dago ★ 2.8k

A few useful sites with lists of many nice tools:

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by dago ★ 2.8k

0

Entering edit mode

Would I have to deal with gene information? Obviously the transcription factor my PWM is based on is a for muscle genes. Would I have to only look for promoters around these genes?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Affan ▴ 310

0

Entering edit mode

I am working with bacteria and there were several paper that reported the presence of binding sites inside ORFs. So on the one hand it would make sense to scan the whole genome. On the other hand, as you said, this will create a lot of false hits. A first and relatively conservative approach I would say that is scanning only the upstream regions. This is commonly done, and I would say that you will find for sure the more conserved boxes.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by dago ★ 2.8k

0

Entering edit mode

So I've built my PWM based on TF data. I already have the start and end coordinates for the TF binding site. Would it be sufficient to just grab a 1000bp neighbourhood around this binding site? Is 1000bp sufficient enough to encapsulate the entire promoter region for this particular binding site?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Affan ▴ 310

0

Entering edit mode

For prokaryotes 300 to 500 bp are commonly used.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by dago ★ 2.8k

0

Entering edit mode

I'm working with mm9, and so would a 1000bp (or 2000bp) neighbourhood be sufficient to calculate the accuracy?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Affan ▴ 310