Question

What Is A Whole Genome Background In Analysis Of Motifs Or Peaks?

7

Entering edit mode

13.5 years ago

Curiosity ▴ 130

Why does peak analysis or motif analysis most often use a whole genome background, when they do not have any control to compare?

When I run 20k peaks for motif analysis. I picked 5000 target sequences and 40k background sequences. Why are the numbers different? Does it affect p-values (% of target sequences that have motif X versus % of background sequences that have motif X)?

motif motif genome sequence • 4.4k views

ADD COMMENT • link updated 13.5 years ago by Ian 6.1k • written 13.5 years ago by Curiosity ▴ 130

score 2 · Answer 1 · 2011-10-24

Yes, the numbers analyzed will affect the p-value because p-value is a confidence score and confidence changes with the number of tests run or to which you compare. That you "picked" 5000 targets and 40000 background sequences may mean that you have introduced a bias. Can you satisfactorily answer the question that those sequences were selected at random? A whole-genome as background removes that bias. It can be argued that a peak or motif could occur anywhere in the genome. After all, the last few years of results regarding control of transcription - and binding sites for proteins that regulate that process - indicates that binding sites can exist anywhere probably because much more of the genome is transcribed than was once thought.

score 1 · Answer 2 · 2011-10-28

1

Entering edit mode

13.5 years ago

Ian 6.1k

I realise this has already been answered, but i have found using 'mapable' regions of the genome a good way of selecting control sequences. This is because not all areas of the genome can be sequenced.

HERE is an example using the UCSC hg18 human genome.

ADD COMMENT • link 13.5 years ago by Ian 6.1k

0

Entering edit mode

I quiet don't understand your answer. Could you please elaborate more. Or you mean this, there are sequences that are missed by sequencing machines so that we can use them as a good background ?

ADD REPLY • link 13.5 years ago by Curiosity ▴ 130

0

Entering edit mode

It is a bit of an assumption, but if a conservative mapping of NGS data results in uniquely mapping reads, then 'mapable" regions of the genome are good for constraining the genome space when selecting "random" regions to those areas that can be sequenced. It seems to me that selecting regions from the entire genome is wrong as there are parts of the genome that will never by satisfactorily sequenced or correctly mapped. Sorry for the ramble, i hope that helps explain my thinking.

ADD REPLY • link 13.5 years ago by Ian 6.1k