What Is A Whole Genome Background In Analysis Of Motifs Or Peaks?
2
7
Entering edit mode
13.1 years ago
Curiosity ▴ 130

Why does peak analysis or motif analysis most often use a whole genome background, when they do not have any control to compare?

When I run 20k peaks for motif analysis. I picked 5000 target sequences and 40k background sequences. Why are the numbers different? Does it affect p-values (% of target sequences that have motif X versus % of background sequences that have motif X)?

motif motif genome sequence • 4.1k views
ADD COMMENT
2
Entering edit mode
13.1 years ago

Yes, the numbers analyzed will affect the p-value because p-value is a confidence score and confidence changes with the number of tests run or to which you compare. That you "picked" 5000 targets and 40000 background sequences may mean that you have introduced a bias. Can you satisfactorily answer the question that those sequences were selected at random? A whole-genome as background removes that bias. It can be argued that a peak or motif could occur anywhere in the genome. After all, the last few years of results regarding control of transcription - and binding sites for proteins that regulate that process - indicates that binding sites can exist anywhere probably because much more of the genome is transcribed than was once thought.

ADD COMMENT
0
Entering edit mode

Thanx Larry. So picking 5000 targets and 40k background sequences is normal ? I used homer for this analysis.

ADD REPLY
1
Entering edit mode
13.1 years ago
Ian 6.1k

I realise this has already been answered, but i have found using 'mapable' regions of the genome a good way of selecting control sequences. This is because not all areas of the genome can be sequenced.

HERE is an example using the UCSC hg18 human genome.

ADD COMMENT
0
Entering edit mode

I quiet don't understand your answer. Could you please elaborate more. Or you mean this, there are sequences that are missed by sequencing machines so that we can use them as a good background ?

ADD REPLY
0
Entering edit mode

It is a bit of an assumption, but if a conservative mapping of NGS data results in uniquely mapping reads, then 'mapable" regions of the genome are good for constraining the genome space when selecting "random" regions to those areas that can be sequenced. It seems to me that selecting regions from the entire genome is wrong as there are parts of the genome that will never by satisfactorily sequenced or correctly mapped. Sorry for the ramble, i hope that helps explain my thinking.

ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6