Question

When Normalizing Chip-Seq Replicates, Is It Better To Normalize All Reads, Or Only Reads In A Window Of Interst?

2

Entering edit mode

11.1 years ago

bede.portz ▴ 540

I want to determine the best way to normalize ChIP-seq replicates that differ in total reads. I am analyzing ChIP-seq data for a factor that is found near transcription start sites (TSSs), and focusing my analysis on a relatively small window around TSSs. Such experiments may yield 10 million mappable reads, but only <1million map to a window around TSSs in which I am interested, say +/- 1000bp. I want to normalize for differences in tag counts between technical replicates, and replicates generated from different conditions.

It seems I could normalize by total reads between replicates, i.e. make the total reads in each replicate equal to 10 million and proceed to mapping the reads to a window around TSSs. This method actually takes ALL the data, greater than >90% of which I will discard early in the analysis, for normalization. So I would be normalizing the signal of interest by what amounts to a great deal of excess noise.

Alternatively, I could map the reads to the TSSs within the window of interest, and normalize the data that lies within this window. In this case I can first see how the proportion of tags within the window for each replicate compares to the total number of tags in each replicate. If an equal proportion of total reads from each replicate maps to +/- 1kb around TSSs, both methods should yield similar results. However to me it makes sense to refine the data first, isolating those data you will ultimately analyze, than do the normalization between replicates to adjust for read counts. Especially for cases where the biology predicts replicates from different cellular conditions will differ in a narrow window around a subset of genes: a small percentage or a large dataset.

Does a consensus exist as to the best approach?

Thanks,

Bede

chip-seq rna-seq normalization • 4.0k views

ADD COMMENT • link updated 11.1 years ago by Ying W ★ 4.3k • written 11.1 years ago by bede.portz ▴ 540

score 0 · Answer 1 · 2013-11-13

0

Entering edit mode

11.1 years ago

Ying W ★ 4.3k

I don't think a consensus exists.

I've had good experience normalizing to all reads (raw library size or full library size) in conditions where the total amount of protein bound is changing. Ideally, reads in window (effective library size) should give similar results to all reads and this has been the case for some other chip-seq datasets I've analyzed. Since you pre-define your regions (+/- 1kb around TSS), you do not have the complication of different number of binding regions between samples.

ADD COMMENT • link 11.1 years ago by Ying W ★ 4.3k

0

Entering edit mode

Thanks Ying. This has been my experience as well, but my experience is extremely limited, so I thought I would pose the question.

ADD REPLY • link 11.1 years ago by bede.portz ▴ 540