I wanted to add an example to my comment above about normalization with edgeR. I put this as an answer for ease of reading. Everybody, check I'm doing things right.
Say you have 100 peaks that are systematically up-regulated in condition 2 vs condition 1. Each library has 100k reads sequenced even if a minority of reads are actually in peaks (that's often the case with ChIP-Seq):
counts<- data.frame(cond1_1= rpois(n= 100, lambda= 10),
cond1_2= rpois(n= 100, lambda= 10),
cond1_3= rpois(n= 100, lambda= 10),
cond2_1= rpois(n= 100, lambda= 30),
cond2_2= rpois(n= 100, lambda= 30),
cond2_3= rpois(n= 100, lambda= 30))
libSize<- rep(100000, 6)
group<- c('cond1', 'cond1', 'cond1', 'cond2', 'cond2', 'cond2')
Let's run a standard differential expression/binding with edgeR:
R
library(edgeR)
y<- DGEList(counts= counts, group= group)
y<- calcNormFactors(y)
y<- estimateDisp(y)
et<- exactTest(y, pair= levels(y$samples$group))
detable<- data.frame(topTags(et, n= Inf)$table)
We can see nothing is detected as different:
plot(x= detable$logFC, y= -log10(detable$PValue), xlab= 'logFC', ylab= '-log10(p-value)',
ylim= c(0, 10), xlim= c(-3, 3))
abline(h= -log10(0.01), col= 'red', lty= 'dotted')
abline(v= 0, col= 'blue', lty= 'dotted')
## Nothing is significant!
nrow(detable[detable$logFC > 0 & detable$FDR < 0.05,])
Let's repeat the analysis using the library size and skipping the normalization step. Now all peaks are correctly classified as different:
y<- DGEList(counts= counts, group= group)
y$samples$lib.size<- libSize
y<- calcNormFactors(y, method= 'none')
y<- estimateDisp(y)
et<- exactTest(y, pair= levels(y$samples$group))
detable<- data.frame(topTags(et, n= Inf)$table)
plot(x= detable$logFC, y= -log10(detable$PValue), xlab= 'logFC', ylab= '-log10(p-value)',
ylim= c(0, 10), xlim= c(-3, 3))
abline(h= -log10(0.01), col= 'red', lty= 'dotted')
abline(v= 0, col= 'blue', lty= 'dotted')
## All peaks detected has significantly up:
nrow(detable[detable$logFC > 0 & detable$FDR < 0.05,])
Does it make sense?
Seeing as the "T" in TPM stands for Transcripts, I would say no :P
There are many normalisations methods for ChIP-Seq, all of which have some sort of downside as signal in ChIP is noisy and has multiple components. "Negative signal" when normalising against input is the one that frustrates most people, since it makes the maths much harder. Dividing by some constant (like total signal) is fine/good, but not particularly interesting -- although people frequently mess even that up and divide by total reads. Doing something like quantile normalization has been shown to have very positive effects, although in practise you don't see people doing that a lot. You raise a great point though -- I would love to see more research done on a standardized method for calculating ChIP-Seq signal.
@john Even though 'T' stands for Transcript, this normalization factor is calculated on tags and length of transcript. I don't get it why you said no to TPM? Do you have any other reasons? I would like to understand better.
Ok, well, so the problem here lies between the differences between RNA-Seq and ChIP-Seq at the assay level. For RNA-Seq, a read-pair that maps uniquely to a particular transcript tells you that you have 1 count for that transcript. It's a very 1:1 kind of thing. A ChIP read-pair on the other hand tells you that you may - or may not - have the protein of interest, somewhere on the genome between the two sequenced read-pairs. This amount of uncertainty, and the fact that the assay is genome-centric, means it doesn't work the same way as RNA-Seq. Lets say we have a gene with three exons and 2 possible transcripts. Transcript A is exon 1 and 3, and transcript B which is 2 and 3:
And you have a single read pair in your sequencing file, which i've marked with
>>>>
and<<<<
. In an RNA-Seq assay, that read pair is specific to a single transcript (A), so that transcript gets a +1. For ChIP however, when looking at the signal from a transcript point of view, the "signal" is counted twice - for both transcript A and B.Furthermore, in RNA-Seq, nearly all reads can be attributed to a transcript. In fact RNA-mappers like Tophat frequently just discard the unmappable reads as if they were never there. For ChIP however, a large amount of signal is just background noise which isn't associated with any transcript at all. These two issues mean TPM is not a useful calculation for ChIP, and for transcript-centric applications, dariober's tool is a much better way of going about it. I hope that helps, but if you have more questions please ask.
This paper may be of interest. I've found that adapting RNA-seq approaches (e.g. edgeR or DESeq2) with appropriate considerations is reasonable for comparing ChIP-seq data. One obvious consideration would be if you have greatly increased binding affinity in one condition - then you probably would want to avoid "normalizing" that signal away.
@fanli.gcb Nice suggestion, I have used DESeq for normalization in some analysis but as you said DESeq relies on 'hope' that a lot (more than 50%) of signal or expression of genes does not change. So this will make a problem if you have condition where binding is being reduced for more than 50% of the sites.