TPM for ChIP-seq normalization
4
6
Entering edit mode
8.6 years ago
SP ▴ 300

After reading about discrepancies in RPKM method and solution as TPM. This question came to my mind would TPM be a good tag normalization method for ChIP-seq. During ChIP normalization I am always worried about the ChIP-enrichment strength when comparing ChIP-seq in two different condition after normalizing with tag per million. Is TPM a good method for ChIP-seq normalization? All suggestions are helpful.


Thanks

ChIP-Seq TPM • 10k views
ADD COMMENT
4
Entering edit mode

Seeing as the "T" in TPM stands for Transcripts, I would say no :P

There are many normalisations methods for ChIP-Seq, all of which have some sort of downside as signal in ChIP is noisy and has multiple components. "Negative signal" when normalising against input is the one that frustrates most people, since it makes the maths much harder. Dividing by some constant (like total signal) is fine/good, but not particularly interesting -- although people frequently mess even that up and divide by total reads. Doing something like quantile normalization has been shown to have very positive effects, although in practise you don't see people doing that a lot. You raise a great point though -- I would love to see more research done on a standardized method for calculating ChIP-Seq signal.

ADD REPLY
2
Entering edit mode

@john Even though 'T' stands for Transcript, this normalization factor is calculated on tags and length of transcript. I don't get it why you said no to TPM? Do you have any other reasons? I would like to understand better.

ADD REPLY
2
Entering edit mode

Ok, well, so the problem here lies between the differences between RNA-Seq and ChIP-Seq at the assay level. For RNA-Seq, a read-pair that maps uniquely to a particular transcript tells you that you have 1 count for that transcript. It's a very 1:1 kind of thing. A ChIP read-pair on the other hand tells you that you may - or may not - have the protein of interest, somewhere on the genome between the two sequenced read-pairs. This amount of uncertainty, and the fact that the assay is genome-centric, means it doesn't work the same way as RNA-Seq. Lets say we have a gene with three exons and 2 possible transcripts. Transcript A is exon 1 and 3, and transcript B which is 2 and 3:

[====1====]------------[===============2===============]--------------[========3=======]
    >>>>                                                                 <<<<

And you have a single read pair in your sequencing file, which i've marked with >>>> and <<<<. In an RNA-Seq assay, that read pair is specific to a single transcript (A), so that transcript gets a +1. For ChIP however, when looking at the signal from a transcript point of view, the "signal" is counted twice - for both transcript A and B.

Furthermore, in RNA-Seq, nearly all reads can be attributed to a transcript. In fact RNA-mappers like Tophat frequently just discard the unmappable reads as if they were never there. For ChIP however, a large amount of signal is just background noise which isn't associated with any transcript at all. These two issues mean TPM is not a useful calculation for ChIP, and for transcript-centric applications, dariober's tool is a much better way of going about it. I hope that helps, but if you have more questions please ask.

ADD REPLY
0
Entering edit mode

This paper may be of interest. I've found that adapting RNA-seq approaches (e.g. edgeR or DESeq2) with appropriate considerations is reasonable for comparing ChIP-seq data. One obvious consideration would be if you have greatly increased binding affinity in one condition - then you probably would want to avoid "normalizing" that signal away.

ADD REPLY
0
Entering edit mode

@fanli.gcb Nice suggestion, I have used DESeq for normalization in some analysis but as you said DESeq relies on 'hope' that a lot (more than 50%) of signal or expression of genes does not change. So this will make a problem if you have condition where binding is being reduced for more than 50% of the sites.

ADD REPLY
8
Entering edit mode
8.6 years ago

Not sure it answers your question... One strategy I used to quantify "peak strength" is to count reads in the target region (e.g. from peak callers or some other regions of interest) and compare this count to the count in the flanking regions, left and right of the target.

By contrasting target vs flanking you don't have to worry about library size and you capture local biases that, supposedly, are conserved between libraries.

You can get the script from here:

wget 'https://github.com/dariober/bioinformatics-cafe/blob/master/localEnrichmentBed.py?raw=true' -O localEnrichmentBed.py

This is from the help:

localEnrichmentBed.py -h
usage: localEnrichmentBed.py [-h] --target TARGET --bam BAM --genome GENOME
                             [--slop SLOP] [--blacklist BLACKLIST]
                             [--tmpdir TMPDIR] [--keeptmp] [--verbose]
                             [--version]

DESCRIPTION
    Compute the read enrichment in target intervals relative to local background.

    Typical use case: A ChIP-Seq experiment on a sample returns a number of regions
    of enrichment. We want to know how enriched these regions are in a *different*
    sample. Note that enrichment is quantified relative to the local background
    not relative to an input control.

    See also localEnrichmentScore.R to combine replicates and compare
    treatment vs control.

OUTPUT:
    bed file with header and columns:
1. chrom
2. start
3. end
4. targetID
5. flank_cnt
6. target_cnt
7. flank_len
8. target_len
9. log10_pval
10. log2fc

EXAMPLE
    localEnrichmentBed.py -b rhh047.bam -t rhh047.macs_peaks.bed -g genome.fa.fai -bl blacklist.bed > out.bed

Useful tip: Get genome file from bam file:

    samtools view -H rhh047.bam \
    | grep -P "@SQ\tSN:" \
    | sed 's/@SQ\tSN://' \
    | sed 's/\tLN:/\t/' > genome.txt

REQUIRES:
    - bedtools 2.25+
    - numpy, scipy

NOTES:
For PE reads, the second read in pair is excluded (by samtools view -F 128) 
since coverageBed double counts pairs.

optional arguments:
  -h, --help            show this help message and exit
  --target TARGET, -t TARGET
                        Target bed file where enrichment is to be computed.
                        Use - to read from stdin.

  --bam BAM, -b BAM     Bam file of the library for which enrichment is to be
                        computed.

  --genome GENOME, -g GENOME
                        A genome file giving the length of the chromosomes.
                        A tab separated file with columns <chrom> <chrom lenght>.
                        NB: It can be created from the header of the bam file (see tip above).

  --slop SLOP, -S SLOP  Option passed to slopBed to define the flanking region (aka background).
                        If `int` each target will be extended left and right this many bases.
                        If `float` each target is extended left and right this many times its size.
                        E.g. 5.0 (default) extends each target regions 5 times its length left and right.

  --blacklist BLACKLIST, -bl BLACKLIST
                        An optional bed file of regions to ignore to compute
                        the local background. These might be unmappable regions with 0-counts which would
                        inflate the target enrichment.

  --tmpdir TMPDIR       Temp dir to use for intermediate files. If not set
                        python will get one. A subdir will be created here.

  --keeptmp             If set, the tmp dir is not deleted at the end of the
                        job (useful for debugging).

  --verbose, -V         Print to stderr the commands that are executed. 

  --version             show program's version number and exit
ADD COMMENT
0
Entering edit mode

Looks interesting, I will test and ask for your help if I need. Thanks

ADD REPLY
0
Entering edit mode

@dariober I have a question regarding your tool. If i want to compare peaks in samples from different condition (e.g. YY1 peaks in non-leukemic and leukemic) and what to know if peak-strength increased or decreased. Do you think, I should use localEnrichmentBed?

ADD REPLY
0
Entering edit mode

If you want to compare conditions, each condition with N replicates, then my first choice is to use the methods from RNA-Seq based on linear models, namely edgeR, limma-voom, DESeq or diffBind (which is a wrapper around edgeR and DEseq). As for RNA-Seq there is the issue of conditions where most peaks change in one direction, i.e. the assumption that most genes/peaks is not satisfied.

With respect to edgeR: When many peaks change in one direction (e.g. the leukemic group has many more and stronger peaks that the non-leukemic) I usually use the library size as normalization factor, like:

y<- DGEList(counts= myPeakCountMatrix, group=group)
y$samples$lib.size<- librarySizes
y<- calcNormFactors(y, method= 'none')
...

It's not ideal but it seems to give sensible results also because in contrast to RNA-Seq, most ChIP-Seq reads are background (not in peaks) and one or few peaks do not make most of the sequencing amount.

ADD REPLY
0
Entering edit mode

If you have input or IP control, you can use that as a normalization factor as well. Another variant is to use only the signal coming from the peak intervals you are testing instead of the whole genome.

ADD REPLY
0
Entering edit mode

Mmhhh not sure how you can make use of the input as normalization factor since that is going to be the same in all libraries. If you use the signal only from the peak intervals then you normalize away the differences if one condition has more binding than the other(s) (see my example below).

ADD REPLY
0
Entering edit mode

Sorry, I should clarify. If you have multiple conditions, each of which has an associated input or IP control, I think that is suitable to use as a normalization factor. I like to think of this as binding affinity in each peak region (e.g. IP over background). Now whether this violates any assumptions of RNA-seq models...I don't know.

ADD REPLY
3
Entering edit mode
8.6 years ago

I wanted to add an example to my comment above about normalization with edgeR. I put this as an answer for ease of reading. Everybody, check I'm doing things right.

Say you have 100 peaks that are systematically up-regulated in condition 2 vs condition 1. Each library has 100k reads sequenced even if a minority of reads are actually in peaks (that's often the case with ChIP-Seq):

counts<- data.frame(cond1_1= rpois(n= 100, lambda= 10),
                    cond1_2= rpois(n= 100, lambda= 10),
                    cond1_3= rpois(n= 100, lambda= 10),
                    cond2_1= rpois(n= 100, lambda= 30),
                    cond2_2= rpois(n= 100, lambda= 30),
                    cond2_3= rpois(n= 100, lambda= 30))
libSize<- rep(100000, 6)
group<- c('cond1', 'cond1', 'cond1', 'cond2', 'cond2', 'cond2')

Let's run a standard differential expression/binding with edgeR:

R
library(edgeR)
y<- DGEList(counts= counts, group= group)
y<- calcNormFactors(y)
y<- estimateDisp(y)

et<- exactTest(y, pair= levels(y$samples$group))
detable<- data.frame(topTags(et, n= Inf)$table)

We can see nothing is detected as different:

plot(x= detable$logFC, y= -log10(detable$PValue), xlab= 'logFC', ylab= '-log10(p-value)', 
    ylim= c(0, 10), xlim= c(-3, 3))
abline(h= -log10(0.01), col= 'red', lty= 'dotted')
abline(v= 0, col= 'blue', lty= 'dotted')

## Nothing is significant!
nrow(detable[detable$logFC > 0 & detable$FDR < 0.05,])

enter image description here

Let's repeat the analysis using the library size and skipping the normalization step. Now all peaks are correctly classified as different:

y<- DGEList(counts= counts, group= group)

y$samples$lib.size<- libSize
y<- calcNormFactors(y, method= 'none')

y<- estimateDisp(y)
et<- exactTest(y, pair= levels(y$samples$group))
detable<- data.frame(topTags(et, n= Inf)$table)

plot(x= detable$logFC, y= -log10(detable$PValue), xlab= 'logFC', ylab= '-log10(p-value)', 
    ylim= c(0, 10), xlim= c(-3, 3))
abline(h= -log10(0.01), col= 'red', lty= 'dotted')
abline(v= 0, col= 'blue', lty= 'dotted')

## All peaks detected has significantly up:
nrow(detable[detable$logFC > 0 & detable$FDR < 0.05,])

enter image description here

Does it make sense?

ADD COMMENT
0
Entering edit mode

That makes absolute sense, I have worked like this using DESeq and this was my problem. if I normalize, I will loose the difference and if not normalize than people might ask question why didn't you normalize the data. The problem becomes very big if you have only one replicate. To overcome this normalization of DESeq, I have used tag per million very simple normalization = (tag*1mill)/library size. But I think it is not the best method and as you mentioned in the first post that people should not use whole library size instead use combined signal strength. Whereas total signal strength will jeopardise calculation in the case of genome wide reduction.

ADD REPLY
2
Entering edit mode
8.6 years ago
Ryan Thompson ★ 3.6k

If you have large differences in ChIP efficiency between samples, then you simply cannot detect global changes in binding, since these will be indistinguishable from efficiency bias. There is no normalization that can get around this problem. You would need to show using a separate data set that your IP protocol does not show large efficiency biases, and then you will be able to justify normalizing to the background and detect global binding changes.

The way to demonstrate consistent ChIP efficiency would be to run your ChIP-seq protocol several times on the same sample and show that the ratio of signal to background reads is consistent across all the replicates. (See the csaw User's Guide Section 4 for more discussion, and also here for some similar diagnostic plots).

ADD COMMENT
0
Entering edit mode

Hi @Ryan, I'm very interested in the similar diagnostic plots you linked. The file has been moved since posting. Could you update? Thank you

ADD REPLY
1
Entering edit mode
8.6 years ago
SP ▴ 300

Just to show how different normalization would look like for TPM, RPKM, CPM, DEseq


Inital data: Peak strength of different peaks with width of peaks (many people might not like the example for using RPKM and TPM for normalization)

Sample-a    Sample-b    Sample-c    length
50  150 20  100
154 165 65  500
70  75  25  100
45  35  17  150
30  40  12  100
56  34  24  100

TPM norm

TPM-Norm-a  TPM-Norm-b  TPM-Norm-c
187.4062968516  422.138836773   189.8734177215
115.4422788606  92.8705440901   123.417721519
262.3688155922  211.0694183865  237.3417721519
112.4437781109  65.6660412758   107.5949367089
112.4437781109  112.5703564728  113.9240506329
209.8950524738  95.6848030019   227.8481012658

RPKM norm

Rpkm-a  Rpkm-b  Rpkm-c
1.2345679012    3.006012024 1.226993865
0.7604938272    0.6613226453    0.7975460123
1.7283950617    1.503006012 1.5337423313
0.7407407407    0.4676018704    0.6952965235
0.7407407407    0.8016032064    0.736196319
1.3827160494    0.6813627255    1.472392638

CPM norm

CPM-a   CPM-b   CPM-c
123.4567901235  300.6012024048  122.6993865031
380.2469135802  330.6613226453  398.773006135
172.8395061728  150.3006012024  153.3742331288
111.1111111111  70.1402805611   104.2944785276
74.0740740741   80.1603206413   73.6196319018
138.2716049383  68.1362725451   147.2392638037

DESeq norm

Deseq-n-a   Deseq-n-b   Deseq-n-c
41.5584415584   113.3333333333  43.6400817996
128 124.6666666667  141.8302658487
58.1818181818   56.6666666667   54.5501022495
37.4025974026   26.4444444444   37.0940695297
24.9350649351   30.2222222222   26.1840490798
46.5454545455   25.6888888889   52.3680981595
336.6233766234  377.0222222222  355.6666666667

All of these are per 1000 normalized instead of per million. One thing I observe that all of them nearly show the same structure of normalization. My concern was if the sample-c is truly a reduction then none of them can capture it. I think what @dariober suggested could be a nice way to find enrichment using flanking regions. I have tried my best to normalize them properly, If discrepancies please point them out.

ADD COMMENT
1
Entering edit mode

My concern was if the sample-c is truly a reduction then none of them can capture it

The issue, I think, is that you are using the column sums as library size, i.e. only the alignments in peaks. In this case sample C will be incorrectly boosted upwards. If instead you use the library size from all the valid alignments in the genome (e.g. all alignments with mapq > 10), then you should get the expected answer.

As I mentioned before, for ChIP-Seq the number of reads in peaks is small compared to the reads genome wide so I think it is fine to use the total read count as library size. For RNA-Seq this is not at all the case since in a well prepared library >95% of reads are in transcripts. Effectively, my script localEnrichmentBed.py relies on the fact that the flanking regions do have reads which is pretty much always the case in my experience. Since pull downs are so noisy, at least let's try to make some use of this noise to quantify the signal!

ADD REPLY

Login before adding your answer.

Traffic: 2445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6