Question

Can too high of a ChIPQC RiP% be indicative of insufficient stringency in peak calling?

0

Entering edit mode

3.6 years ago

gkunz ▴ 30

Hello,

I was trying to make sense of the RiP% value that is returned by ChIPQC. I understand that this is basically just a percentage of how many reads are located within the called peaks.

The vignettes indicate that a RiP% of 5% or greater is typically indicative of good enrichment. For some of the samples I processed I am seeing values as great as 22%. for the 30 samples I am analyzing, the RiP% values range from 13.4 to 22.9. My first instinct was, "Great! Higher RiP% value = greater enrichment!". However, as I think about this a little more I have become slightly concerned that the high RiP% values I am observing could be the result of insufficient stringency in peak calling. Having less stringent peak calling parameters might lead to a greater number of reads occurring in peaks simply because a greater number of peaks are being called. Is this a reasonable train of thought? Are these high values something to be concerned about? My thought is that this would definitely have a substantial impact on the identification of differential peaks down the line.

Any information on how to best go about thinking about this or if this is even a problem is appreciated!

Thanks

R ChIP-seq ChIPQC • 2.0k views

ADD COMMENT • link updated 3.6 years ago by dariober 15k • written 3.6 years ago by gkunz ▴ 30

0

Entering edit mode

3.6 years ago

dariober 15k

RiP% [...] is basically just a percentage of how many reads are located within the called peaks.

I may be missing something in this back-of-the-envelope calculation... Say we aim to sequence the human genome at 10x depth. With read length 100 bp we need 300M reads. In R:

genome_size <- 3e9
depth <- 10
rlen <- 100
(nreads <- (genome_size * depth) / rlen)
3e+08

Now, say there are 10000 "true" binding sites and we detect all of them with peaks of size 300 bp and peak height 100x. I would say this number of binding sites is pretty high considering ~20-30k genes and 100x is good enrichment over the background of 10x.

However, the number of reads-in-peaks is just 3M or 1% of the total number of reads:

npeaks <- 10000
peak_size <- 300
peak_height <- 100
reads_in_peaks <- npeaks * peak_size * peak_height / rlen
3e+06

reads_in_peaks / nreads
0.01

This makes me wonder how you should interpret the RiP%...

EDIT in reply to jared.andrews07's comment:

who's getting 300M reads for ChIP-sep [...] that many reads is overkill

I don't know... Considering the size of the genome and the height of even good ChIP peaks I don't think 10x average coverage (300M reads) is overkill. I would say it is still pretty low in fact. I think it seems a lot because sequencing is still expensive and traditionally ChIPseq is done on few tens of millions of reads but in terms of statistics is not much.

Anyway, with fewer reads you detect fewer genuine binding sites so my ballpark 1% is somewhat an upper bound estimate. So, depending on circumstances, I would be more worried about a high RiP (> 5%) than a low one. I agree that RiP is just one of possibly many QCs but I wonder how useful it is... I mean, it feels like measuring the depth of a swimming pool to see if someone pinched a cupful from it...

ADD COMMENT • link 3.6 years ago by dariober 15k

0

Entering edit mode

This can occur for certain TFs and be fine, as ATpoint mentioned. RiP varies widely and QC should not be based solely on one metric. I realize it's just an example, but who's getting 300M reads for ChIP-seq? Considering your enrichment should only cover a small fraction of the genome, that many reads is overkill unless you have a really pathetic antibody that you're forced to use and are determined to get some sort of result.

ADD REPLY • link 3.6 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

Hi Jared - See edit to my post... but basically I agree with you...

ADD REPLY • link 3.6 years ago by dariober 15k

score 3 · Accepted Answer · 2021-04-20

3

Entering edit mode

3.6 years ago

jared.andrews07 ★ 18k

This is going to depend very much on the protein being immunoprecipitated. These ranges are not outside the norm for histone marks - I've seen up to 35% for very high quality ChIPs for H3K27ac for instance. TFs may see a range of 2-10% while still having effective enrichment.

Note that some peak callers do struggle with very high enrichments and increased sequencing depths - the original MACS implementation would sometimes call very wide peaks instead of individual peaks in regions with high enrichment. This could be obviated by using the -call-subpeaks argument, but the resulting subpeaks usually required additional filtering as well to remove valley regions.

You can usually tell if that is occurring in your data by simply viewing them in IGV and looking at min/max/avg peak sizes. For TFs, you can consider setting a max or fixed peak size to help overcome this.

Alternatively, you can use methods that don't rely on peaks to identify differential regions, e.g. csaw.

ADD COMMENT • link 3.6 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

Hi Jared,

Thank you for the thoughtful response! This is the second time you have provided a clear and helpful answer to a question i have posed. It is greatly appreciated!

The ChIP performed was for H3K27ac, so it sounds like these RiP% values are well within reason, which is reassuring! I have done a fair bit of IGV visualization to look at the called peaks and identified differential peaks. There are definitely major variation in height / strength of called peaks, but they do appear to align with the bam files well enough.

I was curious if you know of what the average peak size is for H3K27ac peaks that tend to validate is. I have never been able to find a clear enough answer regarding expected peak sizes (i.e. sufficient read pile up for a called peak). I see such variation among the peaks that have been called in the data set that I have struggled to reach any sort of meaningful conclusion. I would imagine that there is a lot of variability out there depending on a variety of factors like cell type, organism, xyz.... but was curious if you had any input.

I have never looked at csaw before but will aim to give it a look and see if it validates or invalidates the differential peaks I have identified up to this point.

Thanks again! The help and input is greatly appreciated!

ADD REPLY • link 3.6 years ago by gkunz ▴ 30

0

Entering edit mode

As you said, it's going to vary, but I'd say the average H3K27ac peak is probably 1-5kb in size. If you're getting lots of peaks that are 10kb+ wide, you may want to use PeakSplitter or something similar to break them up. MACS2 is generally pretty good at avoiding huge peaks like that unless you're using the broad peaks setting though. I would be rather surprised if it was a significant issue in your data if you used MACS2.

ADD REPLY • link 3.6 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

TFs can well have FRiPs in the 20% or 30% range, it is all on the antibody, expression level of the protein and quality of the chromatin. I saw both published and our own data achieving this % while for some TFs you can be happy to get like 1%. As Jared said, just look at the browser tracks by eye, that is the best diagnostic.

ADD REPLY • link 3.6 years ago by ATpoint 85k