I have read much documentation on cross-strand correlations and ChIP-seq quality assessment by a higher cross-coverage score at the fragment length over Cross-coverage at the read length. I see mixed comments indicating this metric might not be appropriate for paired-end sequencing. Can anyone explain why? My dataset generated from paired-end ChIP-sequencing shows extremely high Pearson correlation at the read length, and no distinct fragment length peak. I wonder if this is possible for a ChIPseq library with heterogeneous fragment lengths in the range of 200-700bp? Attached is my CC plot from ChIPQC.
What does this mean? Can you show some plots for this and a screenshot from the IGV. My personal opinion on all these ENCODE-derived quality metrics is that you can ignore them all and simply look at the data on the IGV. This together with the number of peaks and the fraction of reads in peaks will tell you whether the data are good or not.
Thanks for reply! This QC plot typically produces a peak at the strand_shift value matching the dominant fragment length - my question is what reasons could explain this peak not appearing when my ChIP peaks look pretty good (see image). In the image, from the top down, the data are: input, H3K27me3 ChIP, transcription factor ChIP replicate 1, transcription factor ChIP replicate 2. According to your input, one option could be to validate the peaks of interest with ChIP qPCR. Still curious about the lack of CC_score peak at ChIP fragment length.
I see mixed comments indicating this metric might not be appropriate for paired-end sequencing. Can anyone explain why?
I have no idea why people would think this metric would be inappropriate. Of course in paired-end sequencing, the fragment sizes are known directly (in effect, the insert size), so you don't need to estimate it using this cross-correlation. The additional sequence from longer reads (or having paired-end alignment) will also reduce the artifactual cross correlation (due to alignment issues) at the read length.
I wonder if this is possible for a ChIPseq library with heterogeneous fragment lengths in the range of 200-700bp?
While it's true that your peak at ~150bp coincides with your read length, it also corresponds to the length of DNA wrapped around a nucleosome (~146bp). This is often the dominant peak in digestion-based or tagmentation-based assays; and the nucleosome may partially shield the bound DNA from sonication breaks, leading to a similar bias in ChIP.
Anyway, just plot the aligned fragment length distribution.
What does this mean? Can you show some plots for this and a screenshot from the IGV. My personal opinion on all these ENCODE-derived quality metrics is that you can ignore them all and simply look at the data on the IGV. This together with the number of peaks and the fraction of reads in peaks will tell you whether the data are good or not.
Thanks for reply! This QC plot typically produces a peak at the strand_shift value matching the dominant fragment length - my question is what reasons could explain this peak not appearing when my ChIP peaks look pretty good (see image). In the image, from the top down, the data are: input, H3K27me3 ChIP, transcription factor ChIP replicate 1, transcription factor ChIP replicate 2. According to your input, one option could be to validate the peaks of interest with ChIP qPCR. Still curious about the lack of CC_score peak at ChIP fragment length.