Hello all :)
Some histone marks are referred to as generating 'broad peaks', whilst others 'narrow peaks'. Others, like Pol-II, might be somewhere in between. Personally, I have had a really hard time really getting it clear in my head which marks cause which kinds of peaks, and what that distribution looks like genome-wide.
This got me thinking if there was a way to represent distribution of signal graphically, to illustrate the differences in IP efficiency between two antibodies, two experiments, two chromosomes, etc. Perhaps one experiment produces 'sharper', cleaner peaks than the other (new data compared to previously published stuff, for example).
You can do something like this with a FRiP score (fraction of reads in peaks), but it relies heavily on what is defined by the peak caller as a peak. When comparing two experiments, say in vivo to in vitro, the problem of pre-defined or overlapping peaks can become messy.
These charts are just my first attempt at plotting genome-wide signal distribution - no doubt there is probably some previously existing method/tool to do this which I have not yet come across, in which case I'd be very grateful to hear about how this is supposed to be done!
I couldn't decide between the following two representations of the same data (exact and cumulative), so I thought I would ask here to see what you guys/girls prefer. The data is made by just counting the frequency of signal-depth for every base in the genome. Signal-depth is just like read-depth, except it included the regions between read-pairs, no softclips, duplicates, etc etc - since this is ChIP data.
If you have any suggestions or criticisms, please please please let me know :)
UPDATE - fixed some of the text above, and added plots below for each chromosome. Note that the lines for X and Y look weird because they have a significantly lower chance of receiving signal, since all mice were male. MT is significantly small for a contig.
And below are plots for different replicates which used the exact same antibodies/etc.
This is an interesting idea, I don't think I've seen any software packages to do this. In plot #2 showing the cumulative distribution, how to interpret H3K9me3 (green) vs H3K27me3 (red)? Could this just be a better antibody for H3K27me3 but same pattern? Or if these plots are not normalized to lib size, H3K27me3 could just be more deeply sequenced.
Thanks Ryan :)
So looking at those two particular lines, I would say both are very broad peaks (H3K9me3 being broader) as they both use up around 90% of their signal in 'peaks' no bigger than 50 reads high. This could either reflect the biology of those marks (epitope covers broad domains of chromatin), or suggest that the antibody is poor with high background noise that pulls down any old thing 90% of the time.
If I dare say so, I know that it is unlikely to be the latter, as the data going into these plots are "reference quality" epigenomes - loads of reads and the best antibodies the consortium could get. However, I will admit it would be difficult to tell the difference from just these plots. In fact, I'll go one step further and say that these plots could not tell you if an IP worked or not, unless you already have a line for what 'worked' looks like :( I guess 'peakier' doesn't always mean the IP worked better...
Regarding your last point, I realise now that although the signal (y-axis) is expressed as a % of total signal (and thus normalised between samples), the x-axis - signal depth or peak height - is obviously very related to total reads used in sequencing. If you throw more reads at the problem, you'll get deeper/taller signal/peaks and stretch the plot to the right. This is why the Poor Quality Data, 2013 data, and 2015 data all look similar, but out of phase - the different number of reads used (increasing each year) causes this. So I will re-plot now with depth normalised and see how that changes things : Thank you!