Hello All,
I want to generate a pie chart for protein occupancy onto different genomic features and I considered using bedtools to query a BAM file given annotation in BED format to retrieve number of tags aligning to those features. Is this OK or is it more appropriate to compare peak caller BED output against BED annotation? I thought with querying BAM file there could be a considerable noise coming from non-specific reads.
Thanks
Please don't make pie charts, they're terrible at conveying information.
It doesn't matter if it is a pie chart or other way of presenting % distributions. But why would pie charts be terrible?
It turns out that humans are really bad at accurately estimating and comparing percentages represented in pie charts. A table is typically preferred, though if you have time course or other longitudinal data then there are other graphical options.
Your data are not categorical (a occupancy site could intersect with many genomic features) - so a pie chart (rarely the right approach) is definitively wrong here.
For the first time I see somebody commenting on a data that he did not even see before. Anyway, what you said doesn't make sense and you're mistaken. Even if protein intersect with genomic features you can still quantify the number of tags specific to these features and have an idea of protein's genome-wide occupancy. Actually % distributions are quite common for novel ChIPseq analyses which is the case for me.
If the pie chart you're after is like the ones shown in the link provided in Ido's answer below, then I am not mistaken. It is perfectly possible for a ChIP region to be both in a promoter of one gene, and downstream of another gene - so the data are not categorical & a pie chart is wrong.
Just because it is common, doesn't mean it's correct.
It is not about it being common or not but what kind of information you are after. I agree that ChIP region can be at promoter for one gene and gene body for another but that doesn't exclude the the validity of estimating % distributions. What you want to know is whether protein would have a preference for promoter or gene body or whatever region you are interested in which is a perfectly valid question to answer. You can actually generate average profiles for gene bodies starting from -1000TSS and ending =1000TES which would give you an idea of the most common occupancy.
Unless you're after the wrong kind of information, any collating statistic (sum, mean, median, mode, max, min) based on overlapping regions expressed in a pie chart will cause issues my man.
I know it can seem like a non-issue at first, but since many marks sit right at the beginning of two genes transcribed in opposite directions, TSS +- anything is going to really mess things up - I'm telling you, overlapping intersections are the bane of epigenetics because they're so easy to do wrong and are rarely documented in the methods properly :/
EDIT: to make the comment a little more positive - you can do pie charts if you only collate signal on regions without overlaps - however its almost always better to plot signal distributions (not just a sum or a mean) and look at that. It tells you a lot more.