I am trying to plot a heatmap of read density around a feature of interest (TSSs) very common in genomics papers. something like this (B):
However, I am struggling a bit in getting to look "right". A bit of background: I have mapped ChIP-seq reads for pol2 and calculate the coverage, per nucleotide, using bedtools.
coverageBed -d -abam $bamFile -b $TSSs > $coverage.bed
# output:
chr1 67108226 67110226 uc001dct.3 16 + 1 10
chr1 67108226 67110226 uc001dct.3 16 + 2 10
chr1 67108226 67110226 uc001dct.3 16 + 3 10
chr1 67108226 67110226 uc001dct.3 16 + 4 10
chr1 67108226 67110226 uc001dct.3 16 + 5 8
chr1 67108226 67110226 uc001dct.3 16 + 6 8
chr1 67108226 67110226 uc001dct.3 16 + 7 8
chr1 67108226 67110226 uc001dct.3 16 + 8 8
chr1 67108226 67110226 uc001dct.3 16 + 9 8
chr1 67108226 67110226 uc001dct.3 16 + 10 8
Then in R, the genomic position, in column 7, is converted to relative position to the TSS and read counts normalized to the library size. This is converted to a numeric matrix with each row being a TSS and each column the relative nucleotide position. For the plotting the matrix is ordered number of reads per TSS, and the values logged. This is the outcome:
heatmap(cov.mlog, Rowv=NA, Colv=NA, scale="column", labCol = FALSE, labRow = FALSE, col=brewer.pal(9, "Greens"), margins = c(5, 5))
Although the average coverage plot looks as one would expect for pol2, the heatmap is, well, not great. My question is what am I doing wrong and how to improve it? In this and this paper, the coverage is calculate for a bin of nucleotides and not per nucleotide as I am doing. Would that improve visualization? How to do it in bedtools? Should the sorting of the matrix be done differently?
I am aware that could use ngsplot for this, but I am trying to avoid it because my implementation would fit better with my other analysis.
Thank you!
I have bigwig files for input and IP - the input seems to have a lot more peaks and the IP almost looks flat. Can I actually compare these two files ?
Since the input has more reads 77578695 compared to 52746408 in the IP. I would expect the IP samples to have significantly enriched regions compared to the input, and as you can see the opposite is true in my samples.
https://s3.postimg.org/7ixq74p37/all_genomewide3.jpg
all_genomewide3.jpg