From start to finish, this approach renders a plot for each chromosome, across all chromosomes in your regions of interest. Presumably it could be more interesting to know how overlaps are distributed within a chromosome.
As an example, we'll use reference genome hg38
, along with the Kent tools kit, the BEDOPS toolkit, and R with ggplot2
:
$ fetchChromSizes hg38 | awk '{print $1"\t0\t"$2;}' | sort-bed - > hg38.bed
$ bedmap --echo --echo-overlap-size --skip-unmapped hg38.bed myRegions.bed | cut -f1,4 | awk '{ n=split($2, a, ";"); for(i=1; i <= n; i++){ print $1"\t"a[i]; } }' > overlapSizes.txt
$ plotHistogram.Rscript --inFn=overlapSizes.txt --outFn=overlaps.pdf
The script plotHistogram.Rscript
could use ggplot2
to render a column of per-chromosome histograms, and might look like this:
#!/usr/bin/env Rscript
library("optparse")
option_list = list(
make_option(c("-i", "--inFn"), type="character", default=NULL, help="input file name", metavar="character"),
make_option(c("-o", "--outFn"), type="character", default=NULL, help="output file name", metavar="character")
)
opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)
d <- read.table(opt$inFn, header=F, col.names=c("chr", "overlap"), sep="\t")
d$chr <- as.factor(d$chr)
library(ggplot2)
p <- ggplot(data = df, aes(x=overlap))
p <- p + geom_histogram(aes(fill=chr))
p <- p + scale_fill_brewer(palette="Set3")
p <- p + facet_wrap( ~ chr, ncol=1)
pdf(opt$outFn, width=5, height=30)
print(p)
dev.off()
Thanks Pierre! This works very nicely, but I think I am going to have to bin things a bit differently actually. At the moment, the data basically looks like a very single large bar over 0, with a very small bar at 100 and no visible bars from 200 - 3300. I've been trying to get a hold of some of the gnuplot documentation, but if you have a quick fix so that I could make my bin size smaller, that would be great.
in the awk , change '100' to a lower binning value... e.g:
Ah, excellent, I'm not very familiar with awk so I wasn't sure where the binning value was. Thanks a lot!
Is there any way to change the scale of the x-axis? Say I want to focus in on sizes ranging from 1-300 bp or so. Again, apologies, I'm just unfamiliar with awk.