Question

Output bed file region sizes as histogram

0

Entering edit mode

8.0 years ago

mmmmcandrew ▴ 200

Hey all-

I have a bed file containing millions of regions that range in size from 1 bp - ~3300 bp, but I would like to know the frequency distribution of these sizes. Note that the sizes themselves are not included in the bed file, it simply contains the coordinates.

Is there a tool that could take this bed file as the input and output a histogram (preferably an image file, though I'm not terribly picky) of size frequencies? I don't care whether binning is performed or not.

Thanks!

bed bedtools bedops histogram • 4.7k views

ADD COMMENT • link updated 8.0 years ago by Alex Reynolds 36k • written 8.0 years ago by mmmmcandrew ▴ 200

score 2 · Accepted Answer · 2017-04-27

2

Entering edit mode

8.0 years ago

Pierre Lindenbaum 166k

using gnuplot

$ awk '{printf("%d\n",100*int((int($3)-int($2))/100.0));}' in.bed | \
   sort -n | uniq -c  |\
   awk 'BEGIN {printf("set terminal png; set output \"out.png\"; set xtics rotate; set yrange [0:]; set style data histogram; plot \"-\" using 1:xtic(2) title \"\"\n");} {print}' |\
   gnuplot

ADD COMMENT • link 8.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks Pierre! This works very nicely, but I think I am going to have to bin things a bit differently actually. At the moment, the data basically looks like a very single large bar over 0, with a very small bar at 100 and no visible bars from 200 - 3300. I've been trying to get a hold of some of the gnuplot documentation, but if you have a quick fix so that I could make my bin size smaller, that would be great.

ADD REPLY • link 8.0 years ago by mmmmcandrew ▴ 200

0

Entering edit mode

like a very single large bar over 0,

in the awk , change '100' to a lower binning value... e.g:

10*int((int($3)-int($2))/10.0)

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Ah, excellent, I'm not very familiar with awk so I wasn't sure where the binning value was. Thanks a lot!

ADD REPLY • link 8.0 years ago by mmmmcandrew ▴ 200

0

Entering edit mode

Is there any way to change the scale of the x-axis? Say I want to focus in on sizes ranging from 1-300 bp or so. Again, apologies, I'm just unfamiliar with awk.

ADD REPLY • link 8.0 years ago by mmmmcandrew ▴ 200

score 2 · Accepted Answer · 2017-04-27

2

Entering edit mode

8.0 years ago

Ryan Dale 5.0k

If you have enough RAM and are OK with Python:

import pybedtools
from matplotlib import pyplot as plt
sizes = [len(i) for i in pybedtools.BedTool('my.bed')]
plt.hist(sizes)
plt.savefig('hist.png')
plt.show()

ADD COMMENT • link 8.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

At the "plt.hist(sizes)" step, this gives me the following error:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2954, in hist ax = gca() File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 936, in gca return gcf().gca(*kwargs) File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 586, in gcf return figure() File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 535, in figure *kwargs) File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 84, in new_figure_manager return new_figure_manager_given_figure(num, figure) File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 92, in new_figure_manager_given_figure window = Tk.Tk() File "/mnt/home/mcandr21/anaconda2/lib/python2.7/lib-tk/Tkinter.py", line 1820, in __init__ self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use) _tkinter.TclError: no display name and no $DISPLAY environment variable

I'm not great with python. Any ideas?

ADD REPLY • link 8.0 years ago by mmmmcandrew ▴ 200

1

Entering edit mode

Sounds like you're running on a headless server with no X11 windows. In that case add these two lines to the very beginning:

import matplotilb
matplotlib.use('agg')

see http://stackoverflow.com/questions/2801882/generating-a-png-with-matplotlib-when-display-is-undefined for more details and https://matplotlib.org/faq/howto_faq.html#matplotlib-in-a-web-application-server for background

ADD REPLY • link 8.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

Thanks! This works great now.

ADD REPLY • link 8.0 years ago by mmmmcandrew ▴ 200

score 2 · Accepted Answer · 2017-04-27

From start to finish, this approach renders a plot for each chromosome, across all chromosomes in your regions of interest. Presumably it could be more interesting to know how overlaps are distributed within a chromosome.

As an example, we'll use reference genome hg38, along with the Kent tools kit, the BEDOPS toolkit, and R with ggplot2:

$ fetchChromSizes hg38 | awk '{print $1"\t0\t"$2;}' | sort-bed - > hg38.bed
$ bedmap --echo --echo-overlap-size --skip-unmapped hg38.bed myRegions.bed | cut -f1,4 | awk '{ n=split($2, a, ";"); for(i=1; i <= n; i++){ print $1"\t"a[i]; } }' > overlapSizes.txt
$ plotHistogram.Rscript --inFn=overlapSizes.txt --outFn=overlaps.pdf

The script plotHistogram.Rscript could use ggplot2 to render a column of per-chromosome histograms, and might look like this:

#!/usr/bin/env Rscript
library("optparse")
option_list = list(
  make_option(c("-i", "--inFn"), type="character", default=NULL, help="input file name", metavar="character"),
  make_option(c("-o", "--outFn"), type="character", default=NULL, help="output file name", metavar="character")
)
opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)
d <- read.table(opt$inFn, header=F, col.names=c("chr", "overlap"), sep="\t")
d$chr <- as.factor(d$chr)
library(ggplot2)
p <- ggplot(data = df, aes(x=overlap))
p <- p + geom_histogram(aes(fill=chr))
p <- p + scale_fill_brewer(palette="Set3")
p <- p + facet_wrap( ~ chr, ncol=1)
pdf(opt$outFn, width=5, height=30)
print(p)
dev.off()