There may already be a recipe for this, so asking first before reinventing the wheel:
I would like to create a bed file where the score is the average mapQ from the reads of the input.bam
file. I think bedtools
or bedops
are the way to go:
http://bedtools.readthedocs.org/en/latest/content/tools/bamtobed.html
http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/bam2bed.html
Other than simply running bamtobed
/bam2bed
, I would like to be able to define a sliding window size and step for the windows, of say, size=1000 and step=200.
I also would like to generate the bam2bed information only from a list of regions in regions.bed
. E.g., something like:
mapq_sliding_windows --bam input.bam --wsize 1000 -wstep 200 --regions regions.bed > mapq_sliding_windows.bed
EDITED:
Thank you Aaron for you answer. I got it working but it's slow for my 30x WGS bams:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome
bedtools makewindows -g hg19.genome -w 1000 -s 200 > hg19.windows.bed
bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean > input.mapq.bed
Last step is taking 6 hours and still running on an 86G bam. Any possible ways to make it faster?
hmmm, where is the bam file here? Maybe
aln.bed
should beinput.bam
? I am getting lots of '.' instead of scores in the output ofbedtools map
...bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam) -c 5 -o mean | head
chr1 0 1000 .
chr1 200 1200 .
chr1 400 1400 .
chr1 600 1600 .
chr1 800 1800 .
chr1 1000 2000 .
chr1 1200 2200 .
chr1 1400 2400 .
chr1 1600 2600 .
chr1 1800 2800 .
bedtools --version bedtools v2.16.2
Sorry, I thought
aln.bed
would be clear. Replaced withinput.bed
. The "." is the default NULL value for themap
tool. If no reads align to a window, then there is no average MAPQ. The snippet you are showing is at the telomere.Thanks, I think my problem was that a large part of the results were '.', but skipping the chrM seems to do the trick:
bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean > input.mapq.bed
Adding the final commands in the question. It works but it's slow.
Not sure how fast you were expecting given the size of BAM files, but the 2.19.0 version of bedtools provides significant speed ups for the
map
tool. Based on a 86Gb file, I would still expect it to take on the order of an hour of two. https://github.com/arq5x/bedtools2/releases/tag/v2.19.0Thanks for the info. Compiling with gcc 4.7.2, I get about 30,000 lines per minute with bedtools version 2.17.0, and about 100,000 lines per minute with bedtools2 version 2.19.0. This is the command-line (rounding up value with awk and compressing output):
/illumina/thirdparty/bedtools/bedtools2-2.19.0/bin/bedtools map -a /illumina/development/stripes/misc/hg19.windows.bed -b <(/illumina/thirdparty/bedtools/bedtools2-2.19.0/bin/bedtools bamtobed -i bad.bam | grep -v chrM) -c 5 -o mean | awk -F. '{print $1".0"}' | gzip -c > bad.mean.mapq.bed.gz
That is about the speedup I would have expected. The forthcoming 2.20.0 version will provide another 2X speedup. Probably a week or two away.
Excellent! Looking forward to it!