Question

Bed File Of Mapq Sliding Window On A Bam File?

5

Entering edit mode

10.7 years ago

14134125465346445 ★ 3.6k

There may already be a recipe for this, so asking first before reinventing the wheel:

I would like to create a bed file where the score is the average mapQ from the reads of the input.bam file. I think bedtools or bedops are the way to go:

http://bedtools.readthedocs.org/en/latest/content/tools/bamtobed.html
http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/bam2bed.html

Other than simply running bamtobed/bam2bed, I would like to be able to define a sliding window size and step for the windows, of say, size=1000 and step=200.

I also would like to generate the bam2bed information only from a list of regions in regions.bed. E.g., something like:

mapq_sliding_windows --bam input.bam --wsize 1000 -wstep 200 --regions regions.bed > mapq_sliding_windows.bed

EDITED:

Thank you Aaron for you answer. I got it working but it's slow for my 30x WGS bams:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome    
bedtools makewindows -g hg19.genome -w 1000 -s 200 > hg19.windows.bed    
bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean > input.mapq.bed

Last step is taking 6 hours and still running on an 86G bam. Any possible ways to make it faster?

bed bam bedtools • 6.7k views

ADD COMMENT • link 10.7 years ago by 14134125465346445 ★ 3.6k

score 4 · Answer 1 · 2014-02-27

4

Entering edit mode

10.7 years ago

Aaronquinlan 12k

Assuming your BAM file is sorted by chromosome and start position, and the chromosome ordering in the BAM and BED file are the same, the following bedtools solution will work.

Step 1. Make a BED file of sliding windows.

bedtools makewindows -g genomes/human.hg19.genome -w 1000 -s 200 > hg19.windows.bed
head hg19.windows.bed
chr1    0    1000
chr1    200    1200
chr1    400    1400
chr1    600    1600
chr1    800    1800
chr1    1000    2000
chr1    1200    2200
chr1    1400    2400
chr1    1600    2600
chr1    1800    2800

Step 2. Compute the mean MAPQ (column 5) for each window.

bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i aln.bed) -c 5 -o mean

ADD COMMENT • link 10.7 years ago by Aaronquinlan 12k

0

Entering edit mode

hmmm, where is the bam file here? Maybe aln.bed should be input.bam? I am getting lots of '.' instead of scores in the output of bedtools map...

bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam) -c 5 -o mean | head
chr1 0 1000 .
chr1 200 1200 .
chr1 400 1400 .
chr1 600 1600 .
chr1 800 1800 .
chr1 1000 2000 .
chr1 1200 2200 .
chr1 1400 2400 .
chr1 1600 2600 .
chr1 1800 2800 .

bedtools --version bedtools v2.16.2

ADD REPLY • link 10.7 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Sorry, I thought aln.bed would be clear. Replaced with input.bed. The "." is the default NULL value for the map tool. If no reads align to a window, then there is no average MAPQ. The snippet you are showing is at the telomere.

ADD REPLY • link 10.7 years ago by Aaronquinlan 12k

0

Entering edit mode

Thanks, I think my problem was that a large part of the results were '.', but skipping the chrM seems to do the trick:

bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean > input.mapq.bed

ADD REPLY • link 10.7 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Adding the final commands in the question. It works but it's slow.

ADD REPLY • link 10.7 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Not sure how fast you were expecting given the size of BAM files, but the 2.19.0 version of bedtools provides significant speed ups for the map tool. Based on a 86Gb file, I would still expect it to take on the order of an hour of two. https://github.com/arq5x/bedtools2/releases/tag/v2.19.0

ADD REPLY • link 10.7 years ago by Aaronquinlan 12k

1

Entering edit mode

Thanks for the info. Compiling with gcc 4.7.2, I get about 30,000 lines per minute with bedtools version 2.17.0, and about 100,000 lines per minute with bedtools2 version 2.19.0. This is the command-line (rounding up value with awk and compressing output):

/illumina/thirdparty/bedtools/bedtools2-2.19.0/bin/bedtools map -a /illumina/development/stripes/misc/hg19.windows.bed -b <(/illumina/thirdparty/bedtools/bedtools2-2.19.0/bin/bedtools bamtobed -i bad.bam | grep -v chrM) -c 5 -o mean | awk -F. '{print $1".0"}' | gzip -c > bad.mean.mapq.bed.gz

ADD REPLY • link 10.7 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

That is about the speedup I would have expected. The forthcoming 2.20.0 version will provide another 2X speedup. Probably a week or two away.

ADD REPLY • link 10.7 years ago by Aaronquinlan 12k

0

Entering edit mode

Excellent! Looking forward to it!

ADD REPLY • link 10.7 years ago by 14134125465346445 ★ 3.6k

score 3 · Answer 2 · 2014-02-27

3

Entering edit mode

10.7 years ago

sjneph ▴ 690

bam2bed < your-file.bam | cut -f1-4,7 > your-file.bed
make-regions 1000 200 regions.bed | bedmap --echo --mean - your-file.bed > answer.bed

Where make-regions is an awk statement, perhaps buried in a shell script

awk -v winsize=$1 -v stepsize=$2 \
'{ \
  i = $2; \
  while ( i < $3 ) { \
    j = i+winsize; \
    if ( j > $3 ) { j = $3; } \
    print $1"\t"i"\t"j; \
    i += stepsize; \
  } \
}' $3

bam2bed will convert your bam file with no loss of information. The regions.bed file needs to be sorted with the sort-bed utility.

ADD COMMENT • link 10.7 years ago by sjneph ▴ 690

0

Entering edit mode

updated to put MAPQ in field 5 of your-file.bed since that's what you want to average over.

ADD REPLY • link 10.7 years ago by sjneph ▴ 690

score 2 · Answer 3 · 2014-02-27

I wrote a tool (whammy) that calculates the average mapping quality of a position. With some window smoothing you have the answer you want.

https://github.com/jewmanchue/wham

A tab delimited text file. Columns:

Seqid.
Position.
Number of unmapped mates (target)
Number of unmapped mates (background)
Number of same strand mates (target)
Number of same strand mates (background)
Number of cross seqid mapped mates (target)
Number of cross seqid mapped mates (background)
Depth - Number of reads covering position (target)
Depth - Number of reads covering position (background)
Average insert length (target)
Average insert length (background)
Average mapping quality (target)
Average mapping quality (background)
Euclidean distances based on columns (3,4;5,6;7,8;13,14)

Since you probably don't have a target and background you can just put the same bam into option -t and option -b