Starting with a generic BAM file, it can be converted to BED via bam2bed
.
The 13th column (QUAL) can be converted from its printable characters back to an average Phred quality score, and then swapped with the fifth (score) column, per UCSC specification.
This numerical value can subsequently be used for mapping, applying score statistical functions with bedmap
(shown further below):
$ bam2bed < reads.bam | awk -v FS="\t" -v OFS="\t" 'BEGIN{ for (n=0; n<256; n++) ord[sprintf("%c",n)]=n }{ old=$5; l=split($13,q,""); s=0; for(n=1; n<=l; n++) { s+=ord[q[n]]-33; } $5=s/l; $13=old; print $0; }' > reads.bed
You might need to sort your regions-of-interest (ROI: roi.bed
). If so:
$ sort-bed < unsorted.roi.bed > roi.bed
Then map your sorted BED file containing regions-of-interest to the reads from the BAM file:
$ bedmap --skip-unmapped --delim '\t' --echo --count --bases-uniq --echo-ref-size --bases-uniq-f --mean roi.bed reads.bed > answer.bed
The file answer.bed
will contain summary statistics for reads mapped to ROI and look something like this, unique to the makeup of your ROI and reads:
$ head answer.bed
chr1 199064 200374 404 1310 1310 1.000000 25
chr1 6853080 6870040 204 9044 16960 0.533255 20
chr1 8004008 8027135 507 22308 23127 0.964587 20
chr1 8061658 8098833 366 37175 37175 1.000000 25
chr1 8174215 8211962 369 34253 37747 0.907436 35
chr1 8873384 8890703 434 17319 17319 1.000000 50
chr1 8874708 8890922 420 16214 16214 1.000000 45
chr1 15833039 15850691 271 17652 17652 1.000000 50
chr1 16643023 16645543 266 2520 2520 1.000000 45
chr1 16665212 16667822 281 2610 2610 1.000000 30
The first three columns are the regions of interest. These come from using --echo
.
The last five columns are:
- The number of reads in the BAM file that overlap an interval in the BED file by one or more bases
- The number of bases in the interval in the BED file that have coverage by reads from the BAM file
- The length of the interval in the BED file
- The fraction of bases in the interval in the BED file that have coverage by reads from the BAM file
- The (arithmetic) mean of Phred quality scores of reads over the region
The order of the specified bedmap
operands --count
, --bases-uniq
, --echo-ref-size
, --bases-uniq-f
, --mean
write the aforementioned five column values in that same order.
Other operations are available and can be added to the bedmap
command, if useful. Additionally, if one base of read coverage is too lenient, additional operands are available to further constrain that criterion.
Run bedmap --help
and take a look at the "Overlap Options" and other sections, or take a look at the online documentation for a more complete walkthrough.
Once validated, to automate this process, iterate over a bash array of BAM files, or use another scripted approach for running conversion and mapping commands (GNU make, NF, etc.).