Question

getting a set of coverage/depth metrics for multiple regions

0

Entering edit mode

3.7 years ago

vaushev ▴ 20

I have a set of bam files and a list of my regions of interest in a bed file. For each region I want to get, for each bam file, such metrics as number of reads, number of bases, mean depth, mean quality - essentially what is provided by samtools coverage command - but unfortunately samtools coverage can accept only one defined region (as a command line parameter, it can't take bed file).
I tried other tools but they don't give what I need, for example samtools bedcov and bedtools genomecov give only the number of bases or number of reads, respectively.
For now I'll probably write a script that parses bed file and runs samtools coverage separately for each region - but maybe there's some better solution?

bedtools samtools • 1.5k views

ADD COMMENT • link updated 3.7 years ago by Alex Reynolds 36k • written 3.7 years ago by vaushev ▴ 20

score 2 · Answer 1 · 2021-06-22

Starting with a generic BAM file, it can be converted to BED via bam2bed.

The 13th column (QUAL) can be converted from its printable characters back to an average Phred quality score, and then swapped with the fifth (score) column, per UCSC specification.

This numerical value can subsequently be used for mapping, applying score statistical functions with bedmap (shown further below):

$ bam2bed < reads.bam | awk -v FS="\t" -v OFS="\t" 'BEGIN{ for (n=0; n<256; n++) ord[sprintf("%c",n)]=n }{ old=$5; l=split($13,q,""); s=0; for(n=1; n<=l; n++) { s+=ord[q[n]]-33; } $5=s/l; $13=old; print $0; }' > reads.bed

You might need to sort your regions-of-interest (ROI: roi.bed). If so:

$ sort-bed < unsorted.roi.bed > roi.bed

Then map your sorted BED file containing regions-of-interest to the reads from the BAM file:

$ bedmap --skip-unmapped --delim '\t' --echo --count --bases-uniq --echo-ref-size --bases-uniq-f --mean roi.bed reads.bed > answer.bed

The file answer.bed will contain summary statistics for reads mapped to ROI and look something like this, unique to the makeup of your ROI and reads:

$ head answer.bed
chr1    199064  200374  404 1310    1310    1.000000    25
chr1    6853080 6870040 204 9044    16960   0.533255    20
chr1    8004008 8027135 507 22308   23127   0.964587    20
chr1    8061658 8098833 366 37175   37175   1.000000    25
chr1    8174215 8211962 369 34253   37747   0.907436    35
chr1    8873384 8890703 434 17319   17319   1.000000    50
chr1    8874708 8890922 420 16214   16214   1.000000    45
chr1    15833039    15850691    271 17652   17652   1.000000    50
chr1    16643023    16645543    266 2520    2520    1.000000    45
chr1    16665212    16667822    281 2610    2610    1.000000    30

The first three columns are the regions of interest. These come from using --echo.

The last five columns are:

The number of reads in the BAM file that overlap an interval in the BED file by one or more bases
The number of bases in the interval in the BED file that have coverage by reads from the BAM file
The length of the interval in the BED file
The fraction of bases in the interval in the BED file that have coverage by reads from the BAM file
The (arithmetic) mean of Phred quality scores of reads over the region

The order of the specified bedmap operands --count, --bases-uniq, --echo-ref-size, --bases-uniq-f, --mean write the aforementioned five column values in that same order.

Other operations are available and can be added to the bedmap command, if useful. Additionally, if one base of read coverage is too lenient, additional operands are available to further constrain that criterion.

Run bedmap --help and take a look at the "Overlap Options" and other sections, or take a look at the online documentation for a more complete walkthrough.

Once validated, to automate this process, iterate over a bash array of BAM files, or use another scripted approach for running conversion and mapping commands (GNU make, NF, etc.).