The bedtools 2.25 command I use for coveregeBed is:
coverageBed -d -g genomefile.txt -a target.bed -b input.bam > output.txt
That works for the vast majority of bam files, however when each bam exceeds 60 million mapped reads (~60GB) the run errors. I have been reading about recent memory leak fixes and plan to upgrade to bedtools 2.27. Is there a better command to use or is upgrading enough and re-running? Thank you :).
My approach in this case, assuming I didn't have better hardware available, would be to extract the regions of the bam using the bedfile and samtools view -L. I would then supply that extracted BAM to bedtools. That will give you a much smaller BAM and may help avoid the memory issues you're seeing. I would spot-check a few of the regions in IGV to be 100% sure I'm not losing reads at the interval boundaries.
One "hack" (I hesitate to call it a solution) is to use samtools depth. This command will print out the number of reads spanning a nucleotide for a genome or a given region (samtools depth in.bam -r chr1:1-1000000). However, samtools depth does not give you the same output as coverageBed, but will give you coverage estimates
The --echo option reports the elements in target.bed.
The --count option reports the number of features in the map file (the BAM file) that overlap the reference file (the target.bed file) by one or more bases.
The --bases option reports the number of bases that overlap between target and BAM reads.
The --echo-ref-size reports the length of the target element (for each target element).
Piping to awk does the same calculations as coverageBed.
This should ramp up to whole-genome scale work.
To make things faster, if the BAM file is indexed, you can add --do-not-sort and --reduced options to bypass sorting with sort-bed and to only print a subset of columns as BED.
My approach in this case, assuming I didn't have better hardware available, would be to extract the regions of the bam using the bedfile and
samtools view -L
. I would then supply that extracted BAM to bedtools. That will give you a much smaller BAM and may help avoid the memory issues you're seeing. I would spot-check a few of the regions in IGV to be 100% sure I'm not losing reads at the interval boundaries.One "hack" (I hesitate to call it a solution) is to use
samtools depth
. This command will print out the number of reads spanning a nucleotide for a genome or a given region (samtools depth in.bam -r chr1:1-1000000
). However,samtools depth
does not give you the same output ascoverageBed
, but will give you coverage estimates