Hello,
I'm just wondering if there is a quick and simple way of simply getting the 'length' of a BAM alignment? By length, I just mean the number of nucleotides of the reference genome that are covered by an aligned read. Maybe there is something in BedTools that might be able to do it from the BED perspective?
Thought it might be worth asking before writing some perl that could do it...
Cheers!
Sorry, my question is a bit ambiguous. I mean the number of nucleotides of the reference covered by any read in the bam alignment. Basically, I have a bam alignment created by filtering all reads below a certain mapping quality threshold (
bamtools filter -in in.bam -out out.bam -mapQuality "<=someNumber"
). I can convert it to a bed file with bamtobed etc, and I would like to know how much of my reference is covered by such reads. Hope that's a bit clearer... I'm using perl to try and calculate it from the bed file but its a bit tricky because of all overlapping regions and multiple regions per chromosome. Thanks!Ah, yes, your question was rather poorly worded then. You can just
samtools depth foo.bam | awk '{if($3>0) total+=1}END{print total}'
.That'll do it - cheers :)
I guess I get no points for understanding poorly-worded questions ;)
After seeing Pierre's code dump for samtool's depth, I guess our depth/awk solution wont work because depth doesn't print nucleotides where depth is 0. This is a really good example of why that if statement i mentioned over there should be togglable ^_^
Points for everyone! Except me for hastily writing poorly worded questions. But, since I only want those sites in the reference genome that are covered at least 1 times, the fact that samtools depth doesn't count zero-covered sites shouldn't matter. No?
Ah, well, I imagined you want [bases with at least 1 read] / [total bases that could have had 1 read]
Note that [total bases that could have had 1 read] is not the same as [genome size], since a genome has gaps and "filler". There is no base 9999 in chromosome 1 for example, since the first 10000 are all "N". The sum of the N's probably won't affect your outcome all that much, but it should get you thinking about your mapping strategy and how that affects coverage. What did/do you do with reads that map to multiple locations? It's important, because if you can't map reads to repeats, should your coverage statistics be counting repeat-region DNA?
I feel like i'm not being very helpful - in short, calculating absolute coverage is messy business. Picking one coverage calculating method, applying it to all your samples, and seeing which maps the most is much easier. For this, Devon's command will work perfectly :)
No no, it's all extremely helpful I think! I'm doing variant calling in bacterial genomes, and have excluded any putative SNP that falls into a region of low-qual mapping (ie a repeat region). I found that applying this filter got rid of a lot of false positive / rubbish I was seeing in the output VCF. My question stemmed from wanting to know what proportion of my reference assembly I was excluding at certain MapQual thresholds - here, I have approximated genome size with assembly length, since you are right, true genome size is unknowable especially for unclosed genomes. But a ballpark figure is fine for me :)
Also, I found that both Devon's and Bert's approaches:
or
gave the same answer, so take your pick! Thanks again everyone :)