Hi,
I'm looking to calculate some coverage statistics on the alignment in my targeted sequencing experiment. I have a BAM file and a BED file with the target regions. I wish to determine the percentage of bases covered above 30X, 40X, 50X etc. How can I best do this?
Also, we sequenced a LOT of samples in a single lane, so we had to repeat the experiment multiple times to get sufficient coverage. Will the above procedure help me determine if I have reached sufficient coverage? I wish to have at least 70X across the target region, but I can make do with a small percentage of low coverage regions.
I already have stats on average coverage per sample (by dividing number of bases sequenced on target by total length of target region)
Me to samtools: "I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve."
Thank you, Devon. I'll explore samtools!
Devon,
samtools depth | awk
seems like a really speedy option in my case. I have 30 BAM files and whatsamtools depth
did under 24 hours (all samples put together), GATK's DepthOfCoverage is unable to do for even 5 chromosomes of a single sample.I haven't tried Brian's tool yet, because I'm not sure if it works with targeted sequencing. I'd prefer if I did not have to do the filtration step after the depth analysis.
Thanks for getting back to us on that, it's good to know. I'm not surprised that GATK's tool is slow, their walker methods are generally absurdly slow. BisSNP is a modified version of GATK for methylation realignment/calling and it's around 10x slower on processing a file than my tool, which uses htslib, for example.
Let us know if you end up using Brian's tool too. At the end of the day, the speed here should just be limited by disk IO.
I'm using my org's HPC, and I/O is great + not in my control, but yeah, GATK seems quite slow and consumes time that I do not have.
I might try Brian's tool today and check its performance for my specific use case, but I must also focus on getting the work done. samtools depth has given me granular (per base depth) data, and I need to stratify this + check across multiple sequencing runs for consistently low covered regions.
My ultimate aim is to determine if another run of sequencing can reduce the fraction of low covered regions, and I guess I should consult with my sequencing team on how the biology and sample prep affects this as well.
Focus on getting the work done? Pfft, what's that about :P
Exploration is good, but not at the cost of timely results, no? "I'm working on it" can buy only so much time :P
Agreed :) If only there were infinite time!