Entering edit mode
2.8 years ago
elcortegano
▴
200
I am wondering if anybody know a faster alternative to samtools calmd
. I know that sambamba
is faster than samtools
for many operations such as view
and sort
, but it seems that it does not have a dedicated tool for setting the MD flag in BAM files.
Right now it looks like samtools calmd
is going to take in the order of weeks for each BAM file, even with multi-threading enabled. It looks so much. Any ideas how to process this faster? Thanks
If you say "weeks" then I would assume that there is some kind of I/O or other bottleneck at your end. Can you confirm that you checked CPU usage, is it at 100% or is it in some sleepy state?
This is running in a cluster in our group, and I am using now all cores (64). I cannot see any other user or program running that makes me think that there might be this kind of error.
It might worth mentioning that this is long read data (PacBio Hifi). Input BAM files are ~40 GB in size. The command run is
samtools calmd -@64 input.bam reference.fa > output.bam
. In 24h running, the output BAM file only has ~1GB in size, which makes me think this is going to take very long.If you use
top -d1 -u $USER
, what is the %CPU usage of the samtools process? Is it at 100% or higher or just crippling at a few%. TheS
tate can also be informative, does it happen to beD
?CPU usage goes from 99% to 101%. The process status is R.
I think I'll give it a try changing the partition where the output is written. Or submitting this to another cluster. I cannot find other processes in this cluster, but I know this partition is shared with other users, and the suggestion that this might relate to I/O does make a lot of sense.
Hmm, but R 100% means it is running at fully capacity, so this argues against an I/O, even though I am not familiar with
calmd
in particular, maybe there are bottlenecks elsewhere.After 24 again, ~1 Gb of output, so no change at all
But also no speedup with 64. So using eg 12 and running all your bams in parallel might be the way to go ? I haven't had this much HiFi data, and also haven't run this on our similarly scaled nanopore datasets, so can't comment much further. Maybe check the samtools github and report HiFi being very slow ? Maybe some chunking or other technical parameters might help ?
Try running with less cores (it might run slower or the same with more than 8 or 12 cores). I run samtools calmd all the time as part of our pipelines, albeit never with Pacbio HiFi data. It's pretty quick, but I've never benchmarked in detail. IO can be assessed with programs like glances or bashtop, gotop etc pretty easily.