Question

Faster alternative to samtools calmd

0

Entering edit mode

2.8 years ago

elcortegano ▴ 200

I am wondering if anybody know a faster alternative to samtools calmd. I know that sambamba is faster than samtools for many operations such as view and sort, but it seems that it does not have a dedicated tool for setting the MD flag in BAM files.

Right now it looks like samtools calmd is going to take in the order of weeks for each BAM file, even with multi-threading enabled. It looks so much. Any ideas how to process this faster? Thanks

alignment BAM MD • 2.0k views

ADD COMMENT • link updated 2.8 years ago by colindaven 7.0k • written 2.8 years ago by elcortegano ▴ 200

1

Entering edit mode

If you say "weeks" then I would assume that there is some kind of I/O or other bottleneck at your end. Can you confirm that you checked CPU usage, is it at 100% or is it in some sleepy state?

ADD REPLY • link 2.8 years ago by ATpoint 85k

0

Entering edit mode

This is running in a cluster in our group, and I am using now all cores (64). I cannot see any other user or program running that makes me think that there might be this kind of error.

It might worth mentioning that this is long read data (PacBio Hifi). Input BAM files are ~40 GB in size. The command run is samtools calmd -@64 input.bam reference.fa > output.bam. In 24h running, the output BAM file only has ~1GB in size, which makes me think this is going to take very long.

ADD REPLY • link 2.8 years ago by elcortegano ▴ 200

0

Entering edit mode

If you use top -d1 -u $USER, what is the %CPU usage of the samtools process? Is it at 100% or higher or just crippling at a few%. The State can also be informative, does it happen to be D?

ADD REPLY • link 2.8 years ago by ATpoint 85k

0

Entering edit mode

CPU usage goes from 99% to 101%. The process status is R.

I think I'll give it a try changing the partition where the output is written. Or submitting this to another cluster. I cannot find other processes in this cluster, but I know this partition is shared with other users, and the suggestion that this might relate to I/O does make a lot of sense.

ADD REPLY • link 2.8 years ago by elcortegano ▴ 200

0

Entering edit mode

Hmm, but R 100% means it is running at fully capacity, so this argues against an I/O, even though I am not familiar with calmd in particular, maybe there are bottlenecks elsewhere.

ADD REPLY • link 2.8 years ago by ATpoint 85k

1

Entering edit mode

After 24 again, ~1 Gb of output, so no change at all

ADD REPLY • link 2.8 years ago by elcortegano ▴ 200

0

Entering edit mode

But also no speedup with 64. So using eg 12 and running all your bams in parallel might be the way to go ? I haven't had this much HiFi data, and also haven't run this on our similarly scaled nanopore datasets, so can't comment much further. Maybe check the samtools github and report HiFi being very slow ? Maybe some chunking or other technical parameters might help ?

ADD REPLY • link 2.8 years ago by colindaven 7.0k

0

Entering edit mode

Try running with less cores (it might run slower or the same with more than 8 or 12 cores). I run samtools calmd all the time as part of our pipelines, albeit never with Pacbio HiFi data. It's pretty quick, but I've never benchmarked in detail. IO can be assessed with programs like glances or bashtop, gotop etc pretty easily.

ADD REPLY • link 2.8 years ago by colindaven 7.0k