Question

samtools sort operation very slow, bottle-neck is external hard drive?

2

Entering edit mode

22 months ago

blizard.wizard ▴ 20

Pretty new to this, so bear with me.

I'm running a couple dozen files (~250GB of paired fastq files) on an RNASeq alignment pipeline through a bash script while-do loop. Essentially bowtie2 for mapping, then samtools for sam-to-bam, sort-bam, index-bam. Not enough space on the internal HD of a 2020 macbook pro, so I'm running this on a new fast 5TB external (usb 3.2 gen 2).

Alignment proceeds swimmingly, as well as the sam-to-bam, but things slow down significantly on the sorting. I get the typical output "merging XXX files, using 6 in-memory blocks", but it takes forever... looking at ~15hrs the sorted file is only 2.8GB and growing slowly, original BAM file is 19GB. Typical sorted BAMs in this data set are ~6-8GB.

I had set this pipe up to run over a 2-week vacation, expecting it to be done... it's about 60% right now. When I run other alignments on the native hard drive, the aligning is by far the longest part of the process. Usually 6-8hrs per paired-alignment.

Is this an I/O bottle-neck? Am I being impatient? I'm stuck with the external for the time being...is there a way to improve the speed of the sort operation? I'll move to the cloud when funds/time are available.

Thanks in advance, and let me know if I need to add more info!

samtools BAM • 3.6k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 22 months ago by blizard.wizard ▴ 20

1

Entering edit mode

If you aren't committed to using bowtie2 for alignment, use STAR. It's likely faster for alignment, and can directly write a sorted bam file.

ADD REPLY • link 22 months ago by noodle ▴ 590

2

Entering edit mode

It still does sorting and once memory is full it has to out the files somewhere. I cannot imagine that this would be any different. STAR is splice-aware though and generally a great aligner for RNA-seq so using it is generally a good idea.

ADD REPLY • link 22 months ago by ATpoint 85k

1

Entering edit mode

I cannot imagine that this would be any different

This is not clear. I strongly encourage blizard.wizard to use STAR, after all it's a single command as opposed to the bowtie2/samtools -> sam -> bam -> sort pipeline. If blizard.wizard can afford to write everything locally first and move after this might be much faster (only one way to find out). In any case, blizard.wizard should downsize fastq files to test a pipeline before running on everything.

ADD REPLY • link 22 months ago by noodle ▴ 590

score 6 · Answer 1 · 2023-01-05

Yes, this can well be an I/O problem. The sort will produce lots of intermediate files and merging them requires (obviously) to access them all more or less in parallel, that is something that HDD drives in general, and external ones in particular are slow at. The only way to improve that with the given setup would be to use fewer sorting threads and more memory per these reduced threads to keep the number of temporary files low.

merging XXX files, using 6 in-memory blocks", but it takes forever.

samtools sort has a -T option to define a temporary directory for these files. Does the laptop has enough free space on the internal disk to at least take these tmp files during the sort? It will automatically be removed once sorting is done. External SSDs are not super expensive these days, even external ones, that could help. Does your institution not provide any sort of computational solutions for this, like server or HPC access? Larger data require some computational power. Imagine you find out that you missed a parameter in the pipeline, do you want to wait 2 weeks again for the re-run?

but things slow down significantly on the sorting

That having said, why do you need it sorted? Is this for standard DE analysis? Tools like featureCounts can make a count matrix directly from aligned unsorted files. In fact, it would even resort the files be name if you gave it coordinate-sorted files. Alternatively, use something like salmon to get counts fast and efficiently. Check whether your analysis really needs sorted files, I have yet not come across a situation where I ever needed sorted RNA-seq bams.

bowtie2 for mapping

Are you aware that bowtie2 is not splice-aware, and do you work with eukaryotes?

Alternatively, use tools that are much lighter in terms of memory, I/O consumption and total processing time for RNA-seq like salmon, unless you really need a genome alignment.

score 3 · Answer 2 · 2023-01-06

3

Entering edit mode

22 months ago

colindaven 7.0k

What ATpoint said, but with the following additions

are you using parallel sorting: samtool sort -@ 8 to run on 8 threads, if your machine has this many ?
you can install and use glances to check IO usage
an external ssd, even just 1-2 TB, will make a massive difference (I have seen 250 MB / second sustained IO instead of 50 MB/s with hard disks)
extra internal drives are likely to be even quicker, but check everything is connected via USB3
sambamba used to be quicker, but not any more

ADD COMMENT • link 22 months ago by colindaven 7.0k

2

Entering edit mode

are you using parallel sorting: samtool sort -@ 8 to run on 8 threads, if your machine has this many ?

When there is an I/O bottleneck, extra threads usually don't help. To process 8x more data with 8 threads we need to read 8x more data and write 8x more temporary files, which gets us back to the I/O bottleneck.

Other than that it is obvious that a faster internal HD or SSD would solve the problem in the long term, but it doesn't help solve the current problem. The OP seems to understand where the slow-down comes from and was only seeking to verify the opinion.

ADD REPLY • link 22 months ago by Mensur Dlakic ★ 28k

score 1 · Answer 3 · 2023-01-06

1

Entering edit mode

22 months ago

size_t ▴ 120

try this: https://github.com/biod/sambamba

ADD COMMENT • link 22 months ago by size_t ▴ 120

0

Entering edit mode

users should be aware that sambamba has many bugs, some documented and other less so

ADD REPLY • link 22 months ago by noodle ▴ 590

score 1 · Answer 4 · 2023-01-16

Samtools sort merge step can really thrash the system if it's not performant (eg non-RAID hard-disk).

PRs https://github.com/samtools/samtools/pull/1701 and https://github.com/samtools/samtools/pull/1706 to samtools alleviates this, although it's not yet been released as it just missed 1.16. It reduces the number of temporary files by pre-merging each batch of in-memory BAMs before writing out. It also permits a second stage of merge-and-spill-to-disk again if the number of temporary files gets too large, to avoid thrashing disks. It may still not be as performant as sambamba for sort, but is considerably better than before.

The main way of reducing the sort time though is maximising memory usage. Note the -m option is per thread, but setting is as high as you can will reduce the number of temporary files and speed up the merge step.