samtools sort operation very slow, bottle-neck is external hard drive?
4
2
Entering edit mode
22 months ago

Pretty new to this, so bear with me.

I'm running a couple dozen files (~250GB of paired fastq files) on an RNASeq alignment pipeline through a bash script while-do loop. Essentially bowtie2 for mapping, then samtools for sam-to-bam, sort-bam, index-bam. Not enough space on the internal HD of a 2020 macbook pro, so I'm running this on a new fast 5TB external (usb 3.2 gen 2).

Alignment proceeds swimmingly, as well as the sam-to-bam, but things slow down significantly on the sorting. I get the typical output "merging XXX files, using 6 in-memory blocks", but it takes forever... looking at ~15hrs the sorted file is only 2.8GB and growing slowly, original BAM file is 19GB. Typical sorted BAMs in this data set are ~6-8GB.

I had set this pipe up to run over a 2-week vacation, expecting it to be done... it's about 60% right now. When I run other alignments on the native hard drive, the aligning is by far the longest part of the process. Usually 6-8hrs per paired-alignment.

Is this an I/O bottle-neck? Am I being impatient? I'm stuck with the external for the time being...is there a way to improve the speed of the sort operation? I'll move to the cloud when funds/time are available.

Thanks in advance, and let me know if I need to add more info!

samtools BAM • 3.6k views
ADD COMMENT
1
Entering edit mode

If you aren't committed to using bowtie2 for alignment, use STAR. It's likely faster for alignment, and can directly write a sorted bam file.

ADD REPLY
2
Entering edit mode

It still does sorting and once memory is full it has to out the files somewhere. I cannot imagine that this would be any different. STAR is splice-aware though and generally a great aligner for RNA-seq so using it is generally a good idea.

ADD REPLY
1
Entering edit mode

I cannot imagine that this would be any different

This is not clear. I strongly encourage blizard.wizard to use STAR, after all it's a single command as opposed to the bowtie2/samtools -> sam -> bam -> sort pipeline. If blizard.wizard can afford to write everything locally first and move after this might be much faster (only one way to find out). In any case, blizard.wizard should downsize fastq files to test a pipeline before running on everything.

ADD REPLY
6
Entering edit mode
22 months ago
ATpoint 85k

Yes, this can well be an I/O problem. The sort will produce lots of intermediate files and merging them requires (obviously) to access them all more or less in parallel, that is something that HDD drives in general, and external ones in particular are slow at. The only way to improve that with the given setup would be to use fewer sorting threads and more memory per these reduced threads to keep the number of temporary files low.

merging XXX files, using 6 in-memory blocks", but it takes forever.

samtools sort has a -T option to define a temporary directory for these files. Does the laptop has enough free space on the internal disk to at least take these tmp files during the sort? It will automatically be removed once sorting is done. External SSDs are not super expensive these days, even external ones, that could help. Does your institution not provide any sort of computational solutions for this, like server or HPC access? Larger data require some computational power. Imagine you find out that you missed a parameter in the pipeline, do you want to wait 2 weeks again for the re-run?

but things slow down significantly on the sorting

That having said, why do you need it sorted? Is this for standard DE analysis? Tools like featureCounts can make a count matrix directly from aligned unsorted files. In fact, it would even resort the files be name if you gave it coordinate-sorted files. Alternatively, use something like salmon to get counts fast and efficiently. Check whether your analysis really needs sorted files, I have yet not come across a situation where I ever needed sorted RNA-seq bams.

bowtie2 for mapping

Are you aware that bowtie2 is not splice-aware, and do you work with eukaryotes?

Alternatively, use tools that are much lighter in terms of memory, I/O consumption and total processing time for RNA-seq like salmon, unless you really need a genome alignment.

ADD COMMENT
0
Entering edit mode

Thank you for your comments. Very helpful.

samtools sort has a -T option to define a temporary directory for these files. Does the laptop has enough free space on the internal disk to at least take these tmp files during the sort? It will automatically be removed once sorting is done.

I do have the space for this! definitely adding the -T option.

Does your institution not provide any sort of computational solutions for this, like server or HPC access? Larger data require some computational power. Imagine you find out that you missed a parameter in the pipeline, do you want to wait 2 weeks again for the re-run?

Just starting out with bioinformatics at a command-line level, had previously worked with bioinformaticians on projects at a high/conceptual level. Trying to add it to my skill set, taking classes in my personal time, etc. If I can show theres value add, I get resources for cloud compute. Learned a lesson here for sure.

That having said, why do you need it sorted?

Was told I needed it for use in a genome browser like seqmonk. Need to do more reading.

Are you aware that bowtie2 is not splice-aware, and do you work with eukaryotes?

Yes and yes. Was told not to worry about splicing for this run, for a few reasons... but, I will absolutely be switching to a splice-aware aligner in the future. STAR seems highly rated, but I've read RUM is also excellent.

Still very new to this and brute-forcing my way through learning. Seriously appreciate the comments.

ADD REPLY
1
Entering edit mode

Glad it helped. Definitely go for STAR, very established and robust. Never heard of RUM (not that this means something). True that genome browsers need sorted and indexed files.

ADD REPLY
3
Entering edit mode
22 months ago

What ATpoint said, but with the following additions

  • are you using parallel sorting: samtool sort -@ 8 to run on 8 threads, if your machine has this many ?
  • you can install and use glances to check IO usage
  • an external ssd, even just 1-2 TB, will make a massive difference (I have seen 250 MB / second sustained IO instead of 50 MB/s with hard disks)
  • extra internal drives are likely to be even quicker, but check everything is connected via USB3
  • sambamba used to be quicker, but not any more
ADD COMMENT
2
Entering edit mode

are you using parallel sorting: samtool sort -@ 8 to run on 8 threads, if your machine has this many ?

When there is an I/O bottleneck, extra threads usually don't help. To process 8x more data with 8 threads we need to read 8x more data and write 8x more temporary files, which gets us back to the I/O bottleneck.

Other than that it is obvious that a faster internal HD or SSD would solve the problem in the long term, but it doesn't help solve the current problem. The OP seems to understand where the slow-down comes from and was only seeking to verify the opinion.

ADD REPLY
1
Entering edit mode
22 months ago
size_t ▴ 120

try this: https://github.com/biod/sambamba

ADD COMMENT
0
Entering edit mode

users should be aware that sambamba has many bugs, some documented and other less so

ADD REPLY
1
Entering edit mode
22 months ago
jkbonfield ★ 1.3k

Samtools sort merge step can really thrash the system if it's not performant (eg non-RAID hard-disk).

PRs https://github.com/samtools/samtools/pull/1701 and https://github.com/samtools/samtools/pull/1706 to samtools alleviates this, although it's not yet been released as it just missed 1.16. It reduces the number of temporary files by pre-merging each batch of in-memory BAMs before writing out. It also permits a second stage of merge-and-spill-to-disk again if the number of temporary files gets too large, to avoid thrashing disks. It may still not be as performant as sambamba for sort, but is considerably better than before.

The main way of reducing the sort time though is maximising memory usage. Note the -m option is per thread, but setting is as high as you can will reduce the number of temporary files and speed up the merge step.

ADD COMMENT

Login before adding your answer.

Traffic: 2879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6