Pretty new to this, so bear with me.
I'm running a couple dozen files (~250GB of paired fastq files) on an RNASeq alignment pipeline through a bash script while-do loop. Essentially bowtie2 for mapping, then samtools for sam-to-bam, sort-bam, index-bam. Not enough space on the internal HD of a 2020 macbook pro, so I'm running this on a new fast 5TB external (usb 3.2 gen 2).
Alignment proceeds swimmingly, as well as the sam-to-bam, but things slow down significantly on the sorting. I get the typical output "merging XXX files, using 6 in-memory blocks", but it takes forever... looking at ~15hrs the sorted file is only 2.8GB and growing slowly, original BAM file is 19GB. Typical sorted BAMs in this data set are ~6-8GB.
I had set this pipe up to run over a 2-week vacation, expecting it to be done... it's about 60% right now. When I run other alignments on the native hard drive, the aligning is by far the longest part of the process. Usually 6-8hrs per paired-alignment.
Is this an I/O bottle-neck? Am I being impatient? I'm stuck with the external for the time being...is there a way to improve the speed of the sort operation? I'll move to the cloud when funds/time are available.
Thanks in advance, and let me know if I need to add more info!
If you aren't committed to using bowtie2 for alignment, use STAR. It's likely faster for alignment, and can directly write a sorted bam file.
It still does sorting and once memory is full it has to out the files somewhere. I cannot imagine that this would be any different. STAR is splice-aware though and generally a great aligner for RNA-seq so using it is generally a good idea.
This is not clear. I strongly encourage blizard.wizard to use STAR, after all it's a single command as opposed to the bowtie2/samtools -> sam -> bam -> sort pipeline. If blizard.wizard can afford to write everything locally first and move after this might be much faster (only one way to find out). In any case, blizard.wizard should downsize fastq files to test a pipeline before running on everything.