An unusual problem regarding bwa alignment has been bugging me for about a week which I have not been able to troubleshoot. If anyone can help me resolve this, I'll ask Santa to get you 3 kegs of beer!
I work with two plant species -- A & B -- and both have reference genomes. Ref seq sizes are ~11 and 14 Gbp for A and B, respectively. I indexed both references using ~/bwa-0.7.16a/bin/bwa index A_ref_seq.fa
and the output looks normal.
Also from both species, I have few hundred individuals that were sequenced using Illumina as we want to do multi-sample SNP calling. Number of sequences in the fastq files vary of course but all have an average Q score of > 30, and their length is 90-95 bp. I have not seen anything unusual in the fq files for individuals from both species.
A couple of weeks ago, I aligned all individuals from A with A_ref_seq:
~/bwa-0.7.16a/bin/bwa mem -t 16 A_ref_seq.fa A_Ind1.fq > A_Ind1.sam
This went smoothly. To write sam file for each individual fq file, bwa took 4-5 minutes, on avereage. After this, I did sam to bam, sort bam, then mpileup for SNP calling (version 1.6 for both samtools and bcftools). All ran fine, no problems at all.
However, when I did bwa mem
with individuals of species B last week, bwa took almost 2 hours for a fq file with roughly 1.3 million reads!! I am using same software version, same number of threads (16), same memory (24 gigs) and yet the time it takes is astronomically high! The sam outputs from B individuals are of similar size to the A individuals, so not like bwa is writing junk during the several hours it takes to churn out one sam file.
What could be the reason? And what can I do to make this go faster?
For troubleshooting, I tried making a new index, created new fastq files, but nope, the problem persists. I even moved the storage area where I was doing this ("scratch" vs. home directory) but this did not help. If I run the pipeline for species A right now, it runs fast and works as expected which indicates that any hardware/software changes on our server is not causing this.
I would really appreciate any suggestions to improve this. Thank you.
Any chance you can allocate more memory? I'm guessing your 14Gbp reference will require just over 24gigs.
I had 5gig allocated for human genome and alignment was really slow. After I followed Heng's advice and ran with 6gig it was normal speed. 3.1 human reference/5.5 gig minimum allocation = 14 reference B / x minimum allocation x = 24.8 gigs min ^no idea if it is linear like that, but it seems like a good place to start.
After your suggestion (and reading your old post), I increased the memory to 32 GB and it still is as slow. I have decided for the time being to run multiple jobs in parallel to compensate for the slowness. While that's going, I will also play with increasing nodes, processors per node, memory, etc and report here if anything improves my situation. I still find it very bizarre that this is happening despite running really fast on another species of similar size.
Edit (12-13-2017): This morning I ran it again on a single node with 24 processors but asked 126 GB of memory. And it is printing one sam file every 2 minutes, on average. This is a tremendous speed boost, compared with previous speed of 1 file every 2 freaking hours!! Seems like more memory is the answer.
Thanks Robert! Beer coming your way! ;-)
You are using two different algorithms so can't directly compare the running times of the two (even if all the rest of the stuff is the same). See this for more: When and why is bwa aln better then bwa mem? and also this. Your libraries could also have very different characteristics, which will complicate things.
I'll only take one. Need to heave some for Santa.
I too would like to put my money on the weirdness of the libraries leading to goofy fq files but really have not seen anything so out of the ordinary to notice it. What's insane is that I've done this numerous times in the past with species B but have never seen such slow output.
My only solution to this might be to submit multiple parallel jobs to account for the slowness.
I was not referring to goofy data files but insert sizes, parts of the genome represented in that library etc.
2 h
is not that bad. With 500M read datasets people wait for a full day even with multiple threads.