I've processing targeted resequencing data from a Haloplex panel run on a MiSeq using Agilents Surecall software and generated variant calls. However, I want to generate calls using my own pipeline in GATK to compare to Surecall.
I'm having trouble with the run time for bwa mem run through virtual box.
Virtualbox note, huge difference between real time and CPU time?
[main] Real time: 21280.065 sec; CPU: 2547.937 sec
Surecall
[main] Real time: 572.116 sec; CPU: 375.354 sec
Surecall uses bma **version: 0.7.5a-r405 - Windows port version: 1.2 and the following command:
bwa.exe, mem, -M, -D, 0,0, -B, 4, -A, 1.0, -w, 100, -k, 19, -R, @RG\tID:X\tSM:X, -t, 4, hg19.fasta, R1_Cut.fastq, R2_Cut.fastq, >, X.sam
In virtualbox I'm using bwa Version: 0.7.12-r103 and the following command:
bwa mem -t 6 -M -R @RG\tID:X\tSM:X human_g1k_v37.fasta.gz R1_trimpaired.fastq.gz R2_trimpaired.fastq.gz > X.sam
As for virtualbox I'm using Biolinux 8 with 8 cores and 5Gb of memory dedicated to it. I'm thinking maybe 5Gb is not enough and it's using paging, which is causing the slow down? I might be able to increase the amount allocated slightly, but the computer itself only has 8Gb total and I know the host OS will need some. The only other difference is the -D parameter used by Surecall in bma 0.7.5. I dug through bwa's git and found:
-D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [%.2f]\n", opt->drop_ratio);
I did not see that in the bwa manual, so I did not specify it in my command (0 could be the default anyway as Surecall specified a few parameters with default values).
For human, 5.5G is the minimum. Better 6GB.
Thanks Heng! After running with 6GB allocated to the virtual machine the run times are much better.
Any chance you could elaborate on what the Surecall parameter
-D, 0,0
is doing? Is this the default?