I am using GATK HaplotypeCaller to genotype ~3000 SNPs from amplicon sequencing data, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles $vcf --output_mode EMIT_ALL_SITES mode.
However, it take huge time to call these variants, ~20h.
And I found that adding -nct option did not work, and it even increase to 23h.
Is there any option to accelerate the calling process?
Hello. this is a few years late, but I hope this helps someone else out. I made an improved version of YaGalbi's code above so that it is much faster (9-fold faster). By adding & done; wait to the end of the @chr loop allows all processes to be run simultaneously. wait makes the machine stop before continuing onto the concatenation step. However, you must adjust your resource partitioning in --java-options "Xmx__g" and --native-pair-hmm-threads so that they do not exceed the machine limit for all chromosomes being computed. For example, I run on an HPC server and have 16gb RAM and 6 CPU for each of my 10 chromosomes. This means I allocated 172 gb RAM and 64 CPU (a little extra just in case). This only works if you can allocate these kinds of resources. Here is the updated code:
@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
{
gatk --java-options "-Xmx16g" HaplotypeCaller
--input example.bam
--output example.gvcf.gz
--reference human.fasta
--emit-ref-confidence GVCF
--dbsnp knownsnps.vcf
--native-pair-hmm-threads 6
--L $chr
} & done;
wait
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf
By adding & done; wait to the end of the @chr loop allows all processes to be run simultaneously. wait makes the machine stop before continuing onto the concatenation step.
please, use a workflow manager: make, nextflow, snakemake...
GATK4 HaplotypeCaller no longer has the option to use -nt or -nct. HaplotypeCaller in Spark is in development so that the program can be parallelised, however this is only in beta and is not yet recommended.
To speed things up, I am running HaplotypeCaller with the -L option. I run the program once for each chromosome and then concatenate the results with bcftools (cat won't work because of the file headers)
Pseudocode:
@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
{
gatk HaplotypeCaller
--input example.bam
--output example.gvcf.gz
--reference human.fasta
--emit-ref-confidence GVCF
--dbsnp knownsnps.vcf
--native-pair-hmm-threads 32 # not -nt or -nct , default = 4 (1/3 extra runtime)
--L $chr
}
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf
Thanks, YaGalbi!
I have a related question --
I am trying to figure out how to do exatly what you are talking about, but my reference genome is a denovo assembly with scaffolds, not chromosomes yet. Will that work with just scaffold numbers? I have ±400 scaffolds in a 300Mb genome reference so ideally would do it in 10 steps of 30Mb..
Thanks for you answer!
Actually I also found such a strange issue, when -nct was applied for WGS or WES data, it significantly decrease the runing time, but not for data sequenced from amplicons.
please, use a workflow manager: make, nextflow, snakemake...