Hello,
I have genotype by sequencing data for 400 samples. I am trying to run a SNP calling pipeline using GATK
. I could manage until HaplotypeCaller
command in gatk. However, when I proceed with CombineGVCFs
step to combine all the 400 g.vcf
files into one, GATK
fails to run with the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Required array length 2147483639 + 9 is too large
Then, I created 8 subset lists with 50 g.vcf
files in each. I used the following code just to combine 100 samples from two lists:
gatk --java-options "-Xmx120G" CombineGVCFs -R /mnt/SNP_calling/Reference/genome.fasta -V intermediate_1.g.vcf -V intermediate_2.g.vcf -O combined_1_2.g.vcf
Still, I get the above-mentioned memory error message. I tried to increase the -Xmx
value until 500G
, but it did not resolve the error message.
I am using a docker image of gatk
.
Can you please provide a suggestion to resolve this issue? I thought of the GenomicImportDBI
approach but I have a scaffolded reference genome with 120 scaffolds. So, going that way is more cumbersome.
Even though it's reported as an OutOfMemoryError, it has nothing to do with memory. Java does not support arrays longer than 2147483647 (and the precise limit is somewhat lower, depending on the JVM implementation). So changing -Xmx won't fix it; you simply had too many variants.