I am trying to impute missing values of full genome data (3955671 rows) for more the 700 samples. The script works fine for a smaller dataset (10000 rows) but gives memory error for full genome.
Trail dataset:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 108 139 159 265 350
1 73 0 C A 40 PASS 0 GT:DP:GQ 0|0:5:40 0|0:9:40 0|0:6:38 ./.:.:. ./.:.:.
1 83 0 T C,A 40 PASS 0 GT:DP:GQ 1|1:5:40 1|1:9:40 0|0:8:38 ./.:.:. ./.:.:.
1 92 0 A C 40 PASS 0 GT:DP:GQ 1|1:8:40 1|1:11:40 0|0:9:40 ./.:.:. ./.:.:.
After imputation:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 108 139 159 265 350
1 73 0 C A 0 PASS 0 GT 0|0 0|0 0|0 0|0 0|0
1 83 0 T C,A 0 PASS 0 GT 1|1 1|1 0|0 0|0 0|0
1 92 0 A C 0 PASS 0 GT 1|1 1|1 0|0 0|0 0|0
For full genome dataset command and error:
java -Xmx50g -jar beagle.16May19.351.jar gt=genotype_9.vcf.recode.vcf nthreads=96 out=results
beagle.16May19.351.jar (version 5.0)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.16May19.351.jar" to list command line argument
Start time: 03:02 PM BST on 25 May 2019
Command line: java -Xmx45511m -jar beagle.16May19.351.jar
gt=genotype_9.vcf.recode.vcf
nthreads=96
out=results
No genetic map is specified: using 1 cM = 1 Mb
Reference samples: 0
Study samples: 666
Window 1 (1:73-30427620)
Study markers: 960,417
java.lang.OutOfMemoryError: Java heap space
at phase.PhaseBaum1.<init>(PhaseBaum1.java:107)
at phase.PhaseLS.run(PhaseLS.java:66)
at main.MainHelper.lsPhaseSingles(MainHelper.java:95)
at main.MainHelper.phase(MainHelper.java:72)
at main.Main.phaseData(Main.java:166)
at main.Main.main(Main.java:116)
java.lang.OutOfMemoryError: Java heap space
ERROR
terminating program.
I can use upto 102 cores and here is free memory information for my server:
total used free
Mem: 257823 786 53784
How much memory size should I have to keep in order to perform this task, or do I need to subset my dataset to perform this task on individual datasets?