Hello,
I have 3264 g.VCFs and an interval list for the reference genome that contains 20000 contigs. The interval list looks like the following:
utg19_pilon_pilon:1-42237
utg22_pilon_pilon:1-49947
utg24_pilon_pilon:1-61707
utg30_pilon_pilon:1-459006
utg38_pilon_pilon:1-129173
utg40_pilon_pilon:1-101813
utg58_pilon_pilon:1-143918
utg93_pilon_pilon:1-186249
utg100_pilon_pilon:1-87875
utg104_pilon_pilon:1-49315
I am running the GATK GenomicsDBImport
command as follows:
gatk --java-options "-Xmx220G" GenomicsDBImport --genomicsdb-shared-posixfs-optimizations true --genomicsdb-workspace-path PlantDB --intervals intervals.list --sample-name-map samples.sample_map --batch-size 100 --bypass-feature-reader --merge-input-intervals --overwrite-existing-genomicsdb-workspace true --reader-threads 10
But the speed of the process is quite slow. How can I modify the code to accelerate the process?
parallelize / merge per contig
So, I should loop over the interval list?
use a workflow manager like snakemake, nextflow, etc,...
in NF, that would be something like:
Ok, thank you. I understand.