Hi,
I have a large number of gvcf files that I'm trying to joint genotype, by first running GenomicsDBImport in GATK 4.1.4.0. When I say large I mean 135 samples * 229 genomic intervals = 30,915 files.
Here's what I have:
java -Xmx80g -XX:ParallelGCThreads=20 -jar $GATKPATH GenomicsDBImport -L $LIST \
-V ${SLURM_ARRAY_TASK_ID}.1.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.2.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.3.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.4.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.5.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.6.raw.g.vcf \
...
-V ${SLURM_ARRAY_TASK_ID}.133.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.134.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.135.raw.g.vcf \
--merge-input-intervals true \
--genomicsdb-workspace-path /n/holyscratch01/edwards_lab/rafa/genomic_DBs/db_${SLURM_ARRAY_TASK_ID}
where list points the location of the scaffold list for each interval, and the task ID identifies the interval.
This runs for a while but then this happens:
13:43:09.139 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/n/holyscratch01/edwards_lab/rafa/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Mar 20, 2020 1:43:13 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:43:13.385 INFO GenomicsDBImport - ------------------------------------------------------------
13:43:13.385 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.4.0
13:43:13.385 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
13:43:14.389 INFO GenomicsDBImport - Executing as rmarcondes@holy2c02310.rc.fas.harvard.edu on Linux v3.10.0-957.12.1.el7.x86_64 amd64
13:43:14.389 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v10.0.1+10
13:43:14.389 INFO GenomicsDBImport - Start Date/Time: March 20, 2020 at 1:43:09 PM GMT-05:00
13:43:14.389 INFO GenomicsDBImport - ------------------------------------------------------------
13:43:14.389 INFO GenomicsDBImport - ------------------------------------------------------------
13:43:14.390 INFO GenomicsDBImport - HTSJDK Version: 2.20.3
13:43:14.390 INFO GenomicsDBImport - Picard Version: 2.21.1
13:43:14.390 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:43:14.390 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:43:14.390 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:43:14.390 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:43:14.390 INFO GenomicsDBImport - Deflater: IntelDeflater
13:43:14.390 INFO GenomicsDBImport - Inflater: IntelInflater
13:43:14.390 INFO GenomicsDBImport - GCS max retries/reopens: 20
13:43:14.390 INFO GenomicsDBImport - Requester pays: disabled
13:43:14.391 INFO GenomicsDBImport - Initializing engine
13:44:18.385 INFO IntervalArgumentCollection - Processing 48059334 bp from intervals
13:44:18.412 INFO GenomicsDBImport - Done initializing engine
13:44:18.806 INFO GenomicsDBImport - Shutting down engine
[March 20, 2020 at 1:44:18 PM GMT-05:00] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 1.16 minutes.
Runtime.totalMemory()=11559501824
***********************************************************************
A USER ERROR has occurred: Error creating GenomicsDB workspace: /n/holyscratch01/edwards_lab/rafa/genomic_DBs/db_177 already exists
Thanks for any pointers!!!!
If you check the last line of the log the error is already mentioned.
You should consider a different naming strategy for the DB file.
Just remove the directory "db_177" and try again. But make sure genomics_DBs directory has been created.