Hi all,
I would like to try GenomicsDB in a sample VCF. This VCF is obtained from an imputation software (Gencove) and it is not a g.VCF. I have tried the following code:
docker run -it broadinstitute/gatk:latest
gatk GenomicsDBImport -V /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz --genomicsdb-workspace-path my_database --intervals chr20,chr21
This gives the following error:
A USER ERROR has occurred: Failed to create reader from file:///data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz because of the following error:
Unable to parse header with error: /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz, for input source: file:///data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz
Some comments in GATK forums like here and here, suggest the following:
- Double check that there is a .vcf.gz.tbi index file corresponding to each .vcf.gz file and that both the index file and VCF file were compressed.
- Run ValidateVariants on the VCF file.
- Run HaplotypeCaller
I have checked the first point and the result from ls /data/genomicsdb/results/LUSE03081974
is the following:
human_g1k_v37.fasta
human_g1k_v37.fasta.fai
LUSE03081974.vcf.gz
LUSE03081974.vcf.gz.csi
LUSE03081974.vcf.gz.tbi
LUSE03081974.vcf.gz.tbi.gz
LUSE03081974.bai
LUSE03081974.bam
Therefore, I try the next two points, run ValidateVariants
:
gatk ValidateVariants -R /data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta -V /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz --validation-type-to-exclude ALL --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'
Which gives me an error:
A USER ERROR has occurred: The specified fasta file (file:///data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta) does not exist.
***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException$MissingReference: The specified fasta file (file:///data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta) does not exist.
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.checkFastaPath(CachingIndexedFastaSequenceFile.java:173)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:143)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:125)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:110)
at org.broadinstitute.hellbender.engine.ReferenceFileSource.<init>(ReferenceFileSource.java:35)
at org.broadinstitute.hellbender.engine.ReferenceDataSource.of(ReferenceDataSource.java:27)
at org.broadinstitute.hellbender.engine.GATKTool.initializeReference(GATKTool.java:439)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:722)
at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:45)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
So there appears to be a problem with the fasta file. From comments in this forum and GATK, the fasta files need to have their corresponding .fai
index and .dict
. So, I try to create an index using:
gatk CreateSequenceDictionary -R /data/genomicsdb/results/LUSE03081974/human_g1k_v37.fa
which produces a new error:
[Mon Jul 11 16:27:36 GMT 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1086849024
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Cannot write file: /data/genomicsdb/results/LUSE03081974/human_g1k_v37.dict. Neither file nor parent directory exist.
at htsjdk.samtools.util.IOUtil.assertFileIsWritable(IOUtil.java:554)
at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:223)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
So I am not sure where my problem is. I originally thought it could be that genomicsDB worked only with g.VCF and not VCFs, but then I ended up not being able to use ValidateVariants
or CreateSequenceDictionary
. It seems to me that I might have some issues with naming/referencing files, but I am not sure what is it.
Some additional information, if its useful: using gatk latest from docker and java version:
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
I appreciate your help!
what is the output of