Hello, I am having some issues using GATK GenomicsDB Import, specifically when a command GenomeLocParser raises the following warning:
The available sequence dictionary does not contain a sequence length for contig (2). Skipping validation of the genome loc end coordinate (380000).
This happens for any contig I try, not just 2.
In other posts people have mentioned that this was a problem of incorrect indexing of the fasta files. I have indexed the files myself, using the following commands, as suggested in GATK:
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
bgzip -d human_g1k_v37.fasta.gz
samtools faidx /home/files/human_g1k_v37.fasta
gatk CreateSequenceDictionary -R human_g1k_v37.fasta
A head of the .dict file shows:
@HD VN:1.6
@SQ SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:3 LN:198022430 M5:fdfd811849cc2fadebc929bb925902e5 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375e UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:9 LN:141213431 M5:3e273117f15e0a400f01055d9f393768 UR:file:/home/files/human_g1k_v37.fasta
So the .dict files have their length, but somehow GenomeLocParser does not see that.
This is the command I am using:
gatk GenomicsDBImport -V /home/files/patient.vcf.gz --genomicsdb-workspace-path /home/files/my_database --intervals 2:200-380000 --reference /home/files/human_g1k_v37.fasta --sequence-dictionary /home/files/human_g1k_v37.dict --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'
and this is the complete log where the warning and error from GenomeLocParser:
gatk GenomicsDBImport -V /home/files/patient.vcf.gz --genomicsdb-workspace-path /home/files/my_database --intervals 2:200-380000 --reference /home/files/human_g1k_v37.fasta --sequence-dictionary /home/files/human_g1k_v37.dict --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'
Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /gatk/gatk-package-4.2.6.1-local.jar GenomicsDBImport -V /home/files/patient.vcf.gz --genomicsdb-workspace-path /home/files/my_database --intervals 2:200-380000 --reference /home/files/human_g1k_v37.fasta --sequence-dictionary /home/files/human_g1k_v37.dict
18:26:10.167 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:26:10.277 INFO GenomicsDBImport - ------------------------------------------------------------
18:26:10.277 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1
18:26:10.277 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
18:26:10.278 INFO GenomicsDBImport - Executing as root@bfe698f49095 on Linux v5.4.0-109-generic amd64
18:26:10.278 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
18:26:10.278 INFO GenomicsDBImport - Start Date/Time: July 19, 2022 6:26:10 PM GMT
18:26:10.278 INFO GenomicsDBImport - ------------------------------------------------------------
18:26:10.278 INFO GenomicsDBImport - ------------------------------------------------------------
18:26:10.278 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
18:26:10.278 INFO GenomicsDBImport - Picard Version: 2.27.1
18:26:10.278 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
18:26:10.279 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:26:10.279 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:26:10.279 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:26:10.279 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:26:10.279 INFO GenomicsDBImport - Deflater: IntelDeflater
18:26:10.279 INFO GenomicsDBImport - Inflater: IntelInflater
18:26:10.279 INFO GenomicsDBImport - GCS max retries/reopens: 20
18:26:10.279 INFO GenomicsDBImport - Requester pays: disabled
18:26:10.279 INFO GenomicsDBImport - Initializing engine
18:26:10.652 WARN GenomeLocParser - The available sequence dictionary does not contain a sequence length for contig (2). Skipping validation of the genome loc end coordinate (380000).
18:26:10.653 INFO IntervalArgumentCollection - Processing 379801 bp from intervals
18:26:10.686 INFO GenomicsDBImport - Done initializing engine
18:26:10.978 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
18:26:10.979 INFO GenomicsDBImport - Vid Map JSON file will be written to /home/files/my_database/vidmap.json
18:26:10.979 INFO GenomicsDBImport - Callset Map JSON file will be written to /home/files/my_database/callset.json
18:26:10.979 INFO GenomicsDBImport - Complete VCF Header will be written to /home/files/my_database/vcfheader.vcf
18:26:10.979 INFO GenomicsDBImport - Importing to workspace - /home/files/my_database
18:26:11.159 INFO GenomicsDBImport - Importing batch 1 with 1 samples
terminate called after throwing an instance of 'GenomicsDBConfigException'
what(): GenomicsDBConfigException : Position 200 queried for contig 2 which is of length 0; queried position is past end of contig
I appreciate any ideas or suggestions of where my problem might be.
Hi Pierre, thank you very much for your suggestion. This is the output using that command:
So, from what I understand, my dictionary and my vcf are named the same, right?
so , it's not a problem with your reference but with your input vcf , it should contain a valid dictionary with the length attribute.
but anyway, AFAIK GenomicDBImport reads GVCF files (not VCF), if your vcf had been processed with HaplotypeCaller in GVCF mode it should contain a valid vcf header.
Thanks, Pierre. I got confused because the original GenomicsDB app allows for VCF use.