GATK - GenomeLocParser - The available sequence dictionary does not contain a sequence length for contig
1
0
Entering edit mode
2.3 years ago
aheritas • 0

Hello, I am having some issues using GATK GenomicsDB Import, specifically when a command GenomeLocParser raises the following warning:

The available sequence dictionary does not contain a sequence length for contig (2). Skipping validation of the genome loc end coordinate (380000).

This happens for any contig I try, not just 2.

In other posts people have mentioned that this was a problem of incorrect indexing of the fasta files. I have indexed the files myself, using the following commands, as suggested in GATK:

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz 
bgzip -d human_g1k_v37.fasta.gz 
samtools faidx /home/files/human_g1k_v37.fasta 
gatk CreateSequenceDictionary -R human_g1k_v37.fasta 

A head of the .dict file shows:

@HD VN:1.6
@SQ SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:2    LN:243199373    M5:a0d9851da00400dec1098a9255ac712e UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:3    LN:198022430    M5:fdfd811849cc2fadebc929bb925902e5 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:4    LN:191154276    M5:23dccd106897542ad87d2765d28a19a1 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:5    LN:180915260    M5:0740173db9ffd264d728f32784845cd7 UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:6    LN:171115067    M5:1d3a93a248d92a729ee764823acbbc6b UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:7    LN:159138663    M5:618366e953d6aaad97dbe4777c29375e UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:8    LN:146364022    M5:96f514a9929e410c6651697bded59aec UR:file:/home/files/human_g1k_v37.fasta
@SQ SN:9    LN:141213431    M5:3e273117f15e0a400f01055d9f393768 UR:file:/home/files/human_g1k_v37.fasta

So the .dict files have their length, but somehow GenomeLocParser does not see that.

This is the command I am using:

gatk GenomicsDBImport -V /home/files/patient.vcf.gz  --genomicsdb-workspace-path /home/files/my_database --intervals 2:200-380000 --reference /home/files/human_g1k_v37.fasta  --sequence-dictionary  /home/files/human_g1k_v37.dict --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'

and this is the complete log where the warning and error from GenomeLocParser:

gatk GenomicsDBImport -V /home/files/patient.vcf.gz  --genomicsdb-workspace-path /home/files/my_database --intervals 2:200-380000 --reference /home/files/human_g1k_v37.fasta  --sequence-dictionary  /home/files/human_g1k_v37.dict --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'  
Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /gatk/gatk-package-4.2.6.1-local.jar GenomicsDBImport -V /home/files/patient.vcf.gz --genomicsdb-workspace-path /home/files/my_database --intervals 2:200-380000 --reference /home/files/human_g1k_v37.fasta --sequence-dictionary /home/files/human_g1k_v37.dict
18:26:10.167 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:26:10.277 INFO  GenomicsDBImport - ------------------------------------------------------------
18:26:10.277 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1
18:26:10.277 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
18:26:10.278 INFO  GenomicsDBImport - Executing as root@bfe698f49095 on Linux v5.4.0-109-generic amd64
18:26:10.278 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
18:26:10.278 INFO  GenomicsDBImport - Start Date/Time: July 19, 2022 6:26:10 PM GMT
18:26:10.278 INFO  GenomicsDBImport - ------------------------------------------------------------
18:26:10.278 INFO  GenomicsDBImport - ------------------------------------------------------------
18:26:10.278 INFO  GenomicsDBImport - HTSJDK Version: 2.24.1
18:26:10.278 INFO  GenomicsDBImport - Picard Version: 2.27.1
18:26:10.278 INFO  GenomicsDBImport - Built for Spark Version: 2.4.5
18:26:10.279 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:26:10.279 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:26:10.279 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:26:10.279 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:26:10.279 INFO  GenomicsDBImport - Deflater: IntelDeflater
18:26:10.279 INFO  GenomicsDBImport - Inflater: IntelInflater
18:26:10.279 INFO  GenomicsDBImport - GCS max retries/reopens: 20
18:26:10.279 INFO  GenomicsDBImport - Requester pays: disabled
18:26:10.279 INFO  GenomicsDBImport - Initializing engine
18:26:10.652 WARN  GenomeLocParser - The available sequence dictionary does not contain a sequence length for contig (2). Skipping validation of the genome loc end coordinate (380000).
18:26:10.653 INFO  IntervalArgumentCollection - Processing 379801 bp from intervals
18:26:10.686 INFO  GenomicsDBImport - Done initializing engine
18:26:10.978 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
18:26:10.979 INFO  GenomicsDBImport - Vid Map JSON file will be written to /home/files/my_database/vidmap.json
18:26:10.979 INFO  GenomicsDBImport - Callset Map JSON file will be written to /home/files/my_database/callset.json
18:26:10.979 INFO  GenomicsDBImport - Complete VCF Header will be written to /home/files/my_database/vcfheader.vcf
18:26:10.979 INFO  GenomicsDBImport - Importing to workspace - /home/files/my_database
18:26:11.159 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
terminate called after throwing an instance of 'GenomicsDBConfigException'
  what():  GenomicsDBConfigException : Position 200 queried for contig 2 which is of length 0; queried position is past end of contig

I appreciate any ideas or suggestions of where my problem might be.

GenomicsDB GATK CreateSequenceDictionary • 1.1k views
ADD COMMENT
0
Entering edit mode
2.3 years ago

what is the output of

bcftools view --header-only /home/files/patient.vcf.gz  | grep "##contig" | grep -w 2
ADD COMMENT
0
Entering edit mode

Hi Pierre, thank you very much for your suggestion. This is the output using that command:

(bcfenv) root@fb9f52c2a474:/home/files# bcftools view --header-only /home/files/patient.vcf.gz  | grep "##contig" | grep -w 2
##contig=<ID=2>

So, from what I understand, my dictionary and my vcf are named the same, right?

ADD REPLY
1
Entering edit mode

so , it's not a problem with your reference but with your input vcf , it should contain a valid dictionary with the length attribute.

but anyway, AFAIK GenomicDBImport reads GVCF files (not VCF), if your vcf had been processed with HaplotypeCaller in GVCF mode it should contain a valid vcf header.

ADD REPLY
0
Entering edit mode

Thanks, Pierre. I got confused because the original GenomicsDB app allows for VCF use.

ADD REPLY

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6