GATK4 haplotypecaller calls variants on regions without read coverage
0
0
Entering edit mode
3 months ago

Hey friends. I am running my best practice GATK4 workflow and it seems to work. However, puzzling to me is that variants appear in the vcf file after haplotypecaller that are found in genomic regions without read coverage. I realized this when running CNNScoreVariants using the bam file and vcf file (produced from the bam file) and reference file (path always copy pasted for each command) as input.

I get following output:

gatk --java-options "-Xmx25G" CNNScoreVariants -I mounted_data/bam/duplicate/ERR.mapped.merged.duplicates.bam -V mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller.vcf.gz -R ref/.../hg19_main.fa -O mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller_scored.vcf.gz --tmp-dir mounted_data/tmp_2/ --tensor-type read_tensor
Using GATK jar /gatk/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx25G -jar /gatk/gatk-package-4.5.0.0-local.jar CNNScoreVariants -I mounted_data/bam/duplicate/ERR.mapped.merged.duplicates.bam -V mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller.vcf.gz -R ref/human37_hg19/genome/jordi_chr/hg19_main.fa -O mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller_scored.vcf.gz --tmp-dir mounted_data/tmp_2/ --tensor-type read_tensor
08:34:08.037 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
08:34:08.730 INFO  CNNScoreVariants - ------------------------------------------------------------
08:34:08.731 INFO  CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.5.0.0
08:34:08.731 INFO  CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
08:34:08.731 INFO  CNNScoreVariants - Executing as root@f5028debf481 on Linux v5.15.0-117-generic amd64
08:34:08.731 INFO  CNNScoreVariants - Java runtime: OpenJDK 64-Bit Server VM v17.0.9+9-Ubuntu-122.04
08:34:08.731 INFO  CNNScoreVariants - Start Date/Time: August 22, 2024 at 8:34:08 AM GMT
08:34:08.731 INFO  CNNScoreVariants - ------------------------------------------------------------
08:34:08.731 INFO  CNNScoreVariants - ------------------------------------------------------------
08:34:08.732 INFO  CNNScoreVariants - HTSJDK Version: 4.1.0
08:34:08.732 INFO  CNNScoreVariants - Picard Version: 3.1.1
08:34:08.732 INFO  CNNScoreVariants - Built for Spark Version: 3.5.0
08:34:08.732 INFO  CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
08:34:08.732 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
08:34:08.732 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
08:34:08.732 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
08:34:08.732 INFO  CNNScoreVariants - Deflater: IntelDeflater
08:34:08.732 INFO  CNNScoreVariants - Inflater: IntelInflater
08:34:08.732 INFO  CNNScoreVariants - GCS max retries/reopens: 20
08:34:08.732 INFO  CNNScoreVariants - Requester pays: disabled
08:34:08.733 INFO  CNNScoreVariants - Initializing engine
08:34:08.822 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller.vcf.gz
08:34:08.917 INFO  CNNScoreVariants - Done initializing engine
08:34:08.918 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/gatk/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
08:34:09.958 INFO  CNNScoreVariants - Using key:CNN_2D for CNN architecture:/gatk/mounted_data/tmp_2/small_2d.10497597046314882192.json and weights:/gatk/mounted_data/tmp_2/small_2d.17021895131602487097.hd5
08:34:10.376 INFO  ProgressMeter - Starting traversal
08:34:10.377 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
08:34:10.378 INFO  CNNScoreVariants - Starting pass 0 through the variants
08:34:18.580 WARN  CNNScoreVariants - No reads at contig:chr1 site:1654016
08:34:22.231 INFO  ProgressMeter -         chr1:2274956              0.2                  2000          10124.9
08:34:42.120 INFO  ProgressMeter -         chr1:4366538              0.5                  4000           7560.7
08:34:54.506 INFO  ProgressMeter -         chr1:5458871              0.7                  5000           6798.4
08:35:07.333 INFO  ProgressMeter -         chr1:6037603              0.9                  6000           6320.7
08:35:22.567 INFO  ProgressMeter -         chr1:7263923              1.2                  7000           5818.0
08:35:34.882 INFO  ProgressMeter -         chr1:8219442              1.4                  8000           5680.2
08:35:47.129 INFO  ProgressMeter -         chr1:9390616              1.6                  9000           5581.3
08:35:59.697 INFO  ProgressMeter -        chr1:11082054              1.8                 10000           5488.5
08:36:12.571 INFO  ProgressMeter -        chr1:12194451              2.0                 11000           5401.2
08:36:25.894 WARN  CNNScoreVariants - No reads at contig:chr1 site:13178527
08:36:26.063 INFO  ProgressMeter -        chr1:13217588              2.3                 12000           5306.4

...

This is confusing me. As Haplotypecaller itself called those variants from the exact same bam file using the exact same reference file (only using the main chromosomes). Same issue when doing bqsr step

Does anyone has an explanation for that interesting sight?

Cheers.

GATK4 CNNScoreVariants haplotypecaller • 620 views
ADD COMMENT
0
Entering edit mode

that variants appear in the vcf file after haplotypecaller that are found in genomic regions without read coverage.

show us such variant please

ADD REPLY
0
Entering edit mode

Sure.

Here is one:

chr1    48058280    .   C   CTGTCAGCTGGGGCAGGGACTCAAATCCCAATTTGGGTCTTTCAGTGGAAACAG  255.93  .   AC=2;AF=1.00;AN=2;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.26;SOR=1.329 GT:AD:DP:GQ:PL  1/1:0,6:6:18:270,18,0

and another:

chr1    165217699   .   T   A   102.84  .   AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=34.28;SOR=1.179    GT:AD:DP:GQ:PL  1/1:0,3:3:9:116,9,0

Both are from the BQSR approach.

Thanks.

ADD REPLY
0
Entering edit mode

you said there is no read but the (low) depth are 6 and 3 for those variants, with a poor genotype quality.

ADD REPLY
0
Entering edit mode

yes, that is confusing me. The tool tells me

WARN  CNNScoreVariants - No reads at contig:chrxxxx site:xxxx

And looking at the used bam file manually with

samtools view ERR.bqsr.bam chr1:165217699-165217699

won't show any output, meaining there is no read covering that called variant.

It does not happen often though. what is haplotypecaller doing to my precious bam file.

ADD REPLY
0
Entering edit mode

won't show any output

GATK realign reads, may be this is the reason. Check a larger region around this POS in IGV, and show the CLIPped regions.

ADD REPLY
0
Entering edit mode

thanks for this hint. unfortunately, no read reaches the variant of interest, even when clipping is considered (see fig below). this is no the only thing that confuses me about haplotypecaller. the major problem is a wrong variant call, claiming a homozygous alternative insertion (1/1) although the bam file shows a heterozygous insertion of one base at that position. maybe it's because the second base downstream is the beginning of a 2kb deletion and reads carrying the inserted base reach into that deletion and read that do not carry the inserted base abruptly stop before the deletion. I'll see if structural variant caller get the deletion right. but that is a story for another entry if i am not able to make the 1 base insertion appear as heterozygous instead of homozygous in the vcf file.

anyway, i appreciate your input and help. if there is another idea, i am looking forward to test it.

igv genomic are of interest

ADD REPLY

Login before adding your answer.

Traffic: 1918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6