Hey friends. I am running my best practice GATK4 workflow and it seems to work. However, puzzling to me is that variants appear in the vcf file after haplotypecaller that are found in genomic regions without read coverage. I realized this when running CNNScoreVariants using the bam file and vcf file (produced from the bam file) and reference file (path always copy pasted for each command) as input.
I get following output:
gatk --java-options "-Xmx25G" CNNScoreVariants -I mounted_data/bam/duplicate/ERR.mapped.merged.duplicates.bam -V mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller.vcf.gz -R ref/.../hg19_main.fa -O mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller_scored.vcf.gz --tmp-dir mounted_data/tmp_2/ --tensor-type read_tensor
Using GATK jar /gatk/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx25G -jar /gatk/gatk-package-4.5.0.0-local.jar CNNScoreVariants -I mounted_data/bam/duplicate/ERR.mapped.merged.duplicates.bam -V mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller.vcf.gz -R ref/human37_hg19/genome/jordi_chr/hg19_main.fa -O mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller_scored.vcf.gz --tmp-dir mounted_data/tmp_2/ --tensor-type read_tensor
08:34:08.037 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
08:34:08.730 INFO CNNScoreVariants - ------------------------------------------------------------
08:34:08.731 INFO CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.5.0.0
08:34:08.731 INFO CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
08:34:08.731 INFO CNNScoreVariants - Executing as root@f5028debf481 on Linux v5.15.0-117-generic amd64
08:34:08.731 INFO CNNScoreVariants - Java runtime: OpenJDK 64-Bit Server VM v17.0.9+9-Ubuntu-122.04
08:34:08.731 INFO CNNScoreVariants - Start Date/Time: August 22, 2024 at 8:34:08 AM GMT
08:34:08.731 INFO CNNScoreVariants - ------------------------------------------------------------
08:34:08.731 INFO CNNScoreVariants - ------------------------------------------------------------
08:34:08.732 INFO CNNScoreVariants - HTSJDK Version: 4.1.0
08:34:08.732 INFO CNNScoreVariants - Picard Version: 3.1.1
08:34:08.732 INFO CNNScoreVariants - Built for Spark Version: 3.5.0
08:34:08.732 INFO CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
08:34:08.732 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
08:34:08.732 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
08:34:08.732 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
08:34:08.732 INFO CNNScoreVariants - Deflater: IntelDeflater
08:34:08.732 INFO CNNScoreVariants - Inflater: IntelInflater
08:34:08.732 INFO CNNScoreVariants - GCS max retries/reopens: 20
08:34:08.732 INFO CNNScoreVariants - Requester pays: disabled
08:34:08.733 INFO CNNScoreVariants - Initializing engine
08:34:08.822 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/mounted_data/vcf/vcf/vcf/raw_ERR/ERR_WGS_hg19_haplotypecaller.vcf.gz
08:34:08.917 INFO CNNScoreVariants - Done initializing engine
08:34:08.918 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/gatk/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
08:34:09.958 INFO CNNScoreVariants - Using key:CNN_2D for CNN architecture:/gatk/mounted_data/tmp_2/small_2d.10497597046314882192.json and weights:/gatk/mounted_data/tmp_2/small_2d.17021895131602487097.hd5
08:34:10.376 INFO ProgressMeter - Starting traversal
08:34:10.377 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
08:34:10.378 INFO CNNScoreVariants - Starting pass 0 through the variants
08:34:18.580 WARN CNNScoreVariants - No reads at contig:chr1 site:1654016
08:34:22.231 INFO ProgressMeter - chr1:2274956 0.2 2000 10124.9
08:34:42.120 INFO ProgressMeter - chr1:4366538 0.5 4000 7560.7
08:34:54.506 INFO ProgressMeter - chr1:5458871 0.7 5000 6798.4
08:35:07.333 INFO ProgressMeter - chr1:6037603 0.9 6000 6320.7
08:35:22.567 INFO ProgressMeter - chr1:7263923 1.2 7000 5818.0
08:35:34.882 INFO ProgressMeter - chr1:8219442 1.4 8000 5680.2
08:35:47.129 INFO ProgressMeter - chr1:9390616 1.6 9000 5581.3
08:35:59.697 INFO ProgressMeter - chr1:11082054 1.8 10000 5488.5
08:36:12.571 INFO ProgressMeter - chr1:12194451 2.0 11000 5401.2
08:36:25.894 WARN CNNScoreVariants - No reads at contig:chr1 site:13178527
08:36:26.063 INFO ProgressMeter - chr1:13217588 2.3 12000 5306.4
...
This is confusing me. As Haplotypecaller itself called those variants from the exact same bam file using the exact same reference file (only using the main chromosomes). Same issue when doing bqsr step
Does anyone has an explanation for that interesting sight?
Cheers.
show us such variant please
Sure.
Here is one:
and another:
Both are from the BQSR approach.
Thanks.
you said there is no read but the (low) depth are 6 and 3 for those variants, with a poor genotype quality.
yes, that is confusing me. The tool tells me
And looking at the used bam file manually with
won't show any output, meaining there is no read covering that called variant.
It does not happen often though. what is haplotypecaller doing to my precious bam file.
GATK realign reads, may be this is the reason. Check a larger region around this POS in IGV, and show the CLIPped regions.
thanks for this hint. unfortunately, no read reaches the variant of interest, even when clipping is considered (see fig below). this is no the only thing that confuses me about haplotypecaller. the major problem is a wrong variant call, claiming a homozygous alternative insertion (1/1) although the bam file shows a heterozygous insertion of one base at that position. maybe it's because the second base downstream is the beginning of a 2kb deletion and reads carrying the inserted base reach into that deletion and read that do not carry the inserted base abruptly stop before the deletion. I'll see if structural variant caller get the deletion right. but that is a story for another entry if i am not able to make the 1 base insertion appear as heterozygous instead of homozygous in the vcf file.
anyway, i appreciate your input and help. if there is another idea, i am looking forward to test it.