Hello,
I have a GATK pipeline following best practices workflow to call rare germline SNPs and INDELs. I wanted to reduce time and so I split the processing at chromosome level
In one analysis, I split the aligned and duplicates marked bam file by chromosomes and ran base recalibration(BQSR) and haplotype caller (regular mode, not gvcf) and then concatenated the vcf files of each chromosome of a sample into 1.
In another one, I ran the exact same commands without splitting at chromosome level. I filtered the non standard chromosomes from the result and when I tried to compare the results to the earlier vcf file , there were a couple 100 different variant calls, and in the calls that were matching handful of them had different genotype.
At the moment I am comparing the vcf calls using vcfR package and merging based on CHROM, POS, REF and ALT columns.
Is this expected behavior? I thought that base recalibration and haplotype caller don't consider regions across contigs for it's functionality.
Should I filter the non-standard chromosomes even before running base recalibration by providing the standard chromosomes as a list in -L in the analysis where I am not splitting the chromosomes?
Thank you,
Your guidance is greatly appreciated.
Show us some examples.
This is the script (I am running it on JupyterLab):
Output:
Common Variants:
head(merged_df)
Unique Variants:
thanks but that is not what I meant. Please show us those variants in the VCF files, i want to see the quality, the FILTER, the DEPTH, etc...
for example chr1-14932 on grch38 has usually a very bad quality https://gnomad.broadinstitute.org/variant/1-14932-G-T?dataset=gnomad_r4
furthermore, you'd better use
bcftools isec
to compare VCF files.I did not compare them after filtration. this was as soon as I called the haplotype caller to verify my pipeline's integrity after splitting by chromosomes.
So these are the calls that are common but had different genotypes:
This is the INFO field comparison for these:
When I did base recalibration(BQSR) without splitting the chromosomes, but split only at haplotype caller stage, the vcf files are a 100% match. So I can see that base recalibration is not specific to contigs and uses data across chromosomes. Is this right?
For anyone else with similar issues, I used BQSR to get the recal table for each chromosome but then I did not combine them together and instead used the individual recal table generated for the specific chromosome in the ApplyBQSR step... After combining them into a single recal table using GatherBQSRReports and using the same recal table for ApplyBQSR for each chromosome again, gave me the same results as not splitting by chromosomes.
Cheers