I'm looking into variant calling right now and still can't understand the results of my VCF output.
My steps so far are the following:
Step 1: Download all necessary files:
Download the sample from phase3 1KG : HG02024
Download the related reference genome file : hs37d5.fa.gz
Download the already phased samples file for chr1 : Phased file of phase 3 containing related individuals
Download a dbSNP version that is at least dbSNP 135. I used dbSNP 151
Step 2: Call the variants of chr1 of HG02024 I used the following simple calling pipeline in R with the HaploTypeCaller from GATK
call = str_glue(' python3 $HOME/gatk/gatk --java-options "-Xmx28g" HaplotypeCaller --native-pair-hmm-threads 24 -R {reference}/GRCh37/references_hs37d5_hs37d5.fa -I HG02024.mapped.ILLUMINA.bwa.KHV.low_coverage.20130415_sort.bam -L 1 -O HG02024.mapped.ILLUMINA.bwa.KHV.low_coverage.20130415_sort.vcf -ERC GVCF')
system(call)
This resulted in a file that contains 57 million positions. I still believe that these are to much or at least those can't be all SNPs right? One would expect something around less then 9 million snps I would say. So roughly 8% (share of chr 1 on the full human chr) of 113,862,023 SNPs in dbSNP 151
Step 3: Compare the variants with dbsnp and the phased file
I wrote a short bash script to do this :
#!/bin/bash
#Check if the required number of arguments is provided
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <input_vcf> <dbsnp_vcf> <output_vcf>"
exit 1
fi
input_vcf=$1
dbsnp_vcf=$2
output_vcf=$3
#Extract positions from the input VCF file
grep -v "^#" "$input_vcf" | awk '{print $1 "\t" $2}' > input_positions.txt
#Subset the dbSNP VCF file to overlap with the input VCF file
awk 'FNR==NR {a[$1,$2]; next} !($1,$2) in a' input_positions.txt "$dbsnp_vcf" > "$output_vcf"
#Clean up the temporary file
rm input_positions.txt
echo "Reduced dbSNP VCF file saved as $output_vcf"
When I compare the overlap between dbSNP and the called file I observe an overlap of 13,268,727 SNPS. The overlap between the called file and the phased file I observed was only 1,963,447 SNPs out of 6,468,094 SNPs.
What is the reasoning that the overlap between the called file and the phased file is that low?
EDIT: I also tried to call the 30x bam file of the sample and perform a liftover. Thereby the overlap between this called and lifted vcf and the phase file was even lower (932,179 SNPs).