Question

Small overlap of 1KG Sample file with dbSNP and same sample from phased file

0

Entering edit mode

23 months ago

berndmann ▴ 10

I'm looking into variant calling right now and still can't understand the results of my VCF output.

My steps so far are the following:

Step 1: Download all necessary files:

Download the sample from phase3 1KG : HG02024

Download the related reference genome file : hs37d5.fa.gz

Download the already phased samples file for chr1 : Phased file of phase 3 containing related individuals

Download a dbSNP version that is at least dbSNP 135. I used dbSNP 151

Step 2: Call the variants of chr1 of HG02024 I used the following simple calling pipeline in R with the HaploTypeCaller from GATK

call = str_glue(' python3 $HOME/gatk/gatk --java-options "-Xmx28g" HaplotypeCaller --native-pair-hmm-threads 24 -R  {reference}/GRCh37/references_hs37d5_hs37d5.fa -I HG02024.mapped.ILLUMINA.bwa.KHV.low_coverage.20130415_sort.bam -L 1 -O HG02024.mapped.ILLUMINA.bwa.KHV.low_coverage.20130415_sort.vcf -ERC GVCF')
system(call)

This resulted in a file that contains 57 million positions. I still believe that these are to much or at least those can't be all SNPs right? One would expect something around less then 9 million snps I would say. So roughly 8% (share of chr 1 on the full human chr) of 113,862,023 SNPs in dbSNP 151

Step 3: Compare the variants with dbsnp and the phased file

I wrote a short bash script to do this :

#!/bin/bash

#Check if the required number of arguments is provided
if [ "$#" -ne 3 ]; then
    echo "Usage: $0 <input_vcf> <dbsnp_vcf> <output_vcf>"
    exit 1
fi

input_vcf=$1
dbsnp_vcf=$2
output_vcf=$3

#Extract positions from the input VCF file
grep -v "^#" "$input_vcf" | awk '{print $1 "\t" $2}' > input_positions.txt

#Subset the dbSNP VCF file to overlap with the input VCF file
awk 'FNR==NR {a[$1,$2]; next} !($1,$2) in a' input_positions.txt "$dbsnp_vcf" > "$output_vcf"

#Clean up the temporary file
rm input_positions.txt

echo "Reduced dbSNP VCF file saved as $output_vcf"

When I compare the overlap between dbSNP and the called file I observe an overlap of 13,268,727 SNPS. The overlap between the called file and the phased file I observed was only 1,963,447 SNPs out of 6,468,094 SNPs.

What is the reasoning that the overlap between the called file and the phased file is that low?

EDIT: I also tried to call the 30x bam file of the sample and perform a liftover. Thereby the overlap between this called and lifted vcf and the phase file was even lower (932,179 SNPs).

HaploTypeCaller 1KG dbSNP • 619 views

ADD COMMENT • link 23 months ago by berndmann ▴ 10