I'm confused about some numbers I get in a VCF file. This is an example:
gi|218888746|ref|NC_011770.1| 214 . G C 202 . DP=45;AF1=1;AC1=2;DP4=0,0,25,13;MQ=55;FQ=-141 GT:PL:GQ 1/1:235,114,0:99
The raw depth coverage at this position is 53, as seen in samtools depth and confirmed in IGV, and DP=45, so I guess that 8 of the 53 reads are below the sequencing quality threshold set by samtools, which I think is q20 (Please correct me if this is wrong), My actual confusion comes when I look at the DP4 numbers which totals 38, so if DP is already quality filtered, shouldn't DP equal the sum of DP4 values? Otherwise, what is the different the "high-quality" filtering to get to the DP4 values and the filtering to get to the DP value.
OK. Now the question is totally clear. samtools mpileup conducted too much filtering, therefore, the depth from the mpileup usually far less than samtools depth.
DP = The number of reads covering or bridging POS.
DP4 = Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.
I don't know why you have a discrepancy in what samtools depth is telling you and what mpileup is telling you. Are you using data generated from the exact same BAM?
Did you set a quality threshold when you did samtools depth?
Besides trying to gain general knowledge, what are goal of your project? Are you just trying to call SNPs? I ask because maybe there is a tool biostar could recommend.
Yes, I am trying to call high quality SNPs from a bacterial genome. I am trying to do this because I want to do GATK's quality recalibration, and you need a set of known SNPs (there is no SNP database for my organism of interest). Besides trying to fully understand the data that is coming out of SAMTOOLs. Thanks
I have about 30 genomes of the same strain so I will pool the high quality SNPS from those, and select only the ones that are present in all genomes. Besides I will include some of the known SNPs (kind of manually). This may introduce a small bias but to recalibrate the alignment quality scores it shouldnt be too bad.
Samtools can take all 30 genomes simultaneously to help it make good calls for low coverage regions. You could just give samtools all the BAMs at once. Of course I am biased since I only use Samtools for variant calling. I wouldn't use the same genomes to inform GATK.
At this point Im not too worried about calling low coverage regions because I will call those for each genome using GATK;s SNP calling guidelines. At this point I just want to call the SNPs that are highly reliable. You are right, using SNPs from the same genomes is not the best approach but the lack of a dbSNP for my organism leaves little choice. My post was more directed towards understanding why my depth is different than my DP, which is different than the sum of DP4 values.
Yes, I am using the exact sameBAM. This is the command I use for depth
And this is the command I use for the VCF
Besides trying to gain general knowledge, what are goal of your project? Are you just trying to call SNPs? I ask because maybe there is a tool biostar could recommend.
Yes, I am trying to call high quality SNPs from a bacterial genome. I am trying to do this because I want to do GATK's quality recalibration, and you need a set of known SNPs (there is no SNP database for my organism of interest). Besides trying to fully understand the data that is coming out of SAMTOOLs. Thanks
Will you be using the same dataset to generate the list of SNPs and then use GATK to go back over and call them?
I have about 30 genomes of the same strain so I will pool the high quality SNPS from those, and select only the ones that are present in all genomes. Besides I will include some of the known SNPs (kind of manually). This may introduce a small bias but to recalibrate the alignment quality scores it shouldnt be too bad.
Samtools can take all 30 genomes simultaneously to help it make good calls for low coverage regions. You could just give samtools all the BAMs at once. Of course I am biased since I only use Samtools for variant calling. I wouldn't use the same genomes to inform GATK.
At this point Im not too worried about calling low coverage regions because I will call those for each genome using GATK;s SNP calling guidelines. At this point I just want to call the SNPs that are highly reliable. You are right, using SNPs from the same genomes is not the best approach but the lack of a dbSNP for my organism leaves little choice. My post was more directed towards understanding why my depth is different than my DP, which is different than the sum of DP4 values.