Question

Dp And Dp4 In Vcf

1

Entering edit mode

12.9 years ago

Juliofdiaz ▴ 140

I'm confused about some numbers I get in a VCF file. This is an example:

gi|218888746|ref|NC_011770.1|    214    .    G    C    202    .    DP=45;AF1=1;AC1=2;DP4=0,0,25,13;MQ=55;FQ=-141    GT:PL:GQ    1/1:235,114,0:99

The raw depth coverage at this position is 53, as seen in samtools depth and confirmed in IGV, and DP=45, so I guess that 8 of the 53 reads are below the sequencing quality threshold set by samtools, which I think is q20 (Please correct me if this is wrong), My actual confusion comes when I look at the DP4 numbers which totals 38, so if DP is already quality filtered, shouldn't DP equal the sum of DP4 values? Otherwise, what is the different the "high-quality" filtering to get to the DP4 values and the filtering to get to the DP value.

Thanks

samtools • 15k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 12.9 years ago by Juliofdiaz ▴ 140

Ram · Answer 1 · 2016-02-12

OK. Now the question is totally clear. samtools mpileup conducted too much filtering, therefore, the depth from the mpileup usually far less than samtools depth.

Mapping quality (samtools mpileup -q)
Base quality (samtools mpileup -Q)
Discard anomalous read pairs (samtools mpileup -A)

After remove these filtering, the number of depth from these two method will be exactly same.

samtools depth <SORTEDBAM>
samtools mpileup -A -q 1 -Q 1 <Reference> <SORTEDBAM>

Ram · Answer 2 · 2012-06-05

2

Entering edit mode

12.9 years ago

Zev.Kronenberg 12k

From samtools:

link

DP = The number of reads covering or bridging POS.

DP4 = Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.

I don't know why you have a discrepancy in what samtools depth is telling you and what mpileup is telling you. Are you using data generated from the exact same BAM?

Did you set a quality threshold when you did samtools depth?

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 12.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Yes, I am using the exact sameBAM. This is the command I use for depth

samtools depth <SORTEDBAM>

And this is the command I use for the VCF

samtools mpileup -uf <REFFILE> <SORTEDBAM>

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 12.9 years ago by Juliofdiaz ▴ 140

0

Entering edit mode

Besides trying to gain general knowledge, what are goal of your project? Are you just trying to call SNPs? I ask because maybe there is a tool biostar could recommend.

ADD REPLY • link 12.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Yes, I am trying to call high quality SNPs from a bacterial genome. I am trying to do this because I want to do GATK's quality recalibration, and you need a set of known SNPs (there is no SNP database for my organism of interest). Besides trying to fully understand the data that is coming out of SAMTOOLs. Thanks

ADD REPLY • link 12.9 years ago by Juliofdiaz ▴ 140

0

Entering edit mode

Will you be using the same dataset to generate the list of SNPs and then use GATK to go back over and call them?

ADD REPLY • link 12.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

I have about 30 genomes of the same strain so I will pool the high quality SNPS from those, and select only the ones that are present in all genomes. Besides I will include some of the known SNPs (kind of manually). This may introduce a small bias but to recalibrate the alignment quality scores it shouldnt be too bad.

ADD REPLY • link 12.9 years ago by Juliofdiaz ▴ 140

0

Entering edit mode

Samtools can take all 30 genomes simultaneously to help it make good calls for low coverage regions. You could just give samtools all the BAMs at once. Of course I am biased since I only use Samtools for variant calling. I wouldn't use the same genomes to inform GATK.

ADD REPLY • link 12.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

At this point Im not too worried about calling low coverage regions because I will call those for each genome using GATK;s SNP calling guidelines. At this point I just want to call the SNPs that are highly reliable. You are right, using SNPs from the same genomes is not the best approach but the lack of a dbSNP for my organism leaves little choice. My post was more directed towards understanding why my depth is different than my DP, which is different than the sum of DP4 values.

ADD REPLY • link 12.9 years ago by Juliofdiaz ▴ 140