Vcf Quality Double For Alternate Loci Vs Reference Loci
0
1
Entering edit mode
11.1 years ago
travcollier ▴ 210

Hi... a relative newbie here

We're trying to generate conservative consensus sequences from hiseq data, and I've noticed something odd from mpileup which I hope someone can explain for me.

The quality scores for loci which have a fixed alternate allele appear to be double the scores for loci with the reference allele.... Depth, mapping quality, FQ, and forward/reverse ratios being exactly the same.

For example:

2L      9611    .       C       .       48      .       DP=6;AF1=0;AC1=0;DP4=3,3,0,0;MQ=37;FQ=-45       PL:DP:SP        0:6:0
2L      9612    .       C       A       97.1    .       DP=6;VDB=2.842717e-02;AF1=1;AC1=2;DP4=0,0,3,3;MQ=37;FQ=-45      GT:PL:DP:SP:GQ  1/1:130,18,0:6:0:33
2L      9613    .       A       .       48      .       DP=6;AF1=0;AC1=0;DP4=3,3,0,0;MQ=37;FQ=-45       PL:DP:SP        0:6:0

Why? This makes picking the right thresholds for calling the consensus rather more complicated than it should be. Any insight would be appreciated.

BTW: We're mapping to a reference with BWA + stampy, and the mpileup+bcftools commands are just:

samtools mpileup -C 50 -DSBuf reference.fa markdup.bam > mpileup.out
bcftools view -cg mpileup.bcf > mpileup.vcf
vcf mpileup • 2.7k views
ADD COMMENT
0
Entering edit mode

I may not understand your experiment, but I would treat all three of those loci you show as suspect (i.e. of dubious quality) since the read-depth for all three is only 6, and the GQ score for the middle one is 33 (i.e. very low). What do quality scores for more robust variants (i.e. those with >30x read-depth and GQ =99) look like?

ADD REPLY
0
Entering edit mode

Yes, these are all marginal variants. If I had high coverage and uniformly high GQ scores, having a meaningful QUAL wouldn't matter. We are doing population genetics, so we need more samples and therefore have to deal with lower coverage. Worse still, we're not even working on a model species, much less humans. Our reference genome is decent in many ways, but is based on a hybrid strain which does not exist in nature (or even the lab at this point)... so biasing calling towards the reference alleles makes no sense.

ADD REPLY

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6