Question

(samtools) Filtering and interpreting >100 QUAL scores

0

Entering edit mode

8.9 years ago

umn_bist ▴ 390

I am in the process of filtering/selecting variants via samtools/bcftools for sample RNA sequence. As the title indicates I am getting variant calls (from my pileup) that have QUAL scores >100. I understand that there is not a theoretical limit to phred-scaled quality score but I wanted to make sure I am interpreting it correctly.

Question 1. I understand that QUAL column in a vcf file denotes Phred-scaled probability of all samples being homozygous reference and MQ (mapping quality score) quantify the probability that a read is misplaced. Does this mean that a LOW QUAL and HIGH MQ scores mean the alternative sequence is likely a true variant?

Question 2. I am wondering how you guys have dealt with QUAL scores >100 for annotating purposes?

EXAMPLE:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT
chr1 150898915   .      C      T    113.066    PASS    DP=17;VDB=0.915697;SGB=-0.636426;MQ0F=0;AF1=1;AC1=2;DP4=0,0,7,0;MQ=50;FQ=-47.986    GT:PL    1/1:146,21,0
chr1 150898931   .      T      C    207.999    PASS    DP=31;VDB=0.000846843;SGB=-0.692067;MQSB=0.944444;MQ0F=0;AF1=1;AC1=2;DP4=0,0,18,2;MQ=49;FQ=-86.9865    GT:PL    1/1:241,60,0

pileup vcf phred-scaled-QUAL samtools • 9.0k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by umn_bist ▴ 390

1

Entering edit mode

How did you call variants?

ADD REPLY • link 8.9 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

I called the variant using samtools mpileup and bcftool call options.

samtools mpileup -uf </path/to/ref.fasta> </path/to/sorted.bam> -l </path/to/region.bed> | bcftools call -vc - > </path/to/pileup.vcf>

Thank you for your help.

ADD REPLY • link 8.9 years ago by umn_bist ▴ 390

1

Entering edit mode

samtools/bcftools > v1.0 or < v1.0?

ADD REPLY • link 8.9 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

This was done with samtools/bcftools v1.3.

ADD REPLY • link 8.9 years ago by umn_bist ▴ 390

score 6 · Accepted Answer · 2016-01-25

You are invoking the old variant calling method with -c. So, unless I'm misunderstanding something, the QUAL field is still Phred-scaled probability of there being a non-homozygous reference call at that position. It also appears that this number takes other factors into account and can be quite large. Here's a comment from the GATK forums from a couple of years ago outlining this within their framework:

You may be thinking of Phred-scaled base qualities, which do indeed tend to be capped at certain values by convention. The various likelihood metrics emitted by GATK, which are also Phred-scaled, can easily go up into the thousands. The important thing to understand is that they are relative, not absolute values. So it does depend very much on the particular dataset you are working with.

The QUAL value reflects how confident we are that a site displays some kind of variation considering the amount of data available (=depth of coverage at the site) (because we are more confident when we have more observations to rely on), the quality of the mapping of the reads and alignment of the bases (because if we are not sure the bases observed really belong there, they do not contribute much to our confidence), and the quality of the base calls (because if they look like machine errors, they also do not contribute much to our confidence). Filtering on the QUAL value should be done by adjusting the emit and call thresholds at the calling step.

GQ is a very different metric; it's not about whether the site displays variation. GQ tells you whether, given a site for which there is variation in the population, each sample has been assigned the correct genotype.

Regarding filtering, have a look at our Best Practices document; the very last section gives some base recommendations for hard-filtering if you cannot perform variant recalibration (VQSR).

Variant qualities, like GQ etc., take a lot of this information into account. If you are dealing with large datasets, you can be a bit more liberal on your filtering. But, in your case, a high mapping quality just means that a read has a higher probability of being appropriately placed, but a low QUAL means that there is not a lot of confidence that there is variation there for a variety of possible reasons.

I realize that there have been a couple of other interpretations floating around online, so I would love it if someone would correct me if I'm interpreting this incorrectly. I think what I'm saying is at least consistent with the latest VCF specification.