Question about GATK hardfiltering conditions
2
0
Entering edit mode
7.8 years ago
bitpir ▴ 250

Hello,

GATK recommends the following filter expression for hard filtering of vcfs: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

The website does explain where the numbers from each condition come from. However, I'm still trying to understand why the recommended filter expression being "OR" instead of "AND"; i.e., why do we consider variants that are above "QD < 2.0 OR FS > 60.0 OR MQ < 40.0 etc" instead of "QD < 2.0 AND FS > 60.0 AND MQ < 40.0 etc" as variants that pass the filter?

Also, how important are DP and GQ when filtering for good variants? Is it necessary to do further filtering after what GATK recommends? Will this cause over-filtering?

I only have 5 samples, which is why I am not doing VQSR for filtering step. Thanks a lot for your help!

gatk hardfiltering • 2.8k views
ADD COMMENT
1
Entering edit mode
7.8 years ago
Vivek ★ 2.7k

That GATK filtering expression is not for the variants that PASS, you are writing that expression for the variants that FAIL. Whatever remains is tagged as PASS.

ADD COMMENT
0
Entering edit mode

Thank you for the clarification! Using the OR option, variants that passes the filter will have QD > 2.0 and FS < 60.0 and MQ > 40.0 and MQRankSum > -12.5 and ReadPosRankSum > -8.0.

ADD REPLY
0
Entering edit mode
7.8 years ago
tpoterba ▴ 50

You have several questions in your post, but I can't answer all of them now. GATK uses OR in this expression because these cutoffs are designed to target independent error modes. For example, a mismapped homologous reason will have a low MQ, but probably won't have a Fisher Strand > 60. If you use AND in this expression, you probably won't filter out much at all.

Other than this, a lot of QC is dataset-dependent. Are your samples PCR-free or PCR+? What is your coverage? Genome or exome sequencing? These questions and more should certainly influence your choices. Why aren't you using VQSR, though? Assuming your samples were joint-called with a larger cohort upstream and VQSR run on that larger sample set, VQSR is probably one of your best bets to do effective QC (if I'm wrong about this, ignore it).

ADD COMMENT
0
Entering edit mode

Thank you for the answer! I get it now, the OR option will filter out more variants than AND option. The filter expression is for variants that fail, as Vivek mentioned. My samples were not joint-called with a larger cohort upstream and therefore VQSR is not possible for my current sample set....

ADD REPLY

Login before adding your answer.

Traffic: 2229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6