Question

Comparison Of Two Versions Of Gatk Filtering For Exome Sequencing Data.

1

Entering edit mode

12.6 years ago

michealsmith ▴ 800

I recently read papers involving identifying disease-causing mutations using GATK for exome sequencing data. Most paper's pipeline is sth. like:

Call SNP/indel using GATK UnifiedGenotyper. All calls with a read coverage ≤2× and a Phred-scaled SNP quality of ≤20 were filtered out.

I'm then curious which step of GATK filtering is for RD<2 and qualify<20? (Simply too many steps for GATK!!!)

I compare both GATK-2 and GATK-3.

GATK2:

For exomes with deep coverage per sample
DATA_TYPE_SPECIFIC_FILTERS should be "QUAL < 30.0 || QD < 5.0 || HRun > 5 || SB > -0.10"

GATK3:

For SNPs
DATA_TYPE_SPECIFIC_FILTERS should be "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0". 

For Indels
DATA_TYPE_SPECIFIC_FILTERS should be "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0".

I would say most published paper should have used GATK2; so for users who read paper using GATK2, and use GATK3 for our own research may get confused, so let me clarify:

MQ < 40.0 in GATK3 is equivalent to QUAL < 30.0 in GATK2, and this is mapping quality filter, right?

QD < 2.0 in GATK3 is equivalent to QD<5.0 in GATK2, and this is Read-depth filter, right?

FS > 60.0 in GATK3 is equivalent to SB > -0.10 in GATK2, and this is strand bias filter, right?

thx

gatk • 7.4k views

ADD COMMENT • link 12.6 years ago by michealsmith ▴ 800

score 2 · Answer 1 · 2012-04-11

I believe this filtering is being done on the VCF files. Variants with DP<=2 or QUAL<=20 were removed. I'm not sure what the GATK2 and GATK3 have to do with what you describe from the paper.

As for your questions, QUAL and MQ are two different concepts. QUAL has to do with the quality of the variant while MQ is concerned with the mapping quality. QD is "quality by depth". The definition does not change, I do not think. As for FS and SB, I do not have a good answer as to whether these are equivalent or not.

score 1 · Answer 2 · 2012-04-12

FS (Fisher strand) and SB (strand bias) are almost the same, I think.

SB is simply when the variation is only found on the forward or only the reverse strand.

FS is "Phred-scaled p-value using Fisher's Exact Test to detect strand bias in the reads. More bias is indicative of false positive calls." (From the GATK website)

score 0 · Answer 3 · 2012-04-16

0

Entering edit mode

12.6 years ago

michealsmith ▴ 800

thanks guys. GATK is really complicated and confusing for newbies like me.

I think this is important question for us. We can not blindly use GATK without knowing the what those parameters stand for. Hope someone else can give clear answers.

ADD COMMENT • link 12.6 years ago by michealsmith ▴ 800