Question

What is a bad GATK Genotype Quality?

5

Entering edit mode

9.9 years ago

devenvyas ▴ 770

I am preparing admixture analyses involving the Neanderthal and Denisovan genomes, and I have downloaded extended (=generally non-vcftools friendly) VCF files (http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/ and http://cdna.eva.mpg.de/denisova/VCF/hg19_1000g/). The files were originally made with GATK, but the authors greatly modified to files; thus, they aren't standard VCFs.

I've got them down to just the sites that I have modern data for, so they only a fraction as massive as the original files.

A few hundred sites sites have been marked LowQual, and have been ejected using grep -v; however, LowQual is going into more than just Genotyping Quality (i.e., LowQual sites have GQ's as high as 59, but for the Neanderthal 8,767 of 511,858 non-LowQual sites have GQ<60).

I was wondering what would be good GQ cut-off to use for the non-LowQual line?

Also, any suggestions on how to filter them? (Remember I cannot use VCFtools or GATK or anything similar due to the non-standard formatting)

Here is an example line, GQ is in the subsequent code block

1    5031561    rs7518523    A    G    909.02    .    AC=2;AF=1.00;AN=2;DP=24;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.9665;MQ=39.43;MQ0=0;QD=37.88;1000gALT=G;AF1000g=0.34;AFR_AF=0.70;AMR_AF=0.28;ASN_AF=0.17;EUR_AF=0.27;UR;TS=HPGOMC;TSseq=A,G,A,G,G,A;CAnc=A;GAnc=A;OAnc=G;bSC=987;mSC=0.000;pSC=0.007;GRP=-2.27;Map20=1    GT:DP:GQ:PL:A:C:G:T:IR    1/1:24:72.23:942,72,0:0,0:0,0:10,14:0,0:0

GT 1/1
DP 24
GQ 72.23
PL 942,72,0
A 0,0
C 0,0
G 10,14
T 0,0
IR 0

quality SNP • 13k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 9.9 years ago by devenvyas ▴ 770

0

Entering edit mode

(Remember I can used VCFtools or GATK or anything similar due to the non-standard formatting)

can or cannot use GATK/vcftools?

ADD REPLY • link 2.4 years ago by Ram 45k

0

Entering edit mode

I cannot use either of them

ADD REPLY • link 9.9 years ago by devenvyas ▴ 770

5

Entering edit mode

9.9 years ago

vdauwera ★ 1.2k

The most important point here is to understand the difference between variant site-level (INFO) quality and sample-level (genotype/FORMAT). Depending on what you're trying to learn from your data, the GQ may or may not matter. GQ describes how sure we are that we have the right genotype; for high-quality variant sites that have made it past INFO-level filtering, that just means we're confident there is variation at the site -- we're just not sure whether that variation is in the heterozygous or homozygous-variant form. Like I said, depending on what you're studying, that may or may not matter.

The second most important point is that it sounds like whoever prepared the files only used the built-in QUAL-based filtering, not a proper filtering method like variant recalibration (VQSR). Rather than focusing on GQ, you should look into applying proper filtering at the site level.

My recommendation would be to figure out how to mcguyver this poorly formatted VCF file into shape so you can use GATK or other well-designed tools for filtering, rather than putting effort into working with it as it stands.

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 9.9 years ago by vdauwera ★ 1.2k

2

Entering edit mode

They are not poorly formatting VCF files. They use an extended, non-standard format (i.e., non-standard != poor). For example standard GATK format is not amenable to triallelic sites; they had sites that are biallelic in modern humans, but the Neanderthal/Denisovan was heterozygous with a third allele. (http://www.sciencemag.org/content/suppl/2012/08/29/science.1224344.DC1/Meyer.SM.pdf pp. 16-20; http://www.nature.com/nature/journal/v505/n7481/extref/nature12886-s1.pdf p. 14). I've contacted one of the creators of the files in the past, and he has said that trying to use GATK/vcftools with these files would not be a good idea and that python or pysam would be the best way to go.

You are jumping to a lot of conclusions about the filtering. Based on the Meyer link above, they used more than what you think with multiple iterations of genotyping. (These files are from large scale ancient DNA genome projects, they are not going to be that sloppy).

Given the fact that the VCFs are from ancient DNA, GQ is probably important (and there are some analyses in those supplemental docs indicating that lower GQ values have biases some dating analyses)

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 9.9 years ago by devenvyas ▴ 770

Ram · Accepted Answer · 2015-06-26

Since I didn't get any suggestions, I sought out how this dataset has been used recently in the literature.

I found sources by Qin and Stoneking (dx.doi.org/10.1093/molbev/msv141) and Lazaridis et al. (dx.doi.org/10.1038/nature13673) (former cites the latter), which suggest filtering the LowQual sites as well GQ < 30 and Qual < 50.

Just thought to pass this along for anyone else using these datasets.