I am preparing admixture analyses involving the Neanderthal and Denisovan genomes, and I have downloaded extended (=generally non-vcftools friendly) VCF files (http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/ and http://cdna.eva.mpg.de/denisova/VCF/hg19_1000g/). The files were originally made with GATK, but the authors greatly modified to files; thus, they aren't standard VCFs.
I've got them down to just the sites that I have modern data for, so they only a fraction as massive as the original files.
A few hundred sites sites have been marked LowQual, and have been ejected using grep -v
; however, LowQual is going into more than just Genotyping Quality (i.e., LowQual sites have GQ's as high as 59, but for the Neanderthal 8,767 of 511,858 non-LowQual sites have GQ<60).
I was wondering what would be good GQ cut-off to use for the non-LowQual line?
Also, any suggestions on how to filter them? (Remember I cannot use VCFtools or GATK or anything similar due to the non-standard formatting)
Here is an example line, GQ is in the subsequent code block
1 5031561 rs7518523 A G 909.02 . AC=2;AF=1.00;AN=2;DP=24;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.9665;MQ=39.43;MQ0=0;QD=37.88;1000gALT=G;AF1000g=0.34;AFR_AF=0.70;AMR_AF=0.28;ASN_AF=0.17;EUR_AF=0.27;UR;TS=HPGOMC;TSseq=A,G,A,G,G,A;CAnc=A;GAnc=A;OAnc=G;bSC=987;mSC=0.000;pSC=0.007;GRP=-2.27;Map20=1 GT:DP:GQ:PL:A:C:G:T:IR 1/1:24:72.23:942,72,0:0,0:0,0:10,14:0,0:0
GT 1/1
DP 24
GQ 72.23
PL 942,72,0
A 0,0
C 0,0
G 10,14
T 0,0
IR 0
can or cannot use GATK/vcftools?