Hi all, I have a vcf that I made by followign GATK best practices workflow and I filtered genotypes with low GQ < 20. However I understand that they are not removed instead they are tagged as "FILTER_GQ_20" in my vcf.
gatk VariantFiltration \
-V all_jointcalls_sRecal_allPASS_PP.vcf \
-G-filter "GQ < 20" -G-filter-name "FILTER_GQ-20" \
-O all_jointcalls_sRecal_allPASS_PP2.vcf
I tried to remove all rows with FILTER_GQ-20 by doing a simple grep:
cat all_jointcalls_sRecal_allPASS_PP2.vcf | grep -v "FILTER_GQ-20" > all_jointcalls_sRecal_allPASS_GQ20orhiger.vcf
THen I checked to see how many are present that are good ,GQ>20
cat all_jointcalls_sRecal_allPASS_GQ20orhiger.vcf | wc -l
212298
This seems way low when compared to the original vcf from Genotype Posteriors:
all_jointcalls_sRecal_allPASS_PP2.vcf which has 3598528 variants.
So my question is :
How to remove those variants with FILTER_GA-20 tags properly, in a GATK way, if simple unix command did not do the job right? I checked SelectVariants but if I do exclude filter, I dont think it is right.I checked on on other exclude options but none seem right for what I need to do, hence the post!
Do I need to be worried with the low number passing GQ filter? THis is a WES data .
Is it even necessary to remove them for downstream analysis like VariantAnnotator or funcotator?
also, on another note; is it absolute requirement to have a ped file for annotation and funcotator?
Thankyou in advance.
please, have a look at the file itself. See if something is wrong (bad expression variant are badly filtered). Don't count the number of variants without excluding the header. Count the variant before and after filtering, etc...
huhh ?
I do like gatk but bcftools is fine and faster.
Your other questions depends of what you want to do with your data.
Hi Pierre, THankyou for taking time to reply!
I did take a look at the file before filtering and after filtering. Yes I had counted without the header. The reason you are seeing the oneliner in my earlier post without a grep -v "##" is cos when we use grep to filter out vcf files the header following ## is not retained. But it is the same number in output:
( well minus 1 here coz this has the header starting with chrm pos etc)
I was hesitant to use bcftools options to filter and thought GATK might have a way of doing this and hence the post. I guess I will have to try bcftools and see if that works for me. THankyou again!