Variant Filtration Using Gatk With Ad And Dp
1
0
Entering edit mode
11.0 years ago
ivivek_ngs ★ 5.2k

Dear All,

Is there anyone who used the VariantFiltration walker of GATK and filtered the variant file using JEXL expression of hard filtering? I am interested in filtering my variants manually using AD(Allelic depth) and the DP (the depth passing the quality filter). 70% of my bases in the exome data have been read over 15 times. So after the Variant recalibration I want to filter my variants on the basis of reads which pass the filter quality above 20 (DP >=20) and the the AD >=20. I am not sure if the AD cut of will be sufficient enough but definitely if DP is greater than 20 than all my mutations which have been read over 20 times will be selected. I am not interested in prioritizing my mutations on basis of functional and structural scores impact of mutations on proteins as given by Annovar so I want to filter on this criteria of DP and AD. Did anyone tried this and any inputs if anyone can provide it would be helpful. I aware that Variant filtration walker will work with DP but not sure if it works with AD or not. I would like some suggestions.

Thanks

gatk exome-sequencing • 11k views
ADD COMMENT
1
Entering edit mode
11.0 years ago
Vivek ★ 2.7k

I'm a little confused by your terminology and what you mean by "Mutations being read over 20 times". DP indicates the total number of reads at the variant site and AD indicates the allele depths for reference and alternate alleles. So if you have a DP = 20 at a heterozygous variant, you could have ADs of 10,10.

So you will be losing a subset of mutations when you use a filter like 'AD >= 20 && DP >= 20' and in my experience filtering by AD is unnecessary once you already filter on DP.

ADD COMMENT
0
Entering edit mode

Yes sorry the way I wrote is not correct. It should be the reads at the variant sites that passed the quality should be over 20. So DP >= 20 , Its not mutations, its like at a site where there is a mutation , that has been read over 20 times and passed the quality filer where we can judge by looking at the number of reads that are shown in DP. So filtering by DP will be ideal but as you said in case of heterozygous variant I will lose for AD that are 10. So is it only by DP if I filter it should stand out well right? My idea is to only keep the reads that for a particular variant site has been read atleast 20 times ( but preferably the ALT reads). Since with DP its not clear where the REF or the ALT is being read or not so I was trying to include AD as well. But I am interested in filtering on the basis of the number of times the ALT bases is being read and if its read over 20 times and passed the quality of being a true SNP then I will keep it for my downstream process. So here will it be sufficient to do it only by DP or also should I introduce something else? Any suggestions?

ADD REPLY
0
Entering edit mode

You should usually be filtering by DP for read support at variant sites, if you need to discount heterozygous variant calls that are borderline homozygous reference, a better filter would be based on PL (phase likelihood), which is presented in phred scale in the GATK VCF files.

You would also be getting rid of a lot of false positives if you look at filters for strand bias.

ADD REPLY
0
Entering edit mode

Is there any way where I can use the GATK VariantFiltration to select for only high quality variants where at the variants site with DP > 20 and AD for ALT allele >=10 , then I will be discarding the the variants which have low DP less than 20, since my depth of coverage is quite high so I was thinking on these lines. Even 70% of my sequences have been read over 15 times when I did the DepthofCoverage statistics run. So my idea is to filter out variants which are not read atleast 20 times so DP criteria will be good as it gives total number of reads that passed the quality metrics as well, but with that is not good to add the AD for the ALT allele? is there any JEXL expression I can use to do that? or if I have to use the QUAL score as well what kind of cut off should I use here? My expression is given below where still am getting over 40k variants

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T VariantFiltration -V /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.vcf -o /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.recal.snps.filt.vcf --filterExpression " DP >= 20" --filterName "DepthofQaulity"

Any suggestions are welcome

ADD REPLY
0
Entering edit mode

Here's an example for AD based filter, you need to access it as an ARRAY and not INT.

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'

http://www.broadinstitute.org/gatk/guide/article?id=1255

ADD REPLY

Login before adding your answer.

Traffic: 1721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6