QUAL and AF GATK
1
0
Entering edit mode
8.8 years ago
aleka ▴ 110

Hi.

I am a bit confused. I have my vcf files from GATK (exome data) and I want to filter out the low quality variants and extract the variants with minor allele frequency <0.5.

  1. I had the idea that qual field >30 or >40 is a good variant but in some papers that use GATK protocol I saw that they use qual<30. which is the correct? Others use just vqslod>0. I know that vqslod is the calibrated score. What do you suggest? filter for both? qual or vqslod?
  2. Also the allele frequency derived from gatk (example below) is not the minor allele frequency. right? from where it is derived? Also in the example below the AF=1.00. If it is 1.00, how is it a SNP?

    chr1    15274   rs201931625     A       T       10123.84        PASS    AC=56;AF=1.00;AN=56;DB;DP=360;FS=0.000;InbreedingCoeff=-0.0065;MLEAC=55;MLEAF=0.982;MQ=26.38;QD=29.60;SOR=10.341;VQSLOD=20.31;culprit=MQ        GT:AD:DP:GQ:PL  1/1:0,29:29:87:836,87,0 1/1:0,8:8:24:244,24,0
    -----------------------------------------------------------------------------------------------^____^-------------------------------------------------------------------------------------------------------------------------^^-----------------^^---------------------^
    

    I want to extract all the variants that have minor allele frequency <0.05 according to 1000 genome project. How do I do that? Any idea?

  3. Last question, in the above example there is the DP in the INFO field and also in the format field for each sample. What is the difference? and when I filter for DP >5, from which field is it taken INFO or FORMAT?

Sorry for the many questions.

Any help is appreciated

exome next-gen • 6.1k views
ADD COMMENT
2
Entering edit mode
8.8 years ago
SteveL ▴ 90

Hi @Aleka,

I had a longer answer, but lost it when I linked to follow the thread before posting, so briefly.

A) GATK forum is the best place to ask, and search for answer to these questions.

  1. You definitely want QUAL>30, and you will still have quite some false positives at the lower end
  2. The AF reported by GATK is the frequency of the ALT allele across your samples at this position - since all 28 of your samples are homozygous ALT, it is 1.0 - if one had been a heterozygote it would have been 0.982. The MAF you want is not provided by GATK - you would have to use an annotator such as ANNOVAR or dbNSFP, or something similar to add the annotation to your VCF, before filtering.
  3. The DP in the INFO field is the total DP across all samples, while the DP in the FORMAT field - i.e. for each sample, is sample specific. The sum of the latter is not always the former, since the number reported in the sample field is only for reads that were of sufficient quality to call the allele one way or another - you would have to ask GATK what exactly this means, but often there are a few reads that are regarded as ambiguous. So you want the FORMAT field, and you want >5. >7 is probably better
ADD COMMENT
0
Entering edit mode

Hi SteveL,

Thanks for your reply. Very useful.

Another related question:

I am trying to compare my final vcf file with the 1000G one from GATK bundle. It is a bit strange that I have some variants in my callset that do not exist in the 1000G site. One example is the chr1:14653:rs375086259:C:T in position 14653 that does exist in the NCBI database but does not exist in the 1000G site file from GATK bundle. More specifically, this file begins from chr1 position 51479.

Is there something that I miss? or something wrong?

ADD REPLY
1
Entering edit mode

Yes, I would agree with using the 1000G phase.1 variants for filtering, and also keeping any positions that were not observed in the 1000G project, as these should be rare too.

All the 1000G (phase 1) variants should be in dbSNP, but not the reverse, unless they removed some that they decided were not valid when they released phase 3. This is what I would use to filter for rare variants. In addition the ExAC dataset is very useful for coding regions, since they have sequenced 65k exomes and counting.

ADD REPLY
0
Entering edit mode

I am not that au fait with the 1000G calls to know why their first call is so far into Chr1. One thing to make sure of are that your calls were made on the same version of the genome. i.e. Version 37 not Version 38.

Regardless, you should not expect ALL your calls to have been observed in the 1000G call set - many of the genomes (in phase 1 at least) were called at very low depth, so not all variants will have been observed, and though it is a large and useful dataset, it by no means encompasses all of human variation, so you should find plenty of calls that are not in the 1000G set.

Regarding your original goal of keeping variants with MAF <0.5 do you mean <0.5%, because all variants will have MAF <0.5 ;-) If so, then you can assume that where you have a variant and there is no corresponding variant in the 1000G data that it is most likely rare enough to take forward.

ADD REPLY
0
Entering edit mode

I have used all the files from GATK bundle version hg19. therefore it is not the version of the genome a problem.

I noticed that the positions and the variants in my vcf file mainly correspond to the dbsnp138.vcf that I used as the annotation file.

I am just thinking if I should have used the 100G.phase1.sites.vcf from bundle since I wanted to find the rare variants based on the 1000G. What do you think?

Anyway, why the variants in dbsnp and 1000G are so different since they are both from hg19?

Hi. Considering the MAF, I am sorry, I wanted to write MAF<=0.15. I am looking for rare variants.

ADD REPLY
0
Entering edit mode

Hi SteveL,

Yes, to start I chose the 1000G.phase1.sites. I will also have a look in the Exac. Thanks.

I was wondering if you have any idea about how I can select variants based on the sample DP. The usual command SelectVariants from GATK selects for the total DP.

ADD REPLY

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6