I've annotated some variants using VEP, and was looking at the minor allele frequencies. Some of the variants had very different MAFs in the annotation than I expected (I expected MAF < 1%, whereas some annotated MAFs were >50%). I looked up the same variants on the gnomAD v3 browser, and all the ones I've checked had MAFs much more in line with my expectation, thus being very different from the MAF annotated using VEP. One example: 19:35768033:C:T was annotated with a MAF of 36% (NFE), whereas gnomAD v3.1.2 lists the NFE MAF as 0.0456%.
I ran the annotation using the following command:
vep --input_file input.vcf \
--output_file anno.tab \
--format vcf \
--tab --symbol --hgvs --tsl \
--terms SO \
--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
--offline \
--cache --dir_cache /anno_cache \
--plugin CADD,whole_genome_SNVs.tsv.gz,gnomad.genomes.r3.0.indel.tsv.gz \
--af_gnomad gnomAD_NFE_AF
Edit: Another example. 1:1719393:A:G, MAF_NFE according to gnomAD: 0.0. Here's the corresponding lines in the annotation:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation IMPACT DISTANCE STRAND FLAGS SYMBOL SYMBOL_SOURCE HGNC_ID TSL HGVSc HGVSp HGVS_OFFSET gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF CLIN_SIG SOMATIC PHENO CADD_PHRED CADD_RAW
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000356200 Transcript missense_variant 423 188 63 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 5 ENST00000356200.7:c.188T>C ENSP00000348529.2:p.Val63Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000356937 Transcript non_coding_transcript_exon_variant 123 - - - - rs72909030,COSV62264367 MODIFIER - -1 - CDK11A HGNC HGNC:1730 1 ENST00000356937.7:n.123T>C - - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000357760 Transcript missense_variant 370 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 1 ENST00000357760.6:c.290T>C ENSP00000350403.2:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000358779 Transcript missense_variant 370 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 1 ENST00000358779.9:c.290T>C ENSP00000351629.5:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000378633 Transcript missense_variant 370 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 1 ENST00000378633.5:c.290T>C ENSP00000367900.1:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000378638 Transcript missense_variant 350 188 63 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 5 ENST00000378638.6:c.188T>C ENSP00000367905.1:p.Val63Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000401096 Transcript downstream_gene_variant - - - - - rs72909030,COSV62264367 MODIFIER 3315 -1 cds_end_NF CDK11A HGNC HGNC:1730 5 - - - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000404249 Transcript missense_variant 403 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 1 ENST00000404249.8:c.290T>C ENSP00000384442.3:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000460465 Transcript missense_variant,NMD_transcript_variant 370 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 1 ENST00000460465.5:c.290T>C ENSP00000462289.1:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000479362 Transcript missense_variant 536 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 cds_end_NF CDK11A HGNC HGNC:1730 1 ENST00000479362.1:c.290T>C ENSP00000423900.1:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000487462 Transcript downstream_gene_variant - - - - - rs72909030,COSV62264367 MODIFIER 3073 -1 - CDK11A HGNC HGNC:1730 5 - - - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000498810 Transcript non_coding_transcript_exon_variant 347 - - - - rs72909030,COSV62264367 MODIFIER - -1 - CDK11A HGNC HGNC:1730 2 ENST00000498810.1:n.347T>C - - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000008128 ENST00000509982 Transcript missense_variant,NMD_transcript_variant 304 290 97 V/A gTt/gCt rs72909030,COSV62264367 MODERATE - -1 - CDK11A HGNC HGNC:1730 5 ENST00000509982.5:c.290T>C ENSP00000422149.1:p.Val97Ala - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
1:1719393:A:G 1:1719393 G ENSG00000268575 ENST00000598846 Transcript non_coding_transcript_exon_variant 3054 - - - - rs72909030,COSV62264367 MODIFIER - -1 - - - - 2 ENST00000598846.1:n.3054T>C - - 0.4984 0.4963 0.4969 0.4982 0.4964 0.5 0.4992 0.4994 0.498 - 0,1 0,1 18.49 1.900858
Is there an explanation for these discrepancies? Am I making a mistake in my annotation and if so, might the other data fields (particularly gene and consequence) be affected?
The VEP includes gnomAD r2.1.1 exomes only, so if you want gnomAD v3. To include gnomAD v3 data in the VEP output you should use custom annotation: https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
The frequencies returned by VEP are correct, in your example 1:1719393:A:G has frequency 0.4992 (v2 exomes only) while in v3.1.2 is 0.0 (genomes).
I'm not sure why there is such a difference in frequency, but this could be due to the region coverage. The gnomAD documentation explains a little about allele frequency differences between the datasets:
"Therefore gnomAD v2 is still our recommended dataset for most coding regions analyses. However, gnomAD v3.1 represents a very large increase in the number of genomes, and will therefore be a much better resource if your primary interest is in non-coding regions or if your coding region of interest is poorly captured in the gnomAD exome"