Hello,
I have a list of SNPs (in the form of a VCF) found in our [very] targeted sequencing dataset of ~15,000 individuals.
I am looking to compare the MAFs of these SNPs within this 'population' (our dataset) to their MAFs across different populations, such as the populations defined in ExAC or the 1000 Genomes Project.
Is there an effective way that you would recommend to do this?
These samples were processed using GrCh38 — I believe the ExAC variants have coordinates based on the previous build (please correct me if I'm wrong), so I'm unsure about using the MAFs from the ExAC data.
The output table I have in mind would look something like this:
snpID MAF_mysamples MAF_european MAF_finnish MAF_african MAF_se_asian MAF_asian
As always, your input is greatly appreciated.
Hello,
Ensembl have a lifted over version for 1000 Genomes, ExAC and gnomAD exomes for hg38. Have a look at this ftp directory.
You could use this for annotating your vcf and extract than all the information in a way you like. How should your final output look like?
fin swimmer
Thanks @finswimmer - just updated my post to clarify what I'm looking for as output.
The output table I have in mind would look something like this:
Just use ANNOVAR, as it outputs allele frequencies for all of these populations, and it supports hg38. It even has a function that converts VCF to the format required for ANNOVAR, to assist you.
Regarding allele frequencies in your own sample cohort, you can just calculate the AF (allele frequency) INFO tag and encode it directly into your VCF using BCFtools: How to use bcftools to calculate AF INFO field from AC and AN in VCF?
To then extract the AF in an 'easy' format, use BCFtools query, something like: A: Extracting certain columns from VCF file