Hi Biostars,
I would like to learn a proper way of calculating minor allele frequencies , when doing population genetics. With a large sample (more than 100) of whole-genome sequences in VCF formats, I use vcftools. I use optional arguments (--maf
and --max-maf
) to limit/filter SNPs with a particular range of MAFs. For example, I do not want MAFs less than 0.05, so I use this command:
vcftools --vcf xxx.vcf --out SNP --remove-indels --maf 0.05 --012
However, this would result in a genotype file (012 format) with MAFs ranging from 0.05 to 1, since there may be site locations (genome positions) that mostly match alternatives (compared to ref genome). I thought that MAFs should range between 0 and 0.5. For example, vcftools computes like this:
0 0 0 0 0 0 1 0 0 0 maf = 1/20 = 0.05
2 1 2 2 2 2 2 2 2 2 maf = 19/20 = 0.95
Shouldn't the second row be considered to have a MAF of 0.05? Then, should I simply provide optional arguments to vcftools --maf 0.05
and --max-maf 0.95
? Or is there a better way to do this?
Thanks!
Thanks for info. Just for my understanding, if there are two possible alleles for a position, this would be identical to calling
--maf 0.05
and--max-maf 0.95
when using vcftools? Also, in this case, we could simply transform/scale MAF (as outputted from vcftools) originally ranged between 0.5 and 1 to between 0 and 0.5?