From the discussions in previous questions, I understand that REF and ALT need not necessarily correspond to major and minor alleles. REF is from the ref genome and could very well be the minor allele for the variant.
I'd like to find out if a REF allele is a minor allele for any variant in my region of interest. One of the ways I could do this is to find out COUNT(variants) where af > 0.5 in my region of interest.
Would I be correct in assuming this approach will definitely give me the right answer? Is there any underlying assumption I'm missing before I use this as my standard approach?
Any anomalies you might have noted in your experience would help me. Thank you!
Thank you, Jorge. Your reply on a different post was one of my references for the REF/ALT definition. We have a local DB created from reformatted/processed 1000genomes. I'll check on how the DBA dealt with multi-allelic variants.
If they haven't dealt with it the right way, I can always use your
bcftools
command with the raw VCF. Thank you :-)the cases where filtering by AF>0.5 wouldn't work if AF is calculated only with the first alternative allele are rare, but take them into consideration is more appropriate though. also, have in mind that filtering 1000genomes raw data by AF also deals with indels, which I'm not sure that could help you to achieve your goal.
I ran a bunch of queries on my DB. There were no multi-allelic variants of any kind, and for all variants with only one REF and ALT alleles, I found no case where
af
was>= 0.5
. I guess I can safely assume that all ALT alleles are minor alleles in my sample space.there are indeed multi-allelic variants on latest 1000genomes release (previously they used to collapse to bi-allelics) as stated in the callset readme file, and plenty of variants with AF > 0.5 too. if you don't find any yourself then it does depend on the way you've built your database, or on the region or the samples you are considering.
It is the region, I am quite certain. We store multi-allelic variants as multiple records, one record per ALT allele in an SQL database.
On the indels, I'm targeting only SNVs anyway.