Entering edit mode
7.8 years ago
nkausthu
▴
30
I have ~200 exomes which includes related and unrelated individuals. I have done the joint genotype calling and calculated the allele frequency using VCFtools. But as its a mixed population what is the ideal way to calculate allele frequency?. I would like to make an in-house variant database from our available exomes and the corresponding allele frequencies will be used to filter the variants. It would be helpful if you can give some further information about the methods to adjust relatedness. Thank you ..
You could use GATK SelectVariants to subset the VCF file accordingly and then calculate the AF.
Actually I would like to know if I include all these related individuals along with unrelated individuals for calculating the allele frequency will it be biased? Or is there any statistical way to avoid this bias?
I would expect that AFs will be more similar amongst related individuals. Depending on what you want to do, there are methods for adjusting for relatedness.
I would like to make an in-house variant database from our available exomes and the corresponding allele frequencies will be used to filter the variants. It would be helpful if you can give some further information about the methods to adjust relatedness. Thank you ..
Please add this information to your initial post and try to be as informative as possible when asking questions. Those details are very important.
For filtering you don't want to inflate the allele frequencies because of related individuals. I think the only correct way of creating such a database would be to count a variant shared by e.g. three sibs as just once. You should count in how many families variants are observed, because those observations are not independent. An easier way (but you will lose information) would be to not include related individuals (essentially just chose one individual per family, randomly).
You are absolutely right!! Removing redundant variants from related individual is something I though about but again the problem is which zygosity I should keep. eg : same variant in het/het/hom in three related individuals and which variant I will keep and which will I remove? . As you already told if I consider one individual form each family then I will loose so many variants. So I am bit confused ...
Don't count a variant twice if the two observations are from the same family, it still counts as one.
just consider the following 3 scenarios
I would simplify scenario 3 to 2 allele counts. But it's imperfect.
You can adjust population allele frequency for relatedness using ideas described above. From my experience and 1000 genome project, some people incorrectly report their relatedness and ethnic group. Because of this I strongly recommend you to test for relatedness based on vcf files you have. You can use KING http://people.virginia.edu/~wc9c/KING/manual.html for this. You can use apriori probabilities given the relatedness and correct for it to count each allele frequncy approximately once.
Another thing to consider is storing the number of alternative homozygotes and heterozygotes you saw with no pathology. The reason is for inheritance model and penetrance testing.
Your post does not explain who these people are or what you are trying to accomplish. Or even what kind of data you have. Please clarify it, in great detail.