Hi,
I am conducting rare variant tests for predefined genomic units (such as genes) genome-wide, each containing a subset of variants from a multi-sample VCF. I used SKAT-O for that purpose in R, that functions reasonably well to output the rare variant test association p-values per genomic unit.
However it does not report the cumulative allele frequencies of tested group of variants within a unit (let's say) in patient and controls; so I need to go back to the VCF and do it manually myself to learn about the cumulative frequencies and the enrichment of rare variants (in what direction, with what effect size?)
I am trying to come up with a script to solve this, but I wasn't successful so far. Is there any script/tool that can you suggest for that purpose?
For example, what I want is something like below:
1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1"
2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy controls, all in the same VCF [for this phenotypic information, use my table that indicates a binary phenotype for each sample, such as PLINK .fam/.ped file])
3) Calculate the cumulative frequencies of these mutations for each group
4) Output these frequencies and an enrichment ratio between patients vs controls
It'd be great if I can do this for a group of genes in a loop -> outputting a table containing info for each.
Any suggestions?
Thanks a lot Pierre! Your code helped me, but what I was looking for was something that is a bit different.
I want to query the (rare) variants falling into a group (usually a gene name) from a .tsv file, not from SnpEff annotation directly. For these variants, I want to obtain cumulative MAF (so what's the frequency of the total alternative alleles of these variants in the total possible successful calls within a group [total n of variants x n of Patients (-minus "./." missing genotypes) x 2]). Then I will get the ratio of MAFcumulative_patients vs MAFcumulative_controls, for instance.
Can it be also done with your scripts + bcftools query?