Basically, I have to start by saying that I don't know much about statistical genetics; however I understand statistics.
There is a cohort of 400 patients with a disease. A set of genes are hypothesized that are causative or have significantly more variants than usual.
Based on my understanding it is necessary to have a case cohort and showing that this set of genes are only enriched in the case set but not in the control set.
The NHLBI cohort contains almost 4000 individuals that are ethnically matched with the disease cohort.
I saw in this forum that people have mentioned of using "burden-test" or say "fisher-test". To my understanding, these methods, are comparing the frequency of variations in the population of the cohort.
say,
variant_name 1KG NHLBI Disease
snp1_from_gene1 0.03 0.01 0.1
snp2_from_gene1 0.7 0.02 0.2
snp3_from_gene2 0.3 0.01 0.1
and then, we compare the distribution of frequency in these between NHLBI vs Disease or 1KG vs Disease, to prove that these two distributions are not the same with a certain p-value. 1) is this correct ? can you make more explanation in this part? is this why burden_test
is? (I don't think so)
2) as the second question, if I have 3 sets of control, say, NHLBI, 1KG and another disease cohort say (Autism) these 3 sets don't necessarily agree with each other, and possible, variant in NHLBI can be statistically significant compare to the 1KG. In the above example, one can compare 1KG vs NHLBI and see that there are significantly different distributions. One obvious reason is different variant calling methods. So, I wonder, what is the best strategy to have such comparison?