These are purely statistical/ analysis/ theoretical method questions. I am now conducting a gene/ variant association study between a gene(G) and disease as well as SNP (S) and disease.
I wonder:
- It is common to see the variation of association beta among the SNPs within the same gene?
This is to say, for a gene (G), there are let's say 100 missense/ putative loss of function (pLOF) variants. The direction of association between each of these variants is not the same. i.e. some are positively associated with the disease while some are negatively associated with D.
- Is it common to see the p-value of these SNPs also varies a lot?
Similar to the first question, some of the SNPs are "significantly" associated (p <0.05). But in my case, there are only 9/299 SNPs that are significant. I would like to know if this is common, given that they are all found in the same gene. If this is common to see, how are we going to interpret the variation in the phenotype if the gene is "mutated". I know we may try to blame some of the variants may be gain of function while some are loss of function. I would like to know are there any journal articles that I can cite to support this claim.
- In my case, all the significant SNPs are positively associated with the disease. Is it legitimate to conclude that having missense/ pLOF mutation in this gene is positively associated with the disease?
I know the above conclusion is highly possible to be flawed. I wonder is there a better way to summarize the results from SNPs to gene-level results? I know I probably can claim a specific SNP is associated with the disease. However, as I mentioned above, this seems cannot be generalizable to the gene level, because many other SNPs are not significantly associated with the disease.
- Is there a way to objectively (and legitimately) filter the SNPs to be included in the analysis?
Firstly, maybe I would like to ask if that is a necessity to filter (in a proper and formal data analysis). However, given the heterogeneity I mentioned above, I would like to know how to perform such a filtering step (if that is legitimate). For SNP arrays, I know that we should probably filter out those extremely rare variants due to inaccurate variant calling. I wonder whether whole exome/ genome sequencing also requires such a step.
Thanks in advance!