I have a very ignorant question. Let's say the SNP X has an allele A with a frequency of 0.52 and 0.002 in populations 1 and 2, respectively. In some papers I have read that people remove SNPs with MAF<5% in either of the populations when calculating Fst. These values suggest that A is very differentiated between pop1 and pop2. Indeed, I calculated Fst for SNP X and it has a value of ~0.9. But if I use the MAF>5% criterion, I would remove this strong signal of population differentiation. This does not make much sense for me. I would very much appreciate to have some feedback. Thanks !
Thanks very much Kevin
¡De nada amigo!
Hi Kevin,
Given your response, the rare variants cannot be considered for the population differentiation as they are created in recent years, yes? however, the variants with the allele frequency < 5% are not rare, they are not just common. With removing variants with AF < 5%, we just assay the population differentiation in terms of common variants, while these variants cannot have the significant role in regards to the trait of interest and the various populations may differentiate at the low-frequency variants, not common variants. Could you please kindly correct me whenever I'm wrong and explain me a bit about removing the variants with AF <5% for Fst calculation, which does not still make sense for me?
In my answer, I just state that the authors noted a difference when calculating Fst for 'low frequency' variants (MAF <=0.05) versus 'most common' variants (<0.45 MAF <= 0.5). The title of this question is misleading because it implies that everybody should filter out MAF<=0.05 for calculating Fst.
Common variants can have a big role in disease. It is incorrect to assume that only rare variants contribute to complex disease phenotypes.
Thanks a lot for your explanation. So, in your opinion, is it better to calculate the Fst for lower frequency and common variants, separately rather than removing some variants?
Agree with you about the common variants and disease, thanks for correcting me.
In this paper, the authors mentioned that Fst analysis is not appropriate for detecting genetic risk differentiation among populations and Genetic Risk Variation (GRV) method developed by them can overcome the Fst problems in this situation and and showed its strength for detecting genetic risk differentiation in type 2 diabetes. However, I couldn’t find any script/too to run the GRV method. Could you please kindly share me your idea about it?
I am not in the best position to advise on that. It would be a question more for a statistician, or at least a bioinformatician who has worked in this area for a number of years. I will say that literature frequently contradicts itself. Also the authors' work (GRV) likely will not work in other situations / diseases. You may find more information looking through CrossValidated / StackExchange