I have been trying to understand the mechanism how between gene correlation inflates false positive rate in gene set analysis that assumes independence among genes. I see this discussion in most of the gene set analysis related articles but I have not found a clear description on how that actually happens.
I don't have a good background in statistics and I am self learning. I would appreciate if someone could break it down with an example!
Let me try to explain it simply. If all genes are independent from each other and their expression levels do not vary across samples, then each gene's expression level should be approximately normally distributed after logarithmic transformation. (assuming that you are using Pearson where you need to log transform the counts) If you calculate Pearson correlation coefficients (PCC) between all gene pairs, because all genes are independent form each other, the population of PCCs that you obtained should be normally distributed. By chance, you should expect 5% of the PCCs to have p value less than 0.05. So, if a gene set analysis assumes independence among genes, this is what the analysis is expecting.
However, if some genes are dependent on each other, this will increase the number of significant PCCs. It is because when two genes are strongly dependent on each other, you would expect their PCC to be close to 1. So, more than 5% of the gene pairs will now have p value less than 0.05. So, the number of gene pairs with p value less than 0.05 is higher than what the analysis is expecting. Usually, we call genes with p value less than 0.05 to be "positive", so, in this case, false positive rate is inflated. However, whether those "positives" are really false or not actually depends on the experiment.
Hi,
Thank you for your explanation. As a follow up, lets consider a gene set with 5 genes. The 5 genes are not associated with phenotype and thus all 5 of them will have similarly low absolute t-scores. Then, if we consider sum of absolute t-scores as gene set score (just for example), the gene set score is going to be low. So, in this case (or similar) as the gene set score is low, there will be high chances of getting similar or higher scores by chance and hence higher P-value. Right? I don't see how correlation inflates P-value here.
Please correct me if my understanding is wrong!
If you consider the case where the genes are not associated with the phenotype, you would be considering the false negative rate in gene set analysis. So, lets focus on the case where a phenotype has a falsely assigned low p value.
Let's say there is a phenotype that involves a lot of genes. Imagine that one of the genes associated with this phenotype has a high correlation with many other genes, the gene set will involve a lot of genes because if one gene is differentially expressed, many correlated genes will also be differentially expressed. If a gene set analysis assumes independence among genes, it would assume that all genes in this gene set have no relationships with each other. Since there are a lot of genes that are related to this phenotype, this phenotype will be assigned a lower p value that it should be. It is because when two events are independent, you can just multiply their probabilities.
Hi, Thank you for your explanation. As a follow up, lets consider a gene set with 5 genes. The 5 genes are not associated with phenotype and thus all 5 of them will have similarly low absolute t-scores. Then, if we consider sum of absolute t-scores as gene set score (just for example), the gene set score is going to be low. So, in this case (or similar) as the gene set score is low, there will be high chances of getting similar or higher scores by chance and hence higher P-value. Right? I don't see how correlation inflates P-value here. Please correct me if my understanding is wrong!
Krishna
If you consider the case where the genes are not associated with the phenotype, you would be considering the false negative rate in gene set analysis. So, lets focus on the case where a phenotype has a falsely assigned low p value.
Let's say there is a phenotype that involves a lot of genes. Imagine that one of the genes associated with this phenotype has a high correlation with many other genes, the gene set will involve a lot of genes because if one gene is differentially expressed, many correlated genes will also be differentially expressed. If a gene set analysis assumes independence among genes, it would assume that all genes in this gene set have no relationships with each other. Since there are a lot of genes that are related to this phenotype, this phenotype will be assigned a lower p value that it should be. It is because when two events are independent, you can just multiply their probabilities.