Question

Confusion with fold change cutoff

0

Entering edit mode

5.3 years ago

Gene_MMP8 ▴ 240

I am performing microarray data analysis and I have a set of DEGs using a adjusted pvalue < 0.05. I am getting a total of 152 DEGs. Now I am using different threshold for the log fold change cutoffs. If I am using a log FC of 1, I am getting only 18 DEGs. But if I am including all DEGs (without any fold change cutoff), then after doing functional analysis, I am getting much more interesting biological process terms and pathway terms that have high correlation with the disease. Should I go with a fold change cutoff at all? I wouldn't be able to say which genes are up/down regulated in that case.

R RNA-Seq • 3.2k views

ADD COMMENT • link updated 5.3 years ago by ATpoint 85k • written 5.3 years ago by Gene_MMP8 ▴ 240

1

Entering edit mode

Hello, can you show us your analyse pipeline? The number of DEGs is somehow low in your case. We should use fold change cutoff if you want to perform pathway enrichment. You can try GSEA analysis which take all gene expression data(no FC cutoff).

ADD REPLY • link 5.3 years ago by MatthewP ★ 1.4k

score 7 · Accepted Answer · 2019-08-18

There is no strict rule. The default is a cutoff of zero which means that the statistical framework tests against this cutoff as null hypothesis. The question is if small fold changes (even though statistically significant) are meaningful in a biological context aka does a fold change of e.g. 1.1 (on linear scale) have any biological effect.

Often people apply filters on the results table such as FC > | 1.5 | to focus on what they believe are the biologically-meaningful changes. The problem is that significance of results is also in part a function of the replicate numbers. Smaller fold changes (given the replicates are comparable) will become significant at larger sample size. At large n small FCs will become significant but the biological impact is questionable if observing like FCs of 5%. A data-driven alternative would be to specifically test against a certain fold change (so a user-defined null hypothesis) which is what e.g. glmTreat from edgeR does. The DESeq2 analogon is I think the lfcThreshold parameter in the results function, see here the manual. If limma offers that for arrays I cannot say. From what I understand this might particularily be useful if one has plenty of significant genes (thousands) and wants a data-driven way to reduce this number to the (probably) most meaningful candidates. This approach from what I understand requires greater statistical power and might not be suited for small sample sizes with modest effects.

In your case, given you have only few candidates, I would probably take all 152 genes and proceed with the analysis. Any conclusions you make from any NGS experiment should (imho) anyway be confirmed by an independent approach, be it other experiments by yourself or by showing similar results from published and reasonably-related data you reanalyzed after downloading from NCBI.