Question

Statistics on differentially expressed genes

0

Entering edit mode

5.7 years ago

elb ▴ 260

Guys, I need feedback on the FDR calculation relative to the differential gene expression analysis. I often see people that calculate FDR (after differential gene expression analysis) relative lo lists of around 12.000 genes. Is this correct or is better to make a selection of genes before the differential gene expression analysis to make FDR more reliable (i.e. avoiding to make statistics on insignificant genes because never expressed that will anyhow enter in the FDR calculation)?

Thank you in advance

RNA-Seq • 1.8k views

ADD COMMENT • link 5.7 years ago by elb ▴ 260

0

Entering edit mode

Hello elb , you do not need to bookmark your own question, as you will be notified of any posts made on it through email anyway. Plus, you can always go to https://www.biostars.org/t/myposts/ to look at your posts :-)

ADD REPLY • link 5.7 years ago by Ram 44k

score 0 · Answer 1 · 2019-04-03

Can you please elaborate on how you are planning to make a selection? I think most of the DE tools are designed to handle the entire dataset as it is for a better fit. You will be skewing the data if you are randomly filtering your genes with some cutoff. If you are planning on removing zero counts, then that might be okay depending on the tool you are using. Again, please read about the model of the tool before making such decisions. I would recommend playing with the data to see the mean count distribution, outliers and doing some QC.

score 0 · Answer 2 · 2019-04-03

Generally I filter for expression before inputting data for DE analysis (usually through DESeq2). I'll use log10(FPKM) or log10(TPM) to generate a density plot, then empirically determine a cutoff for expression at the trough of the bimodal distribution. Then depending on the analysis I'm doing I'll often require >=50% of treated OR untreated samples pass this cutoff. Therefore, if a gene goes from unexpressed to expressed it will be included in the analysis, but if a gene goes from 0.01 FPKM to 0.1 FPKM I won't erroneously see a 10-fold increase. This filtering tends to take the ~53,000 gene ENSEMBL annotation down to ~10,000 genes or less depending on the samples, and is completely kosher (there's a discussion in the DESeq2 vignette).

I'm not sure if you had other filtering criteria in mind, but I would be wary. I would do all of my filtering before the differential expression analysis, that way you aren't biasing yourself to unintentionally remove unchanged genes and help your FDR.

score 0 · Answer 3 · 2019-04-03

0

Entering edit mode

5.7 years ago

Ido Tamir 5.2k

Yes its advisable to filter for the reason you have mentioned (i.e. be more sensitive by excluding genes with no chance of being DE or not even being expressed at all in this experiment) and e.g. DESeq2 does this automatically. Overview Independent filtering of results. Theory

ADD COMMENT • link 5.7 years ago by Ido Tamir 5.2k

score 0 · Answer 4 · 2019-04-03

Guys, thank you very very much for this precious help. My point is simply not to affect the FDR by the 0s and although it is intuitive I see people to perform for example differential gene expression analysis between treated and control using all the genes (then FDR will be calculated on the full-length list) and then they remove genes poorly expressed looking at the counts (cpm). In this way the FDR they take and publish is of the final filtered list but no one knows that it comes from the full length gene list until you re-run the analysis....