Question

Differential Expression for a predefined one or multiple genes and multiple testing

0

Entering edit mode

8 months ago

p-radwan.derbala • 0

My question is about Multiple-testing in statistics and Deseq2-related:

Let's imagine the following scenario:

I did a differential expression, using Deseq2, for all genes (as usual) in a specific tumor (lung) and got one biomarker, ex. ACT8, based on the p-adjusted <0.05
If I want to confirm that this marker is differentially expressed also in (stomach), and I performed another Deseq2 analysis, should I consider the p-adjusted or only the p-value<0.05 for the ACT8?

In other words, does the specificity of my scientific question, based on solid evidence that a marker or set of markers was significant in another tumor type, limit the strictness of multiple testing to only the one gene or set of genes of the scientific question?

Deseq2 Multiple-testing Differential-Expression • 1.2k views

ADD COMMENT • link updated 7 months ago by Gordon Smyth ★ 8.3k • written 8 months ago by p-radwan.derbala • 0

1

Entering edit mode

If I get you correctly, you're asking whether you should correct all genes for multiple testing or only a subset, or in an extreme case only the genes you care about.

Basically, I (not being a statistician at all) think that you should correct with the genes that went into the analysis. The power of DESeq2 and tools like it comes from the fact that it uses the shared information across many genes to accurately estimate variance across the full range of average expression values. Without many genes it could not generate this power. Then later cherrypicking which genes go into MT correction seems inaccurate to me. I think you can be filter a bit, for example only protein-coding to lower MT burden a bit, but selecting a handfull of genes seems off to me. Not very scientific comment, I realize this.

ADD REPLY • link 8 months ago by ATpoint 89k

score 2 · Accepted Answer · 2024-12-23

This question has been asked many times on the Bioconductor support forum in the context of limma and edgeR analyses, for example:

https://support.bioconductor.org/p/23568/ (16 years ago)
https://support.bioconductor.org/p/23611/ (16 years ago)
https://support.bioconductor.org/p/63166/ (10 years ago)
https://support.bioconductor.org/p/69725/ (9 years ago)

If you are doing a validation analysis where you are only interested in validating differential expression of a limited number of pre-specified genes, then you should still conduct the linear modelling and empirical Bayes analysis on the whole universe of genes, but you only need to apply multiple testing to the genes of interest.

In limma or edgeR, this is easy. You just conduct the full analysis as usual, then subset at the final step when applying multiple testing by

topTable(fit[genesofinterest,])

or

topTags(fit[genesofinterest,])

If there is only one gene of interest, then you would just be looking at the unadjusted p-value.

The genes of interest must be pre-specified. Choosing the genes from the same dataset would be double-dipping or cherry-picking.

These statistical principles would also apply to DESeq2, but I don't know enough about DESeq2 to say how to implement it.