Entering edit mode
11 months ago
yixinzeng
•
0
Hello, I've recently encountered a problem. I have a short gene set of interest, consisting of about a dozen genes. I'm interested in exploring the differential gene expression within this gene set between two types of samples. Here are my questions:
- Should I perform a full differential expression analysis (for example, using DESeq2), and then filter the results based on the gene set? Or should I first filter the target genes from the counts matrices, and then proceed with the subsequent statistical analysis? (I guess that performing statistical tests on a small part of the gene set would obviously yield better results than performing statistical tests on the entire gene set.)
- If it's the latter, what statistical method should I use? Are workflows like DESeq2 still applicable?
Thank you for your detailed answer! But I'm still worried about multiple testing correction.
Considering the impact of type I error, if I first normalize, then subset my matrices and perform a simple statistical test like a t-test, this would involve far fewer statistical tests than running on the full transcriptome. Is this feasible, and would it be more effective than running on the full transcriptome?
In general, I am only interested in a small part of the whole genes. Is it really necessary to run on the full transcriptome? Would this introduce more errors?
If you're going to perform normalization like FPKM, TPM and differential analysis by wilcoxon-test or t-test, I think it's ok to run on your interested gene list. But if you want to use DESeq2, you'd better perform it on the whole transciptome.
DESeq2 make use of genes with median experssion level for normalization. Assuming the abundance of most genes are similar among all samples, these genes can represent housekeeping genes or genes with median experssion level. If you perform it within the interested gene set, this assumption may not hold.
Thank you very much!
No, the full transcriptome gives you much more power to do the normalisation and dispersion estimation accurately. Also, t-test is not valid on count data.
But you shouldn't we worried about multiple testing correction, because the idea of the code above is that if you have a gene list of 100 genes, you run a full 20,000 tests, but then, because you only ever look at 100 of them, you adjust the multiple testing correction as if you had only ever done 100 tests. This is valid because you throw the other 19,900 tests out without looking at them.
Thank you very much, I understand completely. Have a nice day! :)