I am doing a DGE analysis of a total RNAseq dataset of 2 timepoints (5 reps each). I am particularly interested in looking for changes in expression of 1000 genes.
Currently, I have done the analysis by analysing all genes and then picking the 1000 genes I am interested in. However, my PI has suggested that I could try doing the the analysis by doing DGE on just the 1000 genes. Theoretically, this should improve the statistical significance since there would be minimal adjustments for multiple hypothesis testing.
Is this an advisable way of doing the analysis? Since differential expression levels are fit to a negative binomial distribution (in the case of DESeq2), wouldn't this just mean most of the 1000 genes I input would end up not being differentially expressed?
Edit: We arrived at the list of 1000 genes as we were interested particularly in genes coding small proteins. Hence, we searched Uniprot for human proteins with a maximum length of 100 amino acids.
Thanks for the reply Papyrus.
The list of 1000 genes was compiled by searching for small proteins. We searched the Uniprot database for human proteins of max. length 100 amino acids. Since we are only interested in small proteins in the analysis, would this be a sufficient reason?
No. You could have done a different type of experiment if you really wanted to just focus on those 1000 genes. However, you chose RNA-seq and you therefore should stick to conventions in RNA-seq.
I would preferably do pathway enrichment analysis on the whole DEG results to see if among your list of differentially expressed genes there is an enrichment in small proteins. In general, you may perform pathway-focused analyses (such as GSEA) to see how specific pathways behave in your data.