Hello, I am writing an RNA-seq manuscript with a fellow lab member. I did everything except the bioinformatics. We have seven different tissues and three biological replicates for each, so a total of 21 libraries. We are comparing four of the tissues for a specific analysis. For this analysis, she normalized the counts of those 12 libraries (excluded the 9 others) and then identified DEGs. I suggested that she normalize all the libraries together, and then identify DEGs by doing contrast between all the libraries in the specific analysis (so the 12 libraries in this case).
She did both analyses and found that when comparing for example tissue A to tissue B, 1,221 (79%) genes were identified as upregulated by both normalization methods, 257 (16.6%) were identified by subset-sample-based (i.e., only the 12 libraries) normalization, and 67 (4.3%) were only identified by all-sample-based normalization. And she argues that "by normalizing with the samples/libraires of interest, we were able to identify more DEGs, which covered 94.5% of the DEGs identified by all-sample-based normalization. So, if we report subset-sample-based normalization results, we are missing 5% DEGs that can be identified otherwise; if we report all-sample-based normalization results, we are missing 17% DEGs that would be reported otherwise."
I think that the DEGs identified only by subset-sample-based normalization are likely not real DEGs because they are likely expressed in multiple tissues. The samples we included are not all the tissues in the plant we used, but the more tissues the more complete the picture, correct?
Using the number of DEGs as proof of method efficacy is not necessarily a good idea. For example, you get less DEGs with log fold shrinkage, but this is due to the reduction of bias for genes with lower counts to have higher log fold changes.
As part of estimating the variance of a gene, a component of that variance is calculating using all samples, which often results in having more accurate variance estimates the more samples you include. You could be getting less DEGs for example because more accurate variance estimates are removing false positives.
The ideal workflow would be using all samples in the model, use log fold shrinkage when using the DEGs, and setting a fold change threshold when making the contrast. This will generally result in the highest quality data. Exceptions to this would be if the samples are from vastly different sources where the counts are expected to be very different for most genes.
As a side note, if you haven't done so, run a PCA analysis just to make sure samples are separating as expected. Another case could be that outlying samples are reducing your power to detect DEGs with the full dataset.
Thank you; this is very helpful.
We did a PCA and the samples separated as expected by tissue.