So we have RNAseq data for 3 treatments (A, B and C) that we compared to control. Each treatment should be compared to control (as in A vs. CT, B vs. CT, C vs. CT) and from then we should know the number of differentially expressed genes (padj < 0.1) and which genes are upregulated and downregulated.
When I run the DESeq2 pipeline with a table containing Control (CT) and the 3 treatments, we found about 1,000 differentially expressed genes (padj < 0.1) but the result CSV spreadsheet only had one log2fold column and padj column, where I was expecting to have 3 column log2fold and padj for each comparison (A vs. CT, B vs. CT and C vs. CT). Since I couldn't extract the log2fold and padj for each comparison, I re-run the DESeq2 pipeline on each treatment vs control separately, and now the number of differentially expressed genes is completely different, I obtained:
- A vs CT : 1000 genes
- B vs CT: 2000 genes
- C vs CT : 2500 genes
In total that's far beyond the original 1,000 genes I had when running DESeq2 with the 3 treatments vs Control in one spreadsheet.
- What causes this difference?
- And now I'm wondering which pipeline is the right one?
- Should I run it independently or all treatments together on one spreadsheet?
- If so, how can I extract the padj and log2 fold changes for each treatment if they are run together??
You'll need to show the code you used in the first vs. one of the second cases for us to help. My guess is that in the first instance you ended up comparing the full model against something like ~1, which isn't what you want. It's completely possible to extract fold-changes and adjusted p-values while keeping everything in. In fact, you'll get more reliable results that way, due to better variance estimation.