I use EdgeR
to perform DE analysis using the standard protocol.
The steps are as follows.
- Alignment using
HiSAT2
, - Count matrix generation using
PrepDE.py
, - DE analysis using
EdgeR
using LRT.
When I perform DE analysis with count matrix for only two samples groups which I need to compare, I get larger number of deferentially expressed genes, as compared to, when I perform DE analysis with count matrix for large number of samples, and compare the same two samples groups using contrast parameter.
I am assuming that presence of counts from other group samples affects normalisation and dispersion of counts of samples from these two groups which are of my interest.
My question is, which DE genes should I trust? The ones I get when I use only two-sample-group-count-matrix or the one I get when I use an all-sample-group-count-matrix?
Hi Devon,
Please let me correct myself. By two samples I meant two groups.
In other words, if I perform DE analysis between C1 and T1 groups using C1 and T1 count matrix, I get more number of DE genes. If I perform DE analysis between C1 and T1 groups using C1, T1, C2, T2, C3, T3 count matrix, I get less number of DE genes.
That's better then. Your ability to properly assess variance increases with sample number, so in general the design with more groups will be more reliable. My presumption is that the two group case isn't having extreme variance cases penalized as much.
What if the library prep method for C1, T1, C2, T2 is different from C3, T3, C4, T4? Both are PolyA based kits but from different manufacturers.
Would that lead to variability due to technical reasons? In that case should I make separate count matrix for first four samples and separate count matrix for last four samples?
Yes, in that case you're probably better off splitting things by prep kit.