Question

Question on DE analysis for developing cell populations

0

Entering edit mode

2.8 years ago

heir_of_isildur88 ▴ 30

Hi all, I don't know if this is asked before, and it's seemingly a bit naive, but I just need to get my head around it. I am currently trying to perform DE analysis on bulk RNA-seq data of a group of sorted cells. I have 4 different cell types which are at different stages of development which goes like this: A --> B --> C --> D.

I am trying to look into what genes are DE when they differentiate from 1 population to another. At first, I did a DE analysis using edgeR comparing populations B, C and D against the initial population A. With this method, the number of DE genes for each population are as such (using cutoff of FDR < 0.01 and Log2FC < -2 or > 2):

B vs A: 920 Down, 501 Up

C vs A: 1198 Down, 560 Up

D vs A: 1549 Down, 1100 Up

Next, I then performed DE comparing the subsequent population against the prior population using the similar parameters and cutoffs, and my results are as such:

B vs A: 920 Down, 501 Up

C vs B: 0 Down, 5 Up

D vs C: 15 Down, 178 Up

My question is why there's such a stark difference of the number of DE genes. Logically speaking, assuming the bulk of DE genes of D vs A are similar to C vs A, then the overall DE genes of D vs C would be around 300 Down and 500 Up; but it is not so when DE was performed comparing D vs C. Why is this so? Is my assumption faulty?

Differential Expression • 551 views

ADD COMMENT • link updated 2.8 years ago by Gordon Smyth ★ 7.7k • written 2.8 years ago by heir_of_isildur88 ▴ 30

score 0 · Answer 1 · 2022-02-23

Significance tests mean that the evidence for differential expression has be very strong in order to assess a gene as significantly DE. There's no rule that numbers of DE genes have to add up. For example, if you had three groups and Group2 was intermediate between Group1 and Group3, it would be quite possible to construct a dataset with lots of DE genes for Group3 vs Group1 but none for Group3 vs Group2 or for Group2 vs Group1.

From the DE counts that you give, it appears that most of the developmental expression changes already occur at B, but the changes continue to widen a bit more that C and a bit more at D. The fact there are not many DE genes for C vs B or D vs C is quite common. It just means that the smaller changes at the two later stages are mostly not large enough to achieve the significant cutoff.

I think you're making the discrepancy worse by choosing such a large logFC cutoff. I recommend against using logFC cutoffs in any circumstance but the very large cutoff you are using (log2FC = 2 corresponds to a 4-fold expession change) is to me much too large to be biologically useful and will tend to distort the results. Consider a gene with logFCs equal to 1.8, 0.1 and 0.1 for BvsA, CvsB and DvsC respectively. Such a gene would be DE for D vs A but it won't achieve the cutoff for any of the comparisons in your second analysis.

If I were you, I would strongly consider doing an anova-type test to find genes that are differential between any of the development stages, and then cluster the DE genes in terms of their trend patterns over the four stages. If you really do want to prioritize genes with large expression changes, then we recommend glmTreat() instead of a logFC cutoff.