Question

DESeq2: different log2FC results when using subset of data

1

Entering edit mode

10.1 years ago

weixiaokuan ▴ 140

Hi,

I have data with 4 levels group: condition1, treat1, condition2, treat2.

I am trying to use DESeq2 to identify differentially expressed genes between two levels: treat1 vs condition1.

I tried two methods. First, I only use subset of all data, condition1 and treat1 to get the differentially expressed genes between treat1 vs condition1. Second, I use all the data including condition1, treat1, condition2 and treat2; but I extract the differentially expressed genes between treat1 vs condition1 by results(dds, contrast("group","treat1","condition1")).

Interestingly, the identified differentially expressed genes are completely different. When I checked size factors for each sample in these two different methods, they are also completely different. Is this an expected behavior for DESeq2 pipeline? If so, which method should I use?

Thank you.

RNA-Seq DESeq2 • 5.8k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by weixiaokuan ▴ 140

2

Entering edit mode

I would actually expect the size factor to be different between the first test and the second. This is mainly because you have more samples to account for the difference (e.g. samples from condition2 and treatment 2), therefore the size factor will differ. Put it this way, if for example, the total number of reads in condition 1 is 10 and treatment 1 is 20, then the size factor can be 2 and 1 or something along that line. However, if then I have condition2 with read number at around 15 and treatment around 25, then the size factor can be 2.5, 1.25, 5/3 and 1 for condition 1, treatment1, condition 2 and treatment 2 respectively. This is a very naive example and the actual size factor calculation can be a bit different, but it gives you the concept of why the size factor can differ. As for the different result part, I suspect it might be something to do with the parameter estimation (e.g. the over dispersion parameter estimation), with more samples, the parameter estimation might be a bit different, therefore leads to the difference in results. However I am not 100% sure.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by Sam ★ 4.8k

1

Entering edit mode

...the identified differentially expressed genes are completely different.

By what measure? Do they not even rank highly when compared between methods? Are you simply comparing the overlaps at a given cutoff? What if you plot the p-values for all genes as determined by method 1 vs method 2, is there any correlation? One can expect some difference between methods because variance for each gene in the entire data set will be different than in only a partial data set. But the degree of difference may depend on how different your conditions are. Are there huge differences in the data between conditions by other measures?

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by seidel 11k

0

Entering edit mode

Are condition1 and condition2 biological replicates? Likewise, are treat1 and treat2 biological replicates? Or do these represent 4 different groups (with or without biological replicates)?

ADD REPLY • link 10.1 years ago by h.mon 35k

0

Entering edit mode

Friends,

Thank you for your kind answers. I now did realize that this is expected and I did my due diligence search and find some other post discussing the similar questions. So, I'll use all the samples to build the model and extract different comparisons from the contrast matrix.

-X

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by weixiaokuan ▴ 140

score 5 · Accepted Answer · 2015-07-10

5

Entering edit mode

10.1 years ago

Michael Love ★ 2.6k

This is not surprising, because the model estimates (size factors, dispersion, priors, etc) use all samples. I suggest exploring a PCA plot to see the relationships between samples (see vignette).

ADD COMMENT • link 10.1 years ago by Michael Love ★ 2.6k