DESeq2 - Fewer DE genes with increasing sample size
0
0
Entering edit mode
2.5 years ago
bibrgr • 0

Hello all,

I am using DESeq2 to analyze a barcoded library and have noticed fewer differentially expressed genes as the sample size is increased. For background, this is a bacterial library with barcoded strains. The read counts for each strain under different conditions are used to determine the fitness advantage of the strain.

When I do the analysis on the full dataset (6 unselected library vs. 17 treatment condition), I get 19 differentially expressed genes (actually strains with different counts, but I would think the principle is similar) using padj < 0.05. Some of the genes that are high up on the rank list but have padj > 0.05 I have reason to believe are false negatives, because I observed those genes in an independent replicate of the experiment. However, if I only use 3 treatment samples, I get 158 DE genes, with the number of genes decreasing as the number of treatment samples increases. I would think that more samples gives greater power to detect changes in counts, so this behavior seems unintuitive to me. Does anyone have an idea what might be going on? I tried removing the outlier replacement/Cook's cutoff settings which increases the genes detected but the trend of more samples --> fewer genes remains.

Any help would be much appreciated.

DESeq2 differential-expression • 1.1k views
ADD COMMENT
0
Entering edit mode

Have you done PCA analysis for the samples yet? If you haven't done so it's an important QC step to check whether samples are separating as expected, or whether there are underlying problems with the data like batch effects or sample quality issues.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion! I'm new to DESeq2 and wasn't aware that that's something you can do. There was an outlier sample that I removed; however, there are still just 24 DE genes.

enter image description here

enter image description here

ADD REPLY
1
Entering edit mode

The most variance in your data comes from a separation within groups (the blue and yellow ones separating on the x-axis), so this is probably a strong batch effect. Have these samples that are separating here been processed on different days, or something like this? This is the issue that needs to be addressed first.

ADD REPLY
0
Entering edit mode

No, these were all from the same run. I checked and the cluster of samples on the right has a lot of zeroes (<33% detected), which could be due to dropout or library selection, I'm not sure.

ADD REPLY
0
Entering edit mode

The MA plot looks a bit strange to me as well, with most of the DE genes having very high LFCs (I assume from a "gene" with almost all 0's in one condition but not the other).

enter image description here

ADD REPLY
0
Entering edit mode

I am not a bacterial OMICS person but it might make sense to either include the zero-inflation into the model or use some kind of approach such as RUVseq to somehow tackle this unwanted technical variation because this makes the dispersion skyrocketing I assume, it is in any case the source of PC1 variation it seems.

ADD REPLY
0
Entering edit mode

Okay, I might try that approach. I'm not so concerned about the 6 hour condition as the differences in the library vs. 2 day conditions are what I'm really interested in, but it might be a useful reflection of variance in the larger data.

ADD REPLY

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6