Question

DESeq2 - Fewer DE genes with increasing sample size

0

Entering edit mode

3.1 years ago

bibrgr • 0

Hello all,

I am using DESeq2 to analyze a barcoded library and have noticed fewer differentially expressed genes as the sample size is increased. For background, this is a bacterial library with barcoded strains. The read counts for each strain under different conditions are used to determine the fitness advantage of the strain.

When I do the analysis on the full dataset (6 unselected library vs. 17 treatment condition), I get 19 differentially expressed genes (actually strains with different counts, but I would think the principle is similar) using padj < 0.05. Some of the genes that are high up on the rank list but have padj > 0.05 I have reason to believe are false negatives, because I observed those genes in an independent replicate of the experiment. However, if I only use 3 treatment samples, I get 158 DE genes, with the number of genes decreasing as the number of treatment samples increases. I would think that more samples gives greater power to detect changes in counts, so this behavior seems unintuitive to me. Does anyone have an idea what might be going on? I tried removing the outlier replacement/Cook's cutoff settings which increases the genes detected but the trend of more samples --> fewer genes remains.

Any help would be much appreciated.

DESeq2 differential-expression • 1.4k views

ADD COMMENT • link 3.1 years ago by bibrgr • 0

0

Entering edit mode

Have you done PCA analysis for the samples yet? If you haven't done so it's an important QC step to check whether samples are separating as expected, or whether there are underlying problems with the data like batch effects or sample quality issues.

ADD REPLY • link 3.1 years ago by rpolicastro 13k

0

Entering edit mode

Thanks for the suggestion! I'm new to DESeq2 and wasn't aware that that's something you can do. There was an outlier sample that I removed; however, there are still just 24 DE genes.

enter image description here

ADD REPLY • link 3.1 years ago by bibrgr • 0

1

Entering edit mode

The most variance in your data comes from a separation within groups (the blue and yellow ones separating on the x-axis), so this is probably a strong batch effect. Have these samples that are separating here been processed on different days, or something like this? This is the issue that needs to be addressed first.

ADD REPLY • link 3.1 years ago by ATpoint 88k

0

Entering edit mode

No, these were all from the same run. I checked and the cluster of samples on the right has a lot of zeroes (<33% detected), which could be due to dropout or library selection, I'm not sure.

ADD REPLY • link 3.1 years ago by bibrgr • 0

0

Entering edit mode

The MA plot looks a bit strange to me as well, with most of the DE genes having very high LFCs (I assume from a "gene" with almost all 0's in one condition but not the other).

enter image description here

ADD REPLY • link 3.1 years ago by bibrgr • 0

0

Entering edit mode

I am not a bacterial OMICS person but it might make sense to either include the zero-inflation into the model or use some kind of approach such as RUVseq to somehow tackle this unwanted technical variation because this makes the dispersion skyrocketing I assume, it is in any case the source of PC1 variation it seems.

ADD REPLY • link 3.1 years ago by ATpoint 88k

0

Entering edit mode

Okay, I might try that approach. I'm not so concerned about the 6 hour condition as the differences in the library vs. 2 day conditions are what I'm really interested in, but it might be a useful reflection of variance in the larger data.

ADD REPLY • link 3.1 years ago by bibrgr • 0