Hi all,
My understanding is that DESeq2 should by default normalize for differences in the number of reads. However, I am concerned that the magnitude of this difference is a bit extreme and I wonder if a different and/or additional normalization is needed to account for this. I accounted for this in the design, but I still have such a high number of genes that are significant (~900 with padj < 0.01 out of ~3000 genes).
dds = DESeqDataSetFromMatrix(countData=Counts,
colData=metadata,
design=~Batch + HTSeq_assigned_reads_M + Treatment)
For reference, below is a plot that shows the number of input reads for each sample. The darker green is treatment A, the lighter green is treatment B.
Thank you in advance for any guidance!
For the life of me I cannot get the image to imbed, so here are two links (I was hoping I would have better luck using a different hosting site):
https://imgur.com/a/PFoHRj4
https://ibb.co/TwkKvT4
Not a statistician but I think putting read number as a continuous variable makes no sense as the effect of read depth should be normalized out after the initial normalization step. It is indeed not good to have such different read depths but this is probably nothing you want in the design. The ticky thing here is to distinguish between 0 counts due to biology and due to dropouts caused by low sequencing depth. You could check the percentage of genes with zero counts for each sample and then see if the light green samples are enriched for this. If not or only modestly then this might be ok and you probably can analyse as usual without depth in the design. If you have zero inflation would it be possible to re-sequence these samples to get more depth for the light-green ones? YOu can of course try to correct for things in silico but the best way to account for this would be to actually correct for it in the lab.
I may not have worded my original post well, but I agree that the reads should already be normalized by default. However, I guess I worry that the differences in sequencing depth is beyond the range of what is customary. So perhaps there would be an additional step in normalization? I understand that including this in the design is not customary, but my sense is that the imbalanced nature between sample sizes and sequencing depth is also abnormal. By adding this to the design I reduced the number of significant genes by ~100, but as I mentioned it still seems high.
I wholly agree with your point that the best thing to do would be to sequence further as that would definitely solve my problem! Unfortunately at this point though I do not think that is possible...