Hi all,
I have recently performed some DE analysis using DESeq2. I have noticed that my top hits are enriched for genes with a relatively low read count across many samples.
In some cases, a significant gene will have 0 counts in half of my controls with no single control having a read count > 10 whereas the cases will have reads ranging from 0 to 120 (with approx 20% having 0 counts).
A factor to consider is that I am working with fibroblasts which have been obtained from different biopsy sites. Specifically, the control samples were taken from the ear whereas the cases were taken from the limbs. This is an unavoidable confounder that I will just need to acknowledge as a limitation. However, this fact makes me weary of such genes which have very very low read counts in either cases or controls.
I am hesitant to apply low read filtering myself as I know DESeq2 accounts for this. However, I know this is an option, albeit one with no gold standard as far as I can tell.
Would you have any advice?
No, it doesn't. What it does is to remove outliers and apply independent filtering, removing genes with low power. That is not strictly the same as low count filtering. Anyway, I personally think that it is always a good idea to prefilter results. See the DESeq2 vignette on suggested prefilters, or use
filterByExpr
from edgeR which works very reasonably. Prefiltering and independent filtering can go together, they're not at all mutually exclusive.Hi ATpoint,
Many thanks for the clarification. That makes a lot of sense.
I have tried some pre-filtering now, and most of the genes I was worried about have not been filtered out. I think I would need to set the filtering threshold quite high to remove those genes and I don't think this would be a good idea.
Would you consider the case I describe above an issue? I suppose I could be looking at a real biological difference. My only concern is that although all of my samples are primary fibroblasts, the cases and controls have been sourced from different parts of the body, so that is a central confounder I cannot avoid.
I was wondering if the differences in low counts genes could be linked to that (i.e. genes with very low counts in cases but not controls could simply reflect that this gene is not highly expressed in the part of the skin the cases were sourced from).