Hello,
I am doing some standard differential analysis on dnase-seq data between two conditions using deseq2. My results show a huge number of differential regions with around 80% peaks having adj pvalue <0.01. Since I was not expecting such a huge difference between conditions, I think there might be something wrong with my data.
After searching for similar issues, I have come across posts which mention that this could results from replicates being very similar and not capturing enough biological variability. Looking at IGV tracks for my replicates for each condition (3 reps per condition) shows this to be true for my data. My replicates for each condition look extremely similar and almost look like technical replicates (the person who did the experiment is not around, but I am getting a bit suspicious if they really are biological replicates). When I visually compare the differential peaks in IGV, they are usually areas with very small peaks/low reads and show small differences between conditions which probably wouldn’t have been called significant if the replicates were better. Most of my differential analysis results seem to be just noise.
For now I am just ignoring the adj p value cutoff (as pretty much everything has extremely low pvalues) and using a very high logfc cutoff which gives better results. However, I was wondering if there is anything else I can do with the DeSeq2 or edgeR pipeline to account for this lack of variability in replicates so noisy regions are not called significant?
Thanks
Can you show an MA-plot (plotMA function of DESeq2) and say what the samples are? Cell lines? If cell lines, and maybe taken from the same dish for a "triplicate" then what you describe can happen.
Thanks, I have added the imgbb link for the ma plot to the post (I couldn't get it to embed properly for some reason).
The data is from cell line (mouse embryonic fibroblasts). One wild type and one knockout for a target protein. It is very much possible that the triplicates may have been from the same plate.
Yes, this looks like I was suspecting, lots of changes on the chromatin level and lots of regions with very low effect sizes. You can use the
lfc
option during eitherresults
orlfcShrink
to test specifically against a minimum effect size (the default is 0). Significant results then have good statistical evidence to have an effect size (=logFC) greater thanlfc
. Doing so you can remove regions that are unlikely to have a biological meaning as the effect size is so small. it is also a good strategy for filtering in order to focus on regions with large effect size. I would recommend to uselfcShrink
for it. What you see is (at least in my hands) very common for cell lines.Thanks, that helps a lot. I need to read a bit more about the lfcShrink function as I don't completely understand the idea behind the different types of shrinkage.
However, changing the lfc threshold in results does change my ma plot a lot and gives a much more reasonable number of differential regions. Is there anything I should keep in mind when deciding on a threshold or should I just try different thresholds and choose one that gives results I consider 'reasonable' ?
Generally one uses something that is not overly strict, maybe log2(1.5) given the large number of DE regions in your experiment. The DESeq2 vignette discusses lfcShrink in some detail, and there are many threads at Bioconductor (support.bioconductor.org) as well to get a background. Yes,
lfc
does not change the plot, just the number of DE genes,lfcShrink
will change the logFCs and by this the plot using shrinkage, see manual for details.