Hi all,
I'm using DESeq2 to find differentially expressed genes between two conditions from RNAseq data, with lots of replicates (46 in condition "1", 20 in condition "2").
I get results with significative adjusted p-values, but for most of them the gene expression values are highly variable between replicates.
For example for the gene with the lowest adjusted p-value, I've got all samples from both conditions with low normalized counts (around 10), and just one sample in one condition with >200000 normalized counts, which drives the differential expression toward this condition.
See log2(normalized counts + 1) boxplot below ( the adjusted p-value is 8.05e-12, and the log2FC is -5.87 between condition "1" and "2" for this gene)
Here is the code I used :
dds <- DESeqDataSetFromTximport(tx_import_data, coldata, ~condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds$condition <- relevel(dds$condition, ref = "R")
dds <- DESeq(dds)
res05 <- results(dds, alpha=0.05)
I'm wondering if this is "normal" that DESeq2 keeps those kinds of results and I that should filter it if I find it irrelevant, of if I made some mistake during the process and that DEseq2 should only keep genes without such expression dispersion between replicates?
Thank for your help
With only words but no plots illustrating your question it is difficult to make any statements. Please provide e.g. some boxplots of normalized counts or tables.
Ok I've just put a link with a boxplot illustrating my example.
log2
scale please ;-) and see How to add images to a Biostars post. You have to paste the link with the full suffix likehttps...foo.png
to the image box.done ;) sorry I never uploaded a plot before
I would check if these outliers samples also show outlier-like behaviour in a PCA maybe indicating a batch effect and if so, think about removing them.
Ok thanks, I've checked that and unfortunately they don't seem to be different from the other ones on the PCA.
In my experience, this kind of result typically stems from the presence of a very high variability in samples of the same group (compared to between groups). You may want to correct for possible co-variates in your data (see svaseq) or simply filter out results with high dispersion.