Hi,
I am running an enrichment analysis on 3000 differentially expressed genes (mouse). I have successfully taken a DE geneset from DESeq 2. I get two opposing pwf graphs if I plots it for up and downregulated genes. It does not matter which background I use but for arguments sake I used the following code to get the geneset for upregulated genes.
genes <- rownames(subset(deseq2_result, padj <0.05 & log2FoldChange>0))
background <- rownames(subset(deseq2_result, padj >0.05))
I then proceeded to generate the data frame expected by GOSeq, i.e. DE genes being 1 and background genes 0. Interestingly, I get a very unusual pwf plot. The pwf plot for upregulated genes (log2FoldChange > 0
) is similar to the one in the vignette, with long genes being more differentially expressed.
However, the plot for significantly downregulated genes is inverted. High proportion of short genes that are DE and low proportion of long DE genes.
If I plot all DE expressed genes no sensible line can be dawn as the bins cancel each other out (high scatter).
Any ideas why this might be?
Jakub
PS: apologies if I forgot some important background data in my first post
I think your background genes should include all the genes from your deseq2 results (just
rownames(deseq2_result)
). You are selecting only those which are above 0.05 as background.Many thanks.
Indeed, there are a number of methods how to get the background (my preferred one is using "matchit"). Changing the background to the total background does not make a difference to the result.
For completeness: the background has to be non-overlapping with the DE genes so that background needs to be all genes except those in the "genes" set.