Hello,
I am interested in performing a differential gene expression analysis using RNA-seq data (~300 samples) from ICGC and DESeq2 package. I want to compare two patient groups based on "high" and "low" expression of a gene and, then, to develop a GSEA preranked.
After creating a DESeqDataset object specifying only the condition of interest in the design (design = ~ gene_expression_level
) and applying a variance stabilizing transformation with vts(dds, blind = FALSE)
, I evaluated the presence of batch effects using plotPCA
function. For this, I generated a PCA plot using only my condition of interest and I observed these two groups defined by diagnonal (top left plot). I thougth that these groups could be explained by another variable and I observed that this division was produced by sex (top right plot). However, I think that PC1 and PC2 do not explain well the differences between these two groups and that a diagonal line could account better this situation.
Then, I checked other clinical variables and I observed clearly defined subroups inside each sex group (low left and low right plots). In this context, I have two questions:
1-How should be these PCA plots interpreted taking into account the diagonal pattern?
2-Should I include sex and the other two clinical variables in my design (design = ~ sex + clin1 + clin2 + gene_expression_level
? I think that probably not because it seems that "high" and "low" expression patients are balanced between the 3 variables, but I am not sure and I would appreciate an opinion.
Thank you so much
Thanks for your answer Jared! After trying both options and filtering DEGs by abs(log2FC) > 1.5 and p-value < 0.05, DESeq reported 423 DEGs whith the full design (
design = ~ sex + CLL + MUT + gene_expression_level
) and 406 with the single design (design = ~gene_expression_level
). Both approaches shared 245 DEGs. Maybe differences in results could be explained by the role of other variables.I'd apply the log fold change threshold in the
results
call usinglfcThreshold
rather than post-hoc filtering, as it changes the threshold during testing rather than making your p-values meaningless. 1.5 is a pretty high threshold on a log2 scale, I'd lower it to 1 or 0.58 if you do end up usinglfcThreshold
.