Entering edit mode
8.0 years ago
stan
▴
80
Hello,
I am using DESeq2 to perform some differential expression analysis between 3 levels of a disease condition. However, I am getting a strange looking MA-plot when i compare some of the conditions. I am trying to understand what could be the reason for the skewness in my MA-plot.
Any help will be greatly appreciated.
I have provided the PCA plot of the different samples as well as one of the MA-plots (comparing LTNP vs VC). The analysis steps I have taken are below:
sampleFiles <- grep("count",list.files(directory),value=TRUE)
sampleCondition<-c(rep("EC", 3), rep("LTNP", 4), rep("VC",4))
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)
dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
rld <- rlog(dds, blind = FALSE)
dds <- DESeq(ddsHTSeq2)
res <- results(dds, contrast=c("condition", "LTNP", "VC"))
Thanks, Stan
Can you add to your post to explain your preprocessing steps, and a bit more about your samples (species for example). Something's not right if 99% of your variance is explaining the difference between two conditions.
I have added the analysis steps. These are samples from HIV-positive individuals grouped according to viral load
My only guess is that something is going very wrong when it's computing the size factors. Can you post the output of
sizeFactors(dds)
?Edit: I should note that this is a long-shot explanation.
Holy crap my long-shot explanation has legs! In my experience, whenever you get more than ~10x difference in scale factors you start getting really weird results. Since you have closer to 100x differences, it's quite likely that this is the cause of the problem.
Were the VC samples really sequenced to such a different depth? Can you post the scatter plot of the raw counts of one of the VC samples vs. one of the non-VC samples?
Here is the scatter plot of the raw counts between one VC and LTNP samples.
OK, so there seems to be a HUGE asymmetric difference between VC and LTNP and that's screwing up the size normalization. Do you have ERCC spike-ins? This is one of those rare cases where it'd be better to use them. I should note that the end effect of getting this right will be a very asymmetric change where many many more genes are down-regulated than up-regulated in LTNP vs. VC.
Good point, but also just to be clear, this experiment was performed on the same sequencing run right? They're not random publicly available samples you've pulled down to compare? Also, if you're able, a multiQC report might help to look at.
Well the VC's are very different compared to the rest... that's a clue to solve this mystery :p
99% variance between the groups explained by 1 PC!?!? There's no way that can be real. Did you set the size factors in an atypical manner?
I just followed the steps as outlined in the DESeq2 manual. Could there be a problem with my group "VC" samples? which is causing the 99% variance?
You have 3 samples: how did you set the
design
variable? I struggled with it many times, and I know it can change a lot both in the plots and in the actual p-values.EDIT: also, are you using normalized counts? DESeq2 wants raw counts.
I used raw counts generated using HTSeq. I think my design is OK, it's as follows:
I only have one variable "condition", which has three levels (EC, VC, LTNP).
Can you add in what aligner you used, the code for that, and your code for HTSeqCount please.
don't forget that DESeq2 default (as of version 1.14.1) is to plot only the top 500 genes (highest row variance) in the PCA
Is this written somewhere? Can you link this information here for me as well? I never read that, maybe I missed it! Thanks.
From R, look at
?plotPCA
after importing DESeq2. It uses the top 500 by default, but this can be adapted using parameter.