Strange MA-plot using DESeq2
0
1
Entering edit mode
8.0 years ago
stan ▴ 80

Hello,

I am using DESeq2 to perform some differential expression analysis between 3 levels of a disease condition. However, I am getting a strange looking MA-plot when i compare some of the conditions. I am trying to understand what could be the reason for the skewness in my MA-plot.

Any help will be greatly appreciated.

I have provided the PCA plot of the different samples as well as one of the MA-plots (comparing LTNP vs VC). The analysis steps I have taken are below:

sampleFiles <- grep("count",list.files(directory),value=TRUE)
sampleCondition<-c(rep("EC", 3), rep("LTNP", 4), rep("VC",4))
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)

dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
rld <- rlog(dds, blind = FALSE)
dds <- DESeq(ddsHTSeq2)
res <- results(dds, contrast=c("condition", "LTNP", "VC"))

Thanks, Stan

MA-plot PCAplot

rna-seq R deseq2 • 4.7k views
ADD COMMENT
1
Entering edit mode

Can you add to your post to explain your preprocessing steps, and a bit more about your samples (species for example). Something's not right if 99% of your variance is explaining the difference between two conditions.

ADD REPLY
0
Entering edit mode

I have added the analysis steps. These are samples from HIV-positive individuals grouped according to viral load

ADD REPLY
0
Entering edit mode

My only guess is that something is going very wrong when it's computing the size factors. Can you post the output of sizeFactors(dds)?

Edit: I should note that this is a long-shot explanation.

ADD REPLY
0
Entering edit mode
>  sizeFactors(dds)
      EC_1.count  EC_2.count  EC_3.count  LTNP_1.count 
4.69209178  4.68776603  4.46989074  5.10407662 

    LTNP_2.count    LTNP_3.count    LTNP_4.count    VC_1.count
4.88086201      3.97763836      6.47169000      0.07196773

    VC_2.count  VC_3.count  VC_4.count 
0.06824966  0.05651823  0.05470625
ADD REPLY
4
Entering edit mode

Holy crap my long-shot explanation has legs! In my experience, whenever you get more than ~10x difference in scale factors you start getting really weird results. Since you have closer to 100x differences, it's quite likely that this is the cause of the problem.

Were the VC samples really sequenced to such a different depth? Can you post the scatter plot of the raw counts of one of the VC samples vs. one of the non-VC samples?

ADD REPLY
0
Entering edit mode

Here is the scatter plot of the raw counts between one VC and LTNP samples.

ADD REPLY
0
Entering edit mode

OK, so there seems to be a HUGE asymmetric difference between VC and LTNP and that's screwing up the size normalization. Do you have ERCC spike-ins? This is one of those rare cases where it'd be better to use them. I should note that the end effect of getting this right will be a very asymmetric change where many many more genes are down-regulated than up-regulated in LTNP vs. VC.

ADD REPLY
1
Entering edit mode

Good point, but also just to be clear, this experiment was performed on the same sequencing run right? They're not random publicly available samples you've pulled down to compare? Also, if you're able, a multiQC report might help to look at.

ADD REPLY
0
Entering edit mode

Well the VC's are very different compared to the rest... that's a clue to solve this mystery :p

ADD REPLY
0
Entering edit mode

99% variance between the groups explained by 1 PC!?!? There's no way that can be real. Did you set the size factors in an atypical manner?

ADD REPLY
0
Entering edit mode

I just followed the steps as outlined in the DESeq2 manual. Could there be a problem with my group "VC" samples? which is causing the 99% variance?

ADD REPLY
0
Entering edit mode

You have 3 samples: how did you set the design variable? I struggled with it many times, and I know it can change a lot both in the plots and in the actual p-values.

EDIT: also, are you using normalized counts? DESeq2 wants raw counts.

ADD REPLY
0
Entering edit mode

I used raw counts generated using HTSeq. I think my design is OK, it's as follows:

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)

I only have one variable "condition", which has three levels (EC, VC, LTNP).

ADD REPLY
0
Entering edit mode

Can you add in what aligner you used, the code for that, and your code for HTSeqCount please.

ADD REPLY
0
Entering edit mode

don't forget that DESeq2 default (as of version 1.14.1) is to plot only the top 500 genes (highest row variance) in the PCA

ADD REPLY
0
Entering edit mode

Is this written somewhere? Can you link this information here for me as well? I never read that, maybe I missed it! Thanks.

ADD REPLY
0
Entering edit mode

From R, look at ?plotPCA after importing DESeq2. It uses the top 500 by default, but this can be adapted using parameter.

ADD REPLY

Login before adding your answer.

Traffic: 1819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6