Hi, dear friends, I got a strange thing when doing DESeq2 analysis, maybe I just made a silly mistake. Briefly speaking, sample B is much higher than sample A and C, all data are normalized by DESeq2, yet DESeq2 missed this significant group. Could anyone please tell me why? Thank you very much!
My command lines are as follows:
library(DESeq2)
sample_test<-read.table("sample_test.txt",header=T)
counts_test<-read.table("counts_test.txt",header=T,row.names=1)
dds <- DESeqDataSetFromMatrix( countData=counts_test[rowSums(counts_test)>0,], colData=sample_test, design=~ condition)
dds<-DESeq(dds)
library("apeglm")
res<- lfcShrink(dds, coef="condition_B_vs_A", type="apeglm")
write.table(res,"DESeq2_result.txt",sep="\t",quote=F)
norm<-counts(dds,normalized=T)
write.table(norm,"test_normalize.txt",sep="\t",quote=F)
The strange thing happens in Chr6_6392513_6392788_peak. In normalized data, A sample(3 rep), B sample(3 rep) and C sample(3 rep):
Chr6_6392513_6392788_peak 178.648023305415 1.12212517818275 5.79671671533368 1246.9014826137 1124.82594062888 778.761332712122 0 3.66331655600197 0.
Obviously, B is much higher than A and C.
But in DESeq2_result.txt
, which compares B and A, I got the following result:
Chr6_6392513_6392788_peak 371.079881967738 0.0246548548440441 0.1784530598261 0.026487239202988 0.347601564343674.
The padj is 0.35, not significant when padj threshold is set as 0.1.
The files are:
DESeq2_result.txt
(https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/DESeq2_result.txt)counts_test.txt
(https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/counts_test.txt)sample_test.txt
(https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/sample_test.txt)test_normalize.txt
(https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/test_normalize.txt)
Thank you for your attention!
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.OK, thank you very much, Ram, I will do this next time.
I don't see any errors, just a "strange" result. Instead of
res<- lfcShrink(dds, coef="variety_B_vs_A", type="apeglm")
, try with something like:Thank you h.mon, but the contrast command is not fit with apeglm, I got this error:
My code has no error, but the result is strange, since DESeq2 does not point out the obviouse difference, and I don't know why.
I do not see anything unusual - DESeq2 actually did very well here, i.e., in not identifying this as statistically significant. Just take a look at the numbers that you've got:
Sample A
--> not reproducible across your replicates; extremely high variance
Sample B
--> more reproducible than A but still not great; high variance
Sample C
--> again poor reproducibility across your replicates. Also, conducting reliable stats on low counts is difficult
My human brain looking at this tells me that something's not right with your replicates. So, you should look over the wet lab protocols to see what may have happened.
Kevin
Thank you Kevin.
So in this case, what does the padj value 0.35 mean? Does it mean A sample is the same with B sample? Or, it means DESeq2 just fail to tell the difference? In the first case, it seems a bit strange because the smallest B value is much bigger than the biggest value in A and C. Can I say sample A is the same with sample B? In the second case, is it practical that DESeq2 just excludes the undecidable genes/peaks, so we can say XX gene/peak is similar with XX gene/peak?
From my perspective, at least, it says that they are not statistically different at a Benjamini-Hochberg adjusted threshold of 5% (0.05). It does not say much about how similar they are (you would have to come up with a different test).
The question you should probably ask is why your replicates have such different values? Your example is the perfect example of why people should simply not be doing 1 versus 1 or 2 versus 2 comparisons. Replicates do not always behave like true replicates.
Yes, my title may have some misleading effect. You've mentioned that the three replicate of A and C have large variance, and I should look over the wet lab protocols. Actually, I did the experiment for three separate times, every time with one A, one B and one C. So the replicates are ture biological replicates. If you have wet lab experience you can understand that biological replicates have larger variance than technological replicates. Besides, Chr6_6392513_6392788_peak is just a rare case, most regions are consistent between replicates. You've also mention the key point that omic data should have at least three replicates. I saw a paper before which suggested that we'd better do six replicates for one sample.
Great, yes, 6 replicates would be even better. Unfortunately, some papers also state that using just 1 replicate is okay, which, from my perspective, is reckless.
If Chr6_6392513_6392788 is just a rare example among all of your genomic regions, then perhaps your wet lab work s good! You may want to explore what is specific about that region.
I spent long hours in the wet lab but eventually migrated to bioinformatics. I worked in shellfish microbiology, waste-water testing, and cancer circulating free DNA analysis on blood.
We have similar experience, I spend three years doing wet lab experiment but now I am learning some basic bioinformatic skills, because I have to analyse my samples. I am working in plant, studying plant genome function. I think you are very good at bioinformatics because you have high reputation value, haha.
Cheers
Aifu