Answered by Michael Love on bioconductor - Bioconductor link
Hi,
I am using DESeq2 v1.22.2 to test for changes in tumors over time pre and post treatment. However when I use the results function, I get log2FC on the order of +/- 30, which is of course unreasonably large. If I follow up with lfcshrink + ashr shrinkage, these barely change and if I use lfcshrink + normal shrinkage, these effects are shrunken to nearly 0. I understand "ashr" generally performs better than "normal" and that it preserves large LFC, but these LFC are unrealistically large with ashr. I have 2 questions:
- Is this expected behavior or does log2FC on this scale imply there is something fundamentally wrong with my design or samples?
- If this is expected behavior, should I go with "ashr" or "normal"? I am not using apeglm because I need the contrast argument.
More background:
Not all DEGs have such massive LFC - it seems to almost be bimodal with significant genes either having L2FC > 25 or < 5.
There is significant variability in tumor type (breast, colorectal, melanoma etc), but I am hoping this effect is captured by a paired design. Samples generally cluster on PCA based on their patientID, not time. I have at minimum 2 replicates per timepoint, but usually 3 or more. There is also significant variability in RNA quality, which I have binned into "good" or "bad". My design is thus:
design=~PatientID+time+rbin
I then test the effect of time using the contrast argument. I have noticed that the extent of these massive LFC genes is slightly affected by the threshold I use for RIN, but nothing seems to eliminate the effect entirely. I have tried removing rbin, modeling RIN as a continuous variable, removing all tumors with low RIN, SVA (with 1, 2, and 3 SVs), filtering samples based on tumor type, and removing the PatientID variable with no luck.
I apologize if this question has been asked before and am happy to provide more info as needed. Thanks!
-Alex
Edit - more info
colData:
PatientID disease time rbin
PatientA A d1 low
PatientA A d100 high
PatientA A d200 low
PatientB A d1 high
PatientB A d100 low
PatientC A d1 high
PatientC A d100 low
PatientC A d200 low
PatientD B d1 high
PatientD B d100 high
PatientD B d200 low
PatientE A d1 low
PatientE A d100 high
PatientE A d200 low
PatientF C d1 low
PatientF C d200 low
PatientG D d1 high
PatientG D d100 low
PatientG D d200 high
Code:
# Read in metadata
meta <- read.csv("meta.csv")
# Import
txi <- tximport(quantfiles,
type = "salmon",
tx2gene = tx2gene,
importer = read_tsv,
ignoreTxVersion=TRUE,
countsFromAbundance="lengthScaledTPM")
# Create DESeqDataSet
dds <- DESeqDataSetFromTximport(txi = txi,
colData = meta,
design=~PatientID+rbin+time)
# Run DESeq
dds <- DESeq(dds)
# DGE
res <- list()
res$d100vd1_noShrink<-results(dds,contrast=c("time","d100","d1"))
res$d100vd1_normalShrink<-lfcShrink(dds,contrast=c("time","d100","d1"),type="normal")
res$d100vd1_ashrShrink<-lfcShrink(dds,contrast=c("time","d100","d1"),type="ashr")
PCA (color = disease, label = patient):
MA plots with various shrinkages:
Hello Alex. If you want to investigate "time" you should put "time" at right side of design
design=~Patient+rbin+time
(I think Treatment maybe better in your case, with "before" and "after" 2 levels.). Second I don't think add "RIN" as a variable is good idea, if you prepared those samples library together. "PatientID" neither should be a variable, you can set "CancerType" variable refers to different cancer sample. Maybedesign=~CancerType+Treatment
Thank you for your reply, MatthewP! I will move time/treatment to the right. The libraries were prepared together so I will take RIN out of the design. Could you elaborate a bit on why you think PatientID is a bad idea? I based that off the paired design section in the vignette. Here is another example of someone with a similar design using PatientID as a variable:
https://support.bioconductor.org/p/84241/
Before doing things like RIN modeling I would start with the basics. Does the PCA plot look reasonable and does the MA-plot indicate that normalization worked properly? Are the majority of genes centered at y=0 in a MA-plot? I agree with MatthewP that RIN is not a good variable to model, especially not on continuous scale as there is no guarantee for a linear dependency between RIN and gene expression. I personally would rather exclude bad samples rather than having them corrupting the analysis. Either the RNA is unsable or not (at least this is my imagination). Also agree that patient ID should probably not be in there. Unreasonable FCs can also be product of improper contrasts. Please show all code you used.
Can you also please show the colData, so a data frame that lists patient IDs, treatments, times etc... everything that characterizes the experiment, plus the above mentioned plots.
Thank you very much ATpoint! I have edited the post to include colData, code, and plots. I'd really appreciate any additional insight.
Hi Alex, I have been keeping a close eye on this post as I had a similar problem with inflated LFC values.
Using the
apeglm
package worked very well for me. I see on Bioconductor that you wanted to stick to usingashr
as it accepts the contrast argument, but isnt using thecoef
argument withapeglm
the same thing, you just have to specify the index of the contrast fromresultsNames()
?Hi Barry,
Thanks for your comment. Yes you're absolutely right - I thought I couldn't because resultsNames() outputs contrasts only with respect to the reference level and there are a few non-reference contrasts I'd like to make. However I just came across this section from the DGE workflow that describes how I could get around it:
I will try doing this and follow up!