Hello all,
Apologies in advance for my lack of expertise and knowledge as I am a student bioinformatician. I have tried to understand other answers as well as the DESeq2 vignette and publications already so I have managed to run it successfully, I just have a problem with choosing the settings for my particular experiment.
At the moment I have RNA-seq data (ENSG IDs and raw counts) for leukaemia patients in 4 different primary cytogenetic groups inv(16) (n=17), t(8;21) (n=27), MLL (n=29) and Normal (n=19) - unfortunately I only have one tumour sample per patient and no matching normal tissue data. There is a lot of metadata on these patients - they are of different genders/races/ethnicities/FAB categories/protocols and some have other mutations involved. One variable which I am expected to use to form groups is Event status - I am expected to use 'Censored' patients as the control group and compare 'Relapsed' patients and patients who 'Failed Induction Therapy' to this group.
I have been working on the inv(16) patients first in which there are 5 Censored and 10 Relapsed patients. Two patients were removed as they were extreme outliers when viewed by PCA plot and sample distances heatmap. So here is my code so far:
#Setting up colData object
#Previously identified outliers C10 (4) and D1 (17) removed
ID<-as.matrix(colnames(inv16DF))
Gender<-as.matrix(inv16PD$Gender)
Race<-as.matrix(inv16PD$Race)
Ethnicity<-as.matrix(inv16PD$Ethnicity)
Event<-as.matrix(inv16PD$First.Event)
metadatainv16<-cbind(ID,Gender,Race,Ethnicity,Event)[-c(4,17),]
colnames(metadatainv16)<-c("Patient","Gender","Race","Ethnicity","Event")
#Setting up countData object
dds<-DESeqDataSetFromMatrix(inv16DF[,-c(4,17)], colData=metadatainv16, design=~Event)
dim(dds)
#Removing genes with sum total of 5 reads
dds<-dds[rowSums(counts(dds))>5,]
#Running DESeq2
dds<-DESeq(dds, fitType="parametric", betaPrior=TRUE, minReplicatesForReplace=7)
#Results in order of padj/how many genes < 0.05
resinv16<-results(dds, cooksCutoff=TRUE, independentFiltering=TRUE, alpha=0.05)
resinv16[order(resinv16$padj),]
table(resinv16$padj<0.05)
summary(resinv16)
#Merge with normalised counts and subset padj < 0.05
resdata<-merge(as.data.frame(resinv16), as.data.frame(counts(dds, normalized=TRUE)),
by="row.names"[order(resinv16$padj),]
names(resdata)[1]<-"Gene"
resdatainv16sig<-subset(resdata, resdata$padj<0.05)
#P value histogram
hist(resdata$pvalue, breaks=40, col="skyblue", xlab="P value", main="Histogram of P values for inv(16)")
#Dispersion plot
plotDispEsts(dds, main="Dispersion Plot inv(16)",ylim=c(1e-6, 1e1))
#Regularised log transformation for heatmap and PCA
rld<-rlogTransformation(dds)
#PCA ntd
ntd<-normTransform(dds)
plotPCA(ntd, intgroup="Event")
#PCA rld
plotPCA(rld, intgroup="Event")
#Sample Distance Heatmap
library(RColorBrewer)
library(pheatmap)
heatmapdat<-assay(rld)
dists<-dist(t(heatmapdat))
distmat<-as.matrix(dists)
colours <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(distmat,
clustering_distance_rows=dists,
clustering_distance_cols=dists,
col=colours,
main="Sample-to-sample distances inv(16)")
#MA Plot
plotMA(resinv16, main="MA-plot for Event with inv(16)", ylim=range(resinv16$log2FoldChange, na.rm=TRUE))
#Count plot for gene with minimum padj
plotCounts(dds, gene=which.min(resinv16$padj), intgroup="Event")
Which gives the following outputs:
> summary(resinv16)
out of 28453 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 9, 0.032%
LFC < 0 (down) : 28, 0.098%
outliers [1] : 680, 2.4%
low counts [2] : 7463, 26%
(mean count < 5)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> resinv16[order(resinv16$padj),]
log2 fold change (MAP): Event Relapse vs Censored
Wald test p-value: Event Relapse vs Censored
DataFrame with 28548 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000159189 129.04193 -2.959093 0.6150973 -4.810772 1.503483e-06 0.007669642
ENSG00000160791 89.11385 -2.126363 0.4185705 -5.080059 3.773175e-07 0.007669642
ENSG00000232422 59.67893 -2.824722 0.5868172 -4.813633 1.482112e-06 0.007669642
ENSG00000264063 40.50715 -2.257767 0.4602950 -4.905043 9.340686e-07 0.007669642
ENSG00000234699 13.63526 -2.632995 0.5546159 -4.747420 2.060274e-06 0.008407977
... ... ... ... ... ... ...
ENSG00000267778 0.8509069 -0.1971960 0.4453060 -0.4428325 0.6578869 NA
ENSG00000267785 0.6669036 -0.4189023 0.4848863 -0.8639187 0.3876326 NA
ENSG00000267791 0.6508162 0.0626245 0.4256238 0.1471358 0.8830248 NA
ENSG00000267793 4.9778852 -0.6359067 0.5809850 -1.0945320 0.2737217 NA
ENSG00000267799 4.6509096 0.7252028 0.5571731 1.3015754 0.1930616 NA
and these graphical outputs:
https://postimg.cc/gallery/2qhrndj0u/
I have a few questions despite trying my best to learn how to optimise the settings as my results seem to be quite poor.
1: I am unsure how to tell if variables (gender/race/ethnicity/etc.) significantly contribute to my results. I could possibly include them in design=
to account for variation here. Can't see any clear connection between these variables and how the patients cluster or appear in PCA plots.
2: The patients don't cluster at all, which may be related to my first point, and I think this is a major problem when trying to find DEGs, however I am still expected to group patients according to event status despite highlighting this issue. I've tried many combinations of samples and I've tried to edit the design, eg. design=~Gender+Race+Ethnicity+Event
or different combinations of this.The PCA plot and clustering within the sample distances heatmap clustering does not seem to improve upon choosing only individuals of the same race and ethnicity. One thought I had was to manually pick out samples from the PCA plot which look similar, for eg. 3 Censored Patients and 3 Relapsed patients, although I know this has negative implications for the power of the analysis and would probably be majorly biased. It did produce more DEGs, and better clustering. Eg:
> summary(resinv16)
out of 25289 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 3337, 13%
LFC < 0 (down) : 3239, 13%
outliers [1] : 256, 1%
low counts [2] : 4903, 19%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
and produce the following graphical outputs:
https://postimg.cc/gallery/3al8ck7om/
I don't know how to approach this analysis. Could anyone give me advice on this? Are the patients in the Censored/Relapsed/Induction Failure groups considered biological replicates?
3: I have tried minReplicatesforReplace=Inf
and cooksCutoff=FALSE
when I observed hundreds and thousands of outliers - is this correct? I have also made boxplots with these settings and found no other major outlying samples so assume I don't need to remove samples due to this. This is the code:
par(mar=c(8,5,2,2))
boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2, main="Cook's Distances inv(16)")
and boxplot:
https://postimg.cc/image/5g72qg1px/
4: Are the settings appropriate for this analysis and amount of samples?
I apologise once again for any redundant questions I am asking and a massive thanks in advance for any help given!
Hi Kevin,
Your advice has been incredibly helpful - thank you for not being harsh with your criticism! (I expected to be slated...).
So just for clarification, you're suggesting that I completely remove my:
and use the 3 lines of code you mentioned above (which includes the lfcShrink function) if I am obtaining lists of DEGs to use for gene set enrichment, etc? Do I also use this method before plotting dispersion plot and MA plots or transforming the data (
rlogTransformation(dds)
) and plotting PCA and sample distances heatmap? Apologies if I'm not totally following you there.. I hadn't even considered lfcShrink beforehand.I've tried your 3 lines of code and then used both
blind=TRUE
andblind=FALSE
while plotting myrlogTransformation(dds)
data and didn't notice much difference in how the samples appeared so I don't think there is reason to include an additional variable (apart from Event and perhaps Gender also) in my design formula.In terms of increasing the threshold for low counts, I increased it from a row sum of 5 (leaving 28548 genes) to a row sum of 50 (leaving 22123 genes) which decreased outliers from 680 to 410..this might have been a high threshold. However, I did find posts such as https://support.bioconductor.org/p/79855/ which make me think it isn't really necessary?
Thanks so much for your time, it really is appreciated!
Lindsay
Great - I'm glad.
Yes, I would just use the lines that I provided, i.e., the ones that include
lfcShrink()
. The betaPrior parameter's default should bebetaPrior=FALSE
for theDESeq()
function itself.The dispersion and MA plots are based on the normalised counts prior to any transformation, so, they will always remain the same irrespective of the design formula.
If you have heard otherwise regarding the filtering of low counts, in particular if it has come from Michael Love (DESeq 'Mastermind'), then I would always regard what he says as the absolute truth.
@Kevin Blighe: Why are you so awesome! Thank you very much!
Just doing my best and working hard to help others - that's all that can be asked of us.
Note that the
blind = FALSE
procedure is not 'fool proof' and will likely not get rid of any major batch effect in your data. If you need to directly modify the transformed expression levels for downstream analyses, then you should useremoveBatchEffect()
, as to which I outline here: A: Trying to understand the maths behind one Limma function@Kevin Blighe: Just asking a bit naive question then. Apart from batch effect, when I use Deseq2 to model many other covariates(like age, sex) and do differential expression analysis, are those transformed expression values usable to make boxplot for a particular gene of interest? Or one needs to do further processing, like using removeBatchEffect(), to use those transformed expression values made by Deseq2?
Hello again,
If, for example, we have a model formula like this:
Then, if we obtain test statistics for the
condition
term, these test statistics will be adjusted forage^2
andsex
. Same as if we wanted to perform a DE analysis on sex, the test stats would be adjusted for condition and age^2.If you then perform rlog (or vst) with
blind = TRUE
(transformation is blind to the formula that you used), the transformation can be regarded as being performed on the normalised counts 'as is'. For QC purposes, box-and-whisker plots, scatter plots, heatmaps, PCA, etc, are then typically performed on these transformed expression levels.If you want to use your data for other stuff, like, k-means clustering, downstream regression modeling, etc., then transform via
blind = FALSE
(the transformation sees and utilises the formula that you used), which can sometimes act as a soft form of adjustment for variables in your design. Ultimately, what its doing is controlling for dispersion. Obviously, if it completely adjusted the data, then we'd eliminate the condition effect, too, which we don't want. I checked again the vignette and it is explained there (under heading Blind dispersion estimation): Analyzing RNA-seq data with DESeq2Each dataset is also different, so, there's no way to provide a 'catch all' answer. I would be checking PCA bi-plots with and without blind TRUE|FALSE, and then, if a large batch effect exists, that can be removed via
removeBatchEffect()
performed on the vst or rlog counts. In certain other cases, it may even be more appropriate to stratify the entire experiment and perform analyses separately,, e.g., if the study comprised data from completely disparate tissues with different transcriptional programs.For things like age and sex, though, I'd say that simply using blind = FALSE is fine, in terms of producing data for downstream box-and-whisker / scatter plots.
Thank you, Kevin! I immensely appreciate your detailed explanation! Also, I found in the Vignette:
Dear kevin, one small question.
Does your package "pcatools" allow taking the deseq2 object directly to make pca plots? Or one needs to some steps? I find your package fascinating and want to use it, that's why wondering.