Hi, I am trying to do PCA analysis for my samples for initial quality control. I have 2 different sets of samples - one were sequenced 50bp and other was sequenced 75bp (both of them have disease and control cases). To do the PCA on those samples, I ran DEseq2 on them (which necessarily requires non-normalized counts), followed by vst and plotPCA. But in my PCA plot, I see two clusters - one for the 50bp samples and the other for 75 bp samples. This is not necessarily expected, since there is nothing different between the samples except for sequencing depth. Someone said I should normalize my data. But I think that will be taken care of by DESeq2, and it anyway shouldn't be fed normalized counts. Here is the plot.
Any suggestions?
Here is the image link in case it doesn't show - https://imgur.com/CFXHky4
Here is my code. Here P and NP mean Pain and Non-Pain, which is the effect I am studying. Some files are 50bp, others are 75bp, and I have not provided that info to DESeq -
counts_all_fullGTF <- featureCounts(nthreads=3, isGTFAnnotationFile=TRUE, annot.ext='/Volumes/bam/DRG/annotations/Homo_sapiens.GRCh38.95.gtf', files=c('/Volumes/.../47T7L.fastqAligned.sortedByCoord.out.bam','/Volumes/.../47T7R.fastqAligned.sortedByCoord.out.bam'))$counts
sampleTable_all <- data.frame(condition=factor(c('P','NP','P','P','P','P','NP','P','P','NP','P','P','P','P','P','P','NP','P','P','NP','P','P','P','P','P','NP','NP','NP','NP','P','P','P','P','P')))
coldata <- sampleTable_all
deseqdata_fullGTF <- DESeqDataSetFromMatrix(countData=counts_all_fullGTF, colData=coldata, design=~condition)
dds_fullGTF <- DESeq(deseqdata_fullGTF)
vsd_fullGTF <- vst(dds_fullGTF)
library(ggrepel)
plotPCA(vsd_fullGTF, ntop=5000, "condition") + geom_label_repel(aes(label=substr(name, start = 1, stop = 6), colour = "white", fontface = "bold"))
Thanks Kevin! I have provided my code now. I haven't mentioned the info as to which samples are 50bp and which 75bp anywhere in the code. I know which files are which based on filename, and that's how I identified the clusters in the PCA plot. My main doubt is if I should somehow separately normalize my counts (to correct for differences between the 2 batches), or if its fine to assume that DESeq will take care of that, even if I don't let it know that the samples are coming from separate batches. My main issue with normalizing counts (lets say by calculating the TPM values for each sample and feeding that to DESeq2) is that DESeq2 prefers raw un-normalized counts
BTW, I used STAR for alignment and featureCounts for counting (default settings for both)
Generally, the preferred procedure is to include the confounding variable (here, it is
ReadLength
) in the design formula. Then, when the program derives the test statistics for, e.g.,condition
, it will automatically be adjusting for the effect ofReadLength
based on the fact thatReadLength
is included in the formula. This is how adjustment for covariates works in regression models.Looking at your code, DESeq2 will already be normalising your raw counts (via the
DESeq()
function) and dealing with differences in library sizes across your samples; so, no need to worry there. However, this is not related to the differences introduced by different read lengths.Could you try the following and see what happens?
Separately, this following code will directly remove the batch effect:
Thanks a lot Kevin for the answers at this strange hour on a Sunday! I ran the first code, this time specifying the Readlength as part of the design, but the PCA plot was basically the same - https://imgur.com/3LwzYc8
I ran the second code, which separately removes the batch effect, the result - https://imgur.com/Ovdjs2l
I am curious why the plots are different. We can see that those 3 wayward samples from the plotPCA are also away from the main cluster in the PCAtools output too. But in this case the other samples are more as a single cluster than 2 different. If the first code also normalizes for the reads across the two batches, then why are there 2 separate clusters there?
I am in the Americas right now, so, it's not tooooo late. Well, most will already be asleep.
This first chunk of code is standard for any DESeq2 analysis and normalises your data using size factors that control for differences in sequencing depth across your samples - it also accounts for 'dispersion', which is a statistical measure related to variance. The normalisation will be the exact same irrespective of the design formula. No adjustment of counts is performed for any parameter listed in your design formula, i.e., this normalisation of the raw counts is always 100% blind to the design formula.
Try it out... if you run it with
~ condition
or~ condition + ReadLength
, the values ofcounts(dds_fullGTF, normalized = TRUE)
will be exactly the same.-------------
If you were to then perform a differential expression analysis using the
DESeq2::results()
function, say, forcondition
, the p-values that you see will be adjusted forReadLength
. This is just how regression models work. In this case, each gene is being independently tested in a regression model of generic formula:-------------
Going further, when you transform the normalised counts via
vst(..., blind = FALSE)
, some level of adjustment to the data will be made based on the model formula, but it's only a 'soft' adjustment. In your case, it's not enough, apparently, to mitigate the effect ofReadLength
. When we then uselimma::removeBatchEffect()
, we are taking a more aggressive and direct approach, i.e., by directly modifying our expression levels and eliminating the measured effect ofReadLength
. This is why, after you use the limma function, the effect vanishes on PCA.I am aware that all of this may be difficult to grasp if you are just starting in this field.
Thanks a lot Kevin! That was SUPER helpful
Hi Kevin, I´m trying to run PCA analysis after accounting for a source of variation and am working with your proposed code. I just saw, that in your design formula, you named the condition (= source of variation of interest) BEFORE the time (=source of variation to be filtered out). Usually, the variation of interest should be mentioned last, as also discussed in other comments of this post. Was it switched on purpose or accidently?
Thanks in advance!!