Should I normalize read counts if I want to run plotPCA?
2
1
Entering edit mode
4.6 years ago
c_u ▴ 520

Hi, I am trying to do PCA analysis for my samples for initial quality control. I have 2 different sets of samples - one were sequenced 50bp and other was sequenced 75bp (both of them have disease and control cases). To do the PCA on those samples, I ran DEseq2 on them (which necessarily requires non-normalized counts), followed by vst and plotPCA. But in my PCA plot, I see two clusters - one for the 50bp samples and the other for 75 bp samples. This is not necessarily expected, since there is nothing different between the samples except for sequencing depth. Someone said I should normalize my data. But I think that will be taken care of by DESeq2, and it anyway shouldn't be fed normalized counts. Here is the plot.

Any suggestions? PCA plot link

Here is the image link in case it doesn't show - https://imgur.com/CFXHky4

Here is my code. Here P and NP mean Pain and Non-Pain, which is the effect I am studying. Some files are 50bp, others are 75bp, and I have not provided that info to DESeq -

counts_all_fullGTF <- featureCounts(nthreads=3, isGTFAnnotationFile=TRUE, annot.ext='/Volumes/bam/DRG/annotations/Homo_sapiens.GRCh38.95.gtf', files=c('/Volumes/.../47T7L.fastqAligned.sortedByCoord.out.bam','/Volumes/.../47T7R.fastqAligned.sortedByCoord.out.bam'))$counts

sampleTable_all <- data.frame(condition=factor(c('P','NP','P','P','P','P','NP','P','P','NP','P','P','P','P','P','P','NP','P','P','NP','P','P','P','P','P','NP','NP','NP','NP','P','P','P','P','P')))

coldata <- sampleTable_all
deseqdata_fullGTF <- DESeqDataSetFromMatrix(countData=counts_all_fullGTF, colData=coldata, design=~condition)
dds_fullGTF <- DESeq(deseqdata_fullGTF)
vsd_fullGTF <- vst(dds_fullGTF)
library(ggrepel)
plotPCA(vsd_fullGTF, ntop=5000, "condition") + geom_label_repel(aes(label=substr(name, start = 1, stop = 6), colour = "white", fontface = "bold"))
RNA-Seq • 6.8k views
ADD COMMENT
8
Entering edit mode
4.6 years ago

Firstly, yes, the data should be normalised, and ideally transformed to a normal distribution.

When you ran vst(), did you set blind = FALSE and did you have ReadLength (or batch) as a covariate in your design formula? Setting blind = FALSE 'exposes' the variance stabilising transformation to the design formula - it's useful for producing expression data for downstream applications outside of DESeq2. Read here for further information: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation

If you still need to remove that batch effect, then use limma::removeBatchEffect() on the variance-stabilised expression levels.

As to the origin of the observed difference, I think that multi-mapping will obviously be a bigger issue, given the shorter read length. However, multi-mapping may be dealt with by the aligner or pseudo-aligner that you use - which were used?

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin! I have provided my code now. I haven't mentioned the info as to which samples are 50bp and which 75bp anywhere in the code. I know which files are which based on filename, and that's how I identified the clusters in the PCA plot. My main doubt is if I should somehow separately normalize my counts (to correct for differences between the 2 batches), or if its fine to assume that DESeq will take care of that, even if I don't let it know that the samples are coming from separate batches. My main issue with normalizing counts (lets say by calculating the TPM values for each sample and feeding that to DESeq2) is that DESeq2 prefers raw un-normalized counts

BTW, I used STAR for alignment and featureCounts for counting (default settings for both)

ADD REPLY
2
Entering edit mode

My main doubt is if I should somehow separately normalize my counts (to correct for differences between the 2 batches), or if its fine to assume that DESeq will take care of that, even if I don't let it know that the samples are coming from separate batches.

Generally, the preferred procedure is to include the confounding variable (here, it is ReadLength) in the design formula. Then, when the program derives the test statistics for, e.g., condition, it will automatically be adjusting for the effect of ReadLength based on the fact that ReadLength is included in the formula. This is how adjustment for covariates works in regression models.

Looking at your code, DESeq2 will already be normalising your raw counts (via the DESeq() function) and dealing with differences in library sizes across your samples; so, no need to worry there. However, this is not related to the differences introduced by different read lengths.

Could you try the following and see what happens?

deseqdata_fullGTF <- DESeqDataSetFromMatrix(
  countData = counts_all_fullGTF,
  colData = coldata,
  design = ~ condition + ReadLength)
dds_fullGTF <- DESeq(deseqdata_fullGTF)
vsd_fullGTF <- vst(dds_fullGTF, blind = FALSE)

library(ggrepel)
plotPCA(vsd_fullGTF, ntop=5000, "condition") + geom_label_repel(aes(label=substr(name, start = 1, stop = 6), colour = "white", fontface = "bold"))

Separately, this following code will directly remove the batch effect:

mat <- limma::removeBatchEffect(assay(vsd_fullGTF), colData(dds_fullGTF)[,'ReadLength'])
p <- PCAtools::pca(mat, metadata = colData(dds_fullGTF))
PCAtools::biplot(p, colby = 'condition', shape = 'ReadLength')
ADD REPLY
0
Entering edit mode

Thanks a lot Kevin for the answers at this strange hour on a Sunday! I ran the first code, this time specifying the Readlength as part of the design, but the PCA plot was basically the same - https://imgur.com/3LwzYc8

I ran the second code, which separately removes the batch effect, the result - https://imgur.com/Ovdjs2l

I am curious why the plots are different. We can see that those 3 wayward samples from the plotPCA are also away from the main cluster in the PCAtools output too. But in this case the other samples are more as a single cluster than 2 different. If the first code also normalizes for the reads across the two batches, then why are there 2 separate clusters there?

ADD REPLY
2
Entering edit mode

I am in the Americas right now, so, it's not tooooo late. Well, most will already be asleep.

This first chunk of code is standard for any DESeq2 analysis and normalises your data using size factors that control for differences in sequencing depth across your samples - it also accounts for 'dispersion', which is a statistical measure related to variance. The normalisation will be the exact same irrespective of the design formula. No adjustment of counts is performed for any parameter listed in your design formula, i.e., this normalisation of the raw counts is always 100% blind to the design formula.

deseqdata_fullGTF <- DESeqDataSetFromMatrix(
  countData = counts_all_fullGTF,
  colData = coldata,
  design = ~ condition + ReadLength)
dds_fullGTF <- DESeq(deseqdata_fullGTF)

Try it out... if you run it with ~ condition or ~ condition + ReadLength, the values of counts(dds_fullGTF, normalized = TRUE) will be exactly the same.

-------------

If you were to then perform a differential expression analysis using the DESeq2::results() function, say, for condition, the p-values that you see will be adjusted for ReadLength. This is just how regression models work. In this case, each gene is being independently tested in a regression model of generic formula:

gene ~ condition + ReadLength

-------------

Going further, when you transform the normalised counts via vst(..., blind = FALSE), some level of adjustment to the data will be made based on the model formula, but it's only a 'soft' adjustment. In your case, it's not enough, apparently, to mitigate the effect of ReadLength. When we then use limma::removeBatchEffect(), we are taking a more aggressive and direct approach, i.e., by directly modifying our expression levels and eliminating the measured effect of ReadLength. This is why, after you use the limma function, the effect vanishes on PCA.

I am aware that all of this may be difficult to grasp if you are just starting in this field.

ADD REPLY
1
Entering edit mode

Thanks a lot Kevin! That was SUPER helpful

ADD REPLY
0
Entering edit mode

Hi Kevin, I´m trying to run PCA analysis after accounting for a source of variation and am working with your proposed code. I just saw, that in your design formula, you named the condition (= source of variation of interest) BEFORE the time (=source of variation to be filtered out). Usually, the variation of interest should be mentioned last, as also discussed in other comments of this post. Was it switched on purpose or accidently?

Thanks in advance!!

ADD REPLY
3
Entering edit mode
4.6 years ago
dsull ★ 6.9k

What are you hoping to get out of your PCA analysis? To me, it seems like PCA did its job -- it showed that the two sets of samples clustered differently! This means you need to account for that in your downstream analysis -- for DESeq2 differential gene expression analysis, you would want to model the two sets of samples as a covariate in your design (to account for this batch effect since obviously the two sets of samples were sequenced at different times and, possibly, on different machines).

In terms of normalizing counts, do you suspect that the sequencing depth is what's mainly driving the difference in clustering? If so, get the normalized counts and re-run PCA! Likely, as mentioned above, the difference in clustering is also driven due to the batch effect (so, even after PCA'ing the normalized counts, you'd likely still get two distinct clusters) in which case, if you are using the normalized counts for any downstream analysis, you should use an R package which removes the batch effects.

A note on batch effects: Batch effects suck and there's really no perfect way to deal with them (ideally, all of your samples would have been sequenced at once) but above are some suggestions on how people try to account for batch effects.

Edit: Re-read the post and it seems that you did use normalized counts to run PCA (meaning sequencing depth was accounted for), confirming my suspicion of the batch effect (but see Kevin's suggestion).

ADD COMMENT
0
Entering edit mode

Hi Delaney, thank you for your answer. From this PCA analysis I am hoping to find bad samples that may have gotten contaminated, and I think I do see three of them that are clustered far away. My concern is that I was not expecting clusters for the 2 batches, and wanted to know if I am seeing them because I didn't normalize the counts. I did not in fact, normalize my counts before feeding them to DESeq2, since DESeq wants non-normalized counts. After DESeq2 I do vst, which I am not sure is normalization. I know that difference in sequencing depths will lead to different count levels, but I thought DESeq2 will do that normalization. Am I right in assuming that, or should I normalize the counts myself?

ADD REPLY
1
Entering edit mode

The vst is normalization -- this function normalizes for size factors (i.e. sequencing depth) AND does some special transformation to make your count data on log-scale.

ADD REPLY

Login before adding your answer.

Traffic: 2332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6