While the exact normalization can be up for debate (FPKM, CPM, etc.), having independence between your training and validation datasets is important.
If the DESeq2 rlog values are affected by the overall set of samples, you could be violating your independence for validation if your normalization can be carried out independently for each sample.
For example, consider the DESeq2 vignette:
rld <- rlog(dds, blind=FALSE)
The command above is used in the rlog section.
To make your life easier, I'll copy over the upstream parts of the example code:
library("pasilla")
pasCts <- system.file("extdata",
"pasilla_gene_counts.tsv",
package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
"pasilla_sample_annotation.csv",
package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
rownames(coldata) <- sub("fb", "", rownames(coldata))
cts <- cts[, rownames(coldata)]
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind=FALSE)
(I apologize that this is cut-and-paste from the website: however, a big thanks to Mike Love for making this a lot easier for me to show!)
Now, let's try using duplicates (instead of total n=7):
dds2 <- DESeqDataSetFromMatrix(countData = cts[,c(1,2,4,5)],
colData = coldata[c(1,2,4,5),],
design = ~ condition)
dds2 <- DESeq(dds2) #I think you can skip this...
rld2 <- rlog(dds2, blind=FALSE)
As you can see from the head(assay(rld))
and head(assay(rld2))
commands, the rlog values are different with different sets of samples (which I think is undesirable for what you are describing)
You can also see that the values still vary after setting blind=TRUE
(even though I would guess it wouldn't add a huge bias, and this is a small demo dataset with clear differences to begin with):
rld.blind <- rlog(dds, blind=TRUE)
rld2.blind <- rlog(dds2, blind=TRUE)
That said, if you are concerned about the variance stabilization (because the lower counts tend to have more variability), I think there are ways where a log-transformed FPKM values is arguably kind of like the rlog transformation (and there are probably other strategies that avoid normalizing expression in ways that vary depending upon the total samples considered).
For example, lets say you are using log2(FPKM + 0.1) values. If all FPKM values are below 0.1, you might want to exclude those, but I'm guessing they will also be removed during feature selection (filtering more genes overall). However, the point is that the expression value is calculated independently for each sample.
If you choose a low rounding factor, you may see a increase in variability for the lower expressed genes. However, one solution you may want to consider is increasing the rounding threshold (such as using FPKM of 1 instead of 0.1). Or, if you end up selecting a lower throughput assay with a smaller number of features (first identified from your high-throughput screen), then your scoring method that costs less to test in more samples would be different :)
I also understand that the vignette says "The variance stabilizing and rlog transformations are provided for applications other than differential testing, for example clustering of samples or other machine learning applications," but I think they are mostly talking about a comparison of rlog to the normalized counts within DESeq2 (which I am OK with, but I would prefer to have a normalization that can be independently calculated for each sample).
For example, in the DESeq2 vignette, this is the example for the log-transformed counts:
ntd <- normTransform(dds)
library("vsn")
meanSdPlot(assay(ntd))
However, if you increase the rounding threshold (using a pseudocount of 100 instead of a pseudocount of 1), you get rid of the bump in increased variance:
ntd100 <- normTransform(dds, pc=100)
meanSdPlot(assay(ntd100))
In fact, that plot looks really similar to meanSdPlot(assay(rld))
, at least on my computer (they are similar to each other, but different than the plots on-line with the smaller pseudocount).
So, even when using the context of that code, using a higher rounding threshold (or pseudocount) can help stabilize variance (and help with filtering genes with lower counts / expression).
thanks Kevin,
I obtained the FPKM values with RSEM using the same count matrix that I use to calculate the rlog. Do you have any idea of why these values seem wrong?
Also, I have plotted the same values from here https://xenabrowser.net/datapages/?cohort=GTEX using the GTEX data and same value (log2(FPKM +0.001) and the plot looks pretty similar to my one. I would like to make both data sets comparable so that is why I follow similar pipeline that the project there, any idea of what is going on?
Thank you very much!!
Hey again Silvia, you probably should have added this as a comment to my answer above, but not to worry! :)
The method of normalisation that produces FPKM expression values has come under criticism in recent years, in part due to the issues that you have found when plotting these counts. FPKM does not manage very well the fact that, in RNA-seq, there can be a very wide range of count distribution in the samples being studied. As Igor mentioned, there is neither any cross-sample normalisation performed while producing FPKM expression values.
In your figure, samples C3 and H12 appear problematic.
Upps, I thought I replied directly to you! :)
Ok, I see the idea. Thank you very much for your help!!
Silvia
Absolutely no problem. You can reply to this thread again to get my attention if you need more help!