Hello everyone,
I'm analyzing some RNA-seq datasets for differential expression. We did not run the experiments ourselves, but the database where I got them from indicates that they were done adding additional spike-ins (ERCC92) for normalization. Originally, I did not use these spike-ins, and just went along with the default median ratio method in all the genes included in DESeq2. I saw many people arguing that using spike-ins is often not very robust, but given that I am analyzing data in which transcription-related factors are knocked-down and therefore a big change in the amount of mRNA is expected, I decided to reanalyze the data using the exogenous spike-ins for normalization. Another reason is that our results showed overexpression of many genes in specific experiments in which the opposite behavior was expected.
I couldn't find that much documentation on how to process RNA-seq data using spike-ins, but this is what I did:
- I joined the fasta files containing the reference genome and that containing the sequence for the spike-ins.
- I did the same with the respective .gtf annotation files.
- I proceeded to align everything as usual.
On the DE analysis, I just included an estimateSizeFactors()
step with the option control specifying the ERCC genes, the code as follows:
se_star <- DESeqDataSetFromHTSeqCount(sampleTable=sampletable, directory=data_path, design=~Condition)
se_star <- estimateSizeFactors(se_star, controlGenes=ercc_genes)
se_star2 <- DESeq(se_star)
de <- results(object=se_star2, name="Condition_KO_vs_WT", )
de_shrink <- lfcShrink(dds=se_star2, coef="Condition_KO_vs_WT", type="apeglm")
In two of the three conditions I'm analyzing, I got a reasonable result in terms of DE genes (~2000), which is not that different from what I was achieving without the spike-ins. On the third one, however, the number of DE genes is unusually low (~50), which makes me think that something might be wrong with my analysis. I also calculated the size factors using the iterate normalization:
se_star <- estimateSizeFactors(se_star, controlGenes=ercc_genes, type='iterate')
Doing so, the number of DE genes in this experiment becomes more reasonable again, and the other experiments remain in a similar order of magnitude. However, I do not just want stick to this other method without any explanation. I also examined the PCA, correlation and distances heatmap doing as follows:
vsd <- vst(se_star2)
sampleDistMatrix <- as.matrix(dist(t(assay(vsd))))
pheatmap(sampleDistMatrix)
corMatrix <- cor(assay(vsd))
pheatmap(corMatrix)
print(plotPCA(object=vsd, intgroup="Condition"))
`
Using the default normalization method, I got the following results:
As you can see, in the distance matrix, the samples are not properly clustered by condition (WT / KO). However, when using the iterative method:
In this case, the samples are properly clustered in all cases, and a much larger percentage of the variance falls in the first principal component of the PCA analysis. This plot is consistent with what I was getting when performing the differential expression analysis in absence of the spike-in.
Am I doing everything alright? Could it be that the spike-in is not correct in one of the experiments? Can I just stick to the 'iterate' method?
Thank you very much,
Manuel F.
since you have the spike in controls, can you verify that you do get the expected foldchanges between the different ERCC transcripts both within and across samples, that would be a way that is independent of the size estimation method.