Hi everyone, I'm hoping to integrate public RNA-seq datasets with some of my own RNA-seq datasets. However, the relevant public datasets have pretty different sequencing depths and read lengths than others. For instance, my datasets are ~50m reads/sample, 2x150bp while others are ~30m reads/sample, 2x100 and others are 60m reads/sample, 2x76bp.
I downloaded the raw fastq files and processed them all similarly (trimgalore -> STAR -> salmon -> tximport into DESeq2) and using the raw counts in downstream analyses. However, I'm starting to realize that that's not going to work and I need to incorporate some sort of normalization. My end goal is to be able to compare all of these samples on a single PCA plot.
What's the best way of doing this? I have 15 samples with either 2 or 3 replicates split across 5 DESeq objects.
- Use DESeq2 to normalize read counts based on
sizeFactors
and make a large dataframe of normalized counts prior tovst
Use avgLength in my DESeq object and calculate a rough TPM. Then make a large dataframe of TPM counts. My only caveat is I include the following before exporting my DESeq object.
keep <- rowSums(counts(ddsTxi)) >= 1 ddsTxi <- ddsTxi[keep,]
(Might not matter since it's only removing genes with samples that contain 0 genes.)
Regenerate count matrices using tximport but include
countsFromAbundance = "scaledTPM"
in thetximport()
call. I'm not really sure ifscaledTPM
is a better option thanlengthScaledTPM
for what I want to do, so I will gladly take some guidance there as well.Any other options?
Kind of kicking myself, but any of these options should be fairly quick to incorporate.
In the very first step to making the dataset comparable, you need to remove batch effects. You may use the
ComBat
approach fromsva
R package to do so.