Warning: tl:dr will follow but I guess some background information might be appreciated.
=> Salmon quantifies against a transcriptome, so the quant.sf
are transcript abundance estimates. One commonly performs analysis with DESeq2 on the gene level though. The reason is that:
a) it is more straight-forward to interpret. Multiple transcripts might show different patterns (some go up, some go down) but it is often not clear what each transcript does in terms of function and how to interpret complex patterns. One is typically interested in the overall change of the gene, therefore sums tx counts to a single value representative for the gene.
b) Transcripts usually share most of the exon sequences between them. That means that is a notable mapping uncertainty that would need to be taken into account when doing tx-level analysis. DESeq2 does not support this. One would make use of the bootstrap replicates that salmon can produce (essentially it checks how reliable a read maps to each transcripts and then generates bootstrap replicates with alternative mapping locations). Specialized software such as swish
from the fishpond (Bioconductor) package can then make use of this information to decide how reliable the mapping, and by this observed changes between transcripts are. Tx analysis with DESeq2 (which does not use that information) is suboptimal and in most cases gene level analysis is more informative.
c) As you have fewer genes than transcripts runtime and memory requirements are lower for gene- than transcript analysis, especially when sample size is large, but these days that is probably not a relevant argument anymore.
In any case, you probably want to aggregate the transcript- to the gene level. A common package for this is tximport
from Bioconductor, see https://www.bioconductor.org/packages/release/bioc/html/tximport.html
You can then directly load the gene level tximport output into DESeq2 as described in the DESeq2 manual. It contains all necessary code, same goes for the tximport procedure, there is actually no need for external tutorials if you simply read the manuals at Bioconductor.
Finally, RStudio is just an interface, an IDE. R packages can be used from R command line or RStudio basically without difference.
I am personally not a fan of wrapper scripts for differential analysis as they do things under the hood you have no control over.
I would either use RStudio via the HPC (ssh -X
connection) or simply download it RStudio and run both tximport and DESeq2 on a local machine. It is not computationally demanding, any standard laptop can do it unless you have many hundreds of samples.
That makes it much easier to do exploratory analysis of your data, following the DESeq2 manual. You simply download the folders salmon produced to your local machines and can then analyse on it. You can gzip-compress the quant.sf files if you are limited in disk space, but generally this should not take much space.
Does that make sense to you?
yes, thank you for your suggestion. What about RSEM? Use STAR to generate mapping input and then use RSEM to create TPKM results? Is TPKM reliable as gene level?
DESeq2 starts from raw counts, please read the manual. tximport can read RSEM input as well.