Hi to all,
I have RNA-seq data from 31 tissues of 2 different individuals. My goal is to evaluate the expression of various genes of interest and to compare the expression level among these tissues. I read throughout a lot of papers and posts and I tried to set up my pipeline. I mean first of all after quality check of my reads, I trimmed them and mapped with HISAT2. To have an initial idea about gene expression I visualized the results on IGV (I perfectly known that with IGV it is not possible to quantify the expression but I did it just to have an idea). Then I calculated TPM with Salmon from raw reads and I realized different plots in Python (actually TPM from Salmon matched with IGV results). Now my intention is to proceed with Differential Expression Analysis with DESeq2. I have two questions:
1) is it a good pipeline to analyze data from RNA-seq data step by step? (I perfectly know that for expert bioinformaticians all these procedures seem to be like wasting time but unfortunately for me it is the first time I have to analyze RNA-seq data and I want to be sure step by step).
2) Reading different papers and posts I realized that TPM is not so good to compare gene expression among different samples (in my case tissues) and that there isn’t a real cut-off to consider a gene as low or high expressed. So it could be a good thing compare TPM of genes I m interesting in with ”housekeeping” genes to have an idea about low/medium/esxpression before DEA?
As I said before I want to proceed with DESeq2 as every people recommend to have stronger results.
Thank you in advance for your time and suggestion!
Please read the DESeq2 manual at Bioconductor, it tells you everything you have to know when starting with transcript abundance estimates from
salmon
towards proper differential analysis.Thank you for you suggestion. Actually I’m reading and following that pipeline. I’m not sure you replied exactly to my questions. I did this post to have ideas/feedbacks about my work and to know if I understood correctly that manual. Considering what I wrote, I’m right or wrong? Thank you again!
Sorry for not being precise. If you follow the DESeq2 manual you will be fine, so yes that pipeline is basically fine. You should realize thought that differential expression is commonly done on the gene level rather than the transcript level, the latter is what salmon outputs, therefore (as explained in the manual) you need to use tools like
tximport
to summarize transcript level abundances to the gene level. As for 2) no comparing transcript abundances is not ideal, use the normalized gene level counts from DESeq2, e.g. viavst
function. Alternatively use the normalized counts directly viacounts(dds, normalized=TRUE)
.Indeed I really appreciate your suggestions especially the second one. So basically txt import takes the output from Salmon (transcript level abundances) and transform this to the gene level to perform Differential Expression Analysis right?
Yes, that is the idea. The problem with transcripts is that individual isoforms per gene share large part of the sequence information (most exons are identical between isoforms), therefore it is difficult to know which isoform a read comes from. For tx-level DE analysis you need special approaches that take this so-called mapping uncertainty into account. For the gene level, which is more robust, one simply sums these per-tx counts and then performs DE on these values. The tximport manual has some literature references for a more extensive discussion. You can directly use the output of tximport for DESeq2, see https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-tximeta
In most cases one only wants to know whether a gene is differentially-expressed and not whether individual isoforms change, therefore the gene level analysis is the most common DE approach.
Thank you for your very precious advices. In this case I have to focus on differentially-expressed genes not on isoforms. Anyway this can be very useful for future analysis.