Hello all,
I am new to bionformatics and working on a project and my mission is : getting the first reference transcriptom of a specie and perform differential analysis on 2 temperature conditions at isoform level on deseq2. And I have a few questions about methodology.
So far I have a reference transcritome ( I did filter my Trinity fasta according to quality redundancy and also according to transcript expression).
I am concerned it seems not recommanded to perform diffential analysis at isoform level (https://support.bioconductor.org/p/43395/#43400)
So I am wondering wether I should change tools to perform isoform level analysis, or if it is better to do a differential analysis at gene level. Also I wonder if Ihave to cluster my transcripts (using tools like corset), prior to count, since kallisto only gives count at transcript level, unless deseq2 can use the transcript id to cluster them into genes ?
And also now that I am thinking about doing an analysis at gene level I am concerned wheter my filtering according to transcript expression will skew my analysis.
Thank you for reading !
Please use full words -
level
, notlvl
. Smalll things like these are the difference between being a professional and not being one.I have no idea what you're doing for your reference transcriptome (language is very unclear).
But to do gene-level analysis with DESeq2, you have to summarize the transcript-level estimates to gene-level (see: tximport).
If you want to do transcript-level differential expression analysis, I'd recommend using sleuth (note: sleuth can also do gene-level analysis).
Ok thanks, sorry for being so unclear, I have just edited my post to make it better.
I have decided to do gene-level analysis on deseq2. So far I have followed the documentation. My transcripts id looks like this :
TRINITY_DN0_c0_g1_i2
. I am not sure it is the right thing but I create my tx2gene table like thisAnd when I look at my final count matrix, it contains for each gene the sum of all isoforms estimated counts, is it normal?
yes, if you use tximport, it actually sum all isoform counts from their gene as gene-level count.
Ok thanks ! I just found it a bit surprising, I would have expected it took into account some other data such as isoform length for instance
I think some other methods like genome aligned based can get the accurate expression count of gene-level. such as subread+featureCount?
It's actually more accurate to get gene-level expression from transcript-level estimates.
Many papers have been written on this e.g.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (this is the tximport paper)