Dear community,
I have a general question about differential expression analysis when one is working with non-model organisms. When genomic resources are absent, one of the first RNAseq analysis steps is the generation of the de novo transcriptome, which is used afterwards as reference for mapping/abundance estimation of the assembled transcripts. When these abundances are fed into differential expression analysis tools (e.g. DESEq2), it is recommended to use gene level estimates instead of using transcript/isoform abundances. When I am working with RNAseq data, I normally use: Trinity -> RSEM -> tximport/DESeq2, whereby I use the Trinity assumption about gene-isoform relation for summarizing abundances of gene level with tximport. This is of course not perfect, but it is a starting point. The problem comes, when you want to do functional enrichment: the transcriptome will be annotated on transcript level, but the differential gene expression is not on transcript level anymore, but for functional enrichment (e.g. GO term enrichment), you need annotation data and information about what is differentially expressed. For isoforms that are annotated to the same gene product (as in most cases), this is not problematic. But how to deal with isoforms from the same (Trinity) 'gene' which show different annotations (and would therefore get different GO terms)?
- Going back to transcript level expression analysis (to avoid the annotation problem)?
- Doing the annotation not on transcript level but select one representative isoform? If yes, how? Clustering, select the longest or most expressed isoform or other kinds of representatives?
- Summarizing all isoform annotations (and therefore all GO terms) for one gene?
Are there any experiences/thoughts/recommendations about that? A link to overseen posts/literature would also be highly appreciated! Many thanks in advance, looking forward to your opinions/thoughts about that.
Hello, I am wondering the exact same thing. I have been mulling over very similar options and sort of taking every approach, but I would love to hear from someone experienced with this situation.
MB did you get any resolution elsewhere in how to approach?
Hey, sorry for the late response. Unfortunately not really. I prefer to work with Trinity 'genes' instead of transcripts because I know there is a lot of contig-oversplitting in the assemblies, which will affect the abundance estimation of a transcript. Thus, using gene level counts is IMO a more flexibel way of 'clustering' than traditional clustering on a fixed threshold.
I found that in most cases, the annotations are the same for all isoforms. Since it could be biologically valid that one gene has different gene products, I'll do now a summarising approach. But I guess it would also fine to exclude ambiguous annotations. In most data sets, ambiguously annotated genes will not have so much impact that they really change the outcome of a functional enrichment analysis. If they do, that means you cannot trust your annotation/enrichment data anyway. Since for most non-model organisms, annotation data is too scarce to make in-depth analysis anyway, I would recommend trying to make the DE analysis as robust as possible and using annotation/functional enrichment just as first insight in possibly regulated pathways.
All the best.