Different annotations for different isoforms from the same gene
0
0
Entering edit mode
10 weeks ago
Chikae • 0

Hello, I am working with soil metatranscriptomic data.

I ran Blastx on Trinity output (i.e., isoform sequences) and found that different isoforms from the same genes (e.g., TRINITY_DN0_c0_g1_i1 and TRINITY_DN0_c0_g1_i2) are annotated with different taxonomies. I wonder how people handle this inconsistency when they present gene expression data. For example, do people only use the annotation from the longest isoform because it is the most reliable?

Additionally, I ran the EggNOG mapper, and the same issue occurred; different isoforms from the same genes are annotated with different functions, such as different KO numbers and GO terms. Interestingly, the longest isoforms were not annotated with any KO numbers, while a middle-length isoform had the highest number of KO and GO terms. In this case, using only the longest isoform for annotation is not ideal. I wonder what people do for this issue, too. For example, do people combine all the isoform annotations when discussing at the gene level?

Thank you very much for your help!

metatranscriptomics Trinity RNA-seq Isoform • 459 views
ADD COMMENT
0
Entering edit mode

How different are these annotations exactly? Do these purported isoforms all have annotations that are still essentially from within the same gene family?

I'd also presume that for a metatranscriptome, the gene-to-isoform associations provided by Trinity are unlikely to be robust (this isn't always even the case for a "normal" transcriptome in my experience). So is it possible, perhaps, in your case that these isoforms are not actually isoforms at all?

ADD REPLY
0
Entering edit mode

Thank you very much for the comment!

The differences in taxonomic annotations based on Blastx are substantial. The taxonomies listed in the table below are assigned to different isoforms derived from the same genes.

enter image description here

In terms of the EggNOG mapper output, it appears that one isoform can be representative. For example, that isoform contains all the KO numbers found in other isoforms, although most of the isoforms are not annotated. The situation is depicted in the image below.

enter image description here

I'd also presume that for a metatranscriptome, the gene-to-isoform associations provided by Trinity are unlikely to be robust (this isn't always even the case for a "normal" transcriptome in my experience). So is it possible, perhaps, in your case that these isoforms are not actually isoforms at all?

This makes sense to me... In this case, do you suggest, or do you know of anyone who suggests, using 'isoform' similarly to 'genes' instead of combining isoforms to construct less-likely genes? I would also appreciate any other suggestions on how to handle this issue!

ADD REPLY
0
Entering edit mode

Thank you for these additional details.

I believe what is possibly happening here is that some of these "isoforms" (a misnomer at this juncture, I think) are sufficiently short enough that they are, at best, getting matched to perhaps just a domain or some other part of the target sequence. I believe in such a case, one would not necessarily find the matched sequence(s) to be arising from the nearest taxonomic neighbor unless the portion of the sequence in question is known to be sufficiently-conserved-yet-sufficiently-variable (e.g., as in the 16S rRNA in bacteria).

Another possibility is that these sequences (at least the ones you've shown in the first table) are "contaminants" (perhaps belonging to the same gene family) in the sense that they're from off-target organisms that were present in your community? You mentioned that this is a soil community, so finding some sequences from plants (assuming there the soil is from the rhizosphere, for example) would not necessarily be amiss. If you have reasons to believe that these are contaminants, you might have a case for discarding them from your analysis.

In this case, do you suggest, or do you know of anyone who suggests, using 'isoform' similarly to 'genes' instead of combining isoforms to construct less-likely genes?

Are you simply trying to get rid of redundancy in the assembly? If that's the case, you should probably just cluster at 90% identity and 90% coverage (of the shorter sequence), or some other "sufficiently high" threshold, and be done with it. If you don't want to cluster the sequences, just treat each sequence as its own independent "gene" unless you have extremely compelling evidence to indicate that some two or more sequences are isoforms.

If you're doing some kind of functional analysis with the EggNOG annotations, you might end up discarding all those sequences without KEGG numbers anyway.

I think in any case you have a good reasons to justify disregarding the gene-isoform relationships proposed by the assembler for this data.

ADD REPLY

Login before adding your answer.

Traffic: 2299 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6