Question

Gene trans map for Trinotate input using non-Trinity assembled transcriptome?

0

Entering edit mode

2.8 years ago

Corey • 0

I am following the Trinotate pipeline to functionally annotate an IsoSeq transcriptome for a non-model organism. I've performed all of the necessary TransDecoder steps and am at the stage of the Trinotate pipeline where I am loading data into the sqlite database. I have my assembled transcripts fasta and the transdecoder.pep file, but I am missing the tab-delimited gene to transcript mapping file. Were this a Trinity-assembled transcriptome, I know there is a script that would generate that mapping file, but I'm unsure how I would derive this file from my non-Trinity-assembled transcriptome. Is there a file that would have been generated earlier in this pipeline with the correct formatting, or is there a way I can generate this file?

RNAseq transcriptomics Trinotate assembly annotation • 3.2k views

ADD COMMENT • link updated 21 months ago by liyong ▴ 80 • written 2.8 years ago by Corey • 0

0

Entering edit mode

I do not think is possible. The get_Trinity_gene_to_trans_map.pl only use the information of the transcript fasta headers to generate that tab-delimited gene to transcript mapping file:

ADD REPLY • link 2.8 years ago by andres.firrincieli 3.8k

0

Entering edit mode

I am aware that the provided script doesn't work if the fasta headers aren't in the Trinity format, but Trinotate should be able to use non-Trinity assemblies if a gene-to-transcript text file can be generated. I would imagine there are other ways to do this short of manually creating that file in a text editor - the Trinotate pipeline doesn't give any details about how to do this, only stating that it's up to the user to provide it if non-Trinity assemblies are used. I have no experience on this end, but I would expect that a relatively simple script would be able to pull more generic fasta headers and map them to their own "gene" ID with a tab-delimited text output. Mine look like:

" >UnnamedSample_HQ_transcript/0 "

Followed by the sequence.

Alternatively, are there other annotation pipelines I could look into - ideally that would accept some of these outputs I've already generated?

ADD REPLY • link 2.8 years ago by Corey • 0

0

Entering edit mode

I have no experience on this end, but I would expect that a relatively simple script would be able to pull more generic fasta headers and map them to their own "gene" ID with a tab-delimited text output.

Is not that simple. The information stored in the fasta header of each transcript in trinity is the result of Butterfly which is the final step of the assembly pipeline. If multiple transcripts originate from the same de brujin graph, those transcripts are considered isoforms of the same gene. Technically speaking, the gene-to-transcript are predicted during the assembly pipeline and not as a result of the get_Trinity_gene_to_trans_map.pl script.

Alternatively, are there other annotation pipelines I could look into

ideally that would accept some of these outputs I've already generated?

One possibility is to map your transcript to a very close reference genome if available, but I would post this question in the Trinotate/Trinity github issue section

ADD REPLY • link 2.8 years ago by andres.firrincieli 3.8k

0

Entering edit mode

hey did you solve it? i'm stuck with this now

ADD REPLY • link 2.5 years ago by Pilar • 0

score 0 · Answer 1 · 2022-07-08

Hello, if I am understanding you need a file that maps gene to transcript and you have ran TransDecoder? Could you use the gff3 file generated by transdecoder and some commands like awk?

Something like:

Then edit and repeat for the transcript ID and make those outputs into two columns.

This particular script is based on my transcoder gff output. Basically saying that if the feature is a gene, go to the 9th column where all the info and IDs are, then cut the second item delimited by a space and then the first item delimited by a semi colon. My gene IDs have quotations around them so the sed command is to remove those.

If you could provide a sample of your gff file we could probably write a specific script to isolate your gene and transcript IDs. I am still a beginner with this stuff but have found some solutions for similar issues using some basic shell scripting.

score 0 · Answer 2 · 2023-02-06

0

Entering edit mode

21 months ago

liyong ▴ 80

This thread (https://github.com/Trinotate/Trinotate.github.io/issues/63) might helps.

ADD COMMENT • link 21 months ago by liyong ▴ 80