I am following the Trinotate pipeline to functionally annotate an IsoSeq transcriptome for a non-model organism. I've performed all of the necessary TransDecoder steps and am at the stage of the Trinotate pipeline where I am loading data into the sqlite database. I have my assembled transcripts fasta and the transdecoder.pep file, but I am missing the tab-delimited gene to transcript mapping file. Were this a Trinity-assembled transcriptome, I know there is a script that would generate that mapping file, but I'm unsure how I would derive this file from my non-Trinity-assembled transcriptome. Is there a file that would have been generated earlier in this pipeline with the correct formatting, or is there a way I can generate this file?
I do not think is possible. The
get_Trinity_gene_to_trans_map.pl
only use the information of the transcript fasta headers to generate that tab-delimited gene to transcript mapping file:I am aware that the provided script doesn't work if the fasta headers aren't in the Trinity format, but Trinotate should be able to use non-Trinity assemblies if a gene-to-transcript text file can be generated. I would imagine there are other ways to do this short of manually creating that file in a text editor - the Trinotate pipeline doesn't give any details about how to do this, only stating that it's up to the user to provide it if non-Trinity assemblies are used. I have no experience on this end, but I would expect that a relatively simple script would be able to pull more generic fasta headers and map them to their own "gene" ID with a tab-delimited text output. Mine look like:
" >UnnamedSample_HQ_transcript/0 "
Followed by the sequence.
Alternatively, are there other annotation pipelines I could look into - ideally that would accept some of these outputs I've already generated?
Is not that simple. The information stored in the fasta header of each transcript in trinity is the result of Butterfly which is the final step of the assembly pipeline. If multiple transcripts originate from the same de brujin graph, those transcripts are considered isoforms of the same gene. Technically speaking, the
gene-to-transcript
are predicted during the assembly pipeline and not as a result of the get_Trinity_gene_to_trans_map.pl script.One possibility is to map your transcript to a very close reference genome if available, but I would post this question in the Trinotate/Trinity github issue section
hey did you solve it? i'm stuck with this now