Question

Converting an output de-novo transcriptome assembled with Trinity to a .gff3 file

0

Entering edit mode

5.3 years ago

Raito92 ▴ 100

Hello! I've de-novo assembled a transcriptome from Trinity, resulting into Trinity.fasta, whose headers look like this:

>TRINITY_DN29256_c0_g1_i1 len=323 path=[0:0-322]

Followed, in the next line, by the sequence.

To run an external downstream analysis with a R script, I'd need to have a .gff3 reference file (FeatureCounts function from RSubread). Of course, for now, annotation isn't needed, just names and coordinates.

I've already performed a classic edgeR analysis with Trinity, I'm just trying something different and need this very specific input file.

Can anyone help me here? Thanks in advance!

Trinity • 4.8k views

ADD COMMENT • link updated 3 months ago by jon50250 • 0 • written 5.3 years ago by Raito92 ▴ 100

0

Entering edit mode

I do not have experience with Trinity, but I have seen similar cases where a GFF3 was obtained by mapping the Trinity fasta to the reference with GMAP. Maybe it can help in your case.

ADD REPLY • link 5.3 years ago by alex.zaccaron ▴ 470

0

Entering edit mode

I've tried to use GMAP, with the following code, but the script seems to freeze for no reason and I get an empty output file.

gmap -d Trinity.fasta -f 3 > meh.gff3

What do you mean by reference? It's a de-novo assembly, because my organism is not a model one, so I don't really have one.

ADD REPLY • link 5.3 years ago by Raito92 ▴ 100

1

Entering edit mode

gmap is to map transcripts against a reference genome. The gff you get describe the location and the structure of the transcripts within the reference genome. As you don't have reference genome it is useless here.

What you can do it is to use transcoder to predict the coding regions within a transcript fasta file. The gff you will get describe the feature type of the different regiosn in each sequence, i.e the exon and what is coding (CDS) and what is non-coding (UTR).

ADD REPLY • link 5.1 years ago by Juke34 9.0k

0

Entering edit mode

Hi Juke34, if you're able, can you please clarify something for me (based on the answer you've given here)? Thanks a lot in advance. I also have a similar issue where I know that I need a gtf or gff file for downstream mapping, but not sure which approach is best. I've already conducted a de novo Trinity reconstruction for my non-model species, and I've completed the Transdecoder and Trinotate pipelines. We have a "good enough" genome for this species already, but I didn't use it for the reconstruction because we didn't want to be constrained by the genome. Downstream, we need to map some RNA-seq reads using this "good enough" genome, and my annotation is supposed to accompany this, but I'm unsure about the gtf file. Do I use the transdecoder one, or should I use GMAP to obtain one that's specific to my "good enough" genome? I would greatly appreciate some help. Thank you!

ADD REPLY • link 21 months ago by EkHe ▴ 10

2

Entering edit mode

You will need to map your transdecoder fasta file to your genome in order to make an annotation that described the location of your genes within this genome (GFF/GTF). You can use braker, Augustus, maker , PASA or other annotation tool.

ADD REPLY • link 21 months ago by Juke34 9.0k

0

Entering edit mode

this was helpful thank you! For mapping the trans decoder back to genome say with braker3, this would be the peptide fasta? So total inputs would include genome and raw fastq ranaseq as well as the peptides from transdecoder?

ADD REPLY • link 3 months ago by jon50250 • 0

0

Entering edit mode

Maybe map with minimap2 instead, then bamtobed, then to gff (or maybe there's a direct bam->gff converter...)

ADD REPLY • link 5.1 years ago by cschu181 ★ 2.8k

score 1 · Answer 1 · 2019-11-12

featureCounts assigns zero counts to multi-mapped reads. Trinity assemblies have a lot of "redundancy", as the assembler tries to recover all possible isoforms of a gene. This would mean a lot of the mapped reads would map to multiple locations (to several isoforms), and featureCounts would assign zero counts to all those reads. Better approaches to deal with this would be quantification with RSEM, Salmon or kallisto.

score 1 · Answer 2 · 2022-02-17

Trinity has a cdna_fasta_file_to_transcript_gtf.pl script that makes a GTF file out of Trinity FASTA in the util/misc folders of the Trinity installation.

perl /<trinity_folder>/util/misc/cdna_fasta_file_to_transcript_gtf.pl Trinity.fasta | grep -w "exon" - > Trinity.gtf

You can also remove the pipe and whats after it, I have it since some software requires the GTF to have only "exon" lines: perl /<trinity_folder>/util/misc/cdna_fasta_file_to_transcript_gtf.pl Trinity.fasta > Trinity.gtf

You can then convert your gtf file into gff3 if necessary, using gffread from Cufflinks:

`gffread Trinity.gtf -o Trinity.gff3`

This essentially gives the GTF/GFF3 file with locations of starts and ends of the FASTA sequences. Then in software requiring such formats like GTF/GFF3 the Trinity.fasta can be used in place of the "genome" file, if no reference genome is available to map transcriptome to.