Hi there,
I am wondering if anyone could provide a tip or any help with generating the necessary transcript to gene map file necessary for using salmon to align RNAseq data against a reference transcriptome?
I would like to do this with the QUT nicotiana benthamiana reference transcriptome. However, the way in which the GFF3 file for the annotation is constructed makes this not possible using the BUSparse package, and there is no gtf file where "transcript_id" and "gene_id" are helpfully specified.
in the attributes column of the gff file, it's not obvious to me which tag denotes transcript, and which is gene. But i'm guessing that (for my purposes at least) "Nbv5tr6198039.mrna1" for example may be considered transcript id, while "Nbv5tr6198039" may be considered gene id. Please see below some example lines from the GFF3 file.
Nbv0.5scaffold4004 Nbdbv05 gene 109116 109315 . - . ID=Nbv5tr6198039.path1;Name=not determined by homology or low homology during annotation
Nbv0.5scaffold4004 Nbdbv05 mRNA 109116 109315 . - . ID=Nbv5tr6198039.mrna1;Name=Nbv5tr6198039;Parent=Nbv5tr6198039.path1;coverage=100.0;identity=100.0
Nbv0.5scaffold4004 Nbdbv05 CDS 109168 109314 100 - 0 ID=Nbv5tr6198039.mrna1.cds1;Name=Nbv5tr6198039;Parent=Nbv5tr6198039.mrna1;Target=Nbv5tr6198039 2 148 +
Thanks in advance for any help.
Yes the value of the ID attribute of the gene feature can be considered as the
gene_id
the value of the ID attribute of the mRNA feature can be considered as the
transcript_id
If you need a GTF file you may convert your GFF file using on of these tools: https://agat.readthedocs.io/en/latest/gff_to_gtf.html
Thank you very much for your help. I have now converted the GFF3 file into gtf format using gffread. However, before i use this gtf file to generate my transcript to gene mapping file (to use with salmon and eventual splicosomal analysis) I am thinking my gtf file may need further modification.
Lines in the gtf file currently look like below:
With gene_id given a different extension (".pathX") depending on the transcript. Am i right in thinking this should not be the case, and that different transcript id's of the same gene should map to exactly the same gene_id? if so, should i strip the ".pathX" gene_id extension from the file?
TLDR: should i modify my newly generated gtf so that the lines above look more like the lines below?
I guess you should not perform this modification otherwise you risk to merge different genes together. Looking at the location of mrna1 and mrna2 I don't think they are part of the same gene.