I am assembling a gtf file from a bam file which I generated by aligning my rnaseq reads using STAR. Assembly was done using StringTie and the Ensembl annotation file for GRCh38.
My problem is that the resulting gtf file does not contain all the information that is in the reference annotation. Crucially, it is missing information on transcript biotype which I am interested in.
For instance the reference annotation has the following fields for a transcript:
1 havana exon 12975 13052 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "4"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; exon_id "ENSE00001799933"; exon_version "2"; tag "basic"; transcript_support_level "NA";
However, my assembled gtf file looks like this:
1 StringTie exon 12613 12721 1000 + . gene_id "MSTRG.1"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; ref_gene_id "ENSG00000223972";
I've also tried searching the entire file for "transcript_biotype" but nothing comes up.
From this previous post, I saw that a potential fix might be to convert the gtf to bed12 and then annotate the bed12 using the Ensembl annotation file. However, I'm not sure exactly which bedtools function to use.
Would be great if anyone could point to a different solution.
Hey, same question here. Have you solve it?