What do you mean by "real one"?
De novo assembled GTF keeps a value called ref_gene_id which refers to the correspondant gene into the guide annotation. I you want to have the ref_gene_id instead of the MSTRG one, I'd suggest to do this in Python. Here's my approach:
Extract transcripts from the gtf
awk -F'\t' '{ if ($3=="transcript") print $0}' original.gtf > output.gtf
Build a conversion file: launch this redirecting output, like :
python3 script.py > conversion_list.tsv
Script.py
#!/usr/bin/python3
fields={}
with open('output.gtf', 'r') as gtf:
for line in gtf:
att_column=line.split('\t')[8]
if "ref_gene_id" in att_column:
kvp=att_column.split(';')[:-1]
gene_id_tarocc=kvp[0]
gene_id_tarocc= gene_id_tarocc.replace(' "', '="')
gene_id_tarocc=gene_id_tarocc.split('=')[1].replace('"','')
gene_id_ensembl=kvp[6]
gene_id_ensembl= gene_id_ensembl.replace(' "', '="')
gene_id_ensembl=gene_id_ensembl.split('=')[1].replace('"','')
fields[gene_id_tarocc]=gene_id_ensembl
#dict full, time to output
for k in fields.keys():
print('{}\t{}'.format(k,fields[k]))
Now use this file to rebuild the gtf
#!/usr/bin/python3
'''
This script will convert denovo gene_id description into ENSEMBL one.
Code syntax is not the best, so apologies for that.
Briefly, it loads a file composed by:
denovo_gene_id ensembl_gene_id
To fill a dictionary, where *denovo_gene_id* is the key and *ensembl_gene_id* is the value
Then it loads the GTF file: 1st to 8th column into an untouched variable -> fixed_fields
9th column to be edited -> attr_column
Splitting by semicolumn to retrieve denovo_geneid_value, then searching into dictionary for
correspondant value.
It prints to stdout, so it needs to be run like:
$ python3 eid_conversion.py > desired_output.gtf
'''
conv_dict={}
def fill_dict(dict_to_fill):
with open('conversion_list.tsv','r') as conv_table:
for line in conv_table:
line=line.rstrip()
fields=line.split('\t')
dict_to_fill[fields[0]]=fields[1]
#dict full, time to convert
fill_dict(conv_dict)
with open('original.gtf', 'r') as gtf_denovo:
for line in gtf_denovo:
line=line.rstrip()
if line.startswith('#'):
print(line)
else:
try:
fixed_fields=('\t'.join(line.split('\t')[:8]))
attr_column=line.split('\t')[8]
gene_id=""
attr_column=attr_column.split(';')
gene_id=attr_column[0].split(' ')[1]
gene_id=gene_id.replace('"','')
rest_of_line=(';'.join(attr_column[1:]))
replaced_line=""
try:
gene_id=conv_dict[gene_id]
replaced_line='gene_id "%s"; %s'%(gene_id,rest_of_line)
print('{}\t{}'.format(fixed_fields,replaced_line))
except KeyError:
print(line)
except IndexError:
print('GTF file may be corrupted. Check this line %s' %line)
exit()
Just to check...you don't have to bother with stringtie if you have a well annotated genome. You are working with a non-model organism?