Question

Convert MSTRG gene IDs to real one?

0

Entering edit mode

3.5 years ago

HaroonPakistan • 0

Hi, Thanks to all seniors and juniors!

I have run a stringtie tool to merge the gtf files of my all samples. But, it automatically names gene IDs to MSTRG (default). But, I want to keep gene names original. In below, I have shared a command which I used. Would anyone like to provide the command which can retain the original names of genes?

Command:

$ stringtie --merge -p 8 -G genome.gff -o stringtie_merged.gtf mergelist.txt

Gene MSTRG Stringtie • 2.7k views

ADD COMMENT • link updated 3.5 years ago by swbarnes2 14k • written 3.5 years ago by HaroonPakistan • 0

0

Entering edit mode

Just to check...you don't have to bother with stringtie if you have a well annotated genome. You are working with a non-model organism?

ADD REPLY • link 3.5 years ago by swbarnes2 14k

score 0 · Answer 1 · 2021-06-08

What do you mean by "real one"? De novo assembled GTF keeps a value called ref_gene_id which refers to the correspondant gene into the guide annotation. I you want to have the ref_gene_id instead of the MSTRG one, I'd suggest to do this in Python. Here's my approach:

Extract transcripts from the gtf

awk -F'\t' '{ if ($3=="transcript") print $0}' original.gtf > output.gtf

Build a conversion file: launch this redirecting output, like :

python3 script.py > conversion_list.tsv

Script.py

#!/usr/bin/python3
fields={}
with open('output.gtf', 'r') as gtf:
        for line in gtf:
                att_column=line.split('\t')[8]
                if "ref_gene_id" in att_column:
                        kvp=att_column.split(';')[:-1]
                        gene_id_tarocc=kvp[0]
                        gene_id_tarocc= gene_id_tarocc.replace(' "', '="')
                        gene_id_tarocc=gene_id_tarocc.split('=')[1].replace('"','')
                        gene_id_ensembl=kvp[6]
                        gene_id_ensembl= gene_id_ensembl.replace(' "', '="')
                        gene_id_ensembl=gene_id_ensembl.split('=')[1].replace('"','')
                        fields[gene_id_tarocc]=gene_id_ensembl

#dict full, time to output
for k in fields.keys():
        print('{}\t{}'.format(k,fields[k]))

Now use this file to rebuild the gtf

#!/usr/bin/python3
'''
This script will convert denovo gene_id description into ENSEMBL one.
Code syntax is not the best, so apologies for that.
Briefly, it loads a file composed by:
denovo_gene_id  ensembl_gene_id
To fill a dictionary, where *denovo_gene_id* is the key and *ensembl_gene_id* is the value

Then it loads the GTF file: 1st to 8th column into an untouched variable -> fixed_fields
                9th column to be edited ->          attr_column
Splitting by semicolumn to retrieve denovo_geneid_value, then searching into dictionary for 
correspondant value. 
It prints to stdout, so it needs to be run like:
$   python3 eid_conversion.py > desired_output.gtf
''' 


conv_dict={}
def fill_dict(dict_to_fill):
    with open('conversion_list.tsv','r') as conv_table:
        for line in conv_table:
            line=line.rstrip()
            fields=line.split('\t')
            dict_to_fill[fields[0]]=fields[1]

#dict full, time to convert

fill_dict(conv_dict)

with open('original.gtf', 'r') as gtf_denovo:
    for line in gtf_denovo:
        line=line.rstrip()
        if line.startswith('#'):
            print(line)
        else:
            try:    
                fixed_fields=('\t'.join(line.split('\t')[:8]))
                attr_column=line.split('\t')[8]
                gene_id=""
                attr_column=attr_column.split(';')
                gene_id=attr_column[0].split(' ')[1]
                gene_id=gene_id.replace('"','')
                rest_of_line=(';'.join(attr_column[1:]))
                replaced_line=""
                try:
                    gene_id=conv_dict[gene_id]
                    replaced_line='gene_id "%s"; %s'%(gene_id,rest_of_line)
                    print('{}\t{}'.format(fixed_fields,replaced_line)) 
                except KeyError:
                    print(line)     
            except IndexError:
                print('GTF file may be corrupted. Check this line %s' %line)
                exit()