Convert MSTRG gene IDs to real one?
1
0
Entering edit mode
3.5 years ago

Hi, Thanks to all seniors and juniors!

I have run a stringtie tool to merge the gtf files of my all samples. But, it automatically names gene IDs to MSTRG (default). But, I want to keep gene names original. In below, I have shared a command which I used. Would anyone like to provide the command which can retain the original names of genes?

Command:

$ stringtie --merge -p 8 -G genome.gff -o stringtie_merged.gtf mergelist.txt
Gene MSTRG Stringtie • 2.7k views
ADD COMMENT
0
Entering edit mode

Just to check...you don't have to bother with stringtie if you have a well annotated genome. You are working with a non-model organism?

ADD REPLY
0
Entering edit mode
3.5 years ago
Shred ★ 1.5k

What do you mean by "real one"? De novo assembled GTF keeps a value called ref_gene_id which refers to the correspondant gene into the guide annotation. I you want to have the ref_gene_id instead of the MSTRG one, I'd suggest to do this in Python. Here's my approach:

Extract transcripts from the gtf

awk -F'\t' '{ if ($3=="transcript") print $0}' original.gtf > output.gtf

Build a conversion file: launch this redirecting output, like :

python3 script.py > conversion_list.tsv

Script.py

#!/usr/bin/python3
fields={}
with open('output.gtf', 'r') as gtf:
        for line in gtf:
                att_column=line.split('\t')[8]
                if "ref_gene_id" in att_column:
                        kvp=att_column.split(';')[:-1]
                        gene_id_tarocc=kvp[0]
                        gene_id_tarocc= gene_id_tarocc.replace(' "', '="')
                        gene_id_tarocc=gene_id_tarocc.split('=')[1].replace('"','')
                        gene_id_ensembl=kvp[6]
                        gene_id_ensembl= gene_id_ensembl.replace(' "', '="')
                        gene_id_ensembl=gene_id_ensembl.split('=')[1].replace('"','')
                        fields[gene_id_tarocc]=gene_id_ensembl

#dict full, time to output
for k in fields.keys():
        print('{}\t{}'.format(k,fields[k]))

Now use this file to rebuild the gtf

#!/usr/bin/python3
'''
This script will convert denovo gene_id description into ENSEMBL one.
Code syntax is not the best, so apologies for that.
Briefly, it loads a file composed by:
denovo_gene_id  ensembl_gene_id
To fill a dictionary, where *denovo_gene_id* is the key and *ensembl_gene_id* is the value

Then it loads the GTF file: 1st to 8th column into an untouched variable -> fixed_fields
                9th column to be edited ->          attr_column
Splitting by semicolumn to retrieve denovo_geneid_value, then searching into dictionary for 
correspondant value. 
It prints to stdout, so it needs to be run like:
$   python3 eid_conversion.py > desired_output.gtf
''' 


conv_dict={}
def fill_dict(dict_to_fill):
    with open('conversion_list.tsv','r') as conv_table:
        for line in conv_table:
            line=line.rstrip()
            fields=line.split('\t')
            dict_to_fill[fields[0]]=fields[1]

#dict full, time to convert

fill_dict(conv_dict)

with open('original.gtf', 'r') as gtf_denovo:
    for line in gtf_denovo:
        line=line.rstrip()
        if line.startswith('#'):
            print(line)
        else:
            try:    
                fixed_fields=('\t'.join(line.split('\t')[:8]))
                attr_column=line.split('\t')[8]
                gene_id=""
                attr_column=attr_column.split(';')
                gene_id=attr_column[0].split(' ')[1]
                gene_id=gene_id.replace('"','')
                rest_of_line=(';'.join(attr_column[1:]))
                replaced_line=""
                try:
                    gene_id=conv_dict[gene_id]
                    replaced_line='gene_id "%s"; %s'%(gene_id,rest_of_line)
                    print('{}\t{}'.format(fixed_fields,replaced_line)) 
                except KeyError:
                    print(line)     
            except IndexError:
                print('GTF file may be corrupted. Check this line %s' %line)
                exit()
ADD COMMENT
0
Entering edit mode

I have stringtie_merged.gtf file which has transcript details of all samples. Then, I used the ballgown package to extract the gene information and their fpkm values. At this moment, it's necessary for me to only use the Ballgown package. You know that stringtie_merged.gtf file has gene names by MSTRG, and that goes in subsequent analysis. But, I am not understanding your answer. Maybe, I don't have enough knowledge. Can you explain more.

ADD REPLY
0
Entering edit mode

De novo assembled transcripts couldn't be always directly overlapped with known transcripts: for this reason, inside the gtf merged file you'll see the field ref_gene_id. Provided script is an implementation used to convert MSTRG gene id into reference one, using the ref_gene_id field. This has to be done before re-doing quantification with the merged gtf. For more details, check developer's suggest here -> https://github.com/gpertea/stringtie/issues/179#issuecomment-395860049

ADD REPLY

Login before adding your answer.

Traffic: 2399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6