Adding 'gene_name' attribute to each row of GTF/GFF file (missing for CDS, transcript, and exon rows)
1
1
Entering edit mode
19 months ago
EkHe ▴ 10

Hello,

Can someone please help me with this issue I'm having? Thank you in advance!

I have a GFF file, and I have the gene_name attribute in my GFF file, but it's only present for each gene entry (i.e., it's absent from the transcript, cds, and exon rows). I want to add the gene_name attribute to every single row of my GFF file, so that for each feature type, the gene_name will be listed and it will match the existing gene_name attribute for the Parent gene feature. Can you please help me with this?

I tried using agat_sp_manage_attributes.pl from AGAT, but it only allows me to add new attributes as empty values.

This is what I have:

scaffold1       maker   gene    288062  292903  .       -       .       ID=XUN_000004;gene_id=XUN_000004;gene_name=LINGO2
scaffold1       maker   transcript      288062  292903  .       -       .       ID=XUN_000004-T1;Parent=XUN_000004;Dbxref=PFAM:PF13306,PFAM:PF07679,PFAM:PF13927,PFAM:PF00047,PFAM:PF13855;gene_id=XUN_000004;note=COG:T,EggNog:ENOG503B9G0;original_biotype=mrna;product=Leucine-rich repeat and immunoglobulin-like domain-containing nogo receptor-interacting protein 2;transcript_id=XUN_000004-T1
scaffold1       maker   exon    288062  289918  .       -       .       ID=XUN_000004-T1.exon2;Parent=XUN_000004-T1;gene_id=XUN_000004;transcript_id=XUN_000004-T1
scaffold1       maker   exon    292892  292903  .       -       .       ID=XUN_000004-T1.exon1;Parent=XUN_000004-T1;gene_id=XUN_000004;transcript_id=XUN_000004-T1
scaffold1       maker   CDS     288062  289918  .       -       0       ID=XUN_000004-T1.cds;Parent=XUN_000004-T1;gene_id=XUN_000004;transcript_id=XUN_000004-T1
scaffold1       maker   CDS     292892  292903  .       -       0       ID=XUN_000004-T1.cds;Parent=XUN_000004-T1;gene_id=XUN_000004;transcript_id=XUN_000004-T1
scaffold1       maker   gene    295971  297761  .       +       .       ID=XUN_000005;gene_id=XUN_000005;gene_name=XUN_000005
scaffold1       maker   transcript      295971  297761  .       +       .       ID=XUN_000005-T1;Parent=XUN_000005;gene_id=XUN_000005;original_biotype=mrna;product=hypothetical protein;transcript_id=XUN_000005-T1

This is what I need:

scaffold1       maker   gene    288062  292903  .       -       .       ID=XUN_000004;gene_id=XUN_000004;gene_name=LINGO2
scaffold1       maker   transcript      288062  292903  .       -       .       ID=XUN_000004-T1;Parent=XUN_000004;Dbxref=PFAM:PF13306,PFAM:PF07679,PFAM:PF13927,PFAM:PF00047,PFAM:PF13855;gene_id=XUN_000004;gene_name=LINGO2;note=COG:T,EggNog:ENOG503B9G0;original_biotype=mrna;product=Leucine-rich repeat and immunoglobulin-like domain-containing nogo receptor-interacting protein 2;transcript_id=XUN_000004-T1
scaffold1       maker   exon    288062  289918  .       -       .       ID=XUN_000004-T1.exon2;Parent=XUN_000004-T1;gene_id=XUN_000004;gene_name=LINGO2;transcript_id=XUN_000004-T1
scaffold1       maker   exon    292892  292903  .       -       .       ID=XUN_000004-T1.exon1;Parent=XUN_000004-T1;gene_id=XUN_000004;gene_name=LINGO2;transcript_id=XUN_000004-T1
scaffold1       maker   CDS     288062  289918  .       -       0       ID=XUN_000004-T1.cds;Parent=XUN_000004-T1;gene_id=XUN_000004;gene_name=LINGO2;transcript_id=XUN_000004-T1
scaffold1       maker   CDS     292892  292903  .       -       0       ID=XUN_000004-T1.cds;Parent=XUN_000004-T1;gene_id=XUN_000004;gene_name=LINGO2;transcript_id=XUN_000004-T1
scaffold1       maker   gene    295971  297761  .       +       .       ID=XUN_000005;gene_id=XUN_000005;gene_name=XUN_000005
scaffold1       maker   transcript      295971  297761  .       +       .       ID=XUN_000005-T1;Parent=XUN_000005;gene_id=XUN_000005;gene_name=XUN_000005;original_biotype=mrna;product=hypothetical protein;transcript_id=XUN_000005-T1
maker GTF transcripts GFF exon • 3.2k views
ADD COMMENT
0
Entering edit mode

does the solution have to be Perl or are you comfortable using other scripting languages?

ADD REPLY
0
Entering edit mode

Hi jv , I'm comfortable using any scripting language for this. Thank you

ADD REPLY
4
Entering edit mode
19 months ago
Buffo ★ 2.4k

An easy solution with python:

gene_names = {}

with open(sys.argv[1], 'r') as gff_input:
    for line in gff_input:
        if line.startswith('#'):
            continue
        line = line.rstrip('\n')
        col = line.split('\t')
        if col[2] == 'gene':
            gene_names[col[8].split(';')[0][3:]] = col[8].split('gene_name=')[1].strip()

with open(sys.argv[1], 'r') as gff_input_2:
    for line in gff_input_2:
        if line.startswith('#'):
            continue
        line = line.rstrip('\n')
        col = line.split('\t')
        if col[2] == 'gene':
            print(line)
        else:
            gene_name = gene_names[col[8][3:].split(';')[0].split('-')[0]]
            print('\t'.join(line.split('\t')) + ';gene_name=' + gene_name)

Save it as my_script.py, and run it as follows:

python my_script.py your_file.gff
ADD COMMENT
0
Entering edit mode

Works perfectly! Thank you so much @Buffo!

ADD REPLY
0
Entering edit mode

I have a request as well which im struggling

My input file which contains the new feature which I want to add which is Entrez_ID

gene_id Entrez_ID
ENSCAFG00845006432 399518
ENSCAFG00845002136 399530
ENSCAFG00845029798 399544
ENSCAFG00845011460 399545
ENSCAFG00845001610 399653
ENSCAFG00845013158 403157
ENSCAFG00845014982 403168
ENSCAFG00845021967 403170
ENSCAFG00845019241 403400 

My gtf file which I have

cat Canis_lupus_familiaris.ROS_Cfam_1.0.108.gtf | head
#!genome-build ROS_Cfam_1.0
#!genome-version ROS_Cfam_1.0
#!genome-date 2020-09
#!genome-build-accession GCA_014441545.1
#!genebuild-last-updated 2020-10
X       ensembl gene    24550462        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";
X       ensembl transcript      24550462        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
X       ensembl exon    24552206        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSCAFE00845128634"; exon_version "1"; tag "Ensembl_canonical";
X       ensembl CDS     24552206        24552226        .       -       0       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00845021332"; protein_version "1"; tag "Ensembl_canonical";
X       ensembl start_codon     24552224        24552226        .       -       0       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

This file also contain a tag of gene_id which is the ensembl ID

Now my final goal is like this example output ,such as if there is match between my reference file and the gtf file ensembl id which is gene_id then there a Entrez_ID should be appended

#!genome-build ROS_Cfam_1.0
#!genome-version ROS_Cfam_1.0
#!genome-date 2020-09
#!genome-build-accession GCA_014441545.1
#!genebuild-last-updated 2020-10
X       ensembl gene    24550462        24552226        .       -       .       gene_id "ENSCAFG00845006432"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding"; Entrez_ID "399518";
X       ensembl transcript      24550462        24552226        .       -       .       gene_id "ENSCAFG00845006432"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; Entrez_ID "399518";
X       ensembl exon    24552206        24552226        .       -       .       gene_id "ENSCAFG00845006432"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSCAFE00845128634"; exon_version "1"; tag "Ensembl_canonical"; Entrez_ID "399518";
X       ensembl CDS     24552206        24552226        .       -       0       gene_id "ENSCAFG00845006432"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00845021332"; protein_version "1"; tag "Ensembl_canonical"; Entrez_ID "399518";
X       ensembl start_codon     24552224        24552226        .       -       0       gene_id "ENSCAFG00845006432"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; Entrez_ID "399518";

Any suggestion or help would be really appreciated

ADD REPLY
2
Entering edit mode

That would be easier:

  1. Load the first file into a dictionary (gene_id is the key, entrez_id is the value)
  2. Load file 2 (annotation) and match gene ids with the dictionary keys
  3. print

something like this:

gene_names = {}

with open(sys.argv[1], 'r') as file_input:
    for line in file_input:
        line = line.rstrip('\n')
        col = line.split('\t') #assuming that columns are separated by tabulations
        gene_names[col[0]] = col[1]

with open(sys.argv[2], 'r') as gff_input:
    for line in gff_input:
        if line.startswith('#'):
            continue
        line = line.rstrip('\n')
        col = line.split('\t')
        gene_entrez = str(gene_names[col[8].split(';')[0].split('"')[1]])
        print(line + '; Entrez_ID = "' + gene_entrez+ '"') #there was an error here

That should work (didn't test it): Save it as pyscript.py and run it as follows:

python pyscript.py name_of_file_1 gtf_file_name
ADD REPLY
2
Entering edit mode

EDITED: there was an error on the last line

ADD REPLY
1
Entering edit mode

what was the error?

I modified your script a little bit

import sys

def save_output(filename, data):
    with open(filename, 'w') as file_output:
        file_output.write(data)

gene_names = {}

with open(sys.argv[1], 'r') as file_input:
    for line in file_input:
        line = line.rstrip('\n')
        col = line.split('\t')  # assuming that columns are separated by tabulations
        gene_names[col[0]] = col[1]

output_data = ''
with open(sys.argv[2], 'r') as gff_input:
    for line in gff_input:
        if line.startswith('#'):
            continue
        line = line.rstrip('\n')
        col = line.split('\t')
        gene_name = col[8].split(';')[0].split('"')[1]
        if gene_name in gene_names:
            gene_entrez = str(gene_names[gene_name])
            output_data += line + '; Entrez_ID "' + gene_entrez + '";\n'

output_filename = sys.argv[3]
save_output(output_filename, output_data)
ADD REPLY
2
Entering edit mode

I wrote the gene_names instead of gene_entrez.

ADD REPLY
0
Entering edit mode

You script is working absolutely fine after the modification. Now I have got gtf file where it maps those gene_id which does have respective Entrez_ID. This reduces the no of genes in my gtf file. Because many of the genes doesn't contain Entrez_ID.

Now if I have to add Entrez_ID to gene_id which is present in my dictionary which is done by your script and also want to keep those gene_id who don't have Entrez_ID. How do I modify the script

ADD REPLY
2
Entering edit mode

ah.. it should raise an Error then, a KeyError? if so, you can create an exemption:

gene_names = {}

with open(sys.argv[1], 'r') as file_input:
    for line in file_input:
        line = line.rstrip('\n')
        col = line.split('\t') #assuming that columns are separated by tabulations
        gene_names[col[0]] = col[1]

with open(sys.argv[2], 'r') as gff_input:
    for line in gff_input:
        if line.startswith('#'):
            continue
        line = line.rstrip('\n')
        col = line.split('\t')
        try:
            gene_entrez = str(gene_names[col[8].split(';')[0].split('"')[1]])
            print(line + '; Entrez_ID = "' + gene_entrez+ '"') #there was an error here
        except KeyError:
            print(line)
ADD REPLY
0
Entering edit mode

"it should raise an Error then, a KeyError? if so, you can create an exemption:" well that i also okay but I want gene_id with Entrez_ID and also gene_id which doesn't have Entrez_ID. That is my final goal in the output gtf file.

ADD REPLY
1
Entering edit mode

I got it, so you mean that some genes might not be present in table 1 (file 1), but you want to include them in the output file, is that correct? The first script assumes all the gene_ids have entrez_id from table 1, if not, it should raise a KeyError because it won't be able to find them. The script number 2 should be able to bypass the error and include them in the output, if it still not including them, it's because I'm missing something. Please copy and paste a few lines of the genes missing in the output.

ADD REPLY
0
Entering edit mode

yes. You got me right. I want to put every thing in the output which includes gene_id with Entrez_ID and gene_id without Entrez_ID as well

ADD REPLY

Login before adding your answer.

Traffic: 2364 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6