Question

Please help : How to deal with 'Missing gene_name' of GTF file?

0

Entering edit mode

4.2 years ago

tofukaj ▴ 20

Dear all great helpers,

I'm very new to Drop-Seq analysis, and struggling to learn its pre-processing process. I'm trying to create a reFlat file from canine gtf file acquired from ensembl (ftp://ftp.ensembl.org/pub/release-101/gtf/canis_lupus_familiaris/Canis_lupus_familiaris.CanFam3.1.101.gtf.gz) using "ConvertToRefFlat" function provided by Drop-seq tools (https://github.com/broadinstitute/Drop-seq/). I would also like to note that any function using the gtf file also fail to produce the results.

The problem occur due to the 'Missing gene_name' of the canine gtf file. I've tried my best to find a tool to fix this issue without any success. Due to my poor background in coding, it's impossible for me to fix such issue without the specific tool.

It would be very kind if anyone could show me a tool to deal with this issue.

Best regards, Kaj

RNA-Seq GTF Drop-Seq • 4.1k views

ADD COMMENT • link updated 18 months ago by EkHe ▴ 10 • written 4.2 years ago by tofukaj ▴ 20

1

Entering edit mode

I opened the file and can see that gene_name is among the attribute of all the feature of gene records.
Could you clarify with an example why you say that gene_name is missing?

ADD REPLY • link 4.2 years ago by Juke34 8.9k

1

Entering edit mode

tofukaj : Using the annotation file from UCSC for Dog may solve this issue. refFlat nomenclature is used by UCSC and the code may be expecting a UCSC file. You can try exporting that file from UCSC genome browser in GTF format.

ADD REPLY • link 4.2 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much for your kind advice. I have tried download the canine GTF file from UCSC. It cannot be applied with the ConvertToRefFlat function of Drop-seq tools. I would try to ask the maintainer about this.

ADD REPLY • link 4.2 years ago by tofukaj ▴ 20

7

Entering edit mode

4.2 years ago

Juke34 8.9k

You can also use agat_sp_manage_attributes.pl from AGAT

agat_sp_manage_attributes.pl --gff input.gff --att gene_id/gene_name --cp

It will copy the gene_id value into the gene_name attribute (it creates the gene_name attribute if missing too). If gene_name already exists it will not touch/replace it.

ADD COMMENT • link 4.2 years ago by Juke34 8.9k

0

Entering edit mode

Hi Juke34 , this tool is awesome. However,I'm trying to implement it for a specific use case, and it's not working. I have the gene_name attribute in my GFF file, but it's only present for each gene entry (i.e., it's absent from the transcript, cds, and exon rows). I want to add the gene_name attribute to every single row of my GFF file, so that for each feature type, the gene_name will be listed and it will match the existing gene_name attribute for the gene feature. Can you please help me with this? This is what I have:

scaffold1       maker   gene    288062  292903  .       -       .       ID=DUN_000004;gene_id=DUN_000004;gene_name=LINGO2
scaffold1       maker   transcript      288062  292903  .       -       .       ID=DUN_000004-T1;Parent=DUN_000004;Dbxref=PFAM:PF13306,PFAM:PF07679,PFAM:PF13927,PFAM:PF00047,PFAM:PF13855;gene_id=DUN_000004;note=COG:T,EggNog:ENOG503B9G0;original_biotype=mrna;product=Leucine-rich repeat and immunoglobulin-like domain-containing nogo receptor-interacting protein 2;transcript_id=DUN_000004-T1
scaffold1       maker   exon    288062  289918  .       -       .       ID=DUN_000004-T1.exon2;Parent=DUN_000004-T1;gene_id=DUN_000004;transcript_id=DUN_000004-T1
scaffold1       maker   exon    292892  292903  .       -       .       ID=DUN_000004-T1.exon1;Parent=DUN_000004-T1;gene_id=DUN_000004;transcript_id=DUN_000004-T1

And this is what I need (for the entire file though):

scaffold1       maker   gene    288062  292903  .       -       .       ID=DUN_000004;gene_id=DUN_000004;gene_name=LINGO2
scaffold1       maker   transcript      288062  292903  .       -       .       ID=DUN_000004-T1;Parent=DUN_000004;Dbxref=PFAM:PF13306,PFAM:PF07679,PFAM:PF13927,PFAM:PF00047,PFAM:PF13855;gene_id=DUN_000004;gene_name=LINGO2;note=COG:T,EggNog:ENOG503B9G0;original_biotype=mrna;product=Leucine-rich repeat and immunoglobulin-like domain-containing nogo receptor-interacting protein 2;transcript_id=DUN_000004-T1
scaffold1       maker   exon    288062  289918  .       -       .       ID=DUN_000004-T1.exon2;Parent=DUN_000004-T1;gene_id=DUN_000004;gene_name=LINGO2;transcript_id=DUN_000004-T1
scaffold1       maker   exon    292892  292903  .       -       .       ID=DUN_000004-T1.exon1;Parent=DUN_000004-T1;gene_id=DUN_000004;gene_name=LINGO2;transcript_id=DUN_000004-T1

ADD REPLY • link 18 months ago by EkHe ▴ 10

1

Entering edit mode

4.2 years ago

rpolicastro 13k

Looking at the GTF file, it looks like there are a lot of genes without a 'gene_name'. Here are some examples:

grep "\bgene\b" Canis_lupus_familiaris.CanFam3.1.101.gtf | grep -v "\bgene_name\b" | wc -l
12483

grep "\bgene\b" Canis_lupus_familiaris.CanFam3.1.101.gtf | grep -v "\bgene_name\b" | head -n 5
X   ensembl gene    21422   25435   .   +   .   gene_id "ENSCAFG00000039510"; gene_version "2"; gene_source "ensembl"; gene_biotype "lncRNA";
X   ensembl gene    28152   65551   .   -   .   gene_id "ENSCAFG00000029674"; gene_version "2"; gene_source "ensembl"; gene_biotype "protein_coding";
X   ensembl gene    66698   77743   .   +   .   gene_id "ENSCAFG00000041875"; gene_version "1"; gene_source "ensembl"; gene_biotype "lncRNA";
X   ensembl gene    119141  122206  .   +   .   gene_id "ENSCAFG00000044010"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";
X   ensembl gene    201984  202346  .   -   .   gene_id "ENSCAFG00000041333"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";

However, all genes appear to have a 'gene_id'.

grep "\bgene\b" Canis_lupus_familiaris.CanFam3.1.101.gtf | grep -v "\bgene_id\b" | wc -l
0

If possible, switch from 'gene_name' to 'gene_id' in the software.

ADD COMMENT • link 4.2 years ago by rpolicastro 13k

0

Entering edit mode

That's a brilliant idea! However, my very poor coding skill will never get me to that solution. Is it possible to have a code to fill the missing gene_name in these line with their gene_id?

ADD REPLY • link 4.2 years ago by tofukaj ▴ 20

1

Entering edit mode

Check the command line options for ConvertToRefFlat to see if there is an option to switch from using 'gene_name' to 'gene_id'.

ADD REPLY • link 4.2 years ago by rpolicastro 13k

0

Entering edit mode

Thank you very much!!

ADD REPLY • link 4.2 years ago by tofukaj ▴ 20

score 1 · Accepted Answer · 2020-08-28

According from all your kind suggestions, I've managed to create new gene_name and transcript_name using gene_id and transcript_id. This could be achieved by 'pygtftk' (https://github.com/dputhier/pygtftk). Assoc. Denis Puthier-whom I gave the overall credit to him kindly provided me guidance as follows:

`

###Get the GTF file from canis lupus familiaris
    gtftk retrieve -s canis_lupus_familiaris

###Uncompress
    gunzip Canis_lupus_familiaris.CanFam3.1.101.chr.gtf.gz

###Get the number of unique gene_id, gene_name, transcript_id, transcript_name (to check #gene_name < #gene_id; #transcript_name < transcript_id)
    gtftk count_key_values -u -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_id,gene_name,transcript_id, transcript_name

###Create new gene_name from as gene_name|gene_id
    gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_name,gene_id -d A_NEW_KEY | gtftk del_attr -k gene_name | perl -npe 's/A_NEW_KEY/gene_name/' > Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn.gtf

###Create new transcript_name from as transcript_name|transcript_id
    gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn.gtf -k transcript_name,transcript_id -d A_NEW_KEY | gtftk del_attr -k transcript_name | perl -npe 's/A_NEW_KEY/transcript_name/' > Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn_tn.gtf

###Check the number of gene_name and gen_id is the same
    gtftk count_key_values -u -i Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn_tn.gtf -k gene_id,gene_name,transcript_id, transcript_name

`

I find a number of people struggle to deal with this issue. Hope this could be useful for the others.

Thank you very much for your kind suggestion Kaj