Please help : How to deal with 'Missing gene_name' of GTF file?
3
0
Entering edit mode
4.2 years ago
tofukaj ▴ 20

Dear all great helpers,

I'm very new to Drop-Seq analysis, and struggling to learn its pre-processing process. I'm trying to create a reFlat file from canine gtf file acquired from ensembl (ftp://ftp.ensembl.org/pub/release-101/gtf/canis_lupus_familiaris/Canis_lupus_familiaris.CanFam3.1.101.gtf.gz) using "ConvertToRefFlat" function provided by Drop-seq tools (https://github.com/broadinstitute/Drop-seq/). I would also like to note that any function using the gtf file also fail to produce the results.

The problem occur due to the 'Missing gene_name' of the canine gtf file. I've tried my best to find a tool to fix this issue without any success. Due to my poor background in coding, it's impossible for me to fix such issue without the specific tool.

It would be very kind if anyone could show me a tool to deal with this issue.

Best regards, Kaj

RNA-Seq GTF Drop-Seq • 4.1k views
ADD COMMENT
1
Entering edit mode

I opened the file and can see that gene_name is among the attribute of all the feature of gene records.
Could you clarify with an example why you say that gene_name is missing?

ADD REPLY
1
Entering edit mode

tofukaj : Using the annotation file from UCSC for Dog may solve this issue. refFlat nomenclature is used by UCSC and the code may be expecting a UCSC file. You can try exporting that file from UCSC genome browser in GTF format.

ADD REPLY
0
Entering edit mode

Thank you very much for your kind advice. I have tried download the canine GTF file from UCSC. It cannot be applied with the ConvertToRefFlat function of Drop-seq tools. I would try to ask the maintainer about this.

ADD REPLY
1
Entering edit mode
4.2 years ago
tofukaj ▴ 20

According from all your kind suggestions, I've managed to create new gene_name and transcript_name using gene_id and transcript_id. This could be achieved by 'pygtftk' (https://github.com/dputhier/pygtftk). Assoc. Denis Puthier-whom I gave the overall credit to him kindly provided me guidance as follows:

`

###Get the GTF file from canis lupus familiaris
    gtftk retrieve -s canis_lupus_familiaris

###Uncompress
    gunzip Canis_lupus_familiaris.CanFam3.1.101.chr.gtf.gz

###Get the number of unique gene_id, gene_name, transcript_id, transcript_name (to check #gene_name < #gene_id; #transcript_name < transcript_id)
    gtftk count_key_values -u -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_id,gene_name,transcript_id, transcript_name

###Create new gene_name from as gene_name|gene_id
    gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_name,gene_id -d A_NEW_KEY | gtftk del_attr -k gene_name | perl -npe 's/A_NEW_KEY/gene_name/' > Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn.gtf

###Create new transcript_name from as transcript_name|transcript_id
    gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn.gtf -k transcript_name,transcript_id -d A_NEW_KEY | gtftk del_attr -k transcript_name | perl -npe 's/A_NEW_KEY/transcript_name/' > Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn_tn.gtf

###Check the number of gene_name and gen_id is the same
    gtftk count_key_values -u -i Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn_tn.gtf -k gene_id,gene_name,transcript_id, transcript_name

`

I find a number of people struggle to deal with this issue. Hope this could be useful for the others.

Thank you very much for your kind suggestion Kaj

ADD COMMENT
1
Entering edit mode

Thanks for sharing. Please consider to use the code button 10101 to highlight your code with markdown.

ADD REPLY
1
Entering edit mode

Hi Kaj, Just want to emphasize that I just commit a fix to pygtftk (master branch) that now allows to merge attributes using the same key as source and destination (i.e. use -k gene_name,gene_id -d gene_name in place of -d -d A_NEW_KEY). So the pipe "gtftk merge_attr ... | gtftk del_attr ... | perl -npe ..." can now be substituted by "gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_name,gene_id -d gene_name". This will be included in the next release (v1.1.5) which will be available in the upcoming weeks.

ADD REPLY
7
Entering edit mode
4.2 years ago
Juke34 8.9k

You can also use agat_sp_manage_attributes.pl from AGAT

agat_sp_manage_attributes.pl --gff input.gff --att gene_id/gene_name --cp

It will copy the gene_id value into the gene_name attribute (it creates the gene_name attribute if missing too). If gene_name already exists it will not touch/replace it.

ADD COMMENT
0
Entering edit mode

Hi Juke34 , this tool is awesome. However,I'm trying to implement it for a specific use case, and it's not working. I have the gene_name attribute in my GFF file, but it's only present for each gene entry (i.e., it's absent from the transcript, cds, and exon rows). I want to add the gene_name attribute to every single row of my GFF file, so that for each feature type, the gene_name will be listed and it will match the existing gene_name attribute for the gene feature. Can you please help me with this? This is what I have:

scaffold1       maker   gene    288062  292903  .       -       .       ID=DUN_000004;gene_id=DUN_000004;gene_name=LINGO2
scaffold1       maker   transcript      288062  292903  .       -       .       ID=DUN_000004-T1;Parent=DUN_000004;Dbxref=PFAM:PF13306,PFAM:PF07679,PFAM:PF13927,PFAM:PF00047,PFAM:PF13855;gene_id=DUN_000004;note=COG:T,EggNog:ENOG503B9G0;original_biotype=mrna;product=Leucine-rich repeat and immunoglobulin-like domain-containing nogo receptor-interacting protein 2;transcript_id=DUN_000004-T1
scaffold1       maker   exon    288062  289918  .       -       .       ID=DUN_000004-T1.exon2;Parent=DUN_000004-T1;gene_id=DUN_000004;transcript_id=DUN_000004-T1
scaffold1       maker   exon    292892  292903  .       -       .       ID=DUN_000004-T1.exon1;Parent=DUN_000004-T1;gene_id=DUN_000004;transcript_id=DUN_000004-T1

And this is what I need (for the entire file though):

scaffold1       maker   gene    288062  292903  .       -       .       ID=DUN_000004;gene_id=DUN_000004;gene_name=LINGO2
scaffold1       maker   transcript      288062  292903  .       -       .       ID=DUN_000004-T1;Parent=DUN_000004;Dbxref=PFAM:PF13306,PFAM:PF07679,PFAM:PF13927,PFAM:PF00047,PFAM:PF13855;gene_id=DUN_000004;gene_name=LINGO2;note=COG:T,EggNog:ENOG503B9G0;original_biotype=mrna;product=Leucine-rich repeat and immunoglobulin-like domain-containing nogo receptor-interacting protein 2;transcript_id=DUN_000004-T1
scaffold1       maker   exon    288062  289918  .       -       .       ID=DUN_000004-T1.exon2;Parent=DUN_000004-T1;gene_id=DUN_000004;gene_name=LINGO2;transcript_id=DUN_000004-T1
scaffold1       maker   exon    292892  292903  .       -       .       ID=DUN_000004-T1.exon1;Parent=DUN_000004-T1;gene_id=DUN_000004;gene_name=LINGO2;transcript_id=DUN_000004-T1
ADD REPLY
1
Entering edit mode
4.2 years ago

Looking at the GTF file, it looks like there are a lot of genes without a 'gene_name'. Here are some examples:

grep "\bgene\b" Canis_lupus_familiaris.CanFam3.1.101.gtf | grep -v "\bgene_name\b" | wc -l
12483

grep "\bgene\b" Canis_lupus_familiaris.CanFam3.1.101.gtf | grep -v "\bgene_name\b" | head -n 5
X   ensembl gene    21422   25435   .   +   .   gene_id "ENSCAFG00000039510"; gene_version "2"; gene_source "ensembl"; gene_biotype "lncRNA";
X   ensembl gene    28152   65551   .   -   .   gene_id "ENSCAFG00000029674"; gene_version "2"; gene_source "ensembl"; gene_biotype "protein_coding";
X   ensembl gene    66698   77743   .   +   .   gene_id "ENSCAFG00000041875"; gene_version "1"; gene_source "ensembl"; gene_biotype "lncRNA";
X   ensembl gene    119141  122206  .   +   .   gene_id "ENSCAFG00000044010"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";
X   ensembl gene    201984  202346  .   -   .   gene_id "ENSCAFG00000041333"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";

However, all genes appear to have a 'gene_id'.

grep "\bgene\b" Canis_lupus_familiaris.CanFam3.1.101.gtf | grep -v "\bgene_id\b" | wc -l
0

If possible, switch from 'gene_name' to 'gene_id' in the software.

ADD COMMENT
0
Entering edit mode

That's a brilliant idea! However, my very poor coding skill will never get me to that solution. Is it possible to have a code to fill the missing gene_name in these line with their gene_id?

ADD REPLY
1
Entering edit mode

Check the command line options for ConvertToRefFlat to see if there is an option to switch from using 'gene_name' to 'gene_id'.

ADD REPLY
0
Entering edit mode

Thank you very much!!

ADD REPLY

Login before adding your answer.

Traffic: 1519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6