The problem occur due to the 'Missing gene_name' of the canine gtf file. I've tried my best to find a tool to fix this issue without any success. Due to my poor background in coding, it's impossible for me to fix such issue without the specific tool.
It would be very kind if anyone could show me a tool to deal with this issue.
ADD COMMENT
• link
updated 18 months ago by
EkHe
▴
10
•
written 4.2 years ago by
tofukaj
▴
20
1
Entering edit mode
I opened the file and can see that gene_name is among the attribute of all the feature of gene records.
Could you clarify with an example why you say that gene_name is missing?
tofukaj : Using the annotation file from UCSC for Dog may solve this issue. refFlat nomenclature is used by UCSC and the code may be expecting a UCSC file. You can try exporting that file from UCSC genome browser in GTF format.
Thank you very much for your kind advice. I have tried download the canine GTF file from UCSC. It cannot be applied with the ConvertToRefFlat function of Drop-seq tools. I would try to ask the maintainer about this.
According from all your kind suggestions, I've managed to create new gene_name and transcript_name using gene_id and transcript_id. This could be achieved by 'pygtftk' (https://github.com/dputhier/pygtftk). Assoc. Denis Puthier-whom I gave the overall credit to him kindly provided me guidance as follows:
`
###Get the GTF file from canis lupus familiaris
gtftk retrieve -s canis_lupus_familiaris
###Uncompress
gunzip Canis_lupus_familiaris.CanFam3.1.101.chr.gtf.gz
###Get the number of unique gene_id, gene_name, transcript_id, transcript_name (to check #gene_name < #gene_id; #transcript_name < transcript_id)
gtftk count_key_values -u -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_id,gene_name,transcript_id, transcript_name
###Create new gene_name from as gene_name|gene_id
gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_name,gene_id -d A_NEW_KEY | gtftk del_attr -k gene_name | perl -npe 's/A_NEW_KEY/gene_name/' > Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn.gtf
###Create new transcript_name from as transcript_name|transcript_id
gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn.gtf -k transcript_name,transcript_id -d A_NEW_KEY | gtftk del_attr -k transcript_name | perl -npe 's/A_NEW_KEY/transcript_name/' > Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn_tn.gtf
###Check the number of gene_name and gen_id is the same
gtftk count_key_values -u -i Canis_lupus_familiaris.CanFam3.1.101.chr_with_gn_tn.gtf -k gene_id,gene_name,transcript_id, transcript_name
`
I find a number of people struggle to deal with this issue. Hope this could be useful for the others.
Hi Kaj,
Just want to emphasize that I just commit a fix to pygtftk (master branch) that now allows to merge attributes using the same key as source and destination (i.e. use -k gene_name,gene_id -d gene_name in place of -d -d A_NEW_KEY). So the pipe "gtftk merge_attr ... | gtftk del_attr ... | perl -npe ..." can now be substituted by "gtftk merge_attr -i Canis_lupus_familiaris.CanFam3.1.101.chr.gtf -k gene_name,gene_id -d gene_name". This will be included in the next release (v1.1.5) which will be available in the upcoming weeks.
It will copy the gene_id value into the gene_name attribute (it creates the gene_name attribute if missing too). If gene_name already exists it will not touch/replace it.
Hi Juke34 , this tool is awesome. However,I'm trying to implement it for a specific use case, and it's not working.
I have the gene_name attribute in my GFF file, but it's only present for each gene entry (i.e., it's absent from the transcript, cds, and exon rows). I want to add the gene_name attribute to every single row of my GFF file, so that for each feature type, the gene_name will be listed and it will match the existing gene_name attribute for the gene feature. Can you please help me with this? This is what I have:
That's a brilliant idea! However, my very poor coding skill will never get me to that solution. Is it possible to have a code to fill the missing gene_name in these line with their gene_id?
I opened the file and can see that
gene_name
is among the attribute of all the feature ofgene
records.Could you clarify with an example why you say that gene_name is missing?
tofukaj : Using the annotation file from UCSC for Dog may solve this issue.
refFlat
nomenclature is used by UCSC and the code may be expecting a UCSC file. You can try exporting that file from UCSC genome browser in GTF format.Thank you very much for your kind advice. I have tried download the canine GTF file from UCSC. It cannot be applied with the ConvertToRefFlat function of Drop-seq tools. I would try to ask the maintainer about this.