gff to gtf missing gene id
1
Hi,
I was trying my hand at annotating a genome using prokka, and I've converted the output gff file to gtf (gffread file.gff -T -o file.gtf
) and this is what my gtf file looks like:
CP001095.1 prokka transcript 210 1712 . + . transcript_id "LCLPEOGO_00001_gene"; gene_id "LCLPEOGO_00001_gene"; gene_name "dnaA"
CP001095.1 prokka CDS 210 1712 . + 0 transcript_id "LCLPEOGO_00001_gene"; gene_name "dnaA";
CP001095.1 prokka transcript 2447 3571 . + . transcript_id "LCLPEOGO_00002_gene"; gene_id "LCLPEOGO_00002_gene"; gene_name "dnaN_1"
CP001095.1 prokka CDS 2447 3571 . + 0 transcript_id "LCLPEOGO_00002_gene"; gene_name "dnaN_1";
Every second line is missing the gene id, the gtf file format descriptions online look different to mine, is there something wrong with my output? or can I continue to work with this - I would really like to make use of it in FeatureCounts (sorry in advance if this is a really noob question. Any help is appreciated xx ).
transcriptomics
annotation
• 1.6k views
You can fix that using AGAT :
agat_convert_sp_gff2gtf.pl --gff test.gtf
##gtf-version 3
CP001095.1 prokka gene 210 1712 . + . gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "nbisL1-gene-1"; gene_name "dnaA";
CP001095.1 prokka transcript 210 1712 . + . gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "LCLPEOGO_00001_gene"; Parent "nbisL1-gene-1"; gene_name "dnaA";
CP001095.1 prokka exon 210 1712 . + . gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "nbis-exon-1"; Parent "LCLPEOGO_00001_gene"; gene_name "dnaA";
CP001095.1 prokka CDS 210 1712 . + 0 gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "cds-1"; Parent "LCLPEOGO_00001_gene"; gene_name "dnaA";
CP001095.1 prokka gene 2447 3571 . + . gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "nbisL1-gene-2"; gene_name "dnaN_1";
CP001095.1 prokka transcript 2447 3571 . + . gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "LCLPEOGO_00002_gene"; Parent "nbisL1-gene-2"; gene_name "dnaN_1";
CP001095.1 prokka exon 2447 3571 . + . gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "nbis-exon-2"; Parent "LCLPEOGO_00002_gene"; gene_name "dnaN_1";
CP001095.1 prokka CDS 2447 3571 . + 0 gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "cds-2"; Parent "LCLPEOGO_00002_gene"; gene_name "dnaN_1";
Login before adding your answer.
Traffic: 2137 users visited in the last hour
It seems like it is a known thing about GFF generated using Prokka. See https://github.com/tseemann/prokka/issues/338 and https://github.com/gpertea/gffread/issues/45
One could specify -t cds in the FeatureCounts command if you want to calculate the raw counts for each CDS.
Thanks for the suggestion, if nothing else works I'll just specify for the cds and see what happens :)