gff to gtf missing gene id
1
0
Entering edit mode
19 months ago
plain_text • 0

Hi,

I was trying my hand at annotating a genome using prokka, and I've converted the output gff file to gtf (gffread file.gff -T -o file.gtf) and this is what my gtf file looks like:

CP001095.1  prokka  transcript  210 1712    .   +   .   transcript_id "LCLPEOGO_00001_gene"; gene_id "LCLPEOGO_00001_gene"; gene_name "dnaA"
CP001095.1  prokka  CDS 210 1712    .   +   0   transcript_id "LCLPEOGO_00001_gene"; gene_name "dnaA";
CP001095.1  prokka  transcript  2447    3571    .   +   .   transcript_id "LCLPEOGO_00002_gene"; gene_id "LCLPEOGO_00002_gene"; gene_name "dnaN_1"
CP001095.1  prokka  CDS 2447    3571    .   +   0   transcript_id "LCLPEOGO_00002_gene"; gene_name "dnaN_1";

Every second line is missing the gene id, the gtf file format descriptions online look different to mine, is there something wrong with my output? or can I continue to work with this - I would really like to make use of it in FeatureCounts (sorry in advance if this is a really noob question. Any help is appreciated xx ).

transcriptomics annotation • 1.6k views
ADD COMMENT
0
Entering edit mode

It seems like it is a known thing about GFF generated using Prokka. See https://github.com/tseemann/prokka/issues/338 and https://github.com/gpertea/gffread/issues/45

One could specify -t cds in the FeatureCounts command if you want to calculate the raw counts for each CDS.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion, if nothing else works I'll just specify for the cds and see what happens :)

ADD REPLY
3
Entering edit mode
19 months ago
Juke34 8.9k

You can fix that using AGAT:

agat_convert_sp_gff2gtf.pl --gff test.gtf 

##gtf-version 3
CP001095.1  prokka  gene    210 1712    .   +   .   gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "nbisL1-gene-1"; gene_name "dnaA";
CP001095.1  prokka  transcript  210 1712    .   +   .   gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "LCLPEOGO_00001_gene"; Parent "nbisL1-gene-1"; gene_name "dnaA";
CP001095.1  prokka  exon    210 1712    .   +   .   gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "nbis-exon-1"; Parent "LCLPEOGO_00001_gene"; gene_name "dnaA";
CP001095.1  prokka  CDS 210 1712    .   +   0   gene_id "LCLPEOGO_00001_gene"; transcript_id "LCLPEOGO_00001_gene"; ID "cds-1"; Parent "LCLPEOGO_00001_gene"; gene_name "dnaA";
CP001095.1  prokka  gene    2447    3571    .   +   .   gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "nbisL1-gene-2"; gene_name "dnaN_1";
CP001095.1  prokka  transcript  2447    3571    .   +   .   gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "LCLPEOGO_00002_gene"; Parent "nbisL1-gene-2"; gene_name "dnaN_1";
CP001095.1  prokka  exon    2447    3571    .   +   .   gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "nbis-exon-2"; Parent "LCLPEOGO_00002_gene"; gene_name "dnaN_1";
CP001095.1  prokka  CDS 2447    3571    .   +   0   gene_id "LCLPEOGO_00002_gene"; transcript_id "LCLPEOGO_00002_gene"; ID "cds-2"; Parent "LCLPEOGO_00002_gene"; gene_name "dnaN_1";
ADD COMMENT
0
Entering edit mode

Haven't looked at AGAT before, will try that, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2137 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6