Question

Making gene annotation from a GTF file

0

Entering edit mode

10.6 years ago

Karyo ▴ 10

Hi,

I have downloaded a GTF formatted file from a database. As you know, it is a tab delimited file with 9 columns and it goes like this:

#RefSeq_name Source Feature Start End Score Strand Frame Attribute
Chromosome5 file_source start_codon 4470284 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source stop_codon 4469688 4469690 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4470173 4470286 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4470173 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4470034 4470120 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4470034 4470120 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4469273 4469969 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4469691 4469969 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source start_codon 4455593 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source stop_codon 4453288 4453290 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4455560 4455595 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4455560 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4455321 4455372 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4455321 4455372 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4454682 4455003 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4454682 4455003 . - 2 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4454473 4454620 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4454473 4454620 . - 1 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4453288 4454397 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4453291 4454397 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";

With this GTF file, a protein model FASTA file is made, which is the number of transcript_id in the attributes column. Because of the splicing, one gene_id can have more than one transcript_id, so the numbers of gene_id and transcript_id are different. I would like to parse this GTF file to form more simple GTF format like this:

Chromosome5 file_source gene 4469688 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source gene 4453288 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";

The feature (Column 3) as gene and column 4 and 5 as gene start and end sites, respectively, for each transcript_id. It seems easy, but some genes do not have start_codon or end_codon features.

Does anyone know such GTF file parser making a gene annotation file with only start_codon, end_codon, CDS and exon information for each "transcript_id". Let me know, please.

GFF gene genome GTF • 7.4k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by Karyo ▴ 10

Ram · Accepted Answer · 2014-04-29

4

Entering edit mode

10.6 years ago

Ryan Dale 5.0k

Inferring gene extent from GTF files can be done with gffutils (github, docs).

The docs for importing GTF files have some more detail for handling more difficult cases, but your example file looks straightforward. The following gist shows how to write a new file containing the inferred genes:

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Ryan Dale 5.0k

0

Entering edit mode

Wow! It works, Thank you @Daler!

ADD REPLY • link 10.6 years ago by Karyo ▴ 10