Hi,
I have downloaded a GTF formatted file from a database. As you know, it is a tab delimited file with 9 columns and it goes like this:
#RefSeq_name Source Feature Start End Score Strand Frame Attribute
Chromosome5 file_source start_codon 4470284 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source stop_codon 4469688 4469690 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4470173 4470286 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4470173 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4470034 4470120 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4470034 4470120 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4469273 4469969 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4469691 4469969 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source start_codon 4455593 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source stop_codon 4453288 4453290 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4455560 4455595 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4455560 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4455321 4455372 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4455321 4455372 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4454682 4455003 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4454682 4455003 . - 2 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4454473 4454620 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4454473 4454620 . - 1 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4453288 4454397 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4453291 4454397 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
With this GTF file, a protein model FASTA file is made, which is the number of transcript_id
in the attributes column. Because of the splicing, one gene_id
can have more than one transcript_id
, so the numbers of gene_id
and transcript_id
are different. I would like to parse this GTF file to form more simple GTF format like this:
Chromosome5 file_source gene 4469688 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source gene 4453288 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
The feature (Column 3) as gene and column 4 and 5 as gene start and end sites, respectively, for each transcript_id
. It seems easy, but some genes do not have start_codon
or end_codon
features.
Does anyone know such GTF file parser making a gene annotation file with only start_codon
, end_codon
, CDS
and exon
information for each "transcript_id". Let me know, please.
Wow! It works, Thank you @Daler!