Question

How to fix my broken GTF file?

0

Entering edit mode

8.1 years ago

scchess ▴ 640

I have an incomplete GTF file with lines such as:

chr1    hg38_ct_UserTrack_3545  exon    94353   94355   2109    +   .   gene_id "R2_66"; transcript_id "R2_66_1";

This describes an exon. All the lines in my incomplete GTF file describe an exon or CDS.

Question:

I want to fix my GTF file. For instance, I need to do something like:

chrI   hg38_ct_UserTrack_3545  gene      6790136 6808198 .       +       .       gene_id "R1_102";

and

chrI   hg38_ct_UserTrack_3545  transcript      6790136 6808198 .       +       .       transcript_id "R1_102";

I would like to add annotation at the transcript and gene level. What's the best way?

gtf • 4.2k views

ADD COMMENT • link updated 8.1 years ago by Ryan Dale 5.0k • written 8.1 years ago by scchess ▴ 640

0

Entering edit mode

I think you can get gene region on genome by your gtf file. Ignore UTR region. You can try to get the start and end position of a gene. The start and end position should in the first and last exon's boundary, these exons should belong to the same gene.

ADD REPLY • link 8.1 years ago by l0o0 ▴ 220

1

Entering edit mode

8.1 years ago

gandrescabrera ▴ 80

Have you tried BioMart @ Ensembl? You can DL the GTF file with the annotations from this http://www.ensembl.org/info/data/ftp/index.html

Then you could load this to R, if you use, or make a AWK with Bash: for example

awk -F"\t" '$3 ~ /exon|CDS|start_codon|stop_codon|5UTR|3UTR|inter|inter_CNS/{print}' data > data3

Where -F = field separator, in this case a tab $3 = the field you want to look in /pattern/ = the pattern to search {print} = action to take data = file to look pattern at data3 = output file

hope it works fine!

ADD COMMENT • link 8.1 years ago by gandrescabrera ▴ 80

score 2 · Accepted Answer · 2017-04-27

The gffutils python package can help with this: https://daler.github.io/gffutils

It will infer the gene and transcript extents; doing so is the default behavior for GTF files. The following will import the GTF into a sqlite3 database (which you can use for other things if needed) and then output a new file with everything including gene and transcript features:

import gffutils
db = gffutils.create_db('my.gtf', 'my.gtf.db')
with open('fixed.gtf', 'w') as fout:
    for feature in db.all_features():
        fout.write(str(feature) + '\n')

I should point out that what you call an "incomplete" file is actually an on-spec GTF file -- the specification says not to include gene or transcript features.