How to fix my broken GTF file?
2
0
Entering edit mode
7.6 years ago
scchess ▴ 640

I have an incomplete GTF file with lines such as:

chr1    hg38_ct_UserTrack_3545  exon    94353   94355   2109    +   .   gene_id "R2_66"; transcript_id "R2_66_1";

This describes an exon. All the lines in my incomplete GTF file describe an exon or CDS.

Question:

I want to fix my GTF file. For instance, I need to do something like:

chrI   hg38_ct_UserTrack_3545  gene      6790136 6808198 .       +       .       gene_id "R1_102";

and

chrI   hg38_ct_UserTrack_3545  transcript      6790136 6808198 .       +       .       transcript_id "R1_102";

I would like to add annotation at the transcript and gene level. What's the best way?

gtf • 3.8k views
ADD COMMENT
0
Entering edit mode

I think you can get gene region on genome by your gtf file. Ignore UTR region. You can try to get the start and end position of a gene. The start and end position should in the first and last exon's boundary, these exons should belong to the same gene.

ADD REPLY
2
Entering edit mode
7.6 years ago
Ryan Dale 5.0k

The gffutils python package can help with this: https://daler.github.io/gffutils

It will infer the gene and transcript extents; doing so is the default behavior for GTF files. The following will import the GTF into a sqlite3 database (which you can use for other things if needed) and then output a new file with everything including gene and transcript features:

import gffutils
db = gffutils.create_db('my.gtf', 'my.gtf.db')
with open('fixed.gtf', 'w') as fout:
    for feature in db.all_features():
        fout.write(str(feature) + '\n')

I should point out that what you call an "incomplete" file is actually an on-spec GTF file -- the specification says not to include gene or transcript features.

ADD COMMENT
1
Entering edit mode
7.6 years ago

Have you tried BioMart @ Ensembl? You can DL the GTF file with the annotations from this http://www.ensembl.org/info/data/ftp/index.html

Then you could load this to R, if you use, or make a AWK with Bash: for example

awk -F"\t" '$3 ~ /exon|CDS|start_codon|stop_codon|5UTR|3UTR|inter|inter_CNS/{print}' data > data3

Where -F = field separator, in this case a tab $3 = the field you want to look in /pattern/ = the pattern to search {print} = action to take data = file to look pattern at data3 = output file

hope it works fine!

ADD COMMENT

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6