How to change gene coordinate in gtf file?
1
0
Entering edit mode
2.7 years ago
Info.shi ▴ 30

Hi, I have gtf file I need to change the coordinate according to + and - strand to eliminate UTR region and consider CDs start and end coordinate.

My primary gtf file-

Chr_3a transdecoder gene 26355 34213 . - . ID=MSTRG.7.5

Chr_3a transdecoder cds 33198 33363 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 30850 31322 . - 2 ID=MSTRG.7.5

Chr_3a transdecoder cds 29756 30785 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 29426 29679 . - 2 ID=MSTRG.7.5

Chr_3a transdecoder gene 13108235 13128245 . + . ID=MSTRG.1

Chr_3a transdecoder cds 13113822 13113951 . + 0 ID=MSTRG.1

Chr_3a transdecoder cds 13114050 13114146 . + 2 ID=MSTRG..1

Chr_3a transdecoder cds 13114259 13114432 . + 1 ID=MSTRG..1

Chr_3a transdecoder cds 13116046 13116286 . + 1 ID=MSTRG.1

Chr_3a transdecoder cds 13117096 13120860 . + 0 ID=MSTRG..1

Expected formate

In - strand

Chr_3a transdecoder gene 29426 33363 . - . ID=MSTRG.7.5

Chr_3a transdecoder cds 33198 33363 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 30850 31322 . - 2 ID=MSTRG.7.5

Chr_3a transdecoder cds 29756 30785 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 29426 29679 . - 2 ID=MSTRG.7.5

While in + strand

Chr_3a transdecoder gene 13113822 13120860 . + . ID=MSTRG.1

Chr_3a transdecoder cds 13113822 13113951 . + 0 ID=MSTRG.1

Chr_3a transdecoder cds 13114050 13114146 . + 2 ID=MSTRG.1

Chr_3a transdecoder cds 13114259 13114432 . + 1 ID=MSTRG.1

Chr_3a transdecoder cds 13116046 13116286 . + 1 ID=MSTRG.1

Chr_3a transdecoder cds 13117096 13120860 . + 0 ID=MSTRG.1

Kindly suggest to me how to get my desirable output I am not good at programming and changing coordinates manually is very tough for all genes.

Thank you

perl python R • 1.1k views
ADD COMMENT
0
Entering edit mode

This problems seems to be not about removing UTR genes but finding the to ends of the CDS regions, kind of a merging CDS regions that belong to the same transcript.

Looks at posts like these:

ADD REPLY
1
Entering edit mode
2.7 years ago
Shred ★ 1.5k

If you're not good at programming it may become an harsh task to do. I've worked on something similar so I could suggest a way to do that.

Split your gtf file into strand specific files.

awk -F'\t' '{if ($7=="+") print $0}' > forward.gtf

awk -F'\t' '{if ($7=="-") print $0}' > reverse.gtf

Then write a parser (in Python would be easier) where you define a class to store each gtf feature. Something like:

Gene x/
├─ Transcript X.1/
├─ Transcript X.2/
│  ├─ 5'UTR
│  ├─ CDS
│  ├─ 3' UTR
├─ Transcript X.n/
│  ├─ ..
│  ├─ ..

Then iterate over each gene to access each transcript: here you'll substract UTR coordinates from the Gene one and rewrite the record. Using a dictionaries in Python to store gene/transcript features, you could preserve adding order to edit only the first/last CDS according to the UTR coordinates.

I wrote a parser some time ago, intended to do se opposite thing: add 3' UTR while missing into a GTF file. There you could find this data structure implemented, which is basically a nested Ordered dict implemented in Python3: but as you've said that your programming skills are not that good, maybe a better idea would be to pass this concept to someone able to implement.

ADD COMMENT

Login before adding your answer.

Traffic: 2495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6