Entering edit mode
6.4 years ago
sangram_keshari
▴
260
I am trying to get FASTA sequence from a GTF file having transcripts with multiple exons (filtered from merged_transcripts.gtf from cufflink pipeline. To be precise with class_code "u")
So I am using gffread using this filtered GTF file and the genome file to get the sequence for those transcripts.
But while looking at the output, For a transcript sometime sequence starts from 2nd exon and sometimes 3rd exon.
Can anyone suggest what may be going wrong here?
Hi Sangram
Can you post some examples?
Sure,
Few lines from GTF file (intergenic.gtf)
gffread syntax:
So while retrieving sequence using gffread I should be getting sequence having both the exons. But output comes with 2nd exon only.
I doubt the gtf filtering done. Could you once try something as below to be sure
Do a gffread -w on the whole gtf and get the complete transcripts fasta.
Then get the list of transcripts of 'u' class code from the gtf.
Subsequently get those transcript sequences alone from the fasta .
Check whether the issue persist.
After doing this step also, It's giving some similar kind of results. I tried to understand hard why, but couldn't.
So I took a different approach with bed tool after filtration setup to get the sequence of each exon of transcripts and some python script to combine them. Its rather complicated, but worked out for now.
Anyways, Thank you Jeiffin :)
Okay. When you get time, give a try with some other gtf and see how it turns out.