Entering edit mode
4.2 years ago
fuhaolll2
▴
30
how could i extract each transcript's last exon from GTF ,get a new GTF only contain last exon.
how could i extract each transcript's last exon from GTF ,get a new GTF only contain last exon.
I would script this sort of thing. I'm sure there is an R way to do it, but I tend to use python for this sort of thing. There might be a clever way to do it in awk, but I don't know it.
If you are happy with installing things, cgatapps has code for handling GTF files. This sort of thing could be achieved with:
from cgat import GTF
from cgatcore import iotools
input_file = iotools.open_file("my_transcripts.gtf.gz")
gtf_entries = gtf.iterator(input_file)
transcripts = gtf.transcript_iterator(gtf_enteries)
for transcript in transcripts:
exons = [e for e in transcript if e.feature == "exon"]
if len(exons) == 0:
continue
if transcript[0].strand == "-":
exons = sorted(exons, key=lambda x: -1*x.end)
else:
exons = sorted(exons, key=lambda x: x.start
last_exon = exons[-1]
print (str(last_exon) + "\n")
The same thing can be achieved with pysam
with a little more effort if cgat
is too big an installation.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What have you tried? awk? R?
for first exon,i know exon number is 1,can get it .but last exon number maybe 6,7,8.
not exactly an answer to your question but I would not trust the numbering of the exons. Sometimes people will start numbering the "last" exon as 1 if the gene is on the reverse strand for instance.
just a little warning ;)