Extract longest mRNA from .GFF when I have multiple rows of same seqID
2
Hello,
I have a GFF file that has multiple rows for each seqID.
Using python:
I want to extract the longest mRNA sequence for each seqID and return in a new tab-delimited file
<seqID> <start> <end>
The original GFF is formatted as follows
<seqid><SOURCE><FEATURE><START><END><SCORE><STRAND><FRAME><ATTRIBUTES><COMMENTS>
I'm very new to this and could use a hand for what is likely easy for someone with more experience.
Thanks for any help you can provide, and I'm not likely able to successfully modify someone else script appropriately
sequence
genome
• 1.1k views
There are some GTF parsers in python. For example pyGTF by which you can get the length of seqIDs or try to get a bed file of seqIDs and use groupBy to get the first/last coordinates and then get length. If you provide head
of your file, we could help a bit further.
Thanks for your help. This was my first-ever post. I'll be sure to put things in the appropriate place moving forward. Should the head portion of this post be considered code?
##gff-version 3
BHEC01006359.1 . contig 1 807 . . . ID=BHEC01006359.1;Name=BHEC01006359.1
###
BHEC01012377.1 . contig 1 805 . . . ID=BHEC01012377.1;Name=BHEC01012377.1
###
BHEC01004863.1 . contig 1 8509 . . . ID=BHEC01004863.1;Name=BHEC01004863.1
BHEC01004863.1 snap_masked match 4053 8297 29.048 + . ID=BHEC01004863.1:hit:1841:4.5.0.0;Name=snap_masked-BHEC01004863.1-abinit-gene-0.1-mRNA-1
BHEC01004863.1 snap_masked match_part 4053 4314 12.637 + . ID=BHEC01004863.1:hsp:6435:4.5.0.0;Parent=BHEC01004863.1:hit:1841:4.5.0.0;Target=snap_masked-BHEC01004863.1-abinit-gene-0.1-mRNA-1 1 262 +;Gap=M262
BHEC01004863.1 snap_masked match_part 8263 8297 16.411 + . ID=BHEC01004863.1:hsp:6436:4.5.0.0;Parent=BHEC01004863.1:hit:1841:4.5.0.0;Target=snap_masked-BHEC01004863.1-abinit-gene-0.1-mRNA-1 263 297 +;Gap=M35
BHEC01004863.1 augustus_masked match 4055 4327 0.83 + . ID=BHEC01004863.1:hit:1842:4.5.0.0;Name=augustus_masked-BHEC01004863.1-abinit-gene-0.0-mRNA-1
BHEC01004863.1 augustus_masked match_part 4055 4327 0.83 + . ID=BHEC01004863.1:hsp:6437:4.5.0.0;Parent=BHEC01004863.1:hit:1842:4.5.0.0;Target=augustus_masked-BHEC01004863.1-abinit-gene-0.0-mRNA-1 1 273 +;Gap=M273
###
BHEC01004863.1 est_gff:est2genome expressed_sequence_match 8071 8351 805 - . ID=BHEC01004863.1:hit:1840:3.12.0.0;Name=Csept_BB_C55352;score=805
BHEC01004863.1 est_gff:est2genome match_part 8071 8351 805 - . ID=BHEC01004863.1:hsp:6434:3.12.0.0;Parent=BHEC01004863.1:hit:1840:3.12.0.0;Target=Csept_BB_C55352 7 284 +;Gap=M281
BHEC01053345.1 . contig 1 2142 . . . ID=BHEC01053345.1;Name=BHEC01053345.1
###
BHEC01052641.1 . contig 1 803 . . . ID=BHEC01052641.1;Name=BHEC01052641.1
###
BHEC01000922.1 . contig 1 1466 . . . ID=BHEC01000922.1;Name=BHEC01000922.1
###
BHEC01008444.1 . contig 1 9527 . . . ID=BHEC01008444.1;Name=BHEC01008444.1
BHEC01008444.1 snap_masked match 3239 9054 54.086 + . ID=BHEC01008444.1:hit:1739:4.5.0.0;Name=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1
BHEC01008444.1 snap_masked match_part 3239 3312 11.140 + . ID=BHEC01008444.1:hsp:6367:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 1 74 +;Gap=M74
BHEC01008444.1 snap_masked match_part 5275 5320 11.772 + . ID=BHEC01008444.1:hsp:6368:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 75 120 +;Gap=M46
BHEC01008444.1 snap_masked match_part 5469 5541 6.534 + . ID=BHEC01008444.1:hsp:6369:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 121 193 +;Gap=M73
BHEC01008444.1 snap_masked match_part 8183 8261 10.882 + . ID=BHEC01008444.1:hsp:6370:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 194 272 +;Gap=M79
BHEC01008444.1 snap_masked match_part 9024 9054 13.758 + . ID=BHEC01008444.1:hsp:6371:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 273 303 +;Gap=M31
BHEC01008444.1 snap_masked match 7114 7425 -11.831 - . ID=BHEC01008444.1:hit:1740:4.5.0.0;Name=snap_masked-BHEC01008444.1-abinit-gene-0.2-mRNA-1
BHEC01008444.1 snap_masked match_part 7114 7425 -11.831 - . ID=BHEC01008444.1:hsp:6372:4.5.0.0;Parent=BHEC01008444.1:hit:1740:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.2-mRNA-1 1 312 +;Gap=M312
BHEC01008444.1 augustus_masked match 7114 7425 0.52 - . ID=BHEC01008444.1:hit:1741:4.5.0.0;Name=augustus_masked-BHEC01008444.1-abinit-gene-0.0-mRNA-1
BHEC01008444.1 augustus_masked match_part 7114 7425 0.52 - . ID=BHEC01008444.1:hsp:6373:4.5.0.0;Parent=BHEC01008444.1:hit:1741:4.5.0.0;Target=augustus_masked-BHEC01008444.1-abinit-gene-0.0-mRNA-1 1 312 +;Gap=M312
Login before adding your answer.
Traffic: 1283 users visited in the last hour
Question
type post, not aForum
type post. I've made the necessary change but please be more careful in the future. Reading posts under the how-to tag will help.Please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.Please avoid the use of emojis and such in a professional/scientific setting.