I have a GFF annotation for my sequences where many protein sub-features are overlapping. I want to extract the longest of those sub-features (i.e. one sub-feature for each sequence).
For example, the following is how my gff file looks:
chrA01:14912487-14917603(-)_Name=Name1 Source1 repeat 6 5116 . - . ID=repeat_region1
chrA01:14912487-14917603(-)_Name=Name1 Source1 inverted 6 7 . - . Parent=repeat_region1
chrA01:14912487-14917603(-)_Name=Name1 Source1 LTR 6 5116 . - . ID=LTR_retrotransposon1;Parent=repeat_region1;similarity=95.70;seq_number=666
chrA01:14912487-14917603(-)_Name=Name1 Source1 ltrs 6 302 . - . Parent=LTR_retrotransposon1
chrA01:14912487-14917603(-)_Name=Name1 Source2 protein 535 835 1.7e-16 - . Parent=LTR_retrotransposon1;reading_frame=0;name=INT_crm
chrA01:14912487-14917603(-)_Name=Name1 Source2 protein 535 874 0 - . Parent=LTR_retrotransposon1;reading_frame=0;name=INT_reina
chrA01:14912487-14917603(-)_Name=Name1 Source2 protein 547 772 1.9e-28 - . Parent=LTR_retrotransposon1;reading_frame=0;name=INT_del
chrA01:14912487-14917603(-)_Name=Name1 Source2 protein 1812 2124 8.3e-41 - . Parent=LTR_retrotransposon1;reading_frame=1;name=RNaseH_galadriel
chrA01:14912487-14917603(-)_Name=Name1 Source2 protein 1812 2127 4.6e-15 - . Parent=LTR_retrotransposon1;reading_frame=1;name=RNaseH_v_clade
chrA01:14912487-14917603(-)_Name=Name1 Source2 protein 1812 2127 1.2e-16 - . Parent=LTR_retrotransposon1;reading_frame=1;name=RNaseH_athila
chrA01:14912487-14917603(-)_Name=Name1 Source1 ltrs 4815 5116 . - . Parent=LTR_retrotransposon1
chrA01:14912487-14917603(-)_Name=Name1 Source1 inverted 5115 5116 . - . Parent=repeat_region1
###
chrA01:18337410-18342821(-)_Name=Name2 Source1 repeat 1 5411 . - . ID=repeat_region2
chrA01:18337410-18342821(-)_Name=Name2 Source1 inverted 1 2 . - . Parent=repeat_region2
chrA01:18337410-18342821(-)_Name=Name2 Source1 LTR 1 5411 . - . ID=LTR_retrotransposon2;Parent=repeat_region2;similarity=98.93;seq_number=794
chrA01:18337410-18342821(-)_Name=Name2 Source1 ltrs 1 374 . - . Parent=LTR_retrotransposon2
chrA01:18337410-18342821(-)_Name=Name2 Source2 protein 427 1435 0 - . Parent=LTR_retrotransposon2;reading_frame=1;name=INT_crm
chrA01:18337410-18342821(-)_Name=Name2 Source2 protein 439 1390 1.8e-38 - . Parent=LTR_retrotransposon2;reading_frame=1;name=INT_TF
chrA01:18337410-18342821(-)_Name=Name2 Source2 protein 481 1435 0 - . Parent=LTR_retrotransposon2;reading_frame=1;name=INT_v_clade
chrA01:18337410-18342821(-)_Name=Name2 Source2 protein 1657 1990 3e-18 - . Parent=LTR_retrotransposon2;reading_frame=1;name=RNaseH_v_clade
chrA01:18337410-18342821(-)_Name=Name2 Source2 protein 1657 1990 6.8e-17 - . Parent=LTR_retrotransposon2;reading_frame=1;name=RNaseH_csrn1
chrA01:18337410-18342821(-)_Name=Name2 Source2 protein 1657 1990 2.7e-35 - . Parent=LTR_retrotransposon2;reading_frame=1;name=RNaseH_reina
chrA01:18337410-18342821(-)_Name=Name2 Source1 ltrs 5038 5411 . - . Parent=LTR_retrotransposon2
chrA01:18337410-18342821(-)_Name=Name2 Source1 inverted 5410 5411 . - . Parent=repeat_region2
###
I want to extract the longest or the best RNaseH regions from each sequence. I am able to parse the gff to get RNase annotations. But, I am not sure how to proceed next:
from gffutils.iterators import DataIterator
from Bio import SeqIO
input_filename = 'mygff.gff'
infasta = list(SeqIO.parse('myseqs.fasta', 'fasta'))
features=DataIterator(input_filename)
for feature in features:
if "name=RNaseH" in str(feature):
print str(feature)
#check lengths
#check e-values