I have gff3 files for several genomes where for some annotation features, the coordinates are one base more than the sequence length. When the file is uploaded to some tools, its giving error saying that the length of the contig is more than sequence length.
bgc_001 GN contig 1 2606 . . . ID=AQ2860336_000071;Name=AQ2860336_000071
bgc_002 GN contig 1 26678 . . . ID=AQ2860336_000072;Name=AQ2860336_000072
bgc_003 GN contig 1 22812 . . . ID=AQ2860336_000073;Name=AQ2860336_000073
bgc_001 . cluster 1 2606 . . . id=bgc_001.region001
bgc_001 . element 1 2607 . + . contig_edge='True'
bgc_001 . region 1 2607 . + . candidate_cluster_numbers='1'
bgc_002 . cluster 1 26678 . . . id=002.region001
bgc_002 . element 1 26679 . + . contig_edge='True';
bgc_002 . region 1 26679 . + . candidate_cluster_numbers='1'
bgc_003 . cluster 643 22812 . . . id=bgc_003.region001
bgc_003 . element 643 22812 . + . contig_edge='False'
bgc_003 . region 643 22812 . + . candidate_cluster_numbers='1'
In the above sample table, there are three contigs - bgc_001, bgc_002 and bgc_003. for bgc_001, the sequence length is 2606. But, for annotation feature "element" and "region" , the end is 2707. Similarly for bgc_002, its the same where the sequence length is 26678 but for "element" and "region" , the end is 26679. For bgc_003, everything looks good.
How can i read sequence length from top rows and check end for features "element" and "region" . If its more than the sequence length, correct it to sequence maximum length.
for example, the output will be as
bgc_001 GN contig 1 2606 . . . ID=AQ2860336_000071;Name=AQ2860336_000071
bgc_002 GN contig 1 26678 . . . ID=AQ2860336_000072;Name=AQ2860336_000072
bgc_003 GN contig 1 22812 . . . ID=AQ2860336_000073;Name=AQ2860336_000073
bgc_001 . cluster 1 2606 . . . id=bgc_001.region001
bgc_001 . element 1 2606 . + . contig_edge='True'
bgc_001 . region 1 2606 . + . candidate_cluster_numbers='1'
bgc_002 . cluster 1 26678 . . . id=002.region001
bgc_002 . element 1 26678 . + . contig_edge='True';
bgc_002 . region 1 26678 . + . candidate_cluster_numbers='1'
bgc_003 . cluster 643 22812 . . . id=bgc_003.region001
bgc_003 . element 643 22812 . + . contig_edge='False'
bgc_003 . region 643 22812 . + . candidate_cluster_numbers='1'
Should be doable with an array, but before overthinking this: To me, it looks like an issue with 0 and 1-based coordinates if the delta is always just 1? In that case, I personally would aim at shifting all entries in the GTF by -1, rather than selectively shortening those that are too long?
Not all features are incorrect. Other features represent correctly except for these two. I have only given few features and short file as example. Can you point me how to use array to solve this.
That is the point. You don't notice, because only some exceed the maximum length, but if you used different tools, one of them might work with 0 and one with 1-based coordinates. But either way, I have posted a possible solution for the approach you envisioned.