Hello,
The Join operator is used for backsplicing events/split genes. Could anyone please provide additional details on interpretation of "Join" in case of pgRNA of NC_003977.
In NC_003977, pgRNA (pre-genomic RNA; encodes core and polymerase) is denoted as
join(1820..3182,1..1932)
Why does pgRNA contain overlapping coordinates? Is it because it codes for 2 proteins and each set of coordinates indicates individual proteins?
When exported NC_003977 to gff3 format using "Send to | gff3" option, coordinates in "Join" operators are summed up causing them to exceed the genome size (refer image).
Is it a bug or intended behavior?
Is there documentation on how to interpret these types of coordinates?
Yes, this is a circular genome and this gene goes past reference relative position (1), but 2 coordinates are overlapping making region 1820..1932 counted twice. Do we read this overlapping region only once when we make gene sequence linear?
How to interpret this in gff3 as coordinates exceeds the total genome size?
does not matter that it overlaps, it starts at a start codon and goes until the next stop codon. If that stop codon runs into the gene again that is fine, it is in a different reading frame now this produces different aminoacids.
the reason it goes past the genome size is to properly show you how long the feature is. A typical use case is to do an end-start to figure out the length of the feature. Since this goes around they want to capture that.
but you are correct in assuming that this may cause problems. Many tools expecting linear genomes will fail to properly operate on this particular GFF file.