I have a gff3 file produced by an analysis step (specifically InterPro, but that's not terribly relevant here). Since that tool took a fasta file of proteins, all of the analysis results have coordinates respective to the analysed protein sequences
I'm trying to get these results to show up in a browser like JBrowse, the easiest way of doing this I found was to rebase the coordinates against the parent genome. E.g. if there was a match_part
from 1-100 of cdsA which is comprised of bases 200..300, then we'd update the match_part
to be 200..300, and change the parent reference to the parent genome.
I have a small tool that does this, but was wondering if anyone has a better solution (I just want to display them properly in JBrowse), or fully featured existing implementation of a rebasing tool like this?
Example
I have a gff file with my gene calls, like so:
##gff-version 3
##sequence-region Merlin 1 172788
Merlin GeneMark.hmm gene 2 691 -856.563659 + . ID=Merlin_1
Merlin GeneMark.hmm gene 1067 2011 -1229.683915 - . ID=Merlin_3
From this, those gene sequences were extracted, translated to protein sequences, and then run through some analysis step which generated some results/matches. In this case they're InterProScan results
Merlin feature polypeptide 1 229 . + . ID=Merlin_1
Merlin Gene3D protein_match 2 50 2.9E-21 + . ID=match%2477_2_50;Name=G3DSA:3.90.176.10;Target=Merlin_1 2 50;date=23-02-2015;status=T
In order to have these results visible, properly, in JBrowse, those coordinates need to be adjusted such that they reflect their coordinates respective to the parent genome.
The feature with ID=Merlin_1, should be moved 1 base to the right, as the gene that was analysed to produce that match starts at base 2.
A hit with ID=Merlin_3, going from bases 1..11 (in the InterPro results) would need to be changed to the minus strand, and moved to 2000..2011 according to the parent genome.
Thanks! I'd looked at GT a while back, but didn't know about the offsetfile option.
Do you know if it handles strandedness? E.g. analysed feature is minus strand, 1000-1200, match_part is 1-100, so the final location should be minus strand, 1100-1200
Now I'm not so sure I'm thinking about the same thing you are. Perhaps a couple of examples would help clarify things.
Updated my post with a more descriptive example.