Question

Rebase analysed GFF3 against parent data

1

Entering edit mode

9.6 years ago

rasche.eric ▴ 70

I have a gff3 file produced by an analysis step (specifically InterPro, but that's not terribly relevant here). Since that tool took a fasta file of proteins, all of the analysis results have coordinates respective to the analysed protein sequences

I'm trying to get these results to show up in a browser like JBrowse, the easiest way of doing this I found was to rebase the coordinates against the parent genome. E.g. if there was a match_part from 1-100 of cdsA which is comprised of bases 200..300, then we'd update the match_part to be 200..300, and change the parent reference to the parent genome.

I have a small tool that does this, but was wondering if anyone has a better solution (I just want to display them properly in JBrowse), or fully featured existing implementation of a rebasing tool like this?

Example

I have a gff file with my gene calls, like so:

##gff-version 3
##sequence-region Merlin 1 172788
Merlin  GeneMark.hmm    gene    2   691 -856.563659 +   .   ID=Merlin_1
Merlin  GeneMark.hmm    gene    1067    2011    -1229.683915    -   .   ID=Merlin_3

From this, those gene sequences were extracted, translated to protein sequences, and then run through some analysis step which generated some results/matches. In this case they're InterProScan results

Merlin  feature polypeptide 1   229 .   +   .   ID=Merlin_1
Merlin  Gene3D  protein_match   2   50  2.9E-21 +   .   ID=match%2477_2_50;Name=G3DSA:3.90.176.10;Target=Merlin_1 2 50;date=23-02-2015;status=T

In order to have these results visible, properly, in JBrowse, those coordinates need to be adjusted such that they reflect their coordinates respective to the parent genome.

The feature with ID=Merlin_1, should be moved 1 base to the right, as the gene that was analysed to produce that match starts at base 2.

A hit with ID=Merlin_3, going from bases 1..11 (in the InterPro results) would need to be changed to the minus strand, and moved to 2000..2011 according to the parent genome.

gff3 software gene • 2.7k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 9.6 years ago by rasche.eric ▴ 70

Ram · Answer 1 · 2015-05-04

0

Entering edit mode

9.6 years ago

Daniel Standage 4.1k

UPDATE: It turns out I misunderstood the original question. The response below is for transforming all coordinates for a sequence uniformly.

The gt gff3 command in the GenomeTools library has an -offset option that allows you to specify offsets as you have described. This will apply the same offset to all the data, or alternatively if you want to specify offsets for each sequence you can use the -offsetfile option.

ADD COMMENT • link updated 21 months ago by Ram 44k • written 9.6 years ago by Daniel Standage 4.1k

0

Entering edit mode

Thanks! I'd looked at GT a while back, but didn't know about the offsetfile option.

Do you know if it handles strandedness? E.g. analysed feature is minus strand, 1000-1200, match_part is 1-100, so the final location should be minus strand, 1100-1200

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.6 years ago by rasche.eric ▴ 70

0

Entering edit mode

Now I'm not so sure I'm thinking about the same thing you are. Perhaps a couple of examples would help clarify things.

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.6 years ago by Daniel Standage 4.1k

0

Entering edit mode

Updated my post with a more descriptive example.

ADD REPLY • link 9.6 years ago by rasche.eric ▴ 70