Question

GFF3 file format

0

Entering edit mode

8.9 years ago

rakeshmbb • 0

Hi everyone. Presently I am working with a GFF3 file. In case of any feature if it is present in minus strand why genomic co-ordinate for start of that feature is lower than the end of the feature. It should be reverse. Is not it? For example if a gene is present in minus strand it should start with a higher coordinate than that of end coordinate.

Please help I am confused. Actually I want to measure intergenic distance between a set of gene for further analysis.

Thank you in advance

sequence • 4.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.9 years ago by rakeshmbb • 0

score 1 · Answer 1 · 2016-02-05

1

Entering edit mode

8.9 years ago

Devon Ryan 105k

The start/end coordinates are on the "+" strand, regardless of whether the feature is on the "-" strand or not (i.e., for - strand features, the end is the start and the start is the end). This makes sorting and otherwise handling the files easier.

ADD COMMENT • link 8.9 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you Devon Ryan. I was also thinking it in the same way but I was not sure.

ADD REPLY • link 8.9 years ago by rakeshmbb • 0

Ram · Answer 2 · 2016-02-05

You gff file is correct, by this definition the start coordinate must be less than the end coordinate, all parsing libraries should handle the coordinates and strand correctly. In fact the way it is encoded makes the string extraction using standard functions more efficient to use on sequence data that are always only given in one direction:

## Pseudocode, given start end ordered already
for all feature in ggf.features:
    subseq = substring (chromosome, feature.start, feature.end) 
    # given substring function is 1-based, most to all substring functions work that way
    subseq = reverse.complement(subseq) if feature.strand == "-"

## given start end not particularly ordered but identified by strand
for all feature in ggf.features:
    (start, end) = sort (feature.start, feature.end) 
    ## we save this operation each time we extract a feature
    ## this can be implemented in many ways, but will always result in 1 or 4 redundant
    ## machine register operations:
    ## 1. a > b ?  2.-4.: swap: a=tmp; a=b; b=tmp; 
    subseq = substring (chromosome, start, end) 
    subseq = reverse.complement(subseq) if feature.strand == "-"

Because, extractions are more common than writing a gff file (once vs. every time someone uses the genome file), we will save 3 (with sanity checking, because swap never happens) - 4 (without any sanity check) on sort each time we access a feature. Not saying that this really was the reason, but it might sound convincing.

score 0 · Answer 3 · 2016-02-05

0

Entering edit mode

8.9 years ago

Thibault D. ▴ 700

Hi rakeshmbb,

It is not written in GFF3 specifications, however most of the GFF3 files are sorted according to ascending position. This order "reverses" the features' order of genes present in the minus strand.

ADD COMMENT • link 8.9 years ago by Thibault D. ▴ 700