Question

How to group features based on ID in gff

1

Entering edit mode

9.0 years ago

markcharder ▴ 10

Hi all,

What I want to do seems like it should be relatively simple but I lack the programming know-how to do it and have not found any examples anywhere on the internet, so I really hope someone can help!

I currently have a GFF file that is missing parent lines, with examples like so:

sequence  source  EST_match  1  20  .  -  0  ID=ID1;
sequence  source  EST_match  25  30  .  -  0  ID=ID2;
sequence  source  EST_match  35  50  .  -  0  ID=ID2;
sequence  source  EST_match  90  110  .  -  0  ID=ID3;

In this case, the two features labelled ID=2 should be a part of the same parent feature. So what I want to do is remove IDs from features with the same IDs and create a parent feature that appears before them in the gff with coordinates spanning the child feature coordinates, like so:

sequence  source  EST_match  1  20  .  -  0  ID=ID1;
sequence  source  EST_match  25  50  .  -  0  ID=2;
sequence  source  EST_match  25  30  .  -  0  Parent=2
sequence  source  EST_match  35  50  .  -  0  Parent=2
sequence  source  EST_match  90  110  .  -  0  ID=ID3;

I have written something in perl that compares the current line with the previous and next lines and skips the ID field if lines contain the same IDs but I cannot seem to work out how to create parent IDs.

Looking forward to any suggestions!

Thank you.

Mark

GFF3 GFF • 3.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by markcharder ▴ 10

0

Entering edit mode

I've seen a few different conventions for encoding alignments in GFF3. The first example you gave is perfectly valid (as far as the GFF3 specification is concerned: individual tools may have specific requirements).

sequence  source  EST_match  1   20   .  -  .  ID=ID1;
sequence  source  EST_match  25  30   .  -  .  ID=ID2;
sequence  source  EST_match  35  50   .  -  .  ID=ID2;
sequence  source  EST_match  90  110  .  -  .  ID=ID3;

The two lines with ID=ID2 represent a single alignment. In GFF3, this is called a "multi-feature", a feature that is discontiguous and thus requires multiple entries to specify its structure. You see this frequently with CDS features in protein-coding genes.

Another common convention is as follows.

sequence  source  EST_match  1   20   .  -  .  ID=ID1;
sequence  source  match_part 1   20   .  -  .  Parent=ID1;
sequence  source  EST_match  25  50   .  -  .  ID=ID2;
sequence  source  match_part 25  30   .  -  .  Parent=ID2;
sequence  source  match_part 35  50   .  -  .  Parent=ID2;
sequence  source  EST_match  90  110  .  -  .  ID=ID3;
sequence  source  match_part 90  110  .  -  .  Parent=ID3;

I don't think it's valid (with respect to the Sequence Ontology) to have an EST_match feature be a child of another EST_match feature. Also, the 8th column (phase) is only relevant to CDS features, and should be a period/full stop for all other feature types, not 0.

ADD REPLY • link 9.0 years ago by Daniel Standage 4.1k

Ram · Answer 1 · 2015-11-06

4

Entering edit mode

9.0 years ago

Andrzej Zielezinski 11k

Save the following Python script as parentsGFF.py

from itertools import groupby
import re
import sys

fh = open(sys.argv[1])
oh = open(sys.argv[1]+'.out', 'w')

for id, ls in groupby(fh, lambda x: re.split('\s+', x)[8]):
   ls = [re.split('\s+', l.strip()) for l in ls]
   if len(ls)==1:
       oh.write("\t".join(ls[0])+'\n')
   else:
       st = min([int(l[3]) for l in ls])
       en = max([int(l[4]) for l in ls])
       fl = ls[0][:]
       fl[3] = str(st)
       fl[4] = str(en)
       id = fl[8].split('=')[1]
       oh.write("\t".join(fl)+'\n')
       for l in ls:
           l[8] = 'Parent='+id
           oh.write("\t".join(l)+'\n')
fh.close()
oh.close()

Input GFF file: test.gff

sequence  source  EST_match  1  20  .  -  0  ID=ID1;
sequence  source  EST_match  25  30  .  -  0  ID=ID2;
sequence  source  EST_match  35  50  .  -  0  ID=ID2;
sequence  source  EST_match  90  110  .  -  0  ID=ID3;

Run it:

python parentsGFF.py test.gff

Output GFF file: test.gff.out

sequence    source    EST_match    1    20    .    -    0    ID=ID1;
sequence    source    EST_match    25    50    .    -    0    ID=ID2;
sequence    source    EST_match    25    30    .    -    0    Parent=ID2;
sequence    source    EST_match    35    50    .    -    0    Parent=ID2;
sequence    source    EST_match    90    110    .    -    0    ID=ID3;

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

I dont know what is the purpose of this file, but some tools will complain if not all the features (lines) have an ID. So it would be good to add that. Moreover, in similar cases of yours, usualy the childs are called match_part (second row). But I don't think it will affect dowstream analysis.

ADD REPLY • link 9.0 years ago by Juke34 8.9k

0

Entering edit mode

Hi, thanks a lot for that, it's almost there! Unfortunately however, it writes the coordinates of the first child sequence as "25 50" in the example, when it should be "25 30". I have no idea how to understand Python as have just started with Perl, sorry.

ADD REPLY • link 9.0 years ago by markcharder ▴ 10

0

Entering edit mode

Sorry about that. I just modified my answer.

ADD REPLY • link 9.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Hi, thanks again, managed to fix the bug (below). This is a huge problem-solver for me. Cheers,

Mark

ADD REPLY • link 9.0 years ago by markcharder ▴ 10