Hi all,
What I want to do seems like it should be relatively simple but I lack the programming know-how to do it and have not found any examples anywhere on the internet, so I really hope someone can help!
I currently have a GFF file that is missing parent lines, with examples like so:
sequence source EST_match 1 20 . - 0 ID=ID1;
sequence source EST_match 25 30 . - 0 ID=ID2;
sequence source EST_match 35 50 . - 0 ID=ID2;
sequence source EST_match 90 110 . - 0 ID=ID3;
In this case, the two features labelled ID=2 should be a part of the same parent feature. So what I want to do is remove IDs from features with the same IDs and create a parent feature that appears before them in the gff with coordinates spanning the child feature coordinates, like so:
sequence source EST_match 1 20 . - 0 ID=ID1;
sequence source EST_match 25 50 . - 0 ID=2;
sequence source EST_match 25 30 . - 0 Parent=2
sequence source EST_match 35 50 . - 0 Parent=2
sequence source EST_match 90 110 . - 0 ID=ID3;
I have written something in perl that compares the current line with the previous and next lines and skips the ID field if lines contain the same IDs but I cannot seem to work out how to create parent IDs.
Looking forward to any suggestions!
Thank you.
Mark
I've seen a few different conventions for encoding alignments in GFF3. The first example you gave is perfectly valid (as far as the GFF3 specification is concerned: individual tools may have specific requirements).
The two lines with ID=ID2 represent a single alignment. In GFF3, this is called a "multi-feature", a feature that is discontiguous and thus requires multiple entries to specify its structure. You see this frequently with CDS features in protein-coding genes.
Another common convention is as follows.
I don't think it's valid (with respect to the Sequence Ontology) to have an EST_match feature be a child of another EST_match feature. Also, the 8th column (phase) is only relevant to CDS features, and should be a period/full stop for all other feature types, not 0.