I have inherited some GFF files where identified features often have common start or stop coordinates. I need to collapse such occurrences to the longest instance. As examples, I have provided an example problem and the solution below. Are there readily available tools that can do this? OR someone's Perl / Python / Bioconductor scripting? Thanks!
PROBLEM
case 1 - plus strand, common start coord
chr1 fBS CDS 1000 2000 + . PfamID1
chr1 fBS CDS 1000 3000 + . PfamID1
chr1 fBS CDS 1000 4000 + . PfamID1
chr1 fBS CDS 1000 5000 + . PfamID1
normal
chr2 fBS CDS 9000 10000 + . PfamID1
case 2 - minus strand, common end coord
chr4 fBS CDS 5000 1000 - . PfamID1
chr4 fBS CDS 4000 1000 - . PfamID1
chr4 fBS CDS 3000 1000 - . PfamID1
normal
chr9 fBS CDS 6431 15000 + . PfamID1
case 3 - plus strand, common end coord
chr10 fBS CDS 1000 5000 + . PfamID2
chr10 fBS CDS 2000 5000 + . PfamID2
chr10 fBS CDS 3000 5000 + . PfamID2
chr10 fBS CDS 4000 5000 + . PfamID2
case 4 - minus strand, common start coord
chr12 fBS CDS 5000 4000 - . PfamID2
chr12 fBS CDS 5000 3000 - . PfamID2
chr12 fBS CDS 5000 2000 - . PfamID2
SOLUTION - should contain only 6 lines after collapsing each of cases 1, 2, 3 and 4 into one line each
chr1 fBS CDS 1000 5000 + . PfamID1
chr2 fBS CDS 9000 10000 + . PfamID1
chr4 fBS CDS 5000 1000 - . PfamID1
chr9 fBS CDS 6431 15000 + . PfamID1
chr10 fBS CDS 1000 5000 + . PfamID2
chr12 fBS CDS 5000 2000 - . PfamID2
If I am not mistaken, for GFF "start" coordinate has always to be equal or smaller then "end" coordinate, so the above is not valid GFF.