Is There A Diff For Gff Files (Gffdiff)?
2
5
Entering edit mode
11.2 years ago
Michael 55k

I was wondering if there is a tool to compare two genome annotations or more specifically GFF3 files and output a set of differences in the style of the unix tool diff. The matching of genomic features in both files should be based on the IDs for all features, but could also include optional positional matching for features that do not have IDs (however this might be a bit tricky). Certainly, the order of features and lines in the file should not matter (that's why diff wouldn't work), and changes to position, sequence type, and tags should be taken into account. The output should be decomposed into a set of 'delete', 'insert', and optionally 'update' operations, yielding two or three GFF files containing features to delete, insert, and update. (Writing this, it sounds like a database backend might be part of an efficient solution).

It would be neat if the tools also provides output in a more verbose format like: "In file2 feature exon_1 of type exon was removed; feature transcript_1 of type transcript was shortened by 100bp at the 5' end".

Does such a script exist, and if not, it doesn't look too difficult, so should we make it?

gff gff3 • 5.3k views
ADD COMMENT
0
Entering edit mode

I guess it would essentially be a bedtools intersect type tool that is ID aware?

**edit. oh cool. When did Biostar start doing auto text links for popular software? I didn't link bedtools to the google code website and it did it auto-magically.

ADD REPLY
0
Entering edit mode

Yeah, that is a good feature, also works for SAM (links to sam format or, samtools, but it might also mean "significance analysis of microarrays").

OT it might be something like bedtools intersect+IDs+tags+type, but I'd prefer that it reads GFF natively.

ADD REPLY
2
Entering edit mode

Yeah letting people see what changed between annotation versions would be useful.

I can imagine some ambiguous cases where it might be harder to distinguish between multiple type of updates. For example, a fused gene annotation that was split into two separate genes in a later version. How do we distinguish between split and a deletion of old gene + insertion of two new genes. I guess the tool wouldn't have to be that sophisticated.

ADD REPLY
0
Entering edit mode

How would a fused gene look in terms of IDs? I thought of pushing that burden up to the annotator, who decides on the choice of appropriate stable IDs. A fused gene might look like: GeneID1(start1, end1) + GeneID2(start2, end2) -> GeneID1(start1, end2), but I guess technically, there is no way to be sure that the annotator meant to fuse the genes.

ADD REPLY
0
Entering edit mode

I guess you can categorize fused gene annotation -> 2 split gene annotations as a "refinement" update. Whenever annotations in the current version are completely within an annotation in the old version, mark it as a "refinement". So this could be a gene annotation with too many extra bases in the UTRs trimmed to a shorter annotation or fused gene becoming 2 separate genes.

The opposite case I guess can be categorized as an "expansion" where an annotation in the older version is completely within the annotation of a newer version.

ADD REPLY
0
Entering edit mode

So, I am going to assume that such a tool does not exist?

ADD REPLY
0
Entering edit mode

Found this: http://www.kivinen.iki.fi/nwn/docs/gffdiff.html (comes up in a goole search at the top rank), just in case this comes up for someone, it has nothing to do with this question. It is for a very different sort of gff file:http://social.bioware.com/wiki/datoolset/index.php/GFF

ADD REPLY
2
Entering edit mode
10.8 years ago
Michael 55k

Here is one quick hack that works on the IDs of gff features, because I needed one right now. It is useful for e.g. the CHADO bulk loader gmod_bulk_load_gff3, because it has two options to either upload or delete features, but not to import only new or changed features.

Usage:

gffdiff file1.gff3 file2.gff3

Output: 3 files

  • gffdiff_delete.gff3: features to delete from the database containing features from file1 (deleted and changed)
  • gffdiff_create.gff3: features to create (new features or changed features)
  • gffdiff_conflicts: Conflicting position/ID features (features which have identical position but no ID in file 1 and new ID in file2)

It has a very rudimentary conflict detection, currently only for identical positioned features which get a new or loose their ID. Renaming a feature leads to a delete and create entry for the old and new feature. Using this method, it will be hard or impossible to delete features that have no ID depending on the bulk loader.

I'll put a little improved version with documentation and some command-line parameters on github later.


#!/usr/bin/env perl
use strict;
use warnings;

use Bio::Tools::GFF;
use Bio::SeqFeatureI;

my %fstore1 = ();
my %fstore2 = ();
my %conflicts = ();

# specify 2 input files to compare
my $gffio1 = Bio::Tools::GFF->new(-file => $ARGV[0], -gff_version => 3);
my $gffio2 = Bio::Tools::GFF->new(-file => $ARGV[1], -gff_version => 3);
# gff file that contains new features in file 2
my $create = Bio::Tools::GFF->new(-gff_version => 3,
                                  -file        => ">gffdiff_create.gff3");
# gff file that contains the fieatures missing in file 2 from file 1 
my $delete = Bio::Tools::GFF->new(-gff_version => 3,
                                  -file        => ">gffdiff_delete.gff3");
# file with conflicting features, e.g. position identical but different ID
my $conffh;
open ($conffh, '>' , "gffdiff_conflicts") or die "couldn't open gffdiff_conflicts $!\n";

# loop over the input stream file 1
while(my $feature = $gffio1->next_feature()) {
   my $id = ($feature->primary_id) ? 
     $feature->primary_id : new_id($feature);
   ### some sanity checks 
   die "something went wrong with the ID" unless $id;
   die "duplicate feature ID $id" if exists $fstore1{$id};  
   $fstore1{$id} = $feature;
}
print "read ".scalar ( keys %fstore1)." unique features from file 1\n"; 
$gffio1->close();
# read features of second file
my $new = my $changed = my $conflicts = my $deleted = 0;
while(my $feature = $gffio2->next_feature()) {
  my $id = ($feature->primary_id) ? 
    $feature->primary_id : new_id($feature);
  $fstore2{$id} = $feature;
  ### some sanity checks 
  die "something went wrong with the ID" unless $id;
  my @what = ();
  if (my $a = is_new($id, $feature)) {
    # print "Feature $id is new\n";
    if (ref $a) {
      $conflicts++;
      ## store conflict not to delete it later
      ## TODO: disable this block via command-line option --ignore-conflicts
      $conflicts{new_id($a)} = 1;
      print $conffh " >>>>> ".$ARGV[0]."\nC: ".$a->gff_string()."\n-----\n";
      print $conffh " <<<<< " .$ARGV[1]."\nC: ".$feature->gff_string()."\n<<<<<\n";
    } else {
      $create->write_feature($feature);
      $new++;
    }
  } 
   elsif (@what = what_changed($id, $feature)) {
     ### TODO: Allow for a 'natural language' report: 
     #print "Feature $id has different ". (join " and ", @what)."\n";
     #print "- ".$fstore1{$id}->gff_string."\n";
     $delete->write_feature($fstore1{$id});
     #print "+ ".$feature->gff_string()."\n";
     $create->write_feature($feature);
     $changed++;

   } else {     
    # print "$id unchanged\n";
   }

}
 $gffio2->close();
print "read ".scalar ( keys %fstore2)." unique features from file 2\n"; 

## check for features present in file1 and missing in file 2
foreach my $key (keys %fstore1) {
  if (! exists $fstore2{$key}) {
    # protect feature from deletion if it is already known as an ID conflict
    if (! exists $conflicts{new_id($fstore1{$key})} ) {
      $delete->write_feature($fstore1{$key});
      $deleted++;
    }
  }

}
$create->close;
$delete->close;
close $conffh or die $!;
print "$new new features, $changed changed, $conflicts conflicting, $deleted deleted features\n";

## create a new id based on landmark, type, strand, start and end.
## allows to identify exactly matching features
## need a little more refined mechanism of range overlap (best as in IRanges though)
sub new_id {
  my $feature = shift;
  my $start = $feature->start;
  my $end = $feature->end;
  my $strand = $feature->strand;   
  my $type = $feature->primary_tag();
  my $seq_id = $feature->seq_id();
  return "$seq_id:$type:$start:$end:$strand";
}

sub is_new {
 my $id = shift;
 my $feature = shift;
 if (! exists $fstore1{$id}) {
   if (! exists $fstore1{new_id($feature)}) {
     return 1;
   } else {
     return $fstore1{new_id($feature)}
   }
 }
 return 0;
}

sub what_changed {
   my ($id, $feature) = @_;
   die "no id" unless $id; 
   die "no feature for id $id" unless ref $feature;
   unless  (ref  $fstore1{$id}) {
     warn "$id not found, shouldn't happen:".$fstore1{$id};
     return ();
   }
   my @what = ();
   ## quick and dirty check
   return () if $feature->gff_string eq $fstore1{$id}->gff_string;
   ## now we have to find out what's different (we ignore most tags for now):
   #$feature->strand ne $fstore1{$id}->strand ||
   push @what, "position/strand" unless ($feature->equals($fstore1{$id}), 'strong');
   push @what, "phase" if   $feature->phase ne $fstore1{$id}->phase;
   push @what, "primary_tag" if  $feature->primary_tag() ne $fstore1{$id}->primary_tag();
   push @what, "source_tag" if  $feature->source_tag() ne $fstore1{$id}->source_tag();
   push @what, "display_id" if  $feature->display_id() ne $fstore1{$id}->display_id();
   ## TODO: check all tags: TODO:
   ### my %tags1 = map $fstore1{$id}->get_all_tags();
   return @what;
}
ADD COMMENT
0
Entering edit mode
10.2 years ago

Sorry for the late response, but the ParsEval program from the AEGeAn Toolkit was designed precisely for this use case.

ADD COMMENT

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6