Here is one quick hack that works on the IDs of gff features, because I needed one right now. It is useful for e.g. the CHADO bulk loader gmod_bulk_load_gff3
, because
it has two options to either upload or delete features, but not to import only new or changed features.
Usage:
gffdiff file1.gff3 file2.gff3
Output: 3 files
- gffdiff_delete.gff3: features to delete from the database containing features from file1 (deleted and changed)
- gffdiff_create.gff3: features to create (new features or changed features)
- gffdiff_conflicts: Conflicting position/ID features (features which have identical position but no ID in file 1 and new ID in file2)
It has a very rudimentary conflict detection, currently only for identical positioned features which get a new or loose
their ID. Renaming a feature leads to a delete and create entry for the old and new feature. Using this method, it
will be hard or impossible to delete features that have no ID depending on the bulk loader.
I'll put a little improved version with documentation and some command-line parameters on github later.
#!/usr/bin/env perl
use strict;
use warnings;
use Bio::Tools::GFF;
use Bio::SeqFeatureI;
my %fstore1 = ();
my %fstore2 = ();
my %conflicts = ();
# specify 2 input files to compare
my $gffio1 = Bio::Tools::GFF->new(-file => $ARGV[0], -gff_version => 3);
my $gffio2 = Bio::Tools::GFF->new(-file => $ARGV[1], -gff_version => 3);
# gff file that contains new features in file 2
my $create = Bio::Tools::GFF->new(-gff_version => 3,
-file => ">gffdiff_create.gff3");
# gff file that contains the fieatures missing in file 2 from file 1
my $delete = Bio::Tools::GFF->new(-gff_version => 3,
-file => ">gffdiff_delete.gff3");
# file with conflicting features, e.g. position identical but different ID
my $conffh;
open ($conffh, '>' , "gffdiff_conflicts") or die "couldn't open gffdiff_conflicts $!\n";
# loop over the input stream file 1
while(my $feature = $gffio1->next_feature()) {
my $id = ($feature->primary_id) ?
$feature->primary_id : new_id($feature);
### some sanity checks
die "something went wrong with the ID" unless $id;
die "duplicate feature ID $id" if exists $fstore1{$id};
$fstore1{$id} = $feature;
}
print "read ".scalar ( keys %fstore1)." unique features from file 1\n";
$gffio1->close();
# read features of second file
my $new = my $changed = my $conflicts = my $deleted = 0;
while(my $feature = $gffio2->next_feature()) {
my $id = ($feature->primary_id) ?
$feature->primary_id : new_id($feature);
$fstore2{$id} = $feature;
### some sanity checks
die "something went wrong with the ID" unless $id;
my @what = ();
if (my $a = is_new($id, $feature)) {
# print "Feature $id is new\n";
if (ref $a) {
$conflicts++;
## store conflict not to delete it later
## TODO: disable this block via command-line option --ignore-conflicts
$conflicts{new_id($a)} = 1;
print $conffh " >>>>> ".$ARGV[0]."\nC: ".$a->gff_string()."\n-----\n";
print $conffh " <<<<< " .$ARGV[1]."\nC: ".$feature->gff_string()."\n<<<<<\n";
} else {
$create->write_feature($feature);
$new++;
}
}
elsif (@what = what_changed($id, $feature)) {
### TODO: Allow for a 'natural language' report:
#print "Feature $id has different ". (join " and ", @what)."\n";
#print "- ".$fstore1{$id}->gff_string."\n";
$delete->write_feature($fstore1{$id});
#print "+ ".$feature->gff_string()."\n";
$create->write_feature($feature);
$changed++;
} else {
# print "$id unchanged\n";
}
}
$gffio2->close();
print "read ".scalar ( keys %fstore2)." unique features from file 2\n";
## check for features present in file1 and missing in file 2
foreach my $key (keys %fstore1) {
if (! exists $fstore2{$key}) {
# protect feature from deletion if it is already known as an ID conflict
if (! exists $conflicts{new_id($fstore1{$key})} ) {
$delete->write_feature($fstore1{$key});
$deleted++;
}
}
}
$create->close;
$delete->close;
close $conffh or die $!;
print "$new new features, $changed changed, $conflicts conflicting, $deleted deleted features\n";
## create a new id based on landmark, type, strand, start and end.
## allows to identify exactly matching features
## need a little more refined mechanism of range overlap (best as in IRanges though)
sub new_id {
my $feature = shift;
my $start = $feature->start;
my $end = $feature->end;
my $strand = $feature->strand;
my $type = $feature->primary_tag();
my $seq_id = $feature->seq_id();
return "$seq_id:$type:$start:$end:$strand";
}
sub is_new {
my $id = shift;
my $feature = shift;
if (! exists $fstore1{$id}) {
if (! exists $fstore1{new_id($feature)}) {
return 1;
} else {
return $fstore1{new_id($feature)}
}
}
return 0;
}
sub what_changed {
my ($id, $feature) = @_;
die "no id" unless $id;
die "no feature for id $id" unless ref $feature;
unless (ref $fstore1{$id}) {
warn "$id not found, shouldn't happen:".$fstore1{$id};
return ();
}
my @what = ();
## quick and dirty check
return () if $feature->gff_string eq $fstore1{$id}->gff_string;
## now we have to find out what's different (we ignore most tags for now):
#$feature->strand ne $fstore1{$id}->strand ||
push @what, "position/strand" unless ($feature->equals($fstore1{$id}), 'strong');
push @what, "phase" if $feature->phase ne $fstore1{$id}->phase;
push @what, "primary_tag" if $feature->primary_tag() ne $fstore1{$id}->primary_tag();
push @what, "source_tag" if $feature->source_tag() ne $fstore1{$id}->source_tag();
push @what, "display_id" if $feature->display_id() ne $fstore1{$id}->display_id();
## TODO: check all tags: TODO:
### my %tags1 = map $fstore1{$id}->get_all_tags();
return @what;
}
I guess it would essentially be a bedtools intersect type tool that is ID aware?
**edit. oh cool. When did Biostar start doing auto text links for popular software? I didn't link bedtools to the google code website and it did it auto-magically.
Yeah, that is a good feature, also works for SAM (links to sam format or, samtools, but it might also mean "significance analysis of microarrays").
OT it might be something like bedtools intersect+IDs+tags+type, but I'd prefer that it reads GFF natively.
Yeah letting people see what changed between annotation versions would be useful.
I can imagine some ambiguous cases where it might be harder to distinguish between multiple type of updates. For example, a fused gene annotation that was split into two separate genes in a later version. How do we distinguish between split and a deletion of old gene + insertion of two new genes. I guess the tool wouldn't have to be that sophisticated.
How would a fused gene look in terms of IDs? I thought of pushing that burden up to the annotator, who decides on the choice of appropriate stable IDs. A fused gene might look like:
GeneID1(start1, end1) + GeneID2(start2, end2) -> GeneID1(start1, end2)
, but I guess technically, there is no way to be sure that the annotator meant to fuse the genes.I guess you can categorize fused gene annotation -> 2 split gene annotations as a "refinement" update. Whenever annotations in the current version are completely within an annotation in the old version, mark it as a "refinement". So this could be a gene annotation with too many extra bases in the UTRs trimmed to a shorter annotation or fused gene becoming 2 separate genes.
The opposite case I guess can be categorized as an "expansion" where an annotation in the older version is completely within the annotation of a newer version.
So, I am going to assume that such a tool does not exist?
Found this: http://www.kivinen.iki.fi/nwn/docs/gffdiff.html (comes up in a goole search at the top rank), just in case this comes up for someone, it has nothing to do with this question. It is for a very different sort of gff file:http://social.bioware.com/wiki/datoolset/index.php/GFF