Question

When Annotating Features On Sequences With Bioperl, What Is The Best Way To Pass The Annotated Sequences Between Scripts?

0

Entering edit mode

14.6 years ago

Ryan Thompson ★ 3.6k

I am writing a series of scripts for performing and then reporting miRNA binding prediction, and I need to get the data from one script to the next. The input to the whole pipeline is any rich sequence format, like EMBL or genbank format. The first script reads the input sequences and adds features to them, and the second script reports on those features. My original plan was to have the first script simply output the sequences to genbank format. However, I realized that I couldn't save subfeatures if I did this. Furthermore, if I choose to annotate the predicted binding sites using Bio::SeqFeature::Computation instead of Bio::SeqFeature::Generic, then I doubt that the second script will magically detect that the features in the genbank sequence file are supposed to be converted into Bio::SeqFeature::Computation objects.

So is something like Storable or Data::Dump a viable alternative for data interchange between my scripts? Are there any caveats I should be aware of when freezing/thawing blessed objects instead of naked data structures?

bioperl data perl sequence sequence • 3.1k views

ADD COMMENT • link updated 14.6 years ago by Biosidd ▴ 40 • written 14.6 years ago by Ryan Thompson ★ 3.6k

score 1 · Answer 1 · 2010-05-08

1

Entering edit mode

14.6 years ago

Biosidd ▴ 40

You could use SeqFeature storage to store the output from one script and then the next script follows it up from there. Either of file based storage system such as berkelydb3/bdb/sqlite would be sufficient. The seqfeature storage internally uses Data::Dumper/Storable for serializing the features.

ADD COMMENT • link 14.6 years ago by Biosidd ▴ 40

0

Entering edit mode

Nice idea, but that doesn't cover the sequences. Ideally, I'd like to store the sequences with all their annotations in a single file. Thus my original decision to use genbank format.

ADD REPLY • link 14.6 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I believe it does load the sequences if your GFF3.0 file has it defined under the ##FASTA tag or if your Seqfeatue has an attached sequence. It seems in your case you could just stick the EMBL/GenBank file though Bio::DB::Flat family of modules and then access it as a database.

ADD REPLY • link 14.6 years ago by Biosidd ▴ 40

0

Entering edit mode

I ended up implementing my own custom "unflatten" subroutine for my particular data, but this is probably a better answer.

ADD REPLY • link 14.5 years ago by Ryan Thompson ★ 3.6k