In BioPerl, a sequence object can have any number of features, and each of these can have subfeatures nested within them. For example, a feature may be a complete coding sequence of a gene, and its subfeatures might be individual exons that are concatenated to form the full coding sequence.
However, when I use BioPerl to write a sequence object to a file in genbank or embl format, only the top-level features are written to the file, not the sub-features nested within the top-level features. How can I store my subfeatures in sequence files? Should I just convert all my subfeatures into top-level features, and then reconstruct the tree structure next time I read in the sequence?
Example code:
#!/usr/bin/perl
use Bio::SeqIO;
use Bio::SeqFeature::Generic;
# Read in a sequence
my $sequence = Bio::SeqIO->newFh(
-file => $ARGV[0],
)->getline();
# Add the main feature
my $main_feature = Bio::SeqFeature::Generic->new(
-start => 1,
-end => 100,
-primary_tag => 'main_feature',
);
$sequence->add_SeqFeature($main_feature);
# Add some subfeatures
for my $i (1..10) {
my $subfeature = Bio::SeqFeature::Generic->new(
-start => ($i * 10 - 9),
-end => ($i * 10),
-primary_tag => "Sub-feature $i",
);
$main_feature->add_SeqFeature($subfeature);
}
# Write the sequence to the output file. Subfeatures are not saved.
Bio::SeqIO->new(
-fh => *STDOUT{IO},
-format => "genbank",
)->write_seq($sequence) or die "FAIL";
Run the example code on any sequence file. It will print a genbank sequence record to STDOUT. This record will contain "main_feature" but not any of the ten "Sub-feature $i". If you use something like Data::Dump
to dump the perl structure corresponding to $sequence
, you will see that it does contain the sub features. It simply does not output them.
Edit: I just realised that there's another aspect to my question. I'm trying to write two scripts: one to analyze the sequences and add the subfeatures, and a second script to read the sequences and report on the subfeatures. So an additional requirement is that I can easily read the resulting sequences back into BioPerl including the subfeatures.
Creating a RichSeq object from scratch can be a little tricky. You're probably writing you subfeatures in a limited scope. But, without a piece of code (or the modules you're using) a correct answer will be difficult.
Ok, I'll post some example code.
What are all the features you are trying to print ? Can you please paste the full code to pastebin and provide a link ? I wont be able to test the code in its current format.
I posted a full snippet of example code. You just need an arbitrary sequence file to run it on.
Another question: are you trying to convert between file formats or do you really need to annotate raw sequences. For the first option, SeqIO do the whole job (subfeatures included). For the second case, you will need my complete non-trivial answer.
I'm trying to find a sequence format to use as intermediate data storage between an annotator script and a reporter script.