Question

How Can I Save Bioperl Sequence Nested Features In Genbank Or Embl Format?

4

Entering edit mode

15.3 years ago

Ryan Thompson ★ 3.7k

In BioPerl, a sequence object can have any number of features, and each of these can have subfeatures nested within them. For example, a feature may be a complete coding sequence of a gene, and its subfeatures might be individual exons that are concatenated to form the full coding sequence.

However, when I use BioPerl to write a sequence object to a file in genbank or embl format, only the top-level features are written to the file, not the sub-features nested within the top-level features. How can I store my subfeatures in sequence files? Should I just convert all my subfeatures into top-level features, and then reconstruct the tree structure next time I read in the sequence?

Example code:

#!/usr/bin/perl

use Bio::SeqIO;
use Bio::SeqFeature::Generic;

# Read in a sequence
my $sequence = Bio::SeqIO->newFh(
    -file => $ARGV[0],
)->getline();

# Add the main feature
my $main_feature = Bio::SeqFeature::Generic->new(
    -start => 1,
    -end => 100,
    -primary_tag => 'main_feature',
);
$sequence->add_SeqFeature($main_feature);

# Add some subfeatures
for my $i (1..10) {
    my $subfeature = Bio::SeqFeature::Generic->new(
        -start => ($i * 10 - 9),
        -end => ($i * 10),
        -primary_tag => "Sub-feature $i",
    );
    $main_feature->add_SeqFeature($subfeature);
}

# Write the sequence to the output file. Subfeatures are not saved.
Bio::SeqIO->new(
    -fh => *STDOUT{IO},
    -format => "genbank",
)->write_seq($sequence) or die "FAIL";

Run the example code on any sequence file. It will print a genbank sequence record to STDOUT. This record will contain "main_feature" but not any of the ten "Sub-feature $i". If you use something like Data::Dump to dump the perl structure corresponding to $sequence, you will see that it does contain the sub features. It simply does not output them.

Edit: I just realised that there's another aspect to my question. I'm trying to write two scripts: one to analyze the sequences and add the subfeatures, and a second script to read the sequences and report on the subfeatures. So an additional requirement is that I can easily read the resulting sequences back into BioPerl including the subfeatures.

bioperl sequence data • 7.6k views

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.3 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

Creating a RichSeq object from scratch can be a little tricky. You're probably writing you subfeatures in a limited scope. But, without a piece of code (or the modules you're using) a correct answer will be difficult.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

Ok, I'll post some example code.

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

What are all the features you are trying to print ? Can you please paste the full code to pastebin and provide a link ? I wont be able to test the code in its current format.

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

I posted a full snippet of example code. You just need an arbitrary sequence file to run it on.

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

Another question: are you trying to convert between file formats or do you really need to annotate raw sequences. For the first option, SeqIO do the whole job (subfeatures included). For the second case, you will need my complete non-trivial answer.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

I'm trying to find a sequence format to use as intermediate data storage between an annotator script and a reporter script.

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k

Ram · Answer 1 · 2010-04-30

4

Entering edit mode

15.3 years ago

Neilfws 49k

I think the problem may be that whilst Bioperl has a concept of features + subfeatures, a GenBank file does not: anything with a location is simply a feature, regardless of whether it lies within another feature. Sub-features are created by Bioperl when "rich" sequences are read and only for specific tags (such as gene/CDS/exon).

In your code, if you change the loop which generates the "subfeatures" so as the last line is:

$sequence->add_SeqFeature($subfeature);

i.e. subfeatures are added to the sequence object, not the main feature object, then the subfeatures will appear in the output (as features).

Where a feature in a GenBank file contains several lines - those are annotations, not features.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.3 years ago by Neilfws 49k

0

Entering edit mode

Ok, so is there a sequence format the does support subfeatures? Should I forget SeqIO and just use Data::Dump?

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

Also, this is the workaround I'm using right now.

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

I guess GFF is the closest to a representation of sub-features, with its "parent" attribute. There's a gff_string method in Bio::SeqFeature, so you'd just add "parent" + the ID of the main feature as a tag of the sub-feature.

ADD REPLY • link 15.3 years ago by Neilfws 49k

0

Entering edit mode

asciitree, chadoxml e gff do support nested references/features. There is no parser for asciitree, chadoxml is well supported in bioperl. Biopython can read/write gff (up to gff3 I guess) and bioperl can write gff. Chadoxml seems to have a bug in somes *nix distributions. Tried asciitree and worked perfectly with your code.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

I'll return to the flatten/unflatten problem. I've generated fake genbank records some time ago from simulated sequences.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

feel free to accept this as best answer then :-)

ADD REPLY • link 15.3 years ago by Neilfws 49k

0

Entering edit mode

Yeah. I did. I ended up implementing a custom "unflatten" subroutine in the second script.

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k

score 2 · Answer 2 · 2010-04-30

Neil is correct, GenBank/EMBL do not layer features (e.g. everything is top-level), and BioPerl makes no assumptions based on that: WYSIWYG. There are a few scripts bp_genbank2gff3.pl) that help along these lines, in that it helps convert to a format that allows hierarchal features, gene > mRNA > CDS/exon/UTR and so on. The caveat is there is enough variability between records that finding a consistent way to do this is still difficult.

Ram · Answer 3 · 2010-04-29

1

Entering edit mode

15.3 years ago

Khader Shameer 18k

AFAIK you have to define the subfeatures explicitly in the script to generate the genbank or embl format. Have you referred to the Feature-Annotation HOWTO. If you can show the code, it will be better.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

What I mean is, I have a script that reads in some sequences, annotates them with features and subfeatures, and then writes the sequences to an output file. But when I do that, the subfeatures are lost.

ADD REPLY • link 15.3 years ago by Ryan Thompson ★ 3.7k